OpenAI’s Deep Research has more fact-finding stamina than you, but it’s still wrong half the time

April 20, 2025

80

The most recent in generative synthetic intelligence contains AI brokers that may entry the online to search out solutions to questions. Whereas promising, agentic expertise may be very a lot a piece in progress.

In a paper printed final week, OpenAI researchers relate how the corporate’s Deep Analysis expertise, which was constructed to make use of the Net, does much better than OpenAI’s different fashions when answering net questions. It additionally does much better than people on duties requiring hours of looking out.

However Deep Analysis nonetheless stumbles nearly half the time.

OpenAI’s new take a look at suggests Deep Analysis could be extra tenacious and dogged in pursuit of a solution than human researchers for some duties, nevertheless it nonetheless fails to provide you with a solution usually.

Referred to as BrowseComp, the take a look at is described by authors Jason Wei and group as “a easy but difficult benchmark for measuring the flexibility of brokers to browse the online.”

The premise is that AI brokers — that means, AI fashions that may browse “1000’s of net pages” — could possibly be rather more resourceful than people, who’ve restricted reminiscence, get fatigued browsing the Net, and “can solely attend to 1 factor at a time and can’t be parallelized,” imply, cannot direct their brains to function on knowledge in parallel streams of thought.

“Machine intelligence, however, has rather more in depth recall and might function tirelessly with out getting distracted,” write Wei and group.

Wei and group constructed on their prior work from final 12 months, “SimpleQ&A,” which checks AI fashions’ capacity to reply “brief, fact-seeking questions.” The questions lined TV and film trivia, science, historical past, music, video video games, politics, and different subjects.

The BrowseComp set of 1,266 questions is designed to transcend easy data retrieval, the authors relate. As an alternative, they’re questions for which it is laborious to search out the solutions — or, as they put it, “difficult as a result of they require looking out by a big area of potential solutions and matching them to constraints posed within the query,” and “hard-to-find, deeply entangled data on the net.”

For instance, one question-answer pair is the next:

Determine the title of a analysis publication printed earlier than June 2023, that mentions cultural traditions, scientific processes, and culinary improvements. It’s co-authored by three people: one in every of them was an assistant professor in West Bengal and one other one holds a Ph.D.
(Reply: The Fundamentals of Bread Making: The Science of Bread)

They emphasize that such a query is simple to confirm as a result of the reply is contained in a single phrase that’s “self-contained.”

The questions and solutions have been developed by human “trainers,” they usually have been chosen as being not possible to resolve with simply OpenAI’s ChatGPT, with or with out looking skills. The questions have been additionally not possible for an “early model” of Deep Analysis.

Demonstrating simply how weak people are at looking out the Net, they first examined people who have been “accustomed to the dataset” to reply the questions.

The outcomes weren’t good for the people. For 70% of the questions, people gave up after two hours of effort. They solely answered about 30% of the questions, and for 14% of their proposed solutions, the people’ strategies didn’t match the precise reply.

Wei and group hypothesize that people with increased looking out expertise might do higher: “It’s doable that most of the issues that they gave up on can be solvable by skilled professionals (e.g., detectives or investigative journalists) with ample time.”

After the people, they examined Deep Analysis towards OpenAI’s GPT-4o (with and with out looking skills), GPT-4.5, and the o1 mannequin.

The outcomes have been abysmal. “GPT-4o and GPT-4.5 achieved near-zero accuracy, highlighting the problem of the benchmark,” they write. “With out sturdy reasoning or instrument use, fashions fail to retrieve the sorts of obscure, multi-hop information BrowseComp targets.”

O1 fared higher, which “[suggests] that some BrowseComp solutions could be surfaced by inference over inner data.”

With a rating of 51.5%, Deep Analysis was “considerably higher,” and “it’s notably efficient at answering the area of interest, non-intuitive questions that require looking quite a few web sites,” Wei and group write.

Nonetheless, in addition they discovered that GPT-4o utilizing looking and Deep Analysis might err by being “overconfident” about fallacious solutions, which is called a calibration error.

“Fashions with looking capabilities equivalent to GPT-4o with looking and Deep Analysis exhibit increased calibration error,” they write, “suggesting that entry to net instruments could improve the mannequin’s confidence in incorrect solutions. This aligns with observations that Deep Analysis struggles with confidence calibration and sometimes fails to convey uncertainty precisely at current.”

To right for calibration error, they did one other take a look at with Deep Analysis, through which the mannequin needed to output as many as 64 solutions to every query. Then, that they had the mannequin decide the most effective of them. When it did so, Deep Analysis was fairly good at choosing the proper reply amongst all of the proposals.

That, write Wei and group, means that “the mannequin often ‘is aware of’ when it is proper, even when it struggles to precise that certainty as a calibrated chance.”

They word, too, that the success of Deep Analysis improves with extra computing added to it when it searches the Net. Put otherwise, “efficiency scales easily as a perform of the quantity of test-time compute used.” That squares with an rising pattern of throwing extra GPU chips on the activity of inference.

Wei and group do not immediately supply any speculation about why Deep Analysis fails nearly half the time, however the implicit reply is within the scaling of its capacity with extra compute. As they run extra parallel duties, and ask the mannequin to guage a number of solutions, the accuracy scales previous 75% of the questions answered.

The implication is that it’s important to decide on methods that drive the mannequin to consider its personal efforts relatively than merely chasing a single reply. With out that analysis stage, the mannequin struggles a great deal of the time.

A giant gap in BrowseComp, the authors acknowledge, is that it’s restricted to questions which are straightforward for the pc to parse, and whose solutions are straightforward to confirm. Not one of the 1,266 questions included “lengthy responses or capacity to resolve ambiguity in person queries.”

Because of this, BrowseComp, they argue, checks “core” capabilities of AI brokers however isn’t complete. “The mannequin should be very proficient at finding hard-to-find items of knowledge, nevertheless it’s not assured that this generalizes to all duties that require looking.”

Deep Analysis is obtainable to customers of OpenAI’s Plus and Professional subscriptions.

Need extra tales about AI? Join Innovation, our weekly publication.

Supply hyperlink

Buy now

OpenAI’s Deep Research has more fact-finding stamina than you, but it’s still wrong half the time

Related Articles

Bose QuietComfort Ultra vs. Sony WH-1000XM6: I tried the two best...

Hiring specialists made sense before AI — now generalists win

Top 10 AI Models For Web Development in 2025

Leave a Reply Cancel reply

Latest Articles

Bose QuietComfort Ultra vs. Sony WH-1000XM6: I tried the two best...

Hiring specialists made sense before AI — now generalists win

Top 10 AI Models For Web Development in 2025

‘ONE RULE’: Trump says he’ll sign an executive order blocking state...

Anthropic and Accenture sign multi-year AI strategic partnership