When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems

April 16, 2025

75

Table of Contents

Giant language fashions (LLMs) are more and more able to advanced reasoning by means of “inference-time scaling,” a set of strategies that allocate extra computational sources throughout inference to generate solutions. Nonetheless, a brand new research from Microsoft Analysis reveals that the effectiveness of those scaling strategies isn’t common. Efficiency boosts range considerably throughout completely different fashions, duties and downside complexities.

The core discovering is that merely throwing extra compute at an issue throughout inference doesn’t assure higher or extra environment friendly outcomes. The findings might help enterprises higher perceive price volatility and mannequin reliability as they give the impression of being to combine superior AI reasoning into their purposes.

Placing scaling strategies to the check

The Microsoft Analysis staff carried out an intensive empirical evaluation throughout 9 state-of-the-art basis fashions. This included each “typical” fashions like GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Professional and Llama 3.1 405B, in addition to fashions particularly fine-tuned for enhanced reasoning by means of inference-time scaling. This included OpenAI’s o1 and o3-mini, Anthropic’s Claude 3.7 Sonnet, Google’s Gemini 2 Flash Considering, and DeepSeek R1.

They evaluated these fashions utilizing three distinct inference-time scaling approaches:

Normal Chain-of-Thought (CoT): The fundamental technique the place the mannequin is prompted to reply step-by-step.
Parallel Scaling: the mannequin generates a number of unbiased solutions for a similar query and makes use of an aggregator (like majority vote or choosing the best-scoring reply) to reach at a remaining end result.
Sequential Scaling: The mannequin iteratively generates a solution and makes use of suggestions from a critic (probably from the mannequin itself) to refine the reply in subsequent makes an attempt.

These approaches have been examined on eight difficult benchmark datasets protecting a variety of duties that profit from step-by-step problem-solving: math and STEM reasoning (AIME, Omni-MATH, GPQA), calendar planning (BA-Calendar), NP-hard issues (3SAT, TSP), navigation (Maze) and spatial reasoning (SpatialMap).

A number of benchmarks included issues with various problem ranges, permitting for a extra nuanced understanding of how scaling behaves as issues change into more durable.

“The supply of problem tags for Omni-MATH, TSP, 3SAT, and BA-Calendar permits us to research how accuracy and token utilization scale with problem in inference-time scaling, which is a perspective that’s nonetheless underexplored,” the researchers wrote within the paper detailing their findings.

The researchers evaluated the Pareto frontier of LLM reasoning by analyzing each accuracy and the computational price (i.e., the variety of tokens generated). This helps establish how effectively fashions obtain their outcomes.

Inference-time scaling Pareto frontier Credit score: arXiv

In addition they launched the “conventional-to-reasoning hole” measure, which compares the very best efficiency of a traditional mannequin (utilizing a really perfect “best-of-N” choice) in opposition to the common efficiency of a reasoning mannequin, estimating the potential features achievable by means of higher coaching or verification strategies.

Extra compute isn’t all the time the reply

The research supplied a number of essential insights that problem frequent assumptions about inference-time scaling:

Advantages range considerably: Whereas fashions tuned for reasoning typically outperform typical ones on these duties, the diploma of enchancment varies vastly relying on the precise area and activity. Good points typically diminish as downside complexity will increase. For example, efficiency enhancements seen on math issues didn’t all the time translate equally to scientific reasoning or planning duties.

Token inefficiency is rife: The researchers noticed excessive variability in token consumption, even between fashions reaching related accuracy. For instance, on the AIME 2025 math benchmark, DeepSeek-R1 used over 5 occasions extra tokens than Claude 3.7 Sonnet for roughly comparable common accuracy.

Extra tokens don’t result in larger accuracy: Opposite to the intuitive concept that longer reasoning chains imply higher reasoning, the research discovered this isn’t all the time true. “Surprisingly, we additionally observe that longer generations relative to the identical mannequin can generally be an indicator of fashions struggling, fairly than improved reflection,” the paper states. “Equally, when evaluating completely different reasoning fashions, larger token utilization just isn’t all the time related to higher accuracy. These findings encourage the necessity for extra purposeful and cost-effective scaling approaches.”

Value nondeterminism: Maybe most regarding for enterprise customers, repeated queries to the identical mannequin for a similar downside may end up in extremely variable token utilization. This implies the price of working a question can fluctuate considerably, even when the mannequin persistently offers the proper reply.

Variance in response size (spikes present smaller variance) Credit score: arXiv

The potential in verification mechanisms: Scaling efficiency persistently improved throughout all fashions and benchmarks when simulated with a “good verifier” (utilizing the best-of-N outcomes).

Typical fashions generally match reasoning fashions: By considerably rising inference calls (as much as 50x extra in some experiments), typical fashions like GPT-4o may generally method the efficiency ranges of devoted reasoning fashions, notably on much less advanced duties. Nonetheless, these features diminished quickly in extremely advanced settings, indicating that brute-force scaling has its limits.

On some duties, the accuracy of GPT-4o continues to enhance with parallel and sequential scaling. Credit score: arXiv

Implications for the enterprise

These findings carry vital weight for builders and enterprise adopters of LLMs. The problem of “price nondeterminism” is especially stark and makes budgeting tough. Because the researchers level out, “Ideally, builders and customers would like fashions for which the usual deviation on token utilization per occasion is low for price predictability.”

“The profiling we do in [the study] might be helpful for builders as a software to choose which fashions are much less risky for a similar immediate or for various prompts,” Besmira Nushi, senior principal analysis supervisor at Microsoft Analysis, informed VentureBeat. “Ideally, one would need to decide a mannequin that has low commonplace deviation for proper inputs.”

Fashions that peak blue to the left persistently generate the identical variety of tokens on the given activity Credit score: arXiv

The research additionally offers good insights into the correlation between a mannequin’s accuracy and response size. For instance, the next diagram reveals that math queries above ~11,000 token size have a really slim probability of being right, and people generations ought to both be stopped at that time or restarted with some sequential suggestions. Nonetheless, Nushi factors out that fashions permitting these put up hoc mitigations even have a cleaner separation between right and incorrect samples.

“Finally, it is usually the accountability of mannequin builders to consider decreasing accuracy and price non-determinism, and we anticipate plenty of this to occur because the strategies get extra mature,” Nushi stated. “Alongside price nondeterminism, accuracy nondeterminism additionally applies.”

One other essential discovering is the constant efficiency increase from good verifiers, which highlights a important space for future work: constructing strong and broadly relevant verification mechanisms.

“The supply of stronger verifiers can have various kinds of influence,” Nushi stated, resembling bettering foundational coaching strategies for reasoning. “If used effectively, these may shorten the reasoning traces.”

Sturdy verifiers may change into a central a part of enterprise agentic AI options. Many enterprise stakeholders have already got such verifiers in place, which can should be repurposed for extra agentic options, resembling SAT solvers, logistic validity checkers, and many others.

“The questions for the longer term are how such current strategies could be mixed with AI-driven interfaces and what’s the language that connects the 2,” Nushi stated. “The need of connecting the 2 comes from the truth that customers is not going to all the time formulate their queries in a proper method, they may need to use a pure language interface and anticipate the options in an analogous format or in a remaining motion (e.g. suggest a gathering invite).”

Supply hyperlink

Tags
AI
AI News

Buy now

When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems

Placing scaling strategies to the check

Extra compute isn’t all the time the reply

Implications for the enterprise

Related Articles

China’s open AI models are in a dead heat with the...

I Tried GPT 5.2 and This is How It Went..

Undetectable AI vs. Scribbr: Which One Detects AI Writing More Accurately?

Leave a Reply Cancel reply

Latest Articles

China’s open AI models are in a dead heat with the...

I Tried GPT 5.2 and This is How It Went..

Undetectable AI vs. Scribbr: Which One Detects AI Writing More Accurately?

AWS re:Invent was an all-in pitch for AI. Customers might not...

Bone AI raises $12M to challenge Asia’s defense giants with AI-powered...