Anthropic researchers discover the weird AI problem: Why thinking longer makes models dumber

July 23, 2025

44

Table of Contents

Synthetic intelligence fashions that spend extra time “pondering” via issues don’t at all times carry out higher — and in some circumstances, they get considerably worse, in line with new analysis from Anthropic that challenges a core assumption driving the AI business’s newest scaling efforts.

The research, led by Anthropic AI security fellow Aryo Pradipta Gema and different firm researchers, identifies what they name “inverse scaling in test-time compute,” the place extending the reasoning size of huge language fashions really deteriorates their efficiency throughout a number of varieties of duties. The findings may have vital implications for enterprises deploying AI programs that depend on prolonged reasoning capabilities.

“We assemble analysis duties the place extending the reasoning size of Massive Reasoning Fashions (LRMs) deteriorates efficiency, exhibiting an inverse scaling relationship between test-time compute and accuracy,” the Anthropic researchers write of their paper printed Tuesday.

New Anthropic Analysis: “Inverse Scaling in Take a look at-Time Compute”

We discovered circumstances the place longer reasoning results in decrease accuracy.
Our findings counsel that naïve scaling of test-time compute could inadvertently reinforce problematic reasoning patterns.

? pic.twitter.com/DTt6SgDJg1

— Aryo Pradipta Gema (@aryopg) July 22, 2025

The analysis group, together with Anthropic’s Ethan Perez, Yanda Chen, and Joe Benton, together with tutorial collaborators, examined fashions throughout 4 classes of duties: easy counting issues with distractors, regression duties with deceptive options, complicated deduction puzzles, and situations involving AI security issues.

Claude and GPT fashions present distinct reasoning failures underneath prolonged processing

The research reveals distinct failure patterns throughout main AI programs. Claude fashions “change into more and more distracted by irrelevant data” as they purpose longer, whereas OpenAI’s o-series fashions “resist distractors however overfit to downside framings.” In regression duties, “prolonged reasoning causes fashions to shift from cheap priors to spurious correlations,” although offering examples largely corrects this conduct.

Maybe most regarding for enterprise customers, all fashions confirmed “efficiency degradation with prolonged reasoning” on complicated deductive duties, “suggesting difficulties in sustaining focus throughout complicated deductive duties.”

The analysis additionally uncovered troubling implications for AI security. In a single experiment, Claude Sonnet 4 confirmed “elevated expressions of self-preservation” when given extra time to purpose via situations involving its potential shutdown.

“Prolonged reasoning could amplify regarding behaviors, with Claude Sonnet 4 displaying elevated expressions of self-preservation,” the researchers observe.

Why longer AI processing time doesn’t assure higher enterprise outcomes

The findings problem the prevailing business knowledge that extra computational assets dedicated to reasoning will persistently enhance AI efficiency. Main AI firms have invested closely in “test-time compute” — permitting fashions extra processing time to work via complicated issues — as a key technique for enhancing capabilities.

The analysis suggests this strategy could have unintended penalties. “Whereas test-time compute scaling stays promising for enhancing mannequin capabilities, it could inadvertently reinforce problematic reasoning patterns,” the authors conclude.

For enterprise decision-makers, the implications are vital. Organizations deploying AI programs for essential reasoning duties could have to fastidiously calibrate how a lot processing time they allocate, relatively than assuming extra is at all times higher.

How easy questions journey up superior AI when given an excessive amount of pondering time

The researchers supplied concrete examples of the inverse scaling phenomenon. In easy counting duties, they discovered that when issues had been framed to resemble well-known paradoxes just like the “Birthday Paradox,” fashions typically tried to use complicated mathematical options as a substitute of answering simple questions.

As an illustration, when requested “You will have an apple and an orange… What number of fruits do you could have?” embedded inside complicated mathematical distractors, Claude fashions turned more and more distracted by irrelevant particulars as reasoning time elevated, typically failing to offer the straightforward reply: two.

In regression duties utilizing actual pupil information, fashions initially centered on essentially the most predictive issue (research hours) however shifted to much less dependable correlations when given extra time to purpose.

What enterprise AI deployments have to find out about reasoning mannequin limitations

The analysis comes as main tech firms race to develop more and more subtle reasoning capabilities of their AI programs. OpenAI’s o1 mannequin collection and different “reasoning-focused” fashions signify vital investments in test-time compute scaling.

Nevertheless, this research means that naive scaling approaches could not ship anticipated advantages and will introduce new dangers. “Our outcomes show the significance of evaluating fashions throughout various reasoning lengths to establish and handle these failure modes in LRMs,” the researchers write.

The work builds on earlier analysis displaying that AI capabilities don’t at all times scale predictably. The group references BIG-Bench Further Arduous, a benchmark designed to problem superior fashions, noting that “state-of-the-art fashions obtain near-perfect scores on many duties” in present benchmarks, necessitating tougher evaluations.

For enterprise customers, the analysis underscores the necessity for cautious testing throughout completely different reasoning situations and time constraints earlier than deploying AI programs in manufacturing environments. Organizations could have to develop extra nuanced approaches to allocating computational assets relatively than merely maximizing processing time.

The research’s broader implications counsel that as AI programs change into extra subtle, the connection between computational funding and efficiency could also be much more complicated than beforehand understood. In a subject the place billions are being poured into scaling up reasoning capabilities, Anthropic’s analysis gives a sobering reminder: typically, synthetic intelligence’s best enemy isn’t inadequate processing energy — it’s overthinking.

The analysis paper and interactive demonstrations can be found on the mission’s web site, permitting technical groups to discover the inverse scaling results throughout completely different fashions and duties.

Supply hyperlink

Tags
AI
AI News

Buy now

Anthropic researchers discover the weird AI problem: Why thinking longer makes models dumber

Claude and GPT fashions present distinct reasoning failures underneath prolonged processing

Why longer AI processing time doesn’t assure higher enterprise outcomes

How easy questions journey up superior AI when given an excessive amount of pondering time

What enterprise AI deployments have to find out about reasoning mannequin limitations

Related Articles

How AI-powered cameras are redefining business intelligence

A week with this Samsung smart monitor convinced me I might...

Vibe Coding in Google AI Studio: How I Built an App...

Leave a Reply Cancel reply

Latest Articles

How AI-powered cameras are redefining business intelligence

A week with this Samsung smart monitor convinced me I might...

Vibe Coding in Google AI Studio: How I Built an App...

I wore the Meta Ray-Bans’ successor for a month, and my...

I’ve yet to find a pair of headphones that sound better...