This Week in AI: Maybe we should ignore AI benchmarks for now

February 19, 2025

76

Table of Contents

Welcome to iinfoai’s common AI publication! We’re happening hiatus for a bit, however you could find all our AI protection, together with my columns, our every day evaluation, and breaking information tales, at iinfoai. If you need these tales and far more in your inbox day by day, join our every day newsletters right here.

This week, billionaire Elon Musk’s AI startup, xAI, launched its newest flagship AI mannequin, Grok 3, which powers the corporate’s Grok chatbot apps. Skilled on round 200,000 GPUs, the mannequin beats plenty of different main fashions, together with from OpenAI, on benchmarks for arithmetic, programming, and extra.

However what do these benchmarks actually inform us?

Right here at TC, we frequently reluctantly report benchmark figures as a result of they’re one of many few (comparatively) standardized methods the AI trade measures mannequin enhancements. Standard AI benchmarks have a tendency to check for esoteric data, and provides mixture scores that correlate poorly to proficiency on the duties that most individuals care about.

As Wharton professor Ethan Mollick identified in a collection of posts on X after Grok 3’s unveiling Monday, there’s an “pressing want for higher batteries of exams and unbiased testing authorities.” AI corporations self-report benchmark outcomes as a rule, as Mollick alluded to, making these outcomes even more durable to simply accept at face worth.

“Public benchmarks are each ‘meh’ and saturated, leaving a number of AI testing to be like meals critiques, primarily based on style,” Mollick wrote. “If AI is essential to work, we’d like extra.”

There’s no scarcity of unbiased exams and organizations proposing new benchmarks for AI, however their relative benefit is way from a settled matter inside the trade. Some AI commentators and consultants suggest aligning benchmarks with financial affect to make sure their usefulness, whereas others argue that adoption and utility are the final word benchmarks.

This debate could rage till the tip of time. Maybe we must always as an alternative, as X person Roon prescribes, merely pay much less consideration to new fashions and benchmarks barring main AI technical breakthroughs. For our collective sanity, that will not be the worst thought, even when it does induce some degree of AI FOMO.

As talked about above, This Week in AI is happening hiatus. Thanks for sticking with us, readers, by means of this curler coaster of a journey. Till subsequent time.

Information

OpenAI tries to “uncensor” ChatGPT: Max wrote about how OpenAI is altering its AI improvement strategy to explicitly embrace “mental freedom,” regardless of how difficult or controversial a subject could also be.

Mira’s new startup: Former OpenAI CTO Mira Murati’s new startup, Considering Machines Lab, intends to construct instruments to “make AI work for [people’s] distinctive wants and targets.”

Grok 3 cometh: Elon Musk’s AI startup, xAI, has launched its newest flagship AI mannequin, Grok 3, and unveiled new capabilities for the Grok apps for iOS and the net.

A really Llama convention: Meta will host its first developer convention devoted to generative AI this spring. Referred to as LlamaCon after Meta’s Llama household of generative AI fashions, the convention is scheduled for April 29.

AI and Europe’s digital sovereignty: Paul profiled OpenEuroLLM, a collaboration between some 20 organizations to construct “a collection of basis fashions for clear AI in Europe” that preserves the “linguistic and cultural range” of all EU languages.

Analysis paper of the week

OpenAI researchers have created a brand new AI benchmark, SWE-Lancer, that goals to guage the coding prowess of highly effective AI techniques. The benchmark consists of over 1,400 freelance software program engineering duties that vary from bug fixes and have deployments to “manager-level” technical implementation proposals.

Based on OpenAI, the best-performing AI mannequin, Anthropic’s Claude 3.5 Sonnet, scores 40.3% on the total SWE-Lancer benchmark — suggesting that AI has fairly a methods to go. It’s value noting that the researchers didn’t benchmark newer fashions like OpenAI’s o3-mini or Chinese language AI firm DeepSeek’s R1.

Mannequin of the week

A Chinese language AI firm named Stepfun has launched an “open” AI mannequin, Step-Audio, that may perceive and generate speech in a number of languages. Step-Audio helps Chinese language, English, and Japanese and lets customers modify the emotion and even dialect of the artificial audio it creates, together with singing.

Stepfun is considered one of a number of well-funded Chinese language AI startups releasing fashions underneath a permissive license. Based in 2023, Stepfun reportedly not too long ago closed a funding spherical value a number of hundred million {dollars} from a number of buyers that embrace Chinese language state-owned non-public fairness companies.

Seize bag

Nous Analysis, an AI analysis group, has launched what it claims is likely one of the first AI fashions that unifies reasoning and “intuitive language mannequin capabilities.”

The mannequin, DeepHermes-3 Preview, can toggle on and off lengthy “chains of thought” for improved accuracy at the price of some computational heft. In “reasoning” mode, DeepHermes-3 Preview, just like different reasoning AI fashions, “thinks” longer for tougher issues and exhibits its thought course of to reach on the reply.

Anthropic reportedly plans to launch an architecturally comparable mannequin quickly, and OpenAI has stated such a mannequin is on its near-term roadmap.

Supply hyperlink

Tags
AI
AI News

Buy now

This Week in AI: Maybe we should ignore AI benchmarks for now

Information

Analysis paper of the week

Mannequin of the week

Seize bag

Related Articles

China’s open AI models are in a dead heat with the...

I Tried GPT 5.2 and This is How It Went..

Undetectable AI vs. Scribbr: Which One Detects AI Writing More Accurately?

Leave a Reply Cancel reply

Latest Articles

China’s open AI models are in a dead heat with the...

I Tried GPT 5.2 and This is How It Went..

Undetectable AI vs. Scribbr: Which One Detects AI Writing More Accurately?

AWS re:Invent was an all-in pitch for AI. Customers might not...

Bone AI raises $12M to challenge Asia’s defense giants with AI-powered...