What’s higher than an AI chatbot that may carry out duties for you when prompted? AI that may do duties for you by itself.
AI brokers are the most recent frontier within the AI area. AI firms are racing to construct their very own fashions, and choices are continually rolling out to enterprises. However which AI agent is the perfect?
Galileo Leaderboard
On Wednesday, Galileo launched an Agent Leaderboard on Hugging Face, an open-source AI platform the place customers can construct, prepare, entry, and deploy AI fashions. The leaderboard is supposed to assist folks find out how AI brokers carry out in real-world enterprise functions and assist groups decide which agent most closely fits their wants.
📊 Our Agent Leaderboard is 𝗹𝗶𝘃𝗲! We constructed a complete benchmark of which LLMs work finest for AI Brokers 👀
After evaluating 17 main LLMs throughout 14 numerous datasets, we’re excited to share our findings about which fashions really excel at tool-calling—and are able to… pic.twitter.com/Cgw2iWNSA7— 🔭 Galileo (@rungalileo) February 12, 2025
On the leaderboard, you will discover details about a mannequin’s efficiency, together with its rank and rating. At a look, you can too see extra primary details about the mannequin, together with vendor, value, and whether or not it is open supply or personal.
The leaderboard at the moment options “the 17 main LLMs,” together with fashions from Google, OpenAI, Mistral, Anthropic, and Meta. It’s up to date month-to-month to maintain up with ongoing releases, which have been occurring regularly.
How fashions are ranked
To find out the outcomes, Galileo makes use of benchmarking datasets, together with the BFCL (Berkeley Operate Calling Leaderboard), τ-bench (Tau benchmark), Xlam, and ToolACE, which take a look at completely different agent capabilities. The leaderboards then flip this information into an analysis framework that covers real-world use instances.
“BFCL excels in educational domains like arithmetic, leisure, and training, τ-bench focuses on retail and airline situations, xLAM covers information technology throughout 21 domains, and ToolACE focuses on API interactions in 390 domains,” explains the corporate in a weblog put up.
Galileo provides that every mannequin is stress-tested to measure every part from easy API calls to extra superior duties reminiscent of multi-tool interactions. The corporate additionally shared its methodology, reassuring customers that it makes use of a standardized methodology to guage all AI brokers pretty. The put up features a extra technical dive into the mannequin rating.
The rankings
Google’s Gemini-2.0 flash is in first place, adopted carefully by OpenAI’s GPT-4o. Each of those fashions acquired what Galileo calls “Elite Tier Efficiency” standing, which is given to fashions with a rating of .9 or larger. Google and OpenAI dominated the leaderboard with their personal fashions, taking the primary six positions.
Google’s Gemini 2.0 was constant throughout all the analysis classes and balanced spectacular consistency efficiency throughout all classes with cost-effectiveness, in response to the put up, at a value of $0.15/$0.6 per million tokens. Though GPT-4o was an in depth second, it has a a lot larger value level at $2.5/$10 per million tokens.
Within the “high-performance phase,” the class under the elite tier, Gemini-1.5-Flash got here in third place, and Gemini-1.5-Professional in fourth. OpenAI’s reasoning fashions, o1 and o3-mini, adopted in fifth and sixth place, respectively.
Mistral-small-2501 was the primary open-sourced AI mannequin to chart. Its rating of .832 positioned it within the “mid-tier capabilities” class. The evaluations discovered its strengths to be its robust long-context dealing with and gear choice capabilities.
Easy methods to entry
To view the outcomes, you’ll be able to go to the Agent Leaderboard on Hugging Face. Along with the usual leaderboard, it is possible for you to to filter the leaderboard by whether or not the LLM is open-sourced or personal. and by class, which refers back to the functionality being examined (total, lengthy context, composite, and so on).