ai agent leaderboards.md

Key Points

Research suggests the most popular leaderboards for ranking AI agent systems are the Galileo AI Agent Leaderboard and the Berkeley Function-Calling Leaderboard (BFCL), both updated in early 2025.
These leaderboards focus on real-world tasks and function-calling accuracy, respectively, and are hosted on trusted platforms like Hugging Face.
It seems likely that community engagement and academic backing make them widely used, though exact popularity metrics are hard to gauge.

Overview

AI agent systems, which can act autonomously in tasks like gaming or business automation, are ranked on several leaderboards. These rankings help developers and businesses choose the best models for their needs. The evidence leans toward two main leaderboards being popular: the Galileo AI Agent Leaderboard and the Berkeley Function-Calling Leaderboard (BFCL). Both are recent, with updates in 2025, and focus on different aspects of agent performance.

Details on Popular Leaderboards

Galileo AI Agent Leaderboard: This leaderboard evaluates AI agents using a metric called Tool Selection Quality (TSQ), testing them on real-world business scenarios. It’s hosted on Hugging Face and updated monthly, making it accessible and current. It includes both proprietary and open-source models, which is an unexpected detail for laymen, as it shows a mix of commercial and community-driven AI.
Berkeley Function-Calling Leaderboard (BFCL): Focused on how well AI agents call functions or tools, this academic leaderboard is updated as recently as March 2025. It’s part of Berkeley’s research and ranks models on accuracy, cost, and latency, which is another unexpected detail, as it considers practical deployment factors.

Both are likely popular due to their platforms (Hugging Face for Galileo, academic backing for BFCL) and frequent updates, though exact popularity is inferred from mentions in news and blogs rather than hard metrics.

Survey Note: Comprehensive Analysis of AI Agent System Leaderboards

This note provides a detailed examination of leaderboards for ranking AI agent systems, focusing on popularity, evaluation criteria, and relevance as of April 1, 2025. AI agent systems, defined here as autonomous systems capable of tasks like function-calling, multi-turn interactions, and real-world business applications, are increasingly evaluated through standardized benchmarks. The analysis identifies key leaderboards, their methodologies, and their standing in the community, drawing from recent updates and platform engagement.

Identification of Leaderboards

Research suggests two primary leaderboards stand out for ranking AI agent systems: the Galileo AI Agent Leaderboard and the Berkeley Function-Calling Leaderboard (BFCL). Additionally, Scale AI’s SEAL Leaderboards were considered, though they appear broader, focusing on large language models (LLMs) with some agent-related tasks.

Galileo AI Agent Leaderboard: Launched in early 2025, this leaderboard is hosted on Hugging Face at https://huggingface.co/spaces/galileo-ai/agent-leaderboard. It evaluates 17 leading LLMs, both proprietary and open-source, across 14 diverse benchmarks, including BFCL, τ-bench, xLAM, and ToolACE. Updated monthly, it aims to reflect real-world agentic scenarios, making it highly relevant for industry applications. The dataset is open-sourced at https://huggingface.co/datasets/galileo-ai/agent-leaderboard, enhancing community access.
Berkeley Function-Calling Leaderboard (BFCL): Also known as the Berkeley Tool Calling Leaderboard V3, this is an academic initiative updated as of March 22, 2025, accessible at https://gorilla.cs.berkeley.edu/leaderboard.html. It focuses on function-calling accuracy, a core capability for AI agents, using real-world data. It includes metrics like cost (USD per 1000 function calls) and latency (seconds), with evaluation details in blogs linked on the site, such as https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html. Code and data are available at https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard.
Scale AI’s SEAL Leaderboards: Mentioned in a May 2024 update at https://scale.com/leaderboards, these rank LLMs on domains like coding and multi-turn conversations, which include agent tasks. However, the focus is broader, with evaluations by Scale AI’s Safety, Evaluations, and Alignment Lab, and it seems less specialized for agents compared to Galileo and BFCL.

Evaluation Criteria and Methodologies

Each leaderboard employs distinct evaluation criteria, reflecting different aspects of AI agent performance:

Galileo AI Agent Leaderboard: Uses Tool Selection Quality (TSQ) as the primary metric, assessing tool selection accuracy and parameter usage. The evaluation process involves GPT-4o with ChainPoll for assessment, multiple judgments per interaction, and a final score as the proportion of positive assessments. Datasets are synthesized from BFCL (https://gorilla.cs.berkeley.edu/leaderboard.html), τ-bench (https://github.com/sierra-research/tau-bench), xLAM (https://www.salesforce.com/blog/xlam-large-action-models/), and ToolACE (https://arxiv.org/abs/2409.00920), covering single-turn, multi-turn, error handling, context management, and parameter management. Scoring is an equally weighted average across datasets, with example code for TSQ evaluation using the promptquality library.

Category	Details
Leaderboard	Evaluates AI agent performance using Tool Selection Quality (TSQ) metric.
	Analyzes 17 leading LLMs against 14 diverse benchmarks, including private and open-source models.
	Updated monthly to sync with new model releases.
	URL: https://huggingface.co/spaces/galileo-ai/agent-leaderboard
	Dataset open-sourced: https://huggingface.co/datasets/galileo-ai/agent-leaderboard
Evaluation Criteria	Metric: Tool Selection Quality (TSQ), assesses tool selection accuracy and parameter usage.
	Evaluation Process: Uses GPT-4o with ChainPoll, multiple judgments, final score as proportion of positive assessments.
	Datasets: Synthesized from BFCL, τ-bench, xLAM, ToolACE, covering various agent tasks.
	Scoring: Equally weighted average across all datasets.
	Capabilities Tested: Scenario recognition, tool selection, parameter handling, sequential decision making.
	Example code for TSQ evaluation provided, using promptquality library.
	URLs for benchmarks: BFCL (https://gorilla.cs.berkeley.edu/leaderboard.html), τ-bench (https://github.com/sierra-research/tau-bench), xLAM (https://www.salesforce.com/blog/xlam-large-action-models/), ToolACE (https://arxiv.org/abs/2409.00920).
Popularity	No explicit metrics, but industry relevance suggested by practical implications for AI engineers.
	Current state: Proprietary models lead, open-source improving, challenges in complex scenarios.
	Elite Tier (>=0.9): Gemini-2.0-flash (0.938, $0.15/$0.6 per million tokens), GPT-4o (0.900, $2.5/$10 per million tokens).
	High Performance (0.85-0.9): Gemini-1.5-flash (0.895), Gemini-1.5-pro (0.885, $1.25/$5 per million tokens), o1 (0.876, $15/$60 per million tokens), o3-mini (0.847, $1.1/$4.4 per million tokens).
	Mid Tier (0.8-0.85): GPT-4o-mini (0.832), mistral-small-2501 (0.832), Qwen-72b (0.817), Claude-sonnet (0.801).
	Base Tier (<0.8): Claude-haiku (0.765, $0.8/$4 per million tokens), Llama-70B (0.774), Mistral-small (0.750), Ministral-8b (0.689), Mistral-nemo (0.661).

Berkeley Function-Calling Leaderboard (BFCL): Evaluates models on function-calling accuracy using real-world data, with metrics including overall accuracy (unweighted average of sub-categories), cost, and latency. It supports both native function calling and prompt-based text generation, with details in blogs like https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html. The leaderboard is interactive, with a demo at the site, and includes enterprise and open-source functions, as well as multi-turn interactions.
Scale AI’s SEAL Leaderboards: Based on private, curated datasets, it ranks LLMs on domains like generative AI coding, instruction following, math, and multilinguality, with some agent-relevant tasks like multi-turn conversations. It claims neutrality by not disclosing prompts, with methodology at https://scale.com/leaderboard/methodology.

Popularity and Community Engagement

Determining popularity is challenging without explicit metrics, but both Galileo and BFCL show signs of widespread use. Galileo’s presence on Hugging Face, a platform with significant AI community engagement, and mentions in news articles (e.g., ZDNET at https://www.zdnet.com/article/which-ai-agent-is-the-best-this-new-leaderboard-can-tell-you/) suggest industry relevance. BFCL, backed by Berkeley, is updated recently (March 2025) and linked to academic research, indicating academic popularity. Attempts to gauge X posts for mentions did not yield results, possibly due to the specificity of queries, but blog posts and platform activity imply strong community involvement.

Galileo’s leaderboard lists top performers, such as Gemini-2.0-flash (0.938 TSQ) and GPT-4o (0.900 TSQ), with pricing details, suggesting practical utility for engineers. BFCL’s inclusion of cost and latency metrics further enhances its appeal for deployment considerations. Scale AI’s leaderboard, while relevant, seems less focused on agents, with updates from May 2024, making it less current for 2025 trends.

Conclusion

As of April 1, 2025, the Galileo AI Agent Leaderboard and Berkeley Function-Calling Leaderboard (BFCL) are likely the most popular for ranking AI agent systems, given their focus, recent updates, and platform engagement. Galileo excels in real-world business scenarios, while BFCL is strong in academic and function-calling contexts. Scale AI’s SEAL Leaderboards offer broader LLM rankings with some agent relevance but are less specialized. These findings are based on platform activity, news mentions, and academic backing, though exact popularity metrics remain inferred.

chunhualiao/ai agent leaderboards.md