GAIA Leaderboard
GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See our paper for more details.)
Data
GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to solve. It is therefore divided in 3 levels, where level 1 should be breakable by very good LLMs, and level 3 indicate a strong jump in model capabilities. Each level is divided into a fully public dev set for validation, and a test set with private answers and metadata.
GAIA data can be found in this dataset. Questions are contained in metadata.jsonl
. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field file_name
.
Please do not repost the public dev set, nor use it in training data for your models.
Leaderboard
Submission made by our team are labelled "GAIA authors". While we report average scores over different runs when possible in our paper, we only report the best run in the leaderboard.
See below for submissions.
Multi-Agent - Gemini Fine-tuned GPT-4o o1-preview | UK AI Safety Institute | 74.75 | 86.02 | 74.84 | 53.06 | 2025-03-20 |
claude-3-5-sonnet-v2@20241022, gemini-1.5-pro-002 | ServiceNow Research | 63.64 | 83.02 | 69.77 | 46.15 | 2025-01-29 |
Claude, o1, Gemini | Trase Systems | 70.3 | 83.02 | 69.77 | 46.15 | 2025-01-29 | |
claude-3-5-sonnet | h2o.ai | 63.64 | 67.92 | 67.44 | 42.31 | 2024-12-16 | |
GPT-4o and o3-mini | 🐫 CAMEL-AI & HKU | 58.18 | 81.13 | 54.65 | 23.08 | 2025-03-07 | |
claude-3.7-sonnet | ServiceNow Research | 55.76 | 71.7 | 53.49 | 30.77 | 2025-02-26 | |
Claude, Gemini | 55.15 | 69.81 | 54.65 | 26.92 | 2025-03-12 | ||
claude-3-5-sonnet-20241022 | AutoAgent Team@HKU | 55.15 | 71.7 | 53.49 | 26.92 | 2025-02-06 | |
o1 | HF 🤗 smolagents | 55.15 | 67.92 | 53.49 | 34.62 | 2025-02-04 | |
claude-3-5-sonnet-v2@20241022, gemini-1.5-pro-002 | 54.55 | 60.38 | 59.3 | 26.92 | 2024-12-02 | ||
Claude Sonnet 3.5, GPT-4o, o1 | 50.3 | 62.26 | 50 | 26.92 | 2024-12-10 | ||
Multi-Agent - Gemini Fine-tuned GPT-4o o1-preview | Trase Systems | 47.27 | 58.49 | 46.51 | 26.92 | 2024-10-16 | |
o1 and GPT-4o (varies by agent) | MSR AI Frontiers | 46.06 | 56.6 | 46.51 | 23.08 | 2024-10-15 | |
o1-preview, gpt-4o | 46.06 | 60.38 | 44.19 | 23.08 | 2024-10-20 | ||
GPT-4o | Hugging Face 🤗 | 44.24 | 58.49 | 43.02 | 19.23 | 2024-06-26 | |
o1 | 44.24 | 52.83 | 45.35 | 23.08 | 2025-03-14 | ||
GPT-4-turbo | 40 | 50.94 | 40.7 | 15.38 | 2024-05-28 | ||
GPT-4o | Trase Systems | 40 | 47.17 | 40.7 | 23.08 | 2024-08-07 | |
GPT-4-turbo | MSR AI Frontiers | 39.39 | 54.72 | 38.37 | 11.54 | 2024-03-01 | |
GPT-4-turbo | 37.58 | 50.94 | 36.05 | 15.38 | 2024-05-26 | ||
GPT-4o | MSR AI Frontiers | 36.97 | 54.72 | 33.72 | 11.54 | 2024-10-14 | |
GPT-4-turbo | OS-Copilot | 34.55 | 45.28 | 34.88 | 11.54 | 2024-01-24 | |
GPT-4o | 33.94 | 47.17 | 34.88 | 3.85 | 2024-10-14 | ||
31.52 | 32.08 | 33.72 | 23.08 | 2025-03-23 | |||
29.7 | 33.96 | 31.4 | 15.38 | 2025-02-11 | |||
GPT-4-turbo | 29.7 | 43.4 | 27.91 | 7.69 | 2024-05-20 | ||
4o | 23.64 | 28.3 | 25.58 | 7.69 | 2025-03-07 | ||
GPT-4-Turbo | 17.58 | 30.19 | 15.12 | 0 | 2024-02-22 | ||
Meta-Llama-3-70B-Instruct | Hugging Face | 16.97 | 30.19 | 11.63 | 7.69 | 2024-05-07 | |
qw_2 | 13.33 | 22.64 | 11.63 | 0 | 2025-02-24 | ||
aa | 11.52 | 20.75 | 9.3 | 0 | 2025-03-10 | ||
10.3 | 32.08 | 0 | 0 | 2025-03-15 | |||
noxrobot | NOX | 9.7 | 11.32 | 10.47 | 3.85 | 2025-03-11 | |
GPT4 | GAIA authors | 9.7 | 20.75 | 5.81 | 0 | 2023-11-14 | |
claude+deepseek | Void Main Lab | 9.09 | 7.55 | 10.47 | 7.69 | 2025-03-16 | |
GPT4 | GAIA authors | 6.06 | 15.09 | 2.33 | 0 | 2023-11-09 | |
GPT3 | GAIA authors | 4.85 | 7.55 | 4.65 | 0 | 2023-11-17 | |
AutoGPT + GPT4 | AutoGPT | 4.85 | 13.21 | 0 | 3.85 | 2023-11-03 | |
gpt_3.5 | test | 2.42 | 3.77 | 2.33 | 0 | 2024-07-24 | |
MBZUAI | 1.82 | 0 | 3.49 | 0 | 2025-03-16 | ||
ant_test | ant_test | 1.21 | 1.89 | 1.16 | 0 | 2025-03-09 | |
timc | Timc | 0 | 0 | 0 | 0 | 2025-02-10 | |
GPT4 | GAIA authors | 0 | 0 | 0 | 0 | 2023-11-09 |
Submissions
Results can be submitted for both validation and test. Scores are expressed as the percentage of correct answers for a given split.
Each question calls for an answer that is either a string (one or a few words), a number, or a comma separated list of strings or floats, unless specified otherwise. There is only one correct answer. Hence, evaluation is done via quasi exact match between a model’s answer and the ground truth (up to some normalization that is tied to the “type” of the ground truth).
In our evaluation, we use a system prompt to instruct the model about the required format:
You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER]. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
We advise you to use the system prompt provided in the paper to ensure your agents answer using the correct and expected format. In practice, GPT4 level models easily follow it.
We expect submissions to be json-line files with the following format. The first two fields are mandatory, reasoning_trace
is optional:
{"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
{"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}
Our scoring function can be found here.