Gaia2 and ARE: Automating AI Agent Evaluation

The Problem: How Do You Judge an Agent?
Enter Gaia2: A Benchmark Built for (and by) the Community
The Secret Sauce: What Is the Agent Review Engine (ARE)?
From Gaia1 to Gaia2: What Changed?
Who's Behind It? The Open Collaboration

The Problem: How Do You Judge an Agent?

Imagine you’re a teacher. A student hands in a math problem. They solved it in a way you’ve never seen before. But the answer is correct. Do you give them full credit?

Now imagine you have to do this for hundreds of students, each solving problems in wildly different ways. That’s the daily reality for researchers trying to evaluate AI agents. An AI agent is a program that can browse the web, fill out forms, book flights, or search databases on your behalf. Unlike a simple chatbot that just answers questions, an agent takes actions. It clicks buttons, types text, and navigates through websites.

The problem is that there’s no single “right way” to complete a task. An agent asked to find the cheapest flight from New York to London might search Kayak, Google Flights, or even a specific airline’s site. All three are valid. But how do you check whether the agent actually succeeded? In the past, researchers had to watch every single step manually. They’d look at screenshots, read logs, and decide by hand whether the agent did what it was supposed to do.

That manual process is slow, expensive, and hard to repeat. If another lab wants to verify the results, they have to run the whole thing again and re-check everything. It’s like asking every school in the country to grade the same test by hand, with no answer key.

This is exactly the problem that Gaia2 and the Agent Review Engine (ARE) set out to solve. They’re open-source tools that automate the grading process. Instead of a human watching every agent run, a special AI system looks at the screenshots and actions and decides whether the task was completed correctly. It’s like giving every teacher a detailed rubric that can be applied automatically.

Enter Gaia2: A Benchmark Built for (and by) the Community

Gaia2 is the second version of a popular benchmark for testing AI agents. The first version, Gaia1, launched in 2023. It quickly became a standard. Top labs used it to measure how well their agents could handle real-world web tasks. But Gaia1 had a big weakness: it relied on human judges to verify results. That didn’t scale.

Gaia2 changes that. It’s a dataset of over 1,000 tasks, all designed to be solved by web agents. The tasks are real-world scenarios: finding information on a website, comparing prices, filling out a form, booking a reservation. They’re not toy problems. They’re the kind of things you might ask a human assistant to do online.

What makes Gaia2 special is that it’s built by the community, for the community. It’s not a corporate product locked behind a paywall. It’s hosted on the Hugging Face Hub, a popular platform for sharing AI models and datasets. Anyone can download the tasks, run their own agents, and submit results. The whole process is transparent and reproducible.

Hugging Face, the company behind the platform, designed Gaia2 to be a shared infrastructure. Researchers from different labs can compare their agents on the same set of tasks, using the same evaluation method. That’s a huge step forward. Before Gaia2, every lab had its own private tests and its own way of scoring. It was impossible to know which agent was truly better.

The benchmark focuses on the “open web” – sites that anyone can access without logging in. That means no tasks that require a Facebook or Twitter account. No tasks that use custom test domains built just for the benchmark. The goal is to test agents on the messy, unpredictable internet that real users deal with.

The Secret Sauce: What Is the Agent Review Engine (ARE)?

The Agent Review Engine, or ARE, is the heart of Gaia2. It’s the automated grader that replaces the human judge.

Here’s how it works. When an agent runs a task, ARE records every step. It takes screenshots of the browser at each action. It logs every click, every keystroke, every navigation. Then, after the agent finishes, ARE replays the entire run and checks whether the task was completed successfully.

But here’s the clever part. It doesn’t just check whether the final output matches a string. That would be too simple. Many tasks can be solved in different ways. The final answer might be the same, but the path to get there could be completely different. So ARE uses a Vision-Language Model (VLM) to reason about the task.

A VLM is a type of AI that can understand both images and text. Think of it as a teacher who can look at a screenshot of the student’s work and understand what’s happening. The VLM doesn’t just check for exact matches. It looks at the context. It asks: did the agent actually find the right information? Did it fill out the form correctly? Did it complete all the required steps?

This is a big improvement over earlier systems that used simple string matching. Those systems could be tricked easily. An agent might type the right answer but in the wrong place, or it might get the right result by accident. ARE’s VLM can catch those mistakes. It provides detailed error logs and screenshots, so researchers can see exactly where the agent went wrong.

Hugging Face compares ARE to a teacher grading with a rubric. The rubric defines what success looks like for each task. The VLM applies that rubric automatically. It’s not perfect – for the most creative or open-ended tasks, human judgment may still be needed. But for the vast majority of tasks, ARE works well.

From Gaia1 to Gaia2: What Changed?

Gaia1 was a great start, but it had limitations. The biggest one was the manual evaluation bottleneck. Researchers had to watch agent runs and decide success or failure by hand. That took time and money. It also introduced inconsistency. Two different judges might grade the same run differently.

Gaia2 fixes that with ARE. The VLM-based grading is consistent across all runs. It doesn’t get tired. It doesn’t have bad days. It applies the same standard every time.

Another change is the scale. Gaia1 had fewer tasks. Gaia2 has over 1,000, covering a wider range of scenarios. That makes the benchmark more robust. An agent that does well on 1,000 tasks is likely more capable than one that only did well on 100.

Gaia2 also emphasizes reproducibility. Because ARE records every step and provides detailed logs, any other researcher can replay the exact run. They can see exactly what the agent did. This is crucial for scientific integrity. If a result seems too good, someone can check the logs and verify it.

The community involvement is deeper too. Gaia1 was mostly driven by a small group. Gaia2 is a large-scale collaboration with contributions from many organizations. More on that in a moment.

Finally, Gaia2 is designed to be extensible. The dataset can grow as the community adds new tasks. The platform can support new agents as they are developed. It’s not a one-time release. It’s a living benchmark that evolves with the field.

Who’s Behind It? The Open Collaboration

Gaia2 is not the product of a single company. It’s a community effort. Hugging Face led the development, but they worked with a long list of partners.

The list includes Meta, OpenAI, AWS, Google, and Kaggle. It includes top universities: the University of Washington, MIT, UIUC. It includes Apple, NVIDIA, and Salesforce. That’s a who’s who of AI research.

Why did so many organizations join? Because they all face the same problem. They need a reliable way to test their agents. Building your own benchmark is expensive and time-consuming. A shared benchmark is much better. Everyone can contribute tasks, run their agents, and compare results.

Hugging Face emphasizes that this is not a corporate product. It’s a shared infrastructure for the whole community. The dataset, the code, the evaluation engine – all open source. Anyone can use it, modify it, or contribute to it.

This open collaboration is important for fairness. If one company controlled the benchmark, they might design it to favor their own agents. With many contributors, the tasks are more diverse and the evaluation is more neutral.

The collaboration also speeds up progress. When researchers from different labs share their tasks and results, everyone learns faster. The best ideas spread quickly.

Real-World Rules: What Tasks Are In – and What’s Out

Gaia2 tasks are designed to be realistic. They involve the open web – sites that anyone can access without a login. That means no tasks that require a social media account, a subscription, or a corporate login. The goal is to test agents on the internet as most people experience it.

Tasks include things like: find the cheapest price for a specific product on a comparison shopping site. Look up a fact on Wikipedia and fill in a form. Navigate through a government website to find a form. Compare hotel prices on a travel site. These are the kinds of tasks that a human assistant might do.

What’s excluded? Login-gated services. Tasks that require a specific account on a platform like Facebook, Twitter, or LinkedIn. Those are too variable and raise privacy concerns. Also excluded are custom test domains built just for the benchmark. Some older benchmarks used fake websites that were designed to test specific skills. Gaia2 prefers real websites, because that’s where agents will actually be used.

This focus on the open web makes Gaia2 more challenging. Real websites change all the time. A site might update its layout, move a button, or change a form. Agents have to adapt. They can’t just memorize a fixed path.

It also makes the benchmark more fair. If a task uses a real website, anyone can go and see it. There’s no hidden advantage for labs that have access to special test domains.

Costs, Limitations, and the Road Ahead

Gaia2 and ARE are not perfect. They have real limitations.

First, cost. Each agent run costs roughly $1 in compute. That’s because the VLM-based evaluation requires processing screenshots and logs. If you want to test 100 agents on 1,000 tasks each, that’s $100,000. That’s not cheap, but it’s much cheaper than paying human judges to do the same work.

Second, task creativity. For the most open-ended tasks, where there are many valid approaches, the VLM may struggle. Hugging Face acknowledges that for some tasks, human judgment may still be needed. ARE works best when the task has a clear definition of success.

Third, the benchmark currently includes only a few baseline agents. As of the announcement, Gaia2 supports GPT-4o, Gemini 2.0 Flash, and Claude Sonnet 4. These are powerful models, but the community needs more. Hugging Face plans to add more baseline agents in Part 2, with contributions from the community.

Fourth, the benchmark is limited to the open web. That means agents that specialize in tasks behind logins (like managing a social media account) can’t be tested here. But that’s a deliberate choice, not a flaw. The open web is broad enough to test general capabilities.

Looking ahead, the road map is clear. More agents. More tasks. Better VLM evaluation. Hugging Face is inviting the community to contribute. If you have an agent that you think performs well, you can submit it to be added as a baseline. If you have a great task idea, you can contribute it to the dataset.

The vision is a transparent, reproducible, and accessible ecosystem for agent research. Instead of each lab working in isolation, they can all share the same testing ground. That should speed up progress and make results more trustworthy.

How You Can Join the Next Wave

Gaia2 is open to everyone. You don’t need to work at a big lab or a famous university. If you’re a developer or a researcher interested in web agents, you can get involved.

First, you can run your own agent on Gaia2. The dataset is on the Hugging Face Hub. You can download the tasks, set up your agent, and submit your results. The ARE will evaluate your agent automatically. You’ll get detailed logs and screenshots showing how your agent performed.

Second, you can contribute new tasks. If you have a real-world web task that you think is challenging, you can submit it to the Gaia2 dataset. The community will review it and add it if it fits the criteria. This helps the benchmark grow and stay relevant.

Third, you can submit your agent to be included as a baseline. Hugging Face plans to add more agents in Part 2. If your agent performs well, it could become one of the standard models that everyone compares against.

Fourth, you can help improve the Agent Review Engine. The code is open source. If you have ideas for making the VLM evaluation more accurate, or if you find bugs, you can contribute fixes.

Finally, you can simply use Gaia2 to study agents. The detailed logs and screenshots from ARE are a goldmine for research. You can analyze why agents succeed or fail. You can compare different strategies. You can learn what works and what doesn’t.

This is a community effort, and it’s only as strong as the people who participate. Whether you’re a seasoned researcher or a curious developer, there’s a place for you. The tools are free. The data is open. The invitation is there.

Gaia2 and ARE represent a shift in how we study AI agents. Instead of closed, proprietary benchmarks, we now have an open, automated, community-driven system. It’s a better way to grade agents, and it’s a better way to advance the field. The next wave of agent research is starting now. You can be part of it.

Frequently Asked Questions

What is the main problem with evaluating AI agents?

Evaluating AI agents is difficult because there isn't one single correct way to complete a task. Researchers previously had to manually watch every step an agent took, which was slow, expensive, and hard to repeat.

What are Gaia2 and the Agent Review Engine (ARE)?

Gaia2 and ARE are open-source tools designed to automatically grade AI agents. ARE acts as an automated grader, analyzing an agent's actions and screenshots to determine if a task was completed correctly.

How does the Agent Review Engine (ARE) work?

When an AI agent performs a task, ARE records all its actions and takes screenshots. It then uses a Vision-Language Model (VLM) to review these recordings and assess if the task was successfully completed, considering the context of the actions.

What is a Vision-Language Model (VLM) and why is it important for ARE?

A VLM is an AI that can understand both images and text. ARE uses a VLM to act like a teacher with a rubric, allowing it to understand the context of an agent's actions in screenshots and catch mistakes that simple text matching would miss.

What makes Gaia2 different from Gaia1?

Gaia2 significantly improves upon Gaia1 by integrating the Agent Review Engine (ARE) for automated grading, eliminating the need for manual evaluation. Gaia2 also offers a larger dataset of over 1,000 real-world tasks.

Why is Gaia2 considered a community-driven project?

Gaia2 is an open-source dataset hosted on the Hugging Face Hub, allowing anyone to download tasks, run their agents, and submit results. This transparency and accessibility enable researchers from different labs to compare their agents fairly.

What kind of websites does Gaia2 focus on for its tasks?

Gaia2 focuses on tasks involving the 'open web,' meaning websites that are publicly accessible without requiring a login. This ensures that agents are tested on the unpredictable internet that real users encounter.

References

Gaia2 and ARE: Empowering the community to study agents – Original report (Hugging Face Blog)

AI・Biotech & Health

SAIR: A New AI Tool That Could Speed Up Drug Discovery

AI・Enterprise

AssetOpsBench: A New Way to Test AI in Real Factories and Power Plants

Apps・Google

Time Is Running Out: How to Save Your Samsung Messages Before July

Economy・Enterprise

The Office Doesn’t Fix Loneliness at Work

Economy・EVs

Polestar Out, Volvo In: A Trade Rule That Makes No Sense

Apps・Google

Time Is Running Out: How to Save Your Samsung Messages Before July

AI・Hardware

Wall Street Is Whispering a New Name Alongside Nvidia: Micron. But History Says to Be Careful.

AI • AI Tools

TBB Desk

TBB Desk

Key Takeaways

Leave a Comment Cancel reply

Join thousands of readers shaping the tech conversation.

Join thousands of readers shaping the tech conversation.

Sections

Topics

Resources

Advertise

Company