Matyas’ Notes

As Simon has recently pointed out:

I think “agent” may finally have a widely enough agreed upon definition to be useful jargon now

The example the industry feels the same is Huggingface introducing new GAIA tests called (surprise) Gaia2.

Where GAIA was read-only, Gaia2 is now a read-and-write benchmark, focusing on interactive behavior and complexity management.

In their new framework they are approaching the whole benchmark as a test how human would be using agents to achieve their goals i.e. sending emails, create calendar events or simply chatting to other agents. I think this reflects the way we are starting to see using LLM: not just in “read-only” but in “read-write mode” where agents are becoming more and more interactive. Interesting document where Huggingface team describe how the benchamrk works, what they are testing and how you can run the test yourself. They are presenting their first results using typical models like Llama 3.3, GPT-4o or Gemini.