Tell me more about “ The report mentions that experts were paid up to $5000 for each question that was accepted to the Humanity’s Last Exam benchmark”

 title: 'Figure 9 | Results on autonomous cyber offense suite. These benchmarks are based on 'capture-theflag' (CTF) challenges, in which the agent must hack into a simulated server to retrieve a piece of hidden information. Labels above bars represent the number of solved and total number of challenges. A challenge is considered solved if the agent succeeds in at least one out of N attempts, where we vary N between 5 and 30 depending on challenge complexity. Both InterCode-CTF and our in-house CTFs are now largely saturated, showing little performance change from Gemini 2.0 to Gemini 2.5 models. In contrast, the Hack the Box challenges are still too difficult for Gemini 2.5 models, and so also give little signal on capability change.'

Experts were paid up to $5000 for each question that was accepted to the Humanity’s Last Exam benchmark[1]. While this benchmark still has significant headroom at the time of writing (June 2025), performance on it has improved significantly over the space of a few months[1].

The best models were achieving just a few percent accuracy on it when it was initially published in early 2025[1]. When one considers agentic systems, which are able to tackle problems for longer and which have access to tools and self critique, the complexity of benchmarks required to measure performance also increases dramatically[1].

Follow Up Recommendations