Experts were paid up to $5000 for each question that was accepted to the Humanity’s Last Exam benchmark[1]. While this benchmark still has significant headroom at the time of writing (June 2025), performance on it has improved significantly over the space of a few months[1].
The best models were achieving just a few percent accuracy on it when it was initially published in early 2025[1]. When one considers agentic systems, which are able to tackle problems for longer and which have access to tools and self critique, the complexity of benchmarks required to measure performance also increases dramatically[1].
Get more accurate answers with Super Search, upload files, personalized discovery feed, save searches and contribute to the PandiPedia.
Let's look at alternatives: