Q1. 🤔 Which coding task has Gemini achieved SoTA (State of the Art) score?
- LiveCodeBench
- Aider Polyglot
- SWE-bench Verified
- Humanity’s Last Exam
Answer: Aider Polyglot
Q2. 🧐 What challenge does Gemini face regarding evaluation benchmarks, especially with capable reasoning agents?
- Development of novel and sufficiently challenging evaluation benchmarks has struggled to keep pace with model capability improvements
- Benchmarks are too easy to solve.
- Benchmarks are too expensive to create.
- Benchmarks are not representing tasks that have economic value.
Answer: Development of novel and sufficiently challenging evaluation benchmarks has struggled to keep pace with model capability improvements
Q3. 🤯 What was the payment range for experts contributing accepted questions to the Humanity’s Last Exam benchmark?
- Up to $1000 per question
- Up to $3000 per question
- Up to $5000 per question
- Up to $10,000 per question
Answer: Up to $5000 per question

AI Performance Benchmarks History

Related Content From The Pandipedia