One oddly interesting thing is that Gemini Deep Research's performance on the Humanity's Last Exam benchmark has significantly improved, going from 7.95% in December 2024 to a SoTA score of 26.9% and 32.4% with higher compute in June 2025. The report also mentions a 'topological trap' in AI reasoni...
ViewQ1. 🤖 What does the Gemini 2.5 Pro excel at? - Producing interactive web applications - Producing paper - Producing music - Producing vehicles Answer: Producing interactive web applications Q2. 🤔 According to the report, experts making questions for the Humanity’s Last Exam benchmark were paid... ...
ViewWhile the Gemini Plays Pokémon agent was solving the Seafoam Islands dungeon, it encountered a novel bug in the code of Pokémon Red/Blue. It is likely the first AI to find a bug in the game's code. This bug occurred while navigating the multi-level dungeon. The Seafoam Islands dungeon contains five...
ViewQ1. 💰 How much were experts potentially paid for accepted questions in the Humanity’s Last Exam benchmark? - $1000 - $2500 - $5000 - $7500 Answer: $5000 Q2. 📈 What key factor is essential for scaling evaluations to match the increasing capabilities of AI systems? - Using synthetic data for trainin...
ViewThe Gemini Plays Pokémon (GPP) agent encountered a novel bug in the code of Pokémon Red/Blue. According to the report, GPP is likely the first AI to find this bug in the game's code. This occurred in the Seafoam Islands, which contain 5 floors involving multiple boulder puzzles. These puzzles requi...
ViewThe Gemini Plays Pokémon (GPP) agent encountered a novel bug in the code of Pokémon Red/Blue. According to the report, GPP is likely the first AI to find this bug in the game's code. This occurred in the Seafoam Islands, which contain 5 floors involving multiple boulder puzzles. These puzzles requi...
ViewQ1. 🎮 What specific game did the Gemini Plays Pokémon agent play? - Pokémon Yellow - Pokémon Red - Pokémon Blue - Pokémon Green Answer: Pokémon Blue Q2. 🧠 What reasoning action is recognized as especially impressive from Gemini 2.5 Pro playing Pokémon? - Solving spinner puzzles - Finding long path...
ViewThe Gemini Plays Pokémon (GPP) agent encountered a novel bug in the code of Pokémon Red/Blue. According to the report, GPP is likely the first AI to find this bug in the game's code. This occurred in the Seafoam Islands, which contain 5 floors involving multiple boulder puzzles. These puzzles requi...
ViewQ1. 🤔 What is the name of the most capable model introduced in the Gemini 2.X model family? - Gemini 2.5 Lite - Gemini 2.5 Pro - Gemini 2.0 Flash - Gemini 2.0 Pro Answer: Gemini 2.5 Pro Q2. 💡 Besides coding and reasoning skills, what is another capability of the Gemini 2.5 Pro model? - Excel at di...
View🤯 AI just reached a new milestone! The Gemini 2.5 family of models is here, pushing the boundaries of what's possible with complex AI [1]. Get ready for the next generation of agentic systems! 🧠 Gemini 2.5 Pro is the most capable model yet! It excels at coding, reasoning, and multimodal understand...
ViewQ1. 🤔 Which models are included in the Gemini 2.X model family? - Gemini 2.5 Ultra and Gemini 2.5 Micro - Gemini 2.5 Pro and Gemini 2.5 Flash - Gemini 2.0 Max and Gemini 2.0 Mini - Gemini 2.5 Advanced and Gemini 2.5 Basic Answer: Gemini 2.5 Pro and Gemini 2.5 Flash Q2. 💡 Besides coding and reasoni...
ViewGemini 2.5 Pro is the most capable model developed yet. It excels at coding, math, and reasoning tasks and achieves state-of-the-art performance on the Aider Polyglot evaluation....
View🤯 AI just reached a new milestone! The Gemini 2.5 family of models is here, pushing the boundaries of what's possible with complex AI [1]. Get ready for the next generation of agentic systems! 🧠 Gemini 2.5 Pro is the most capable model yet! It excels at coding, reasoning, and multimodal understand...
ViewThe Gemini 2.X series are all built to be natively multimodal, supporting long context inputs of >1 million tokens and have native tool use support. This allows them to comprehend vast datasets and handle complex problems from different information sources, including text, audio, images, video and e...
ViewGemini 2.5 Pro excels at coding tasks and represents a marked improvement over previous models. Performance on LiveCodeBench increased from 30.5% for Gemini 1.5 Pro to 69.0% for Gemini 2.5 Pro, while that for Aider Polyglot went from 16.9% to 82.2%. Relative to other large language models, Gemini a...
View