📚 An AI model is presented with a sentence like "She picked up the heavy book and..." and needs to choose the most plausible continuation from several options. Which AI benchmark is designed to evaluate a model's ability to choose the most plausible ending to a given sentence, testing commonsense natural language inference?
Difficulty: Medium
💻 A software development team wants to evaluate an AI's capability to autonomously identify and fix bugs in a large codebase. The AI is given access to GitHub issues and the repository, and its task is to generate and apply code patches. Which benchmark specifically evaluates an LLM's ability to resolve real-world software issues from GitHub by generating code patches for identified problems?
Difficulty: Hard
Sign Up To Try Advanced Features
Get more accurate answers with Super Pandi, upload files, personalized discovery feed, save searches and contribute to the PandiPedia.