Which puzzles break current LRM abilities?

 title: 'Figure 13: Detailed results on reasoning effort (measured in inference thinking tokens) versus problem complexity (N) for three LRMs (DeepSeek-R1, Claude-3.7-Sonnet with thinking, and o3-mini) across four puzzle environments.'

Current Large Reasoning Models (LRMs) can experience significant performance issues when faced with complex puzzles. The study indicates that LRMs experience 'complete accuracy collapse beyond certain complexities' and that they struggle with 'generalizable problem-solving capabilities for planning tasks' as puzzle difficulty increases[1].

Additionally, LRMs are shown to engage in inefficient reasoning processes, often falling into an 'overthinking phenomenon,' where they explore incorrect solutions instead of arriving at correct ones efficiently[1]. This behavior underscores the limitations of LRMs in executing precise computations and reasoning tasks effectively under more complex scenarios.