What triggers reasoning collapse in LRMs?

 title: 'Figure 12: Density distribution of first failure moves for thinking and non-thinking models across puzzle environments. Top: Claude-3.7-Sonnet comparison; Bottom: DeepSeek-R1 vs DeepSeek-V3.'

Reasoning collapse in Large Reasoning Models (LRMs) is triggered by their failure to develop generalizable problem-solving capabilities beyond certain complexity thresholds. The empirical investigation shows that accuracy progressively declines as problem complexity increases until reaching complete collapse, where performance drops to zero beyond a model-specific threshold[1].

Additionally, there is a counterintuitive reduction in reasoning effort, measured by inference tokens, as models approach this critical complexity point, despite having sufficient computational resources. This indicates inherent limitations in the reasoning capabilities of LRMs, revealing that they do not effectively leverage additional inference time as problem complexity escalates[1].