What do model evaluations reveal?

 title: 'Figure 1: Main capabilities evaluations. We compare the gpt-oss models at reasoning level high to OpenAI's o3, o3-mini, and o4-mini on canonical benchmarks. gpt-oss-120b surpasses OpenAI o3-mini and approaches OpenAI o4-mini accuracy. The smaller gpt-oss-20b model is also surprisingly competitive, despite being 6 times smaller than gpt-oss-120b.'

Model evaluations for gpt-oss reveal that these models, particularly gpt-oss-120b, excel in specific reasoning tasks such as math and coding. They demonstrate strong performance on benchmarks like AIME, GPQA, and MMLU, often surpassing OpenAI's previous models. For example, in AIME 2025 with tools, gpt-oss-120b achieved a 97.9% accuracy, showcasing its advanced reasoning capabilities[1].

However, when evaluated on safety and robustness, gpt-oss models generally performed similarly to OpenAI’s o4-mini. They showed effectiveness in disallowed content evaluations, though improvements are still necessary, particularly regarding instruction adherence and robustness against jailbreaks[1].

Space: Let’s explore the gpt-oss-120b and gpt-oss-20b Model Card