Model evaluations for gpt-oss reveal that these models, particularly gpt-oss-120b, excel in specific reasoning tasks such as math and coding. They demonstrate strong performance on benchmarks like AIME, GPQA, and MMLU, often surpassing OpenAI's previous models. For example, in AIME 2025 with tools, gpt-oss-120b achieved a 97.9% accuracy, showcasing its advanced reasoning capabilities [1].

However, when evaluated on safety and robustness, gpt-oss models generally performed similarly to OpenAI’s o4-mini. They showed effectiveness in disallowed content evaluations, though improvements are still necessary, particularly regarding instruction adherence and robustness against jailbreaks [1].

What do model evaluations reveal?

Related Content From The Pandipedia