Highlights: multilingual AI benchmarks

Multilingual capabilities were evaluated using the MMMLU evaluation.

The gpt-oss-120b at high reasoning performs nearly as well as OpenAI o4-mini.

The MMMLU evaluation included professionally human-translated versions in 14 languages.

gpt-oss-120b's average accuracy in MMMLU high reasoning is 81.3%.

gpt-oss-20b's average accuracy in MMMLU high reasoning is 75.7%.

Space: Let’s explore the gpt-oss-120b and gpt-oss-20b Model Card