Which model outperformed others on the OSWorld benchmark?

Contribute to bytedance/UI-TARS development by creating an account on GitHub.

UI-TARS achieved state-of-the-art results across a variety of standard benchmarks and demonstrated improvements over prior models^[1]. In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude’s 22.0 and 14.9 respectively^[2].

UI-TARS-72B with a 15-step budget (22.7) is comparable to Claude when the latter is given a 50-step budget (22.0), showing great execution efficiency^[2]. UI-TARS-72B-DPO achieves a new SOTA result 24.6 on OSWorld with a budget of 50 steps^[2].

Browser AI Agents

Related Content From The Pandipedia

Which benchmark tests health performance?What benchmarks prove TTD-DR's effectiveness?Which model uses self-reflection?Benefits of Youth Sports Participation AI Performance Benchmarks History Which model is used as LLM-as-a-judge?What was the Humanity’s Last Exam benchmark?What did experts earn for benchmark questions?What is the term for unacknowledged online article changes?Comparison of gpt-oss Models and OpenAI o4-mini What do model evaluations reveal?Tell me more about “ The report mentions that experts were paid up to $5000 for each question that was accepted to the Humanity’s Last Exam benchmark”Which model precedes gpt-5-main?The Purple Cloud How TTD-DR Achieves Superior Performance Compared to Traditional Research Agents