What is the main challenge in training native GUI agents?

Contribute to bytedance/UI-TARS development by creating an account on GitHub.

One of the primary challenges in training native GUI agents is the data bottleneck[1]. Training an end-to-end agent model demands data that integrates all components in a unified workflow, capturing the interplay between perception, reasoning, memory, and action[1]. Comprehensive, high-quality data with rich workflow knowledge from human experts has been scarcely recorded historically, which limits the ability of native agents to generalize across diverse real-world scenarios, hindering their scalability and robustness[1].

Another challenge is that GUI environments, with their high information density, increase the difficulty of developing robust agents[1]. Native GUI agent models must recognize and interpret the evolving user interfaces effectively[1].