
UI-TARS is a native GUI agent model designed to operate without manual rules or cascaded modules, perceiving screenshots, applying reasoning, and generating actions autonomously[1]. UI-TARS learns from prior experiences, refining its performance using environment feedback[1]. It serves as a model for GUI interaction, with a structure geared towards streamlining operations[1].

The architecture of UI-TARS involves a sequential process that includes observations and actions to accomplish tasks[1]. This process can be expressed as (instruction ,(o1, a1),(o2, a2),···,(on, an)), where 'o' denotes the observation (device screenshot) at time step i, and 'a' represents the action executed by the agent[1]. At each time step, UI-TARS takes as input the task instruction, the history of prior interactions (o1, a1,···, oi−1, ai−1), and the current observation oi[1]. Based on this input, the model outputs an action ai from the predefined action space[1].
UI-TARS incorporates core capabilities including perception, action, reasoning (System 1 & 2 thinking), and memory[1]. This design supports effective interaction with graphical user interfaces[1].
To enhance the agent’s reasoning and enable more deliberate decision-making, UI-TARS integrates a reasoning component in the form of 'thoughts' generated before each action, reflecting the reflective nature of 'System 2' thinking[1]. These thoughts act as an intermediary step, guiding the agent to reconsider previous actions and observations before moving forward[1]. Therefore, providing enhanced decision-making in ambiguous situations[1]. The process can then be formalized as (instruction ,(o1, t1, a1),(o2, t2, a2),···,(on, tn, an)), where 't' represents the reasoning thought[1].
The core contributions of UI-TARS are:
GUI perception involves interpreting graphical user interfaces in real-time, adapting to changes as the interface evolves[1]. UI-TARS uses structured text from HTML, visual screenshots, and semantic outlines to achieve a holistic understanding[1]. It is trained to identify elements and generate descriptions, enabling the model to understand the interface layout[1]. It is important to have the ability to interact with the current environment in real-time. GUIs are dynamic, so the agent model must be able to adapt to the current environment[1].

Effective action mechanisms within UI-TARS must be versatile, precise, and adaptable to various GUI contexts[1].
UI-TARS uses a unified action space to standardize actions across platforms such as 'click', 'type', 'scroll', and 'drag'[1]. Actions can be categorized into atomic actions to execute single operations, and compositional actions to sequence multiple atomic actions[1]. UI-TARS predicts the coordinates of elements it needs to interact with, normalizing coordinates to maintain consistency across different devices[1].

Reasoning is a key capability that integrates a variety of cognitive functions, that are separated into two systems[1]. First, is System 1, which refers to fast, automatic, and intuitive thinking, typically employed for simple and routine tasks[1]. Second, is System 2, which encompasses slow, deliberate, and analytical thinking, crucial for solving complex tasks[1].
System 2 Reasoning represents the agent’s ability to handle complex, multi-step tasks through explicit intermediate thinking processes[1]. The System 2 mechanism of UI-TARS uses techniques like task decomposition, long-term consistency, milestone recognition, and reflection[1].
The memory components of UI-TARS are used to store knowledge and historical experience that the agent refers to when making decisions[1]. It uses short-term memory for task-specific information and long-term memory for background knowledge[1]. Native agent models encode long-term operational experience within their internal parameters, converting the interaction process into implicit, parameterized storage[1].
Get more accurate answers with Super Pandi, upload files, personalized discovery feed, save searches and contribute to the PandiPedia.
Let's look at alternatives: