How does UI-TARS enhance GUI perception beyond textual inputs?