Papers
arxiv:2604.21375

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Published on Apr 23
· Submitted by
Cihang Xie
on Apr 24
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

VLAA-GUI is a modular GUI agent framework that addresses early stopping and repetitive loop issues through integrated components for verification, loop breaking, and search capabilities.

AI-generated summary

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

Community

Paper submitter

Autonomous GUI agents suffer from two chronic failure modes: early stopping (declaring success before the task is actually done) and repetitive loops (cycling through the same failing action without recovering). VLAA-GUI is a modular framework with three integrated components that tell the agent when to STOP, RECOVER, and SEARCH:

  1. A mandatory Completeness Verifier enforces UI-observable success criteria at every finish step, double-checked by an independent verifier model.
  2. A mandatory Loop Breaker detects repeated actions / recurring screen states / reflection-signaled stalls and escalates across three tiers.
  3. An on-demand Search Agent queries a search-grounded LLM directly in text, skipping the overhead of browser-based visual search.

Paired with five top-tier backbones, VLAA-GUI reaches 77.5% on OSWorld-Verified with Claude Opus 4.6 and 61.0% on WindowsAgentArena. Three of five backbones surpass human-level (72.4%) on OSWorld in a single pass, and VLAA-GUI with Sonnet 4.6 at only 15 action steps already outperforms the best published 50-step system.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.21375
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.21375 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.21375 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.21375 in a Space README.md to link it from this page.

Collections including this paper 1