HiViG Test-Time Framework Improves Long-Horizon GUI Agent Performance

VIEWS544BOOKMARKS1LIKES15RETWEETS9

Excited to share ✨HiViG✨, a test-time intervention framework for long-horizon GUI tasks via history state tracking and visually grounded error analysis.

1️⃣ History state tracking: HiViG summarizes past interactions into a compact macro-action history, enabling better history-aware planning of policies over long horizons.

2️⃣ Visually grounded error analysis: Instead of overly relying on the policy's textual intents, HiViG verifies raw execution coordinates against the current GUI env screenshot. If an action proposed by the policy is flawed (e.g., visual hallucination, termination misjudgment), it provides corrective guidance before execution.

hyunji amy lee@hyunji_amy_lee

🚨 Introducing HiViG, a test-time intervention framework for long-horizon GUI tasks. By tracking history & verifying actions w/ visual grounding, HiViG boosts performance across diverse GUI environments even for strong policies where existing critics often degrade performance.

At test time, HiViG guides the policy in two crucial phases: 1️⃣ Before proposing an action: it provides the policy with an updated summary of past interactions for better history-aware action generation. 2️⃣ After an action is proposed: it evaluates the proposed action using visually grounded reasoning to intercept any flawed action before execution.

Across three long-horizon GUI benchmarks with various environments (WebArenaLitev2 🌐, AndroidLab 📱, WindowsAgentArena 🖥️) on strong base policies (Qwen3-VL-32B-Thinking, Gemini-3-Flash), HiViG improves average overall success rate by 5.8% and 9.0% compared to the strongest critics, showing its effectiveness and generalization across diverse GUI platforms and policies! 💪

🧵👇

7h544151

REPLIES1

hyunji amy lee@hyunji_amy_lee

HiViG improves challenging long-horizon GUI tasks:

➡️ HiViG improves performance on tasks that otherwise remain very difficult: increasing Qwen3-VL-32B-Thinking’s success rate on the WebArenaLitev2 Map category from 3.9% to 23.1%, and Gemini-3-Flash’s success rate on the WindowsAgentArena Office category from 4.7% to 23.3%.

➡️ These results show that HiViG is particularly beneficial for challenging tasks where policies and existing test-time interventions often fail. HiViG-critic guides policies to make decisions anchored in visual content and historical progress.

7h941

Justin Chih-Yao Chen@cyjustinchen

🚨Existing critics for Computer Use Agents can catch some mistakes, but often miss two things that matter most in long-horizon GUI tasks: 1⃣ They are short-sighted, focusing on the current step while losing track of what has already been accomplished. 2⃣ They lack visual grounding, making it difficult to verify whether a proposed action actually targets the correct UI element.

Introducing ✨HiViG✨, our new test-time intervention framework, which helps GUI agents in two ways: • Before action generation: it provides a compact, history-aware summary of completed achievements to support long-horizon planning. • After action generation: it performs a visually grounded critique to verify proposed actions against the current screenshot and intercept mistakes before they happen.

Across WebArenaLitev2 (Web), AndroidLab (Mobile), and WindowsAgentArena (Desktop), HiViG consistently improves strong base policies, including Qwen3-VL-32B-Thinking (+5.8%) and Gemini-3-Flash (+9.0%)!

We also find that: • History awareness helps agents maintain progress and avoid short-sighted decision loops in long-horizon tasks. • Visual grounding enables critics to catch execution-level errors that text-only critics often miss. • Combining both leads to robust gains across all three environments.

🧵👇

hyunji amy lee@hyunji_amy_lee

🚨 Introducing HiViG, a test-time intervention framework for long-horizon GUI tasks. By tracking history & verifying actions w/ visual grounding, HiViG boosts performance across diverse GUI environments even for strong policies where existing critics often degrade performance.

At test time, HiViG guides the policy in two crucial phases: 1️⃣ Before proposing an action: it provides the policy with an updated summary of past interactions for better history-aware action generation. 2️⃣ After an action is proposed: it evaluates the proposed action using visually grounded reasoning to intercept any flawed action before execution.

Across three long-horizon GUI benchmarks with various environments (WebArenaLitev2 🌐, AndroidLab 📱, WindowsAgentArena 🖥️) on strong base policies (Qwen3-VL-32B-Thinking, Gemini-3-Flash), HiViG improves average overall success rate by 5.8% and 9.0% compared to the strongest critics, showing its effectiveness and generalization across diverse GUI platforms and policies! 💪

🧵👇

6h620122

Elias Stengel-Eskin@EliasEskin

🚨 Test-time intervention for CUA tasks is hard: history is hard to represent, actions require visual grounding and verification before execution, not after. HiViG jointly tackles these points, learning to track history and verify actions against the GUI screenshot.

As a test-time method, HiViG is compatible w/ open- and closed-source models and is domain- and model-general: we see 5.8-9% accuracy gains across WebArenaLite2 (web), AndroidLab (mobile) and WindowsAgentArena (desktop), and across models/model classes (e.g., Qwen3-VL-32B, Gemini-3-Flash), with especially large gains on challenging/long-horizon tasks (+19.2% on WebArenaLiteV2 Maps, +18.6% on WindowsAgentArena Office).

🧵👇

hyunji amy lee@hyunji_amy_lee

🚨 Introducing HiViG, a test-time intervention framework for long-horizon GUI tasks. By tracking history & verifying actions w/ visual grounding, HiViG boosts performance across diverse GUI environments even for strong policies where existing critics often degrade performance.

At test time, HiViG guides the policy in two crucial phases: 1️⃣ Before proposing an action: it provides the policy with an updated summary of past interactions for better history-aware action generation. 2️⃣ After an action is proposed: it evaluates the proposed action using visually grounded reasoning to intercept any flawed action before execution.

Across three long-horizon GUI benchmarks with various environments (WebArenaLitev2 🌐, AndroidLab 📱, WindowsAgentArena 🖥️) on strong base policies (Qwen3-VL-32B-Thinking, Gemini-3-Flash), HiViG improves average overall success rate by 5.8% and 9.0% compared to the strongest critics, showing its effectiveness and generalization across diverse GUI platforms and policies! 💪

🧵👇

5h56060

hyunji amy lee@hyunji_amy_lee

📄http://arxiv.org/abs/2606.11078 🧑‍💻http://github.com/G-JWLee/HiViG 🤗http://huggingface.co/papers/2606.11078

Work led by @jwlee8877 w/ @codezakh, @ArchikiPrasad, @cyjustinchen, Supriyo Chakraborty, Kartik Balasubramaniam, Sambit Sahu, @EliasEskin, @mohitban47 @unccs @unc_ai_group

7h2631

hyunji amy lee@hyunji_amy_lee

Existing test-time interventions are limited in long-horizon GUI tasks: ❌ Scalar reward models are uninformative when all candidates are poor. ❌ Verbal critics struggle to keep track of task progress. ❌ Verbal critics over-rely on textual intent, erroneously approving visually misaligned actions.

HiViG (History-aware Visually Grounded) test-time intervention addresses these by training a verbal critic (HiViG-critic) with two core abilities:

✅ History state tracking: provides a macro-action history that summarizes past interactions to date (e.g., “Successfully opened the ‘Downloads’ directory and confirmed the ‘SpecialProjects’ folder is empty”).

✅ Visually grounded error analysis: verifies raw execution coordinates of actions against actual visual states and provides corrective guidance before executing action.

7h442

hyunji amy lee@hyunji_amy_lee

Our ablation studies reveal that

➡️ Deploying either visually grounded error analysis or history state tracking components of HiViG outperforms all baselines, and combining these two within HiViG yields the highest performance, showing a strong synergistic effect of the two components.

➡️ Combining two visual grounding strategies (intent masking & visual marker) yields superior performance compared to using either strategy only, where the intent masking breaks text-reliance to force analysis on the screenshot, while the visual marker provides spatial anchors needed to verify action execution.

7h271

hyunji amy lee@hyunji_amy_lee

We construct two distinct datasets to train HiViG-critic for two capabilities: history state tracking and visually grounded error analysis.

1️⃣ History state tracking: Iteratively translates visual state changes into a compact macro-action history to track long-term goal progress.

2️⃣ Visually grounded error analysis: To capture diverse failure modes using visually grounded rationales, annotating rationales follows three steps: ➡️ Step 1: extract ground-truth state-transitions ➡️ Step 2: synthesize plausible errors ➡️ Step 3: generate a multi-stage rationale for error analysis that leverages a visual marker and the extracted state-transitions.

7h211

hyunji amy lee@hyunji_amy_lee

Setup: ➡️ We use a generalized pixel-based action space (adopted from Qwen3) to navigate diverse GUIs (web 🌐, mobile 📱, desktop 🖥️). ➡️ All baselines and HiViG conducts test-time intervention, which serves as a pre-execution action evaluation to intercept a policy’s proposed action before the action alters the environment. ➡️ The policy receives the feedback, and proceeds or refines its action.

HIVIG outperforms existing test-time interventions across diverse GUI platforms and policies: ➡️ avg. +7.3% for open-weight Qwen3-VL-32B-Thinking and avg. +9.0% for the frontier closed-source Gemini-3-Flash policies

➡️ avg. +5.8% outperforming the strongest test-time intervention baseline for Qwen3-VL-32B policy.

Lack of visual grounding and history conditioning hurts other critics: ➡️ Compared to HiViG, baseline interventions often degrade frontier model performance, yielding at best a 2.1% gain on WindowsAgentArena and at worst an 8.4% drop on WebArenaLitev2 for Gemini-3-Flash.

➡️ In contrast, applying HiViG-critic increases WebArenaLitev2 for Gemini-3-Flash from 30.5% to 45.5%.

7h171