/Tech1h ago

HiViG Test-Time Framework Improves Long-Horizon GUI Agent Performance

4351851.3K

#258

Original post

Mohit Bansal#258

hyunji amy lee@hyunji_amy_lee

🚨 Introducing HiViG, a test-time intervention framework for long-horizon GUI tasks. By tracking history & verifying actions w/ visual grounding, HiViG boosts performance across diverse GUI environments even for strong policies where existing critics often degrade performance.

At test time, HiViG guides the policy in two crucial phases: 1️⃣ Before proposing an action: it provides the policy with an updated summary of past interactions for better history-aware action generation. 2️⃣ After an action is proposed: it evaluates the proposed action using visually grounded reasoning to intercept any flawed action before execution.

Across three long-horizon GUI benchmarks with various environments (WebArenaLitev2 🌐, AndroidLab 📱, WindowsAgentArena 🖥️) on strong base policies (Qwen3-VL-32B-Thinking, Gemini-3-Flash), HiViG improves average overall success rate by 5.8% and 9.0% compared to the strongest critics, showing its effectiveness and generalization across diverse GUI platforms and policies! 💪

🧵👇

10:17 AM · Jun 10, 2026 · 971 Views

/Tech1h ago

HiViG Test-Time Framework Improves Long-Horizon GUI Agent Performance

4351851.3K

#258

Original post

Mohit Bansal#258

hyunji amy lee@hyunji_amy_lee

🧵👇

10:17 AM · Jun 10, 2026 · 971 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS166LIKES9RETWEETS6

Jaewoo Lee@jwlee8877

Excited to share ✨HiViG✨, a test-time intervention framework for long-horizon GUI tasks via history state tracking and visually grounded error analysis.

1️⃣ History state tracking: HiViG summarizes past interactions into a compact macro-action history, enabling better history-aware planning of policies over long horizons.

2️⃣ Visually grounded error analysis: Instead of overly relying on the policy's textual intents, HiViG verifies raw execution coordinates against the current GUI env screenshot. If an action proposed by the policy is flawed (e.g., visual hallucination, termination misjudgment), it provides corrective guidance before execution.

hyunji amy lee@hyunji_amy_lee

🧵👇

1h16690

BOOKMARKS1

hyunji amy lee@hyunji_amy_lee

📄http://arxiv.org/abs/2606.11078 🧑‍💻http://github.com/G-JWLee/HiViG 🤗http://huggingface.co/papers/2606.11078

Work led by @jwlee8877 w/ @codezakh, @ArchikiPrasad, @cyjustinchen, Supriyo Chakraborty, Kartik Balasubramaniam, Sambit Sahu, @EliasEskin, @mohitban47 @unccs @unc_ai_group

1h2631

REPLIES1

hyunji amy lee@hyunji_amy_lee

HiViG improves challenging long-horizon GUI tasks:

➡️ HiViG improves performance on tasks that otherwise remain very difficult: increasing Qwen3-VL-32B-Thinking’s success rate on the WebArenaLitev2 Map category from 3.9% to 23.1%, and Gemini-3-Flash’s success rate on the WindowsAgentArena Office category from 4.7% to 23.3%.

➡️ These results show that HiViG is particularly beneficial for challenging tasks where policies and existing test-time interventions often fail. HiViG-critic guides policies to make decisions anchored in visual content and historical progress.

1h941

hyunji amy lee@hyunji_amy_lee

Existing test-time interventions are limited in long-horizon GUI tasks: ❌ Scalar reward models are uninformative when all candidates are poor. ❌ Verbal critics struggle to keep track of task progress. ❌ Verbal critics over-rely on textual intent, erroneously approving visually misaligned actions.

HiViG (History-aware Visually Grounded) test-time intervention addresses these by training a verbal critic (HiViG-critic) with two core abilities:

✅ History state tracking: provides a macro-action history that summarizes past interactions to date (e.g., “Successfully opened the ‘Downloads’ directory and confirmed the ‘SpecialProjects’ folder is empty”).

✅ Visually grounded error analysis: verifies raw execution coordinates of actions against actual visual states and provides corrective guidance before executing action.

1h442

hyunji amy lee@hyunji_amy_lee

Our ablation studies reveal that

➡️ Deploying either visually grounded error analysis or history state tracking components of HiViG outperforms all baselines, and combining these two within HiViG yields the highest performance, showing a strong synergistic effect of the two components.

➡️ Combining two visual grounding strategies (intent masking & visual marker) yields superior performance compared to using either strategy only, where the intent masking breaks text-reliance to force analysis on the screenshot, while the visual marker provides spatial anchors needed to verify action execution.

1h271

hyunji amy lee@hyunji_amy_lee

We construct two distinct datasets to train HiViG-critic for two capabilities: history state tracking and visually grounded error analysis.

1️⃣ History state tracking: Iteratively translates visual state changes into a compact macro-action history to track long-term goal progress.

2️⃣ Visually grounded error analysis: To capture diverse failure modes using visually grounded rationales, annotating rationales follows three steps: ➡️ Step 1: extract ground-truth state-transitions ➡️ Step 2: synthesize plausible errors ➡️ Step 3: generate a multi-stage rationale for error analysis that leverages a visual marker and the extracted state-transitions.

1h211

hyunji amy lee@hyunji_amy_lee

Setup: ➡️ We use a generalized pixel-based action space (adopted from Qwen3) to navigate diverse GUIs (web 🌐, mobile 📱, desktop 🖥️). ➡️ All baselines and HiViG conducts test-time intervention, which serves as a pre-execution action evaluation to intercept a policy’s proposed action before the action alters the environment. ➡️ The policy receives the feedback, and proceeds or refines its action.

HIVIG outperforms existing test-time interventions across diverse GUI platforms and policies: ➡️ avg. +7.3% for open-weight Qwen3-VL-32B-Thinking and avg. +9.0% for the frontier closed-source Gemini-3-Flash policies

➡️ avg. +5.8% outperforming the strongest test-time intervention baseline for Qwen3-VL-32B policy.

Lack of visual grounding and history conditioning hurts other critics: ➡️ Compared to HiViG, baseline interventions often degrade frontier model performance, yielding at best a 2.1% gain on WindowsAgentArena and at worst an 8.4% drop on WebArenaLitev2 for Gemini-3-Flash.

➡️ In contrast, applying HiViG-critic increases WebArenaLitev2 for Gemini-3-Flash from 30.5% to 45.5%.

1h171