/Tech3h ago

EDITH Framework Lets Robots Interpret Human Nonverbal Signals in Real Time

4341092.7K
Original post unavailable.
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS33
Dongjun Lee@dongjunlie

To process this egocentric context, EDITH employs a hierarchical policy: a high-level policy decides what to do, a low-level policy executes it.

Hierarchical policy (1/3): a high-level policy The high-level VLM (Gemini-3.1-flash-lite) periodically monitors the stream of egocentric context and produces subtasks onto a shared queue.

[4/n]

3hViews 33
REPLIES1
Dongjun Lee@dongjunlie

This work was made possible by @meta_aria. We used Project Aria smart glasses to build our hardware system, streaming the wearer's egocentric RGB and eye gaze to the robot in real time.

For data collection, a human actor wearing aria glasses and a robot teleoperator work together interactively — the human actor conveys intent through gaze and gestures while the teleoperator demonstrates the matching robot actions.

Huge thanks to @meta_aria for the research kit. 🙏

3hViews 23
Dongjun Lee@dongjunlie

Motivation: Language-conditioned policies support only one human-robot interface: language. But plenty of tasks are hard or tedious to convey in words alone.

That's when people naturally add nonverbal signals (e.g., a glance, a pointing finger),  alongside what they say.

[2/n]

3hViews 12
Dongjun Lee@dongjunlie

Hierarchical policy (3/3): a low-level policy

A low-level VLA (finetuned π₀.₅) produces low-level actions conditioned on each keyframe-augmented subtask. Once each subtask is completed, it moves on to the next one in the subtask queue.

[6/n]

3hViews 11
Dongjun Lee@dongjunlie

To make the robot policy aware of the user's nonverbal signals, we leverage the user's real-time first-person view and eye gaze as additional inputs to the robot policy.

To make this work, we built a hardware pipeline around Project Aria smart glasses that streams the wearer's egocentric RGB and gaze coordinates (i.e., egocentric context) to the policy in real time.

[3/n]

3hViews 11
Dongjun Lee@dongjunlie

People feel the difference. In a user study with 16 external participants, we confirm that EDITH significantly reduces effort of humans in conveying their intent to robot, compared to language-only model (p < 0.001).

[9/n]

3hViews 9
Dongjun Lee@dongjunlie

Hierarchical policy (2/3): keyframe as a subtask

Notably, we represent each subtask as a pair of a subtask instruction and a keyframe: a single frame retrieved from the egocentric stream at the moment the user's intent is most clearly expressed.

We adopt this because intent conveyed through nonverbal cues like pointing or gaze is hard to capture in language alone. The keyframe lets each subtask carry that nonverbal intent, which a purely linguistic instruction would fail to express.

[5/n]

3hViews 9
Dongjun Lee@dongjunlie

EDITH is robust to messy, real human behavior. When the user gets distracted mid-instruction (glancing at a phone, looking away), naive policies latch onto the wrong cue. EDITH tracks the actual intent and holds performance: merely 0.4% relative drop under distraction.

[8/n]

3hViews 7
Dongjun Lee@dongjunlie

Across 3 long-horizon interactive tasks (Muffin-Serving, Tumbler-Sorting, Tool-Passing) that require understanding human's natural nonverbal cues, EDITH achieves 59.7% average success rate and 84.7% task progress, while language-only baselines stay success rate below 10%.

[7/n]

3hViews 7
Dongjun Lee@dongjunlie

📄 paper: https://arxiv.org/abs/2606.10276 🌐 Project website: https://project-edith.github.io

Huge thanks to amazing collaborators @_JuheonChoi_ @SindongG18955 @sinjae_kang and adviser @kimin_le2 🙏.

[n/n]

3hViews 11