Paper status: completed

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Published:10/22/2025

Vision-Language-Action Model (43)Natural Interaction Framework (1)Parallel Behavioral Models (1)Humanoid Robot Interaction (1)Real-Time User Interruption Handling (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

VITA-E introduces a dual-model framework enabling simultaneous seeing, hearing, speaking, and acting, overcoming limitations of existing VLA models. This design supports efficient multitasking while handling real-time interruptions, enhancing interaction fluidity and responsivene

Abstract

Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an Active Model'' and a Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.

Mind Map

In-depth Reading

English Analysis~15 min read · 17,600 chars

1. Bibliographic Information

1.1. Title

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

1.2. Authors

Xiaoyu Liu (Nanjing University)
Chaoyou Fu (Nanjing University) - Corresponding author
Chi Yan (Tencent Youtu Lab)
Chu Wu (Nanjing University)
Haihan Gao (Tencent Youtu Lab)
Yi-Fan Zhang (CASIA - Institute of Automation, Chinese Academy of Sciences)
Shaoqi Dong (Nanjing University)
Cheng Qian, Bin Luo, Xiuyong Yang, Guanwu Li, Yusheng Cai (Fourier Intelligence Inc.)
Yunhang Shen, Deqiang Jiang, Xing Sun (Tencent Youtu Lab)
Haoyu Cao (Tencent Youtu Lab) - Project leader
Caifeng Shan (Nanjing University)
Ran He (CASIA)

1.3. Journal/Conference

Venue: arXiv (Preprint)
Status: Published on arXiv on October 21, 2025.
Context: While currently a preprint, the collaboration involves major research institutions (Nanjing University, CASIA) and industry labs (Tencent, Fourier Intelligence), indicating a high-profile study in the field of embodied AI and robotics.

1.4. Publication Year

2025

1.5. Abstract

This paper addresses the limitations of current Vision-Language-Action (VLA) models, which typically operate in a rigid, sequential manner—unable to multitask (see, hear, speak, and act simultaneously) or handle real-time interruptions. The authors propose VITA-E, a new framework featuring a dual-model architecture (Active and Standby models) and a "model-as-controller" paradigm. This allows a robot to concurrently process visual and auditory inputs, generate speech, and execute physical actions, while being interruptible in nearly real-time. Experiments on a Fourier GR2 humanoid robot demonstrate high success rates in emergency stops, speech interruptions, and concurrent tasks.

1.6. Original Source Link

Link: https://arxiv.org/abs/2510.21817
PDF: https://arxiv.org/pdf/2510.21817v1.pdf

2. Executive Summary

2.1. Background & Motivation

The Problem: Traditional robots and early VLA models operate like a turn-based game. They receive a command, process it, and then execute it. During execution, they are often "deaf" and "blind" to new inputs. They cannot answer a question while moving an object ("concurrency") and cannot immediately stop or change plans if a user shouts "Stop!" ("interruptibility").
Importance: For robots to become true collaborative assistants in human environments (like homes), they must behave naturally. Natural interaction implies doing multiple things at once (e.g., explaining a task while doing it) and reacting instantly to safety hazards or changing user intent.
Gap: Existing methods either lock the robot into a behavior until it finishes or have high latency when switching tasks. They lack a unified framework for handling the high cognitive load of simultaneous perception, reasoning, and low-level control.

2.2. Main Contributions / Findings

Dual-Model Architecture: The authors introduce a system inspired by the brain's hemispheres, utilizing two parallel VLA instances. One serves as the "Active Model" (focusing on the current task) and the other as the "Standby Model" (monitoring for new instructions or interruptions).
Model-as-Controller Paradigm: Instead of using an external rule-based system to manage the robot, they fine-tune the Vision-Language Model (VLM) to generate special tokens (e.g., [ACT], [HALT]). These tokens act as direct system commands, tightly coupling the model's reasoning with the robot's hardware control.
Experimental Success: On a physical humanoid robot, VITA-E achieved a 100% success rate for emergency stops and speech interruptions, and effectively demonstrated the ability to speak and act at the same time.

The following figure (Figure 1 from the original paper) illustrates the VITA-E system in a real-world scenario, where a robot concurrently sees, hears, speaks, and acts while handling interruptions:

该图像是一个示意图，展示了VITA-E系统中的一个机器人与人类的互动场景。机器人正在执行任务，手中持有一个绿色的容器，旁边有一个饮料罐，体现了同时进行的交互过程。

The following figure (Figure 2 from the original paper) provides examples of the robot's responses to various instructions, showcasing its ability to handle complex scenarios like immediate stops ("HALT") or explaining while acting:

Figure 2: VITA-E's responses and actions in various interactive scenarios and instructions. 该图像是插图，展示了 VITA-E 在多种互动场景中的响应与动作。通过不同的语音命令，机器人能够执行如取物、停止和答案反馈等复杂行为，表现出在多任务处理和动态中断方面的能力。

3.1. Foundational Concepts

To understand VITA-E, one must grasp the following concepts:

Vision-Language-Action (VLA) Models: These are AI systems that extend Large Language Models (LLMs). They take images (Vision) and text (Language) as input and output not just text, but also robot control commands (Action).
- Analogy: Imagine a chatbot that can see through a camera and has hands to move things based on what you type or say.
Vision-Language Model (VLM): The "brain" of the system. It processes visual and textual information to "understand" the scene (e.g., recognizing an apple on a table). Popular examples include GPT-4V or open-source models like LLaVA.
Diffusion Policy (Action Expert): A specific type of machine learning model used for generating robot movements.
- How it works: It learns from many examples of robot movements. To generate a movement, it starts with random noise and iteratively "denoises" it to form a smooth, precise trajectory for the robot's arms. In VITA-E, this is the "muscle memory" that executes the "brain's" commands.
Fine-tuning: The process of taking a pre-trained AI model (which knows general things) and training it further on a specific dataset to learn a new skill—in this case, learning to act as a robot controller and output special command tokens.

3.2. Previous Works

End-to-End VLAs (RT-2, OpenVLA): These models try to do everything in one network: input image/text -> output motor commands directly.
- Limitation: They often struggle with complex reasoning because the massive effort to learn motor control dilutes their language/reasoning abilities. They are also typically slow and rigid.
Dual-System Architectures ( $\pi_0$ , GR00T): These split the problem. A "System-2" (VLM) does the high-level thinking ("Pick up the red cup"), and a "System-1" (Action Expert) does the low-level movement.
- Relevance: VITA-E builds on this architecture. It uses a VLM for reasoning and a Diffusion model for action.
VITA (Previous Version): The authors' previous work focused on a voice interaction system that could handle interruptions (full-duplex voice). VITA-E extends this from just "voice" to "voice + physical action."

3.3. Differentiation Analysis

Concurrency: Most prior systems (like SayCan or RT-H) are sequential: Plan -> Act. They cannot speak and act simultaneously. VITA-E uses two parallel models to achieve this.
Interruptibility: In standard models, once an action starts (e.g., a 10-second movement), the robot ignores inputs until finished. VITA-E's "Standby Model" constantly listens, allowing for instant stops or task switches.

4. Methodology

4.1. Principles

The core design philosophy mimics the cooperative mechanism of the human brain's hemispheres.

Active Hemisphere: Focuses on the task at hand (e.g., moving an arm).
Standby Hemisphere: Maintains situational awareness, listening for new auditory or visual cues. This parallel processing allows the system to break the "listen-then-act" bottleneck.

4.2. Architecture Overview

The system follows a Dual-System VLA design:

VLM (System-2): Handles high-level reasoning and dialogue.
Action Expert (System-1): Handles low-level motor control.

Crucially, VITA-E instantiates two copies of this VLA system running in parallel, orchestrated by a server-client framework.

The following figure (Figure 3 from the original paper) details the logical architecture. It shows how the VLM acts as a controller, transitioning between "Hearing" and "Action" states using special tokens:

$Figure 3: The logical architecture and operational states of the active model in our dual-model VITA-E framework. Each of the two models can switch between "Active" and "Standby" states. When a model becomes active, the VLM part acts as a controller, processing user inputs in the "Hearing" state and generating special tokens that can trigger a transition to the "Action" state, where it collaborates with an action expert as a whole VLA model. For example, when the VLM's output starts with the \[ACT\] token, the text between the \[ACT\] token and the \[ INST\] token will be played as audio, and the text after the \[ INST \] token will be sent to the VLA model as an action instruction. Otherwise, the text after the special token will only be played as audio, and the system will execute the command corresponding to the special token. For the detailed functions of each special token, see Table 1.$ 该图像是VITA-E框架中活动模型的逻辑结构示意图，展示了模型在“听”和“动作”状态下的操作流程。模型通过生成特殊标记，如[ACT]和[INST]，来处理用户输入并执行对应的动作指令。

4.3. The Model-as-Controller Paradigm

Instead of hard-coding rules for when to stop or speak, the VLM itself decides the system state by generating special tokens.

4.3.1. Problem Formulation

At each timestep $t$ , the system receives:

$I_t$ : Visual input (images).
$q_t$ : Robot's proprioceptive state (joint positions).
$L_t^{user}$ : User's natural language instruction.

The objective is to produce a tuple $S_t$ : $S_t = (c_t, L_t^{robot}, C_t^{robot})$ Where:
$c_t$ : System behavior control (the special token).
$L_t^{robot}$ : Verbal response to the user.
$C_t^{robot}$ : The semantic goal/instruction for the Action Expert (e.g., "pick up the cup").

The VLM learns a policy $\pi_{VLM}$ to map inputs to this tuple: $\pi_{VLM}(S_t \mid I_t, L_t^{user})$

4.3.2. Special Tokens (The Control Language)

The VLM is fine-tuned to output specific tokens that trigger system-level events.

The following table (Table 1 from the original paper) lists these tokens and their functions:

Token	Description	Example Model Output
[RES]	Signals a voice-only response. Generated as the first token for conversational replies.	[RES] I see an apple on the table.
[ACT]	Signals that the response includes a physical action. Generated as the first token to enter action mode.	[ACT] Okay, I will put the toy in the box. [INST] Pick up toy and place in box.
[INST]	Delimits the spoken part of an action response from the internal action instruction that follows.	(See example above)
[HALT]	Commands an immediate stop of the current action. Generated as the first token for emergency stops.	[HALT] Stopping immediately.
[END]	Signals that a multi-step action sequence has been successfully completed.	[END] The action is finished.

How Training Data is Created: To teach the VLM this behavior, the authors process video trajectories:

Conversational Data: Prepend [RES] to Q&A pairs.
Action Data: Prepend [ACT], add a spoken confirmation ("Okay..."), then [INST], then the technical instruction.
Interruption Data: Synthetically inject "Stop!" into an action trajectory and set the target output to [HALT].

4.3.3. Action Expert for Motor Control

The Action Expert is a Diffusion Transformer denoted as $\pi_a$ . It translates the high-level intent into motor movements.

The Process:

The VLM processes the image $I_t$ and the robot command $C_t^{robot}$ (the text after [INST]).
It extracts the hidden states $h_t$ : $h_t = \pi_{VLM}(I_t, C_t^{robot})_{hidden}$
The Action Expert takes these hidden states and the robot's current physical state $q_t$ to generate the action chunk $\mathcal{A}_t$ (a sequence of future joint positions): $\mathcal{A}_t = \pi_a(h_t, q_t)$

The Action Expert is pre-trained on large-scale data and fine-tuned (specifically the projection head) on the target robot's data to adapt to its specific hardware.

4.4. Dynamic Interaction via Dual-Model Architecture

This is the mechanism that enables "Concurrent Seeing, Hearing, Speaking, and Acting."

The Setup:

Model I (Active): Currently executing a task or speaking.
Model II (Standby): Observing inputs.

Synchronization: The models use semaphores (signals) to coordinate. The Standby model has the authority to interrupt the Active model.

The following figure (Figure 4 from the original paper) illustrates the four primary interaction modes enabled by this architecture:

Figure 4: Sequence diagrams illustrating the four primary interaction modes. Model I is the Active Model, and Model II is the Standby Model. V and A represent voice and action generation, respectively. The Standby Model can process new requests in parallel or preempt the Active Model to handle interruptions and task switches.

Detailed Logic of Interaction Modes:

Concurrency (Fig 4a):
- Scenario: Robot is moving (Active Model I is busy). User asks "What color is the cup?"
- Process: Standby Model II hears the voice request. It checks Model I's state. Seeing Model I is busy with an action (which is "protected" from trivial interruptions), Model II generates the voice response itself.
- Result: Robot moves and speaks simultaneously.
Voice Interruption (Fig 4b):
- Scenario: Robot is speaking a long sentence (Active Model I). User says "Wait, hold on."
- Process: Standby Model II processes the new input. It triggers a Preempt signal.
- Result: Model I stops speaking immediately. Model II processes the new request.
Action Switching (Fig 4c):
- Scenario: Robot is picking up an apple (Task A). User says "No, pick up the pear."
- Process: Standby Model II interprets the new command. It issues a Preempt signal to Model I, halting Task A. Model II then becomes the new Active Model and starts Task B (pick pear).
- Safety: Before starting Task B, the robot executes a "retraction" mechanism (inverse movements) to return to a safe neutral pose.
Emergency Stop (Fig 4d):
- Scenario: Robot is moving. User screams "Stop!"
- Process: Standby Model II processes this with highest priority. It generates the [HALT] token. This triggers an immediate hardware-level stop command to the robot client.
- Result: Instant cessation of all motion.

5. Experimental Setup

5.1. Datasets

The experiments used both simulation and real-world data:

Libero Benchmark (Simulation):
- Description: A standard suite for evaluating robot learning, consisting of tasks like organizing objects or opening drawers.
- Subsets: Libero-90 (pre-training) and Libero-10 (fine-tuning on specific task suites: Spatial, Object, Goal, Long).
Real-World Data:
- Platform: Fourier GR2 Humanoid Robot (RGB camera on head).
- Tasks:
  1. Pick up can: 300 demonstrations.
  2. Pick and place toy: 300 demonstrations.
- Data Collection: Teleoperation at 20Hz, recording 26 degrees of freedom (joints).
Synthetic Vision-Language Data:
- Used to fine-tune the VLM for the "Model-as-Controller" behavior.
- Sources: ActionNet, Libero, and self-collected data, augmented with the special tokens ([ACT], [HALT], etc.).

5.2. Evaluation Metrics

The primary metric is Success Rate (SR).

Conceptual Definition: The percentage of attempts where the robot successfully completes the assigned task or interaction goal.
Mathematical Formula: $SR = \frac{N_{success}}{N_{total}} \times 100\%$
Symbol Explanation:
- $N_{success}$ : The number of trials where the task (e.g., picking up the can, or stopping immediately upon command) was fully completed according to the criteria.
- $N_{total}$ : The total number of trials conducted (typically 30 per task in this paper).
  
  For Concurrency, they also measured Latency (time to response).

5.3. Baselines

VITA-E was compared against state-of-the-art models:

GR00T: A generalist humanoid model (similar dual architecture but without the interactive dual-instance framework).
$\pi_0$ (Pi-Zero): A flow-matching based VLA policy.
Diffusion Policy: A pure action-generation model (no VLM reasoning).
SmolVLA: A lightweight VLA model.
VITA-1.5: The base VLM model used within VITA-E (used in ablation studies to show the benefit of fine-tuning).

6. Results & Analysis

6.1. Fundamental Manipulation Tasks

These experiments test if the robot can simply move and manipulate objects correctly, ignoring the fancy interaction features for a moment.

6.1.1. Simulation (Libero)

The following chart (Figure 5 from the original paper) compares VITA-E with GR00T:

该图像是图表，展示了VITA-E与GR00T在Libero基准测试上的成功率比较。虽然VITA-E的成功率不及相同结构的基线，但本研究的目标是证明该模型能够完成具身任务，并提供量化指标。

Analysis: VITA-E performs decently but generally worse than GR00T.
Reason: The authors explain that GR00T is trained "end-to-end" on massive datasets, updating the vision encoder. VITA-E keeps the VLM frozen and only trains the action projector to save resources. The goal here is not to beat GR00T at manipulation, but to be "good enough" to test interaction.

6.1.2. Real Robot

The following chart (Figure 6 from the original paper) shows success rates on real tasks:

Figure 6: Success rate comparison of VITA-E and baseline models on two fundamental manipulation tasks: (a) Pick up can and (b) Pick and place toy. Results are reported over 30 evaluation trials. 该图像是图表，展示了VITA-E与基线模型在两个基本操作任务上的成功率对比：左侧为“拾取罐”的成功率，右侧为“拾取和放置玩具”的成功率。数据来源于30个评估试验的结果。

Results:
- Pick up can: VITA-E achieves ~90% success, comparable to $\pi_0$ and GR00T.
- Pick and place toy: VITA-E achieves ~90% success.
Conclusion: VITA-E is a competent manipulator in the real world, providing a solid foundation for adding interaction layers.

6.2. Interactive Tasks (Core Results)

This is the most critical section, validating the paper's main contributions.

Qualitative Result (Concurrency): The robot successfully answered questions (e.g., about object color) while picking up objects. The average latency for voice response was 2.26 seconds, which is acceptable for natural interaction.

Quantitative Results (Interruption & Safety): The following are the results from Table 2 of the original paper:

Interactive Task	Speech Interruption	Task Switching	Emergency Stop
Success Rate (SR)	100%	93.3%	100%

Speech Interruption (100%): The dual-model system never failed to stop speaking when interrupted.
Emergency Stop (100%): The robot instantly halted motion every time "Stop!" was commanded. This proves the robustness of the [HALT] token generation by the Standby model.
Task Switching (93.3%): High success rate. The few failures were due to the VLM misinterpreting the intent of the new command (thinking it was just conversation), not a failure of the switching mechanism itself.

6.3. Ablation Studies

The authors tested whether fine-tuning the VLM with special tokens was actually necessary. They compared the fine-tuned VITA-E VLM against the base VITA-1.5 model.

The following are the results from Table 3 of the original paper:

Model	Cannot Execute	Exec. Inst. 1	Exec. Inst. 2	Emergency Stop	Task Completed
VITA-1.5 (Base)	75%	10%	5%	0%	15%
VITA-E VLM (Ours)	90%	95%	95%	100%	60%

Analysis:
- Emergency Stop: The base model scored 0%. It doesn't know it's a robot and doesn't know how to "stop" via a token. The fine-tuned model scored 100%.
- Execution: The base model often refused to act ("I cannot interact with the physical world"). The fine-tuned model correctly generated instructions 95% of the time.
- Takeaway: The "Model-as-Controller" fine-tuning is essential. It transforms a chatbot into a robot controller.

7. Conclusion & Reflections

7.1. Conclusion Summary

VITA-E successfully addresses the rigidity of traditional VLA models. By employing a dual-model architecture (mimicking brain hemispheres) and a special token control system, it enables a robot to:

Multitask: See, hear, speak, and act simultaneously.
Interact Naturally: Handle interruptions and emergency stops in near real-time.
Perform Reliably: Achieve 100% success in safety-critical stop commands on a physical humanoid platform.

7.2. Limitations & Future Work

Computational Cost: Running two full VLA instances (Active + Standby) requires significant GPU resources. This is a trade-off for the advanced interaction capabilities.
VLM Errors: Task switching sometimes fails if the VLM misinterprets the user's intent. The authors suggest training on more diverse embodied scenarios to fix this.
Retraction Mechanism: Currently, when switching tasks, the robot simply "undoes" its movement to a neutral pose. Future work aims to make the transition between different movements smoother and more efficient.
Long-Horizon Tasks: The current system handles atomic tasks well; future research will explore chaining these into complex, multi-step behaviors.

7.3. Personal Insights & Critique

Innovation: The "Dual-Model" approach is a clever engineering solution to the latency problem. Instead of trying to make one model faster, they simply added a second "listener." This is computationally expensive but highly effective for user experience.
Practicality: The use of special tokens ([HALT], [ACT]) is a pragmatic way to bridge the gap between LLMs and control systems. It effectively "grounds" the LLM's reasoning into actionable system states.
Critique: The paper relies heavily on the "Standby Model" correctly interpreting when to interrupt. If the Standby Model is slow or hallucinates, the safety mechanism could fail (though experiments showed 100% success, the sample size of 30 is relatively small for safety certification). Additionally, the hardware requirements (likely multi-GPU) might limit deployment on mobile robots with limited onboard power.
Relevance: As humanoid robots move from labs to homes, the ability to say "Stop!" and have the robot freeze instantly is not just a feature—it's a mandatory safety requirement. VITA-E provides a concrete blueprint for implementing this.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.