Paper status: completed

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

Published:05/21/2025

Vision-Language-Action Model (43)Cross-Task Zero-Shot Generalization (1)Robotic Manipulation Benchmark (1)LLM Task Prediction (1)AGNOSTOS Benchmark (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces the AGNOSTOS benchmark for evaluating cross-task zero-shot generalization in Vision-Language-Action models and proposes the X-ICM method, which enhances action sequence prediction for unseen tasks, showing significant performance improvements.

Abstract

The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.

Mind Map

In-depth Reading

English Analysis~17 min read · 22,331 chars

1. Bibliographic Information

1.1. Title

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

1.2. Authors

Jiaming Zhou, Ke Ye, Teli Ma, Zian Wang, Ronghe Qiu, Junwei Liang: The Hong Kong University of Science and Technology (Guangzhou)
Jiayi Liu: Affiliation not explicitly superscripted in list, likely associated with the main group.
Kun-Yu Lin: The University of Hong Kong
Zhilin Zhao: Sun Yat-sen University
Junwei Liang: Also affiliated with The Hong Kong University of Science and Technology.

1.3. Journal/Conference

Published at (UTC): 2025-05-21. Status: The paper is currently available as a preprint on arXiv. While arXiv is a repository for preprints and not a peer-reviewed journal or conference proceeding itself, it is the standard venue for disseminating the latest research in AI and robotics. The date suggests it is a very recent contribution to the field.

1.4. Publication Year

2025

1.5. Abstract

This paper addresses the critical challenge of cross-task generalization in robotic manipulation using Vision-Language-Action (VLA) models. The authors identify that while current models handle visual variations within known tasks well, they struggle with entirely new tasks (unseen combinations of objects and motions).

Contribution 1: AGNOSTOS Benchmark. A new simulation benchmark based on RLBench containing 23 unseen tasks divided into two difficulty levels to rigorously test zero-shot generalization.
Finding: Existing state-of-the-art VLA models fail significantly on these unseen tasks.
Contribution 2: X-ICM Method. A novel approach called Cross-Task In-Context Manipulation. It uses Large Language Models (LLMs) to predict actions for unseen tasks by learning from "in-context" demonstrations of seen tasks.
Contribution 3: Dynamics-Guided Selection. A strategy to select the most relevant demonstrations for a new task by analyzing the dynamics (how the environment changes) rather than just visual similarity.
Result: X-ICM achieves a 6.0% improvement over $\pi_0$ and 7.9% over VoxPoser on the AGNOSTOS benchmark.

1.6. Original Source Link

Original Source: https://arxiv.org/abs/2505.15660
PDF Link: https://arxiv.org/pdf/2505.15660v3.pdf

2. Executive Summary

2.1. Background & Motivation

The Core Problem: The ultimate goal of robotics is to build general-purpose robots that can operate in "open-world" settings. This means a robot should be able to perform a task it has never seen before, simply by understanding the instruction and the environment. This is known as zero-shot cross-task generalization.
Current Limitations: Existing Vision-Language-Action (VLA) models (models that take images and text as input and output robot actions) are trained on massive datasets. However, evaluations typically focus on within-task generalization—e.g., picking up a red cup instead of a blue cup (visual variation) but still performing the "pick up cup" task. The authors argue that the community has not sufficiently explored the harder problem: generalizing to a completely new task, like "opening a microwave," when the robot has only learned to "open a drawer."
The Gap: There is a lack of rigorous benchmarks that specifically target this "unseen task" scenario. Most evaluations are either non-reproducible real-world tests or simulated benchmarks that focus on visual robustness rather than task novelty.

2.2. Main Contributions & Findings

AGNOSTOS Benchmark: The authors introduce a reproducible benchmark in the RLBench simulator. It explicitly separates training tasks from testing tasks.
- Training: 18 common tasks.
- Testing: 23 entirely unseen tasks, categorized into Level-1 (partial semantic overlap, e.g., similar objects) and Level-2 (completely novel scenarios).
Evaluation of Existing Models: They benchmarked three families of models: Foundation VLAs (like OpenVLA, $\pi_0$ $π_{0}$ ), Human-video pretrained models (like R3M), and In-domain models (like PerAct).
- Finding: None of them generalize effectively. Most fail completely on the Level-2 tasks.
X-ICM (Cross-Task In-Context Manipulation): To solve this, they propose a method that leverages the reasoning power of Large Language Models (LLMs). Instead of training a policy to output actions directly from pixels for every possible task, they feed the LLM text descriptions of "seen" task demonstrations. The LLM then uses In-Context Learning (ICL) to infer the action sequence for the new "unseen" task.
Dynamics-Guided Sample Selection: A key innovation is how they choose which seen examples to show the LLM. They use a diffusion model to learn the "dynamics" (cause-and-effect) of tasks and select examples that share similar physical dynamics with the new task, even if they look different visually.
Performance: The proposed X-ICM method significantly outperforms existing baselines, achieving success on tasks where other models score near zero.

3.1. Foundational Concepts

Vision-Language-Action (VLA) Models: These are AI models designed for robotics. They accept Visual inputs (camera images from the robot's perspective) and Language inputs (instructions like "pick up the apple"). They output Actions (commands to the robot's motors, such as moving the arm or closing the gripper). Think of them as the "brain" connecting eyes and ears to hands.
Zero-Shot Generalization: This refers to a model's ability to perform a task it has never been explicitly trained on.
- Within-task: Handling a new object color or position for a known task.
- Cross-task: Performing a completely different activity (e.g., training on "stacking blocks" and testing on "pouring water"). This paper focuses on the latter, which is much harder.
In-Context Learning (ICL): A capability observed in Large Language Models (LLMs) like GPT-4. If you provide the model with a few examples (the "context") in the prompt—e.g., "Input: A, Output: B; Input: C, Output: D"—and then give it a new input "Input: E", the model can often predict the correct "Output: F" without updating its weights.
- In this paper, the "Input" is the state of the robot/world, and the "Output" is the robot's action.
Diffusion Models (for Dynamics): Diffusion models are generative models (famous for generating images like Stable Diffusion). In this paper, they are used for Dynamics Modeling. "Dynamics" refers to the physics of how the world changes: if I push a cup, it moves. A dynamics model predicts the future state of the world given the current state and an action/instruction.

3.2. Previous Works

The authors position their work against three categories of existing approaches:

Foundation VLA Models: Models trained on massive real-world robot datasets (e.g., Open X-Embodiment). Examples include OpenVLA [7], $\pi_0$ [1], and VoxPoser [2].
- Limitation: They often struggle with the specific domain gap when applied to simulation benchmarks or fail to generalize to semantically distinct tasks despite their large training data.
Human-Video Pre-trained Models: Models like R3M [37] utilize vast amounts of human videos (e.g., YouTube cooking videos) to learn visual representations, hoping this knowledge transfers to robots.
- Limitation: There is a "domain gap" between watching a human hand and controlling a robot arm.
In-Domain Models: Models like PerAct [34] and RVT [35] are trained specifically on the simulator data (RLBench).
- Limitation: They are excellent at the tasks they are trained on but are typically specialists, not generalists. They often fail when the test task differs significantly from the training set.

3.3. Differentiation Analysis

Benchmark Gap: Previous benchmarks (like Colosseum [31] or GemBench [32]) focused heavily on visual perturbations (lighting, texture) within the same tasks. AGNOSTOS is distinct because it focuses on cross-task novelty, specifically categorizing tasks by semantic distance (Level 1 vs Level 2).
Methodological Innovation: While recent works have applied In-Context Learning to robotics (e.g., Instant Policy [74]), they mostly do so for within-task imitation (showing a demo of "stacking" to do "stacking"). X-ICM is the first to rigorously apply and evaluate this for cross-task scenarios (showing a demo of "opening a drawer" to help with "opening a grill"), supported by a novel dynamics-based retrieval mechanism.

4. Methodology

4.1. Principles

The core principle of X-ICM (Cross-Task In-Context Manipulation) is to bridge the gap between a known "seen" task and an unknown "unseen" task using the reasoning capabilities of an LLM. The intuition is that different tasks often share underlying dynamic structures. For example, "opening a drawer" and "opening a grill" involve similar pulling motions and state changes, even if the objects look different. If we can identify a seen task that is dynamically similar to the unseen task, we can show the LLM how to solve the seen task, and the LLM can abstract that logic to solve the unseen one.

The following figure (Figure 2 from the original paper) provides a high-level overview of the X-ICM framework. It illustrates the two main stages: identifying relevant examples (top) and using them to prompt the LLM for action prediction (bottom).

Figure 2 X-ICM Method Overview. X-ICM employs a dynamics-guided sample selection module to retrieve effective demonstrations from seen tasks or eac testednseen task. Thesedemonstrations are then used by the cross-ask n eicoruce i ron 该图像是一个示意图，展示了X-ICM方法中的动态引导样本选择模块。此模块通过从已知任务中检索有效演示，结合特征相似性与对象提取，以实现对未见任务的处理。

4.2. Core Methodology In-depth (Layer by Layer)

The method consists of two main modules:

Dynamics-Guided Sample Selection: Finding the best "teacher" examples.
Cross-Task In-Context Prediction: converting visual/action data into text prompts for the LLM.

4.2.1. Module 1: Dynamics-Guided Sample Selection

Step 1: Dynamics Modeling with Diffusion To judge if two tasks are similar, looking at static images isn't enough. We need to know "what happens" in the task. The authors train a Dynamics Diffusion Model, denoted as $\mathcal{G}$ .

Input: Initial visual observation $v_{i, 1}^s$ (image at time $t=1$ ) and language description $L_i^s$ (e.g., "open the drawer").
Target: The final visual observation $v_{i, T}^s$ (image at time $t=T$ , showing the completed task).
Goal: The model learns to predict the future (completed) state from the initial state.

The model $\mathcal{G}$ is optimized using the standard diffusion training objective. The formula is:

$\underset { \mathcal { G } } { \operatorname* { m i n } } \mathbb { E } _ { i , z , \epsilon \sim \mathcal { N } ( 0 , I ) } \left[ \left\| \epsilon - \epsilon _ { \mathcal { G } } \big ( v _ { i , T , z } ^ { s } , z , v _ { i , 1 } ^ { s } , L _ { i } ^ { s } \big ) \right\| _ { 2 } ^ { 2 } \right]$

Symbol Explanation:

$\mathbb{E}$ : Expectation (average over the dataset).
$i$ : Index of the training demonstration.
$z$ : The diffusion timestep (indicating how much noise is added).
$\epsilon$ : The actual random noise added to the image (sampled from a normal distribution $\mathcal{N}(0, I)$ ).
$v_{i, T, z}^s$ : The "noisy" version of the final image $v_{i, T}^s$ at timestep $z$ .
$\epsilon_{\mathcal{G}}$ : The neural network (noise predictor) within model $\mathcal{G}$ . It tries to guess what noise was added.
$v_{i, 1}^s, L_i^s$ : Conditions provided to the network (initial image and text instruction).
$|| \dots ||_2^2$ : The squared L2 norm (Euclidean distance), measuring the error between the actual noise and the predicted noise.

Step 2: Feature Extraction Once trained, the model $\mathcal{G}$ "understands" the dynamics. We use it to extract a rich feature vector $f_i^s$ for every seen demonstration. This vector combines the visual dynamics and the language semantics.

$\begin{array} { r } { f _ { i } ^ { s } = \big [ f _ { i } ^ { s , v i s } , f _ { i } ^ { s , l a n g } \big ] = \mathcal { G } \big ( v _ { i , 1 } ^ { s } , L _ { i } ^ { s } \big ) \in \mathbb { R } ^ { 2 \times 1 0 2 4 } } \end{array}$

Symbol Explanation:

$f_i^s$ : The combined feature vector for the $i$ -th seen demonstration.
$f_i^{s, vis}$ : The visual dynamics feature extracted from the diffusion model's internal representation (size 1024).
$f_i^{s, lang}$ : The textual feature of the language description (size 1024).
The output is a concatenation of these two, resulting in a vector of size $2 \times 1024$ .

Step 3: Similarity Calculation and Retrieval When confronted with a new unseen task (with initial view $v^u$ and text $L^u$ ), we compute its feature vector $f^u$ using the same model $\mathcal{G}$ . Then, we compare this unseen task against all stored seen demonstrations to find the "Top-K" most similar ones. The metric used is Cosine Similarity.

$\begin{array} { r } { R _ { i } = \frac { f ^ { u } \cdot f _ { i } ^ { s } } { \| f ^ { u } \| \| f _ { i } ^ { s } \| } \quad i \in \{ 1 , \ldots , N \} } \\ { \mathcal { T } _ { i d x } = \arg \mathrm { t o p k } _ { i \in \{ 1 , \ldots , N \} } ^ { K } ( R _ { i } ) } \end{array}$

Symbol Explanation:

$R_i$ : The similarity score between the unseen task and the $i$ -th seen demonstration.
$f^u \cdot f_i^s$ : The dot product of the two feature vectors.
$||f||$ : The magnitude (norm) of the vector. Dividing by magnitudes normalizes the vectors, so we only compare their directions (angles).
$\mathcal{T}_{idx}$ : The set of indices for the $K$ demonstrations with the highest scores. These are the "winners" that will be used in the prompt.

4.2.2. Module 2: Cross-Task In-Context Prediction

Now that we have the relevant seen demonstrations, we need to format them for the LLM. Since standard LLMs process text, we must "textualize" the robot's visual and action data.

Step 1: Textualizing Observations The system extracts object information from the images.

Object Detection: Uses GroundingDINO [89] to find 2D bounding boxes of objects.
3D Coordinate Extraction: Uses the camera's depth information to convert 2D boxes into 3D coordinates (x, y, z).
Normalization: The workspace is divided into a $100 \times 100 \times 100$ grid. Coordinates are mapped to this grid.
Format: Objects are represented as strings like object_name: [x, y, z].
- Example: block: [52, 50, 19]

Step 2: Textualizing Actions Robots perform continuous motion, but LLMs prefer discrete tokens.

Key-Action Selection: Instead of every frame, the method selects "key" moments where the robot's gripper state changes (open/close) or velocities are near zero (stops/turns).
Action Vector: A 7-dimensional vector is used: [x, y, z, roll, pitch, yaw, gripper].
- Position (x,y,z) is normalized to the grid.
- Rotation (roll, pitch, yaw) is discretized into 72 bins (5 degrees each).
Format: [53, 57, 17, 0, 36, 53, 0]

Step 3: Prompt Construction The prompt fed to the LLM is structured as a sequence of examples followed by the query. For each selected seen demonstration $k$ , the format is:

${ } ^ { \mathrm { \tiny { 4 } } } \big [ L _ { k } ^ { s e e n } , \{ O _ { k } ^ { 1 } , \cdots , O _ { k } ^ { m } , \cdots \} \big ] \big [ A _ { k } ^ { 1 } , \cdots , A _ { k } ^ { t } , \cdots \big ] ^ { \boldsymbol { \gamma } }$

Symbol Explanation:

$L_k^{seen}$ : Language instruction for the seen task.
$\{O_k^m\}$ : The set of textualized object locations (Context).
$[A_k^t]$ : The sequence of textualized key-actions (Response).

Final Prompt Structure:

System Prompt: Instructions to the LLM about its role.
Top-K Seen Demos: Concatenated in descending order of similarity.
- $Task: [Instruction 1] Objects: [Obj 1] -> Action: [Seq 1]$
- ...
- $Task: [Instruction K] Objects: [Obj K] -> Action: [Seq K]$
Unseen Task Query:
- $Task: [Unseen Instruction] Objects: [Unseen Objects] -> Action:$
  
  The LLM then generates the completion (the action sequence) for the unseen task.

The following figure (Figure 10 from the original paper, labeled Figure A6 in Appendix) illustrates this prompt construction process:

该图像是一个示意图，展示了Franka Panda机器人如何在可见任务的情况下，通过提供双向演示实现对未见任务的动作预测。图中包括系统提示、已见任务演示以及未见任务的指令。相关的行动序列采用7维形式表示。

5. Experimental Setup

5.1. Datasets

Source: The benchmark AGNOSTOS is built on the RLBench simulation environment.
Training Set (Seen Tasks):
- 18 standard tasks widely used in prior work (e.g., PerAct, RVT).
- Data: 200 demonstrations per task (Total 3,600).
- Purpose: Used to train the baseline models and the dynamics diffusion model $\mathcal{G}$ , and serves as the "retrieval pool" for X-ICM.
- The following figure (Figure 5 from the original paper, labeled Figure A1 in Appendix) shows examples of these seen tasks:
  
  该图像是示意图，展示了多种机器人操控任务，包括放置物品、堆叠积木和关上罐子等。这些任务用于评估视觉-语言-行动模型在未知任务上的泛化能力，特别是在AGNOSET基准上进行零-shot泛化测试。
Testing Set (Unseen Tasks):
- 23 held-out tasks that are not in the training set.
- Level-1 (13 tasks): Share partial semantics with seen tasks (e.g., "Put toilet roll on stand" is similar to the seen task "Place wine at rack location").
- Level-2 (10 tasks): Completely novel scenarios with no overlap in objects or motions (e.g., "Unplug charger", "Water plants").
- The following figure (Figure 1 from the original paper) visualizes the categorization of these testing tasks:
  
  该图像是一个示意图，展示了 AGNOSTOS 的跨任务零-shot 泛化基准。图中分为两个测试等级，分别显示未见任务的操作，第一等级包含相似对象或动作的任务，第二等级则是未见任务且无相似对象或动作。图下方还列出了不同类型的训练方式，如人类视频预训练、跨体现机器人数据预训练等。

5.2. Evaluation Metrics

Success Rate:
- Conceptual Definition: The percentage of attempts where the robot successfully completes the assigned task according to the simulator's strict success conditions.
- Calculation: For each unseen task, the authors perform 3 test runs with different random seeds. Each run consists of 25 rollouts (attempts). $\text{Success Rate} = \frac{\text{Number of Successful Rollouts}}{\text{Total Number of Rollouts}} \times 100\%$
- Reporting: The paper reports the mean and standard deviation across these runs.

5.3. Baselines

The authors compare against a comprehensive list of VLA models:

Foundation VLA Models: Large-scale models often fine-tuned on robotic data.
- OpenVLA [7], RDT [8], $\pi_0$ [1]: State-of-the-art generalist policies.
- VoxPoser [2]: Uses LLMs for value map generation (similar spirit of using LLMs, but different mechanism).
- SAM2Act [41], 3D-LOTUS++ [32].
Human-Video Pre-trained Models:
- R3M [37], D4R [38]: Learn visual encoders from human videos (Ego4D).
In-Domain Models:
- PerAct [34], RVT [35], RVT2 [40]: Architectures specifically optimized for 3D manipulation in RLBench.
- Instant Policy [74]: A retrieval-based policy, but typically uses within-task retrieval.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate a massive gap between current VLA capabilities and true cross-task generalization. Most baselines fail catastrophically on the unseen tasks, especially Level-2 tasks.

Failure of Baselines: Models like PerAct, RVT, and even OpenVLA often achieve 0.0% success on many unseen tasks. This highlights that "generalist" models trained on specific sets of tasks (even large ones) do not automatically generalize to novel task semantics.
X-ICM Dominance:
- X-ICM (7B) outperforms the previous state-of-the-art ( $\pi_0$ ) by 6.9% on Level-1 tasks.
- X-ICM (72B) achieves an average success rate of 30.1%, significantly higher than all competitors.
- Crucially, X-ICM (72B) is the only model that achieves non-zero success on all 23 unseen tasks, whereas every other model fails completely on at least 8 tasks.

Data Presentation (Table 2 Transcription): The following are the results from Table 2 of the original paper. Note the stark contrast between the baselines (often single-digit averages) and X-ICM.

Category	Methods	Level-1 tasks													Level-2 tasks										Level-1	Level-2	All
		Toilet	Knife	Fridge	Micro-wave	Laptop	Phone	Seat	Lamp Off	Lamp On	Book	Umbr-ella	Grill	Bin	Lid	Plate	Ball	Scoop	Rope	Oven	Buzz	Plants	Charger	USB	Level-1	Level-2	All	avg (std)	avg (std)	avg (std)
		In-domain	PerAct	0.0	5.3	37.3	64.0	2.7	0.0	72.0	0.0	1.3	0.0	1.3	8.0	54.7	58.7	2.7	0.0	0.0	0.0	0.0	1.3	4.0	6.7	2.7	19.0 (1.4)	7.6 (1.1)	14.0 (0.9)
RVT	0.0		2.7	50.7	26.7	50.7	2.7	40.0	0.0	1.3	0.0	1.3	0.0	6.7	89.3	2.7	0.0	0.0	0.0	0.0	8.0	4.0	5.3	4.0	14.0 (1.4)	11.3 (1.6)	12.8 (0.6)
Sigma-Agent	0.0		9.3	56.0	9.3	30.7	1.3	65.3	1.3	0.0	0.0	0.0	1.3	4.0	88.0	0.0	0.0	0.0	0.0	0.0	4.0	8.0	5.3	1.3	13.7 (1.6)	10.7 (1.7)	12.4 (0.4)
RVT2	0.0		1.3	0.0	17.3	42.7	1.3	62.7	2.7	1.3	0.0	1.3	5.3	34.7	22.7	40.0	0.0	0.0	0.0	0.0	0.0	1.3	1.3	1.3	13.1 (0.4)	6.7 (1.3)	10.3 (0.6)
InstantPolicy	0.0		1.3	13.3	4.0	4.0	18.7	24.0	0.0	0.0	0.0	0.0	0.0	0.0	26.7	1.3	0.0	0.0	0.0	0.0	0.0	1.3	0.0	0.0	4.3 (1.2)	2.9 (1.4)	3.7 (3.0)
Human-video	D4R	0.0	8.0	32.0	30.7	24.0	0.0	65.3	20.0	4.0	0.0	0.0	0.0	0.0	98.7	0.0	0.0	0.0	0.0	0.0	1.3	1.3	1.3	4.0	14.1 (0.3)	10.7 (0.2)	12.6 (0.2)
	R3M	0.0	0.0	37.3	22.7	25.3	1.3	62.7	6.7	4.0	0.0	0.0	0.0	0.0	48.0	0.0	0.0	0.0	0.0	0.0	8.0	2.7	2.7	1.3	12.3 (1.4)	6.3 (0.9)	9.7 (0.6)
	D4R-Align	0.0	2.7	45.3	74.7	24.0	0.0	41.3	0.0	0.0	1.3	0.0	0.0	0.0	96.0	0.0	0.0	0.0	0.0	0.0	0.0	2.7	1.3	0.0	14.6 (0.8)	10.0 (1.6)	12.6 (0.3)
	R3M-Align	0.0	4.0	49.3	25.3	21.3	0.0	49.3	0.0	5.3	0.0	0.0	1.3	1.3	78.7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.7	0.0	12.1 (0.3)	8.1 (0.9)	10.4 (0.4)
Foundation	OpenVLA	0.0	5.3	38.7	40.0	57.3	0.0	53.3	12.0	1.3	1.3	0.0	10.7	0.0	86.7	1.3	0.0	0.0	0.0	0.0	0.0	1.3	2.7	1.3	16.9 (1.0)	9.3 (0.9)	13.6 (0.8)
	RDT	0.0	0.0	46.7	13.3	14.7	0.0	50.7	0.0	0.0	1.3	0.0	8.0	0.0	52.0	0.0	0.0	0.0	0.0	0.0	0.0	10.7	0.0	0.0	10.4 (0.7)	6.3 (1.6)	8.6 (0.3)
	$\pi_0$	0.0	5.3	85.3	24.0	40.0	1.3	64.0	18.7	8.0	1.3	0.0	33.3	1.3	80.0	0.0	0.0	0.0	0.0	0.0	13.3	17.3	6.7	2.7	21.7 (0.6)	12.0 (1.6)	17.5 (0.4)
	LLARVA	0.0	0.0	12.0	0.0	6.7	0.0	40.0	0.0	0.0	0.0	0.0	0.0	0.0	5.3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	4.5 (0.5)	0.5 (0.9)	2.8 (0.3)
	3D-LOTUS	0.0	6.7	N/A	N/A	N/A	0.0	6.7	0.0	0.0	0.0	0.0	0.0	13.3	N/A	N/A	0.0	0.0	0.0	N/A	0.0	0.0	0.0	0.0	1.8 (0.0)	0.0 (0.0)	1.0 (0.0)
	3D-LOTUS++	0.0	5.3	N/A	N/A	N/A	9.3	68.0	10.7	0.0	0.0	0.0	29.3	5.3	N/A	N/A	0.0	0.0	0.0	N/A	2.7	2.7	0.0	0.0	9.8 (0.6)	0.5 (0.9)	5.8 (0.6)
	SAM2Act	0.0	0.0	36.0	40.0	6.7	6.7	62.7	6.7	0.0	1.3	1.3	13.3	0.0	82.7	2.7	0.0	0.0	0.0	0.0	1.3	13.3	5.3	1.3	13.4 (0.6)	10.7 (0.9)	12.2 (0.3)
	VoxPoser	0.0	0.0	0.0	0.0	5.3	8.0	28.0	88.7	25.3	0.0	0.0	9.3	0.0	90.7	1.3	53.3	0.0	10.7	22.7	36.0	40.0	89.3	2.7	12.6 (0.9)	34.7 (1.8)	22.2 (1.0)
Ours	X-ICM (7B)	1.3	26.7	22.7	45.3	33.3	57.3	48.0	58.7	50.7	1.3	0.0	5.3	18.7	90.7	5.3	2.7	1.3	2.7	5.3	17.3	16.0	22.7	5.3	28.6 (1.9)	16.9 (1.3)	23.5 (1.6)
Ours	X-ICM (72B)	6.7	69.3	12.7	58.7	34.0	68.0	51.3	86.7	74.7	2.0	1.3	8.0	18.7	90.7	13.3	5.3	6.7	4.0	8.0	32.0	16.0	12.0	14.7	37.6 (1.4)	20.3 (1.7)	30.1 (1.0)

6.2. Ablation Studies

6.2.1. Impact of Dynamics-Guided Selection

The authors tested X-ICM (72B) without the dynamics-guided selection module (replacing it with random selection, "w/o sel").

Result: The success rate dropped from 30.1% to 25.2%.
Conclusion: Identifying relevant demonstrations based on dynamics is crucial. Showing the LLM irrelevant tasks (just randomly picked ones) hurts its ability to reason about the new task.

The following are the results from Table 3 of the original paper:

Models Level-1 Level-2 All

X-ICM (72B) w/o sel 30.7 (4.7) 18.0 (2.2) 25.2 (3.2)

X-ICM (72B) 37.6 (1.4) 20.3 (1.7) 30.1 (1.0)

Models	Level-1	Level-2	All
X-ICM (72B) w/o sel	30.7 (4.7)	18.0 (2.2)	25.2 (3.2)
X-ICM (72B)	37.6 (1.4)	20.3 (1.7)	30.1 (1.0)

6.2.2. Number of In-Context Examples

The authors varied the number of demonstrations ( $K$ ) provided to the LLM.

Result: Performance increases rapidly as $K$ goes from 1 to 12.
Plateau: After 12-18 examples, performance levels off. This suggests a "saturation point" where adding more examples doesn't provide new information or potentially introduces noise. They chose $K=18$ for the final model.

The following figure (Figure 3 from the original paper) visualizes this trend:

该图像是一个柱状图，展示了不同数量的在语境示例对任务表现的影响。图中分为Level-1和Level-2两个级别，显示了1、2、6、12、18和24个示例的效果，Level-1的最佳表现为38，Level-2的最佳表现为27。

6.2.3. LLM Backbone

The choice of LLM matters. They compared different 7B/8B models.

Qwen2.5-7B-Instruct performed best (23.5%).
Deepseek-R1 and Llama3 performed significantly worse (9.5% and 15.1%).

Conclusion: The reasoning and instruction-following capability of the underlying LLM is a bottleneck.

The following are the results from Table 4 of the original paper:

LLMs	Level-1	Level-2	All
Deepseek-R1-Distill-Qwen-7B	10.7 (1.1)	7.9 (0.5)	9.5 (0.5)
Llama3.0-8B-Instruct	17.4 (0.7)	11.7 (1.6)	15.1 (0.3)
Ministral-8B-Instruct-2410	22.9 (0.7)	14.8 (0.3)	19.5 (0.5)
InternLM3-8B-Instruct	27.9 (0.7)	13.3 (0.4)	21.8 (0.5)
Qwen2.5-7B-Instruct	28.6 (1.9)	16.9 (1.3)	23.5 (1.6)

6.3. Real-World Experiments

The authors validated X-ICM on a real robot (xArm7) with 5 tasks (e.g., "push button", "stack cups").

Setup: 20 demos collected for each task. Tested in a "leave-one-out" manner (using demos from the other 4 tasks to solve the 5th).
Result: The method worked in the real world, showing decent success rates (e.g., 70% for "put block into bin"). The following figure (Figure 4 from the original paper) shows these real-world results:

该图像是一个展示五个真实世界任务的图表，展示了每个任务的成功率。任务包括将块放入箱子、按按钮、将瓶子放入盒子、堆叠块和堆叠杯子，成功率分别为70%、10%、30%、50%和30%。

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper exposes a significant weakness in current robotic learning: while we have models that can handle visual variety, we lack models that can handle task variety (new actions/objects) without retraining.

AGNOSTOS serves as a necessary "stress test" for future generalist robots.
X-ICM proves that LLMs contain valuable priors for manipulation logic. By smartly retrieving dynamically similar examples (using the diffusion-based selector), we can unlock this knowledge to control robots in unseen scenarios.

7.2. Limitations & Future Work

Visual Information Loss: The method "textualizes" the scene (converting images to lists of object coordinates). This throws away rich visual details (texture, shape, physics) that might be crucial for some tasks (e.g., handling a soft sponge vs. a hard block).
- Future Work: Integrating Multimodal LLMs (MLLMs) that can look at the images directly alongside the text prompts.
Long-Horizon Tasks: The paper briefly mentions a failure case with a long task ("clean the table" decomposed into subtasks). Errors accumulate: if the first step fails, the whole task fails. Zero-shot settings make this even harder.
- Future Work: Better long-horizon planning and error recovery strategies.
Extrapolation Limits: LLMs are not magic; they struggle to extrapolate to completely alien concepts that don't appear in their pre-training data or the context.

7.3. Personal Insights & Critique

The "Context" is Key: This paper reinforces the trend that "Better Retrieval = Better Performance." The Dynamics-Guided Selection module is arguably as important as the LLM itself. It shifts the problem from "learning a policy" to "finding the right analogy."
Symbolic Bottleneck: The reliance on GroundingDINO and converting everything to text coordinates is a fragile point. If the object detector fails or the object is weirdly shaped (hard to define by a center point), the whole system collapses. A purely end-to-end vision-based approach (like OpenVLA) should theoretically be better at this, but this paper shows they aren't there yet for cross-task generalization.
Applicability: The method is highly applicable to service robots (e.g., home tidying bots) where the variety of tasks is infinite, but the physics (pick, place, push) are relatively consistent. The use of standard LLMs means the robot's "brain" can be upgraded instantly as better LLMs (like GPT-5) are released.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~17 min read · 22,331 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions & Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Module 1: Dynamics-Guided Sample Selection

4.2.2. Module 2: Cross-Task In-Context Prediction

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies

6.2.1. Impact of Dynamics-Guided Selection

6.2.2. Number of In-Context Examples

6.2.3. LLM Backbone

6.3. Real-World Experiments

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers