SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation
TL;DR Summary
The paper presents the SpatialActor model to enhance robustness in robotic manipulation by decoupling semantic and geometric information. It employs a semantic-guided geometric module and a spatial transformer. The model demonstrates superior performance across various tasks unde
Abstract
Robotic manipulation requires precise spatial understanding to interact with objects in the real world. Point-based methods suffer from sparse sampling, leading to the loss of fine-grained semantics. Image-based methods typically feed RGB and depth into 2D backbones pre-trained on 3D auxiliary tasks, but their entangled semantics and geometry are sensitive to inherent depth noise in real-world that disrupts semantic understanding. Moreover, these methods focus on high-level geometry while overlooking low-level spatial cues essential for precise interaction. We propose SpatialActor, a disentangled framework for robust robotic manipulation that explicitly decouples semantics and geometry. The Semantic-guided Geometric Module adaptively fuses two complementary geometry from noisy depth and semantic-guided expert priors. Also, a Spatial Transformer leverages low-level spatial cues for accurate 2D-3D mapping and enables interaction among spatial features. We evaluate SpatialActor on multiple simulation and real-world scenarios across 50+ tasks. It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions, showing strong robustness. Moreover, it significantly enhances few-shot generalization to new tasks and maintains robustness under various spatial perturbations. Project Page: https://shihao1895.github.io/SpatialActor
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation
The title clearly states the paper's focus: proposing a new model named SpatialActor for robotic manipulation. It highlights the core technical approach, which is creating "disentangled spatial representations," and the primary goal, which is to achieve "robustness."
1.2. Authors
Hao Shi¹, Bin Xie², Yingfei Liu², Yang Vue¹, Tiancai Wang², Haoqiang Fan², Xiangyu Zhang³, Gao Huang⁴
The authors are affiliated with several prominent institutions:
-
¹Tsinghua University (Department of Automation, BNRist): A top university in China, well-known for its engineering and computer science programs.
-
²Dexmal, ⁴StepFun: Companies that likely focus on robotics and AI applications.
-
³MEGVII Technology: A well-known Chinese AI company specializing in computer vision (Face++).
The collaboration between a leading academic institution and several industrial research labs suggests that the work is grounded in both theoretical rigor and a strong drive for practical, real-world application.
1.3. Journal/Conference
The paper is available on arXiv, a preprint server. The "Published at (UTC)" date of "2025-11-12T18:59:08.000Z" indicates a placeholder for a future publication, suggesting it has been submitted to a top-tier conference or journal for review. Given the topic and quality, likely venues would be major robotics conferences like CoRL (Conference on Robot Learning), ICRA (IEEE International Conference on Robotics and Automation), or top AI/ML conferences like NeurIPS or CVPR.
1.4. Publication Year
The arXiv submission date is not explicitly mentioned, but the placeholder publication date is for 2025. The citations include papers from up to 2025, suggesting this is a very recent work.
1.5. Abstract
The abstract outlines the key problems in robotic manipulation. It criticizes existing point-based methods for losing semantic information and image-based methods for their entangled representations, which are sensitive to real-world depth sensor noise. The paper proposes SpatialActor, a framework that addresses these issues by disentangling semantics and geometry. Its core components are:
-
A Semantic-guided Geometric Module (SGM) that fuses geometric information from noisy depth maps with more robust geometric priors derived from RGB images via an expert model.
-
A Spatial Transformer (SPT) that uses low-level spatial cues for accurate 2D-to-3D mapping and feature interaction.
The abstract highlights impressive results: state-of-the-art performance on the RLBench benchmark (87.4% success), significant robustness improvements (13.9% to 19.4%) under noisy conditions, and strong few-shot generalization.
1.6. Original Source Link
-
Original Source Link:
https://arxiv.org/abs/2511.09555 -
PDF Link:
https://arxiv.org/pdf/2511.09555v1.pdfThe paper is currently available as a preprint on arXiv. This means it has not yet completed the peer-review process for an official publication venue.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper addresses is the lack of robustness in robotic manipulation systems, especially when dealing with imperfect sensor data from the real world. For a robot to interact precisely with its environment (e.g., pick up an object, insert a peg), it needs a deep understanding of both what objects are (semantics) and where they are in 3D space (geometry).
Existing approaches have significant limitations:
-
Point-based methods: These methods use 3D point clouds directly. While they explicitly represent geometry, the process of creating a point cloud from sensor data can be sparse, leading to a loss of fine-grained object details (semantics). They are also expensive to annotate, limiting the scale of pre-training.
-
Image-based methods: These methods take RGB images and depth maps (
RGB-D) as input. They often fuse these two sources of information early on into a single, entangled representation. The problem is that real-world depth sensors are notoriously noisy (due to reflective surfaces, poor lighting, etc.). This noise corrupts the geometric information, and because geometry and semantics are entangled, the noise also disrupts the model's semantic understanding, leading to a sharp drop in performance. As the paper notes, even minor noise causes a significant performance drop in a state-of-the-art model likeRVT-2.The paper's innovative entry point is to disentangle semantics and geometry. Instead of mixing them in a shared feature space,
SpatialActorprocesses them separately. This separation is designed to prevent noise from the depth channel from corrupting the clean semantic information extracted from the RGB channel. Furthermore, it proposes a novel way to construct a robust geometric representation by combining the best of both worlds: the fine-grained but noisy details from the raw depth map and a coarse but robust geometric "prior" generated from the clean RGB image using a powerful, pre-trained depth estimation model.
2.2. Main Contributions / Findings
The paper makes several key contributions:
- A Disentangled Framework (
SpatialActor): It proposes a novel architecture that explicitly decouples semantic and geometric representations to improve robustness against sensor noise. This is a fundamental departure from prior works that jointly model them. - Semantic-guided Geometric Module (SGM): This module intelligently fuses two complementary sources of geometric information:
- Fine-grained but noisy geometry from the raw depth sensor.
- Coarse but robust geometry estimated from the high-quality RGB image using a pre-trained "expert" depth model. A gating mechanism adaptively combines them, preserving details while filtering noise.
- Spatial Transformer (SPT): This component enhances spatial reasoning by incorporating low-level spatial cues directly into the transformer architecture. It uses positional encodings derived from 3D coordinates to help the model understand the precise spatial relationships between different parts of the scene, which is critical for fine-grained manipulation.
- State-of-the-Art Performance and Robustness: Through extensive experiments, the paper demonstrates that
SpatialActornot only achieves top performance on standard benchmarks (RLBench) but is also significantly more robust to depth noise, spatial perturbations, and generalizes better to new tasks with limited data (few-shot learning). This validates the effectiveness of the disentangled design.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, one must be familiar with the following concepts:
- Robotic Manipulation: This field of robotics focuses on enabling robots to physically interact with and manipulate objects in their environment. This includes tasks like grasping, pushing, placing, and assembling objects.
- RGB-D Data: This refers to data captured by sensors that provide both a standard color image (
RGB) and a per-pixel depth map (), which indicates the distance of each pixel from the sensor.RGB-Dcameras like the Intel RealSense are common in robotics. - Point Cloud: A set of data points in a 3D coordinate system. Point clouds are often generated from
RGB-Ddata or LiDAR sensors and are a direct way to represent the 3D geometry of a scene. - Vision-Language Models (VLMs): These are models, like CLIP (Contrastive Language-Image Pre-training), that are pre-trained on vast amounts of image-text pairs from the internet. They learn a shared embedding space where a text description (e.g., "a red apple") and a corresponding image are mapped to nearby points. This allows the model to understand visual concepts described in natural language, which is useful for instruction-following robots.
- Transformers: A neural network architecture originally designed for natural language processing that relies on a mechanism called self-attention. It allows the model to weigh the importance of different parts of the input sequence when processing a specific part. In computer vision (
Vision TransformerorViT), an image is broken into patches, which are treated as a sequence of tokens. The transformer can then learn relationships between different parts of the image. - Self-Attention: The core mechanism of a transformer. For a given token, self-attention calculates a score against every other token in the sequence. These scores are then used to create a weighted sum of all tokens, producing a new representation for the original token that is context-aware. The formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where (Query), (Key), and (Value) are linear projections of the input tokens, and is the dimension of the key vectors.
- Depth Estimation Models: These are deep learning models trained to predict a depth map from a single RGB image. Modern models like Depth Anything are trained on massive and diverse datasets, making them very robust at inferring geometric structure from 2D images, even in challenging conditions.
- Rotary Positional Encoding (RoPE): A technique for encoding the positional information of tokens in a transformer. Unlike absolute positional embeddings that are added to the tokens, RoPE rotates the token embeddings based on their position. This method has been shown to be very effective at capturing relative positional relationships.
3.2. Previous Works
The paper categorizes prior work into several groups, which it aims to improve upon:
- 2D-based Approaches: Models like
R3MandDiffusion Policylearn manipulation policies directly from 2D images. While they benefit from powerful pre-trained 2D vision models, they lack an explicit understanding of 3D space, which makes them struggle with tasks requiring precise geometric reasoning or handling occlusions. - Point Cloud-based Methods: Models like
PolarNetandAnyGraspuse 3D point clouds as their primary input. They have an explicit geometric representation, but suffer from sparsity (missing details between points) and the high cost of acquiring and annotating large-scale 3D data. - Voxel-based Representations: Models like
PerActdiscretize the 3D space into a grid of voxels. This provides a structured representation for reasoning, but it is computationally expensive, especially at high resolutions. - Multi-view RGB-D Approaches: This is the most direct competitor category. Models like RVT (Robotic View Transformer) and its successor RVT-2 take
RGB-Ddata from multiple camera views and feed them into a transformer architecture. They typically use a shared feature space where semantic and geometric information are entangled. While powerful, their performance is fragile and degrades significantly when the depth data is noisy, which is the central problemSpatialActoris designed to solve.
3.3. Technological Evolution
The field of robotic manipulation has evolved from relying on simple proprioceptive (joint-level) states to leveraging rich visual information.
-
Early Vision: Simple computer vision techniques were used to detect objects.
-
Deep Learning Rise: CNNs became dominant for processing 2D images to guide robots.
-
3D Integration: The availability of cheap
RGB-Dsensors led to methods using point clouds and voxels to explicitly model the 3D world. -
Transformer Era: Inspired by successes in NLP and vision, transformers (
RVT,PerAct) were adapted for robotics. They can process information from multiple camera views and fuse it with language instructions, enabling more complex, multi-stage tasks. -
Foundation Models: Recent work leverages large-scale pre-trained models (VLMs like CLIP, depth estimators like Depth Anything) as powerful priors for semantics and geometry.
SpatialActorsits at this latest stage. It doesn't just use foundation models but proposes a more principled way to combine them by disentangling their outputs to build a representation that is both accurate and robust.
3.4. Differentiation Analysis
The core innovation of SpatialActor compared to its main competitors (like RVT-2) is the disentanglement of semantics and geometry.
RVT-2(Entangled): Feeds RGB and Depth data into a shared encoder. The features for semantics and geometry are mixed together. Pro: Simple and can learn joint patterns. Con: Noise in the depth channel can easily corrupt the entire representation, harming semantic understanding.SpatialActor(Disentangled): Processes RGB and Depth in separate streams.-
The semantic stream uses the clean RGB image to get high-quality semantic features via a VLM.
-
The geometric stream is more complex: it doesn't just rely on the noisy raw depth. It creates a robust geometric representation by fusing the raw depth (for detail) with a clean geometric estimate generated from the RGB image (for robustness).
This disentangled design acts as a firewall, preventing depth noise from contaminating the semantic features. The
SGMmodule then carefully reconstructs a high-quality geometric representation, making the final model far more resilient to the challenges of real-world sensor data.
-
4. Methodology
4.1. Principles
The core principle of SpatialActor is "divide and conquer". Instead of treating the multi-modal input (RGB, depth, language, robot state) as a monolithic block of information, the framework separates it into distinct channels—semantics and geometry—to process them according to their strengths and weaknesses. The intuition is that RGB images provide high-quality semantic information but ambiguous geometry, while depth maps provide direct geometry but are often noisy. By handling them separately and fusing them intelligently, the system can be more robust and precise than methods that entangle them from the start.
4.2. Core Methodology In-depth (Layer by Layer)
The SpatialActor framework can be broken down into the following stages, as illustrated in Figure 2 of the paper.
The following figure (Figure 2 from the original paper) shows the overall framework of SpatialActor.
该图像是一个示意图,展示了SpatialActor框架中的多个模块。左侧部分包含处理噪声深度信息的几何编码器和处理RGB图像的语义编码器,通过多尺度门控融合(SGM)来提取和结合不同的信息。中间部分展示了空间转换模块(SPT),用于实现视图级和场景级的交互。右侧为执行模块,显示了机械臂的动作。这一框架旨在增强机器人在复杂环境中的操作能力和稳定性。
4.2.1. Inputs
The system takes a set of inputs for each decision step: $ X = { I ^ { v } , D ^ { v } } _ { v = 1 } ^ { V } , P , L $
- : The RGB image from camera view .
- : The corresponding depth map from view .
- : The total number of camera views.
- : The robot's proprioceptive state (e.g., joint angles, gripper position).
- : The natural language instruction for the task (e.g., "put the red block on the blue block").
4.2.2. Disentangled Feature Extraction
The RGB and depth inputs are processed separately to extract semantic and geometric features.
- Semantic Features: The RGB image and language instruction are passed through a pre-trained Vision-Language Model (VLM) like CLIP. This produces language-aligned semantic features for the image and text features .
- Geometric Features (Raw): The raw, potentially noisy depth map is passed through a standard convolutional encoder (e.g., a ResNet) to produce fine-grained but noisy geometric features .
4.2.3. Semantic-guided Geometric Module (SGM)
This is the first key innovation. The goal is to create a robust geometric representation that is less sensitive to noise. The SGM does this by fusing the noisy features from the raw depth map with a clean "expert prior" generated from the RGB image.
The following figure (Figure 3 from the original paper) details the SGM and SPT modules.
该图像是示意图,展示了语义引导几何模块(SGM)和空间变换器(SPT)的结构。SGM通过多尺度门融合方法对噪声深度信息进行处理,而SPT则利用空间位置编码模块进行视图和场景级别的交互。
-
Expert Geometric Prior Generation: The high-quality RGB image is fed into a frozen, large-scale pre-trained depth estimation model (like
Depth Anything v2). This model has learned to infer geometric structure from semantics on a massive dataset, so it produces a robust but potentially coarse geometric representation, . $ \hat{F}{\mathrm{geo}}^v = \mathcal{E}{\mathrm{expert}}(I^v) \in \mathbb{R}^{H \times W \times C} $- : The frozen depth estimation expert model.
- : The robust, coarse geometric features.
-
Raw Geometric Feature Extraction: Simultaneously, the raw depth map is processed by a trainable depth encoder to get fine-grained features. $ F_{\mathrm{geo}}^v = \mathcal{E}_{\mathrm{raw}}(D^v) \in \mathbb{R}^{H \times W \times C} $
- : The trainable depth encoder (e.g., ResNet-50).
- : The fine-grained, noisy geometric features.
-
Adaptive Gated Fusion: The SGM then adaptively fuses these two complementary geometric representations. A gating mechanism decides pixel-wise whether to trust the fine-grained details from the raw depth or the robust prior from the expert. First, a gate is computed by concatenating both feature maps and passing them through a small Multi-Layer Perceptron (MLP). $ G^v = \sigma\left(\mathrm{MLP}\left(\mathrm{Concat}(\hat{F}{\mathrm{geo}}^v, F{\mathrm{geo}}^v)\right)\right) $
-
: Concatenates the two feature tensors along the channel dimension.
-
: A multi-layer perceptron that learns to produce the gate values.
-
: The sigmoid activation function, which outputs values between 0 and 1 for the gate .
The final fused geometric feature map is a weighted combination controlled by the gate . $ F_{\mathrm{fuse-geo}}^v = G^v \odot F_{\mathrm{geo}}^v + (1 - G^v) \odot \hat{F}_{\mathrm{geo}}^v $
-
: Element-wise multiplication.
-
When a value in the gate is close to 1, the fusion favors the fine-grained feature from the raw depth . When it's close to 0, it favors the robust expert prior . This allows the model to dynamically suppress noise.
Finally, the fused geometric features are concatenated with the semantic features to form the combined spatial features for that view.
-
4.2.4. Spatial Transformer (SPT)
This is the second key innovation. Once we have the per-view features , the SPT's job is to reason about their spatial relationships and fuse information across different views to form a coherent scene understanding.
-
Proprioception Fusion: The robot's own state is projected by an MLP and added to the visual features to make them aware of the robot's current configuration. $ \widetilde{H}^v = H^v + \mathsf{MLP}(P) $
-
Low-level Spatial Cue Injection (2D-to-3D Projection): To enable precise spatial reasoning, the model needs to know the 3D location corresponding to each feature token. For a pixel
(x', y')in the image of view , its 3D coordinate(x, y, z)in the robot's base frame is calculated using the camera's intrinsic and extrinsic parameters. $ [ \boldsymbol{x} , \boldsymbol{y} , \boldsymbol{z} , 1 ] ^ { \top } = E ^ { v } \left( d \cdot ( K ^ { v } ) ^ { - 1 } [ \boldsymbol { x } ^ { \prime } , \boldsymbol { y } ^ { \prime } , 1 ] ^ { \top } \parallel 1 \right) $- : The camera intrinsic matrix for view , which maps 3D camera coordinates to 2D image coordinates.
- : The camera extrinsic matrix for view , which transforms coordinates from the camera frame to the robot's base frame.
d = D^v(x', y'): The depth value at pixel(x', y').- : Vector concatenation.
-
Rotary Positional Encoding (RoPE): The calculated 3D coordinates
(x, y, z)are used to generate positional encodings. The paper uses RoPE.- First, a set of frequencies is defined: $ \omega _ { k } = \lambda ^ { - 2 k / d } , \quad k = 0 , 1 , \ldots , \frac { d } { 2 } - 1 , \quad d = D / 3 $ where and is the feature dimension. This creates a spectrum of frequencies from low to high.
- Sinusoidal embeddings are then computed for each coordinate axis : $ \cos_{\mathrm{pos}} = [\cos(\omega_k u)]{u \in {x, y, z}, k = 0, \ldots, d/2 - 1} $ $ \sin{\mathrm{pos}} = [\sin(\omega_k u)]_{u \in {x, y, z}, k = 0, \ldots, d/2 - 1} $
- These embeddings are applied to the feature vectors using rotation, which modifies the features based on their 3D position. This allows the subsequent attention layers to be inherently aware of relative spatial positions. $ T^v = \tilde{H}^v \odot \cos_{\mathrm{pos}} + \mathrm{rot}(\tilde{H}^v) \odot \sin_{\mathrm{pos}} $ where rotates pairs of feature dimensions, effectively mixing them based on the sine component of the positional encoding.
-
Hierarchical Attention: The SPT uses a two-level attention mechanism:
- View-level Interaction: Self-attention is applied to the tokens within each view separately. This allows the model to refine features by consolidating context from within a single viewpoint.
- Scene-level Interaction: The refined tokens from all views are concatenated together with the language features . Another round of self-attention is applied to this combined set of tokens. This crucial step fuses information across all views and modalities, creating a unified scene representation that understands the relationships between objects seen from different angles and connects them to the language command.
4.2.5. Action Prediction
The final refined tokens are used to predict the robot's action.
- A lightweight decoder produces a per-view 2D heatmap. The target 3D translation
(x, y, z)is determined by finding the pixel with the highest heatmap value (argmax) and lifting its 2D coordinates to 3D using the camera model. - An MLP takes the feature vectors around this peak location and regresses the end-effector rotation and the binary gripper state (open/close).
- The final action is .
- The model is trained with a composite loss: cross-entropy loss for the 2D heatmaps, cross-entropy loss for discretized rotation angles, and a binary classification loss for the gripper state.
5. Experimental Setup
5.1. Datasets
-
RLBench: A large-scale, challenging benchmark for robotic manipulation simulated in CoppeliaSim. It features a Franka Emika Panda arm. The paper uses 18 different tasks with a total of 249 variations (e.g., different object colors, positions). For each task, 100 expert demonstrations are used for training and 25 unseen episodes for testing. Observations include
RGB-Dimages from four cameras (front, left/right shoulder, wrist).- Data Sample Example: An episode would consist of a sequence of observations (four
RGB-Dimages, robot state) and the corresponding expert action (target end-effector pose and gripper state) needed to complete a task like"push the red button, then the green button".
- Data Sample Example: An episode would consist of a sequence of observations (four
-
ColosseumBench: A benchmark designed specifically to evaluate the generalization and robustness of manipulation policies under environmental changes. The paper uses 20 tasks and evaluates performance under perturbations like changing the size of the manipulated object, the size of the container object, and the camera pose.
These datasets were chosen because they are standard in the field and provide a comprehensive test of multi-task learning, precision, robustness, and generalization.
5.2. Evaluation Metrics
-
Success Rate: This is the primary metric for evaluating task completion.
- Conceptual Definition: It measures the percentage of trials in which the robot successfully completes the given task according to the task-specific criteria. A higher success rate indicates a more effective and reliable policy.
- Mathematical Formula: $ \text{Success Rate} (%) = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100 $
- Symbol Explanation:
Number of Successful Trials: The count of test episodes where the policy achieved the task goal.Total Number of Trials: The total number of test episodes evaluated.
-
Average Rank: This metric evaluates the overall performance of a model across multiple tasks by ranking it against other models on each task and then averaging the ranks.
- Conceptual Definition: It provides a summary of how a model performs relative to its competitors across a suite of tasks. A lower average rank is better, indicating that the model consistently ranks high (e.g., 1st, 2nd) on many tasks.
- Mathematical Formula: $ \text{Average Rank} = \frac{1}{N_{\text{tasks}}} \sum_{i=1}^{N_{\text{tasks}}} \text{Rank}_i $
- Symbol Explanation:
- : The total number of tasks in the benchmark.
- : The rank of the model on task (e.g., 1 for the best, 2 for the second best, etc.).
5.3. Baselines
The paper compares SpatialActor against a wide range of state-of-the-art models, representing different approaches to robotic manipulation:
-
PerAct: A voxel-based method using a Perceiver-style transformer. It is a strong baseline that discretizes 3D space. -
RVTandRVT-2: The most direct competitors. These are image-based methods using transformers to process multi-viewRGB-Ddata. They represent the state-of-the-art in entangled representation learning, making them the perfect benchmark to demonstrate the benefits ofSpatialActor's disentangled approach. -
Act3D,3D Diffuser Actor: Other recent 3D representation learning methods. -
C2F-ARM-BC,HiveFormer,PolarNet: Earlier methods representing different architectural choices (coarse-to-fine, point cloud-based, etc.). -
R3M,MVP,VoxPoser: Baselines used in the ColosseumBench evaluation.This comprehensive set of baselines ensures that the performance gains of
SpatialActorare measured against the best and most representative methods currently available.
6. Results & Analysis
6.1. Core Results Analysis
The main results on the RLBench benchmark are presented in Table 1.
The following are the results from Table 1 of the original paper:
| Models | Avg. Success ↑ | Avg. Rank ↓ | Close Jar | Drag Stick | Insert Peg | Meat off Grill | Open Drawer | Place Cups | Place Wine | Push Buttons |
|---|---|---|---|---|---|---|---|---|---|---|
| C2F-ARM-BC (James et a. 2022) | 20.1 | 9.5 | 24.0 | 24.0 | 4.0 | 20.0 | 20.0 | 0.0 | 8.0 | 72.0 |
| HiveFormer (Guhur et al. 2023) | 45.3 | 7.8 | 52.0 | 76.0 | 0.0 | 100.0 | 52.0 | 0.0 | 80.0 | 84.0 |
| PolarNet (Chen et al. 2023) | 46.4 | 7.3 | 36.0 | 92.0 | 4.0 | 100.0 | 84.0 | 0.0 | 40.0 | 96.0 |
| PerAct (Shridhar, Manuell, and Fox 2023) | 49.4 | 7.1 | 55.2±4.7 | 89.6±4.1 | 5.6±4.1 | 70.4±2.0 | 88.0±5.7 | 2.4±3.2 | 44.8±7.8 | 92.8±3.0 |
| RVT (Goyal et al. 2023) | 62.9 | 5.3 | 52.0±2.5 | 99.2±1.6 | 11.2±3.0 | 88.0±2.5 | 71.2±6.9 | 4.0±2.5 | 91.0±5.2 | 100±0.0 |
| Act3D (Gervet et al.2023) | 65.0 | 5.3 | 92.0 | 92.0 | 27.0 | 94.0 | 93.0 | 3.0 | 80.00 | 99.0 |
| SAM-E (Zhang et al. 2024) | 70.6 | 2.9 | 82.4±3.6 | 100.0±0.0 | 18.4±4.6 | 95.2±3.3 | 95.2±5.2 | 0.0±0.0 | 94.4±4.6 | 100±0.0 |
| 3D Differ Actor (Ke t a.) RVT-2 (Goyal et al. 2024) |
81.3 81.4 |
2.8 2.8 |
96.0±2.5 100.0±0.0 |
0±0.0 99.0±1.7 |
66±4.1 40.0±0.0 |
96.8±1.6 99±1.7 |
89.6±4.1 74.0±11.8 |
24.0±7.6 38.0±4.5 |
93.6±4.8 95.0±3.3 |
98.4±2.0 100±0.0 |
| SpatialActor (Ours) | 87.4±0.8 | 2.3 | 94.0±4.2 | 100.0±0.0 | 93.3±4.8 | 98.7±2.1 | 82.0±3.3 | 56.7±8.5 | 94.7±4.8 | 100.0±0.0 |
| Models | Put in Cupboard | Put in Drawer | Put in Safe | Screw Bulb | Slide Block | Sort Shape | Stack Blocks | Stack Cups | Sweep to Dustpan | Turn Tap |
| C2F-ARM-BC (James et a. 2022) | 0.0 | 4.0 | 12.0 | 8.0 | 16.0 | 8.0 | 0.0 | 0.0 | 0.0 | 68.0 |
| HiveFormer (Guhur et al. 2023) | 32.0 | 68.0 | 76.0 | 8.0 | 64.0 | 8.0 | 8.0 | 0.0 | 28.0 | 80.0 |
| PolarNet (Chen et al. 2023) | 12.0 | 32.0 | 84.0 | 44.0 | 56.0 | 12.0 | 4.0 | 8.0 | 52.0 | 80.0 |
| PerAct (Shridar, Manelli, and Fox 203) | 28.0±4.4 | 51.2±4.7 | 84.0±3.6 | 17.6±2.0 | 74.0±13.0 | 16.8±4.7 | 26.4±3.2 | 2.4±2.0 | 52.0±0.0 | 88.0±4.4 |
| RVT (Goyal et al. 2023) | 49.6±3.2 | 88.0±5.7 | 91.2±3.0 | 48.0±5.7 | 81.6±5.4 | 36.0±2.5 | 28.8±3.9 | 26.4±8.2 | 72.0±0.0 | 93.6±4.1 |
| Act3D (Gervet et al. 2023) | 51.0 | 90.0 | 95.0 | 47.0 | 93.0 | 8.0 | 12.0 | 9.0 | 92.0 | 94.0 |
| SAM-E (Zhang et al. 2024) | 64.0±2.8 | 92.0±5.7 | 95.2±3.3 | 78.4±3.6 | 95.2±1.8 | 34.4±6.1 | 26.4±4.6 | 0.0±0.0 | 100.0±0.0 | 100.0±0.0 |
| 3D Diffuser Actor (Ke et al. 2024) RVT-2 (Goyal t al. 2024) |
85.6±4.1 66.0±4.5 |
96.0±3.6 96.0±0.0 |
97.6±2.0 96.0±2.8 |
82.4±2.0 88.0±4.9 |
97.6±3.2 92.0±2.8 |
44.0±4.4 35.0±7.1 |
68.3±3.3 80.0±2.8 |
47.2±8.5 69.0±5.9 |
84.0±4.4 100±0.0 |
99.2±1.6 99.0±1.7 |
| SpatialActor (Ours) | 72.0±3.6 | 98.7±3.3 | 96.7±3.9 | 88.7±3.9 | 91.3±6.9 | 73.3±6.5 | 56±7.6 | 81.3±4.1 | 100.0±0.0 | 95.3±3.0 |
- Overall Performance:
SpatialActorachieves an average success rate of 87.4%, surpassing the previous state-of-the-artRVT-2(81.4%) and3D Diffuser Actor(81.3%) by a significant margin of ~6.0%. It also has the best (lowest) average rank of 2.3. - High-Precision Tasks: The most striking results are on tasks that demand high spatial precision.
- On
Insert Peg,SpatialActorachieves an incredible 93.3% success rate, whereasRVT-2only manages 40.0%. This is a +53.3% improvement, strongly suggesting that the robust and precise spatial representation fromSGMandSPTis highly effective for tasks requiring fine alignment. - On
Sort Shape,SpatialActorachieves 73.3% compared toRVT-2's 35.0%, another massive gain (+38.3%). This task requires both shape recognition (semantics) and precise placement (geometry), andSpatialActorexcels.
- On
6.2. Data Presentation (Tables)
Robustness Under Noisy Conditions (Table 2)
This experiment is central to the paper's thesis. By injecting Gaussian noise into the point clouds derived from depth maps, the authors simulate real-world sensor imperfections.
The following are the results from Table 2 of the original paper:
| Models | Noise type | Avg. Success ↑ | Close Jar | Drag Stick | Insert Peg | Meat off Grill | Open Drawer | Place Cups | Place Wine | Push Buttons |
|---|---|---|---|---|---|---|---|---|---|---|
| RVT-2 | Light | 72.5±0.5 | 92.0±4.0 | 100.0±0.0 | 6.7±4.6 | 100.0±0.0 | 82.7±10.1 | 25.3±6.1 | 96.0±4.0 | 74.7±8.3 |
| SpatialActor (Ours) | 86.4±0.4 | 97.3±2.3 | 98.7±2.3 | 94.7±6.1 | 96.0±0.0 | 73.3±10.1 | 54.7±8.3 | 92.0±4.0 | 98.7±2.3 | |
| RVT-2 | Middle | 68.4±0.9 | 85.3±2.3 | 100.0±0.0 | 2.7±2.3 | 94.7±2.3 | 82.7±11.5 | 20.0±0.0 | 89.3±4.6 | 73.3±4.6 |
| SpatialActor (Ours) | 85.3±0.9 | 100.0±0.0 | 98.7±2.3 | 81.3±6.1 | 96.0±4.0 | 78.7±8.3 | 45.3±10.1 | 89.3±4.6 | 97.3±4.6 | |
| RVT-2 | Heavy | 57.0±0.9 | 49.3±6.1 | 94.7±4.6 | 0.0±0.0 | 97.3±2.3 | 86.7±2.3 | 8.0±4.0 | 86.7±2.3 | 64.0±4.0 |
| SpatialActor (Ours) | 76.4±0.5 | 82.7±2.3 | 98.7 ±2.3 | 61.3±6.1 | 100.0±0.0 | 80.0±4.0 | 21.3±4.6 | 92.0±0.0 | 92.0±4.6 | |
| Models | Put in Cupboard | Put in Drawer | Put in Safe | Screw Bulb | Slide Block | Sort Shape | Stack Blocks | Stack Cups | Sweep to Dustpan | Turn Tap |
| RVT-2 | 57.3±2.3 | 100.0±0.0 | 92.0±4.0 | 81.3±6.1 | 62.7±23.1 | 46.7±6.1 | 53.3±2.3 | 45.3±2.3 | 96.0±6.9 | 93.3±4.6 |
| SpatialActor (Ours) | 81.3±2.3 | 98.7±2.3 | 98.7±2.3 | 88.0±4.0 | 72.0±4.0 | 76.0±6.9 | 62.7±2.3 | 82.7±2.3 | 97.3±4.6 | 93.3±2.3 |
| RVT-2 | 50.7±12.2 | 98.7±2.3 | 98.7±2.3 | 76.0±4.0 | 57.3±2.3 | 38.7±10.1 | 45.3±12.2 | 25.3±6.1 | 96.0±4.0 | 96.0±6.9 |
| SpatialActor (Ours) | 74.7±6.1 | 100.0±0.0 | 94.7±2.3 | 88.0±4.0 | 81.3±15.1 | 76.0±4.0 | 58.7±6.1 | 77.3±8.3 | 100.0±0.0 | 97.3±2.3 |
| RVT-2 | 20.0±6.9 | 97.3±2.3 | 93.3±2.3 | 58.7±2.3 | 57.3±8.3 | 13.3±6.1 | 13.3±6.1 | 1.3±2.3 | 92.0±0.0 | 92.0±4.0 |
| SpatialActor (Ours) | 64.0±4.0 | 100.0±0.0 | 100.0±0.0 | 78.7±8.3 | 58.7±2.3 | 52.0±4.0 | 42.7±6.1 | 70.7±6.1 | 82.7±2.3 | 97.3±4.6 |
RVT-2's performance degrades rapidly with noise: 81.4% (no noise) -> 72.5% (light) -> 68.4% (middle) -> 57.0% (heavy).SpatialActordemonstrates remarkable resilience: 87.4% (no noise) -> 86.4% (light) -> 85.3% (middle) -> 76.4% (heavy).- The performance gap widens as noise increases.
SpatialActoroutperformsRVT-2by +13.9% (light), +16.9% (middle), and +19.4% (heavy). - On the
Insert Pegtask under heavy noise,RVT-2's performance drops to 0%, whileSpatialActormaintains a respectable 61.3%. This is definitive proof that the disentangled representation and the SGM module are extremely effective at mitigating the effects of depth noise.
Few-Shot Generalization (Table 3)
This test evaluates how well a model can adapt to 19 new tasks after being trained on a different set of tasks, using only 10 demonstration videos for each new task.
The following are the results from Table 3 of the original paper:
| Models | Avg. Success ↑ | Close Laptop | Put Rubbish in Bin | Beat Buzz | Close Microwave | Put Shoes in Box | Get Ice | Change Clock | Close Box | Reach Target |
|---|---|---|---|---|---|---|---|---|---|---|
| RVT-2 | 46.9±1.5 | 76.0±6.1 | 10.3±5.1 | 47.4±8.5 | 61.7±9.8 | 7.4±4.3 | 93.7±3.9 | 72.6±2.8 | 49.1±8.6 | 12.0±5.7 |
| SpatialActor (Ours) | 79.2±2.7 | 90.0±7.5 | 100±0.0 | 92.0±2.5 | 95.3±11.4 | 25.3±13.8 | 96.0±2.5 | 83.3±7.3 | 95.3±4.7 | 86.0±2.2 |
| Models | Close Door | Remove Cups | Close Drawer | Spatula Scoop | Close Fridge | Put Knife on Board | Screw Nail | Close Grill | Plate in Rack | Meat on Grill |
| RVT-2 | 4.0±3.3 | 33.7±13.8 | 96.0±0.0 | 70.9±6.8 | 81.7±8.6 | 14.3±7.3 | 38.9±15.1 | 66.3±8.9 | 24.6±7.1 | 30.0±8.5 |
| SpatialActor (Ours) | 36.0±14.1 | 66.0±8.3 | 96.±3.9 | 84.7±8.2 | 95.3±5.3 | 66.0±2.2 | 62.7±6.0 | 96.0±0.0 | 48.0±8.0 | 90.0±2.8 |
SpatialActor achieves an average success rate of 79.2%, while RVT-2 only gets 46.9%. This is a massive +32.3% improvement. This suggests that the robust and well-structured representations learned by SpatialActor provide a much better foundation for transferring knowledge to new, unseen tasks, requiring less data to adapt.
6.3. Ablation Studies / Parameter Analysis
The ablation study in Table 5 is crucial for validating the contribution of each component of SpatialActor. The baseline is RVT-2, which is an entangled model.
The following are the results from Table 5 of the original paper:
| Decouple | SGM | SPT | Avg. success on 18 tasks ↑ | |
|---|---|---|---|---|
| No noise | Heavy noise | |||
| 81.4 | 57.0 | |||
| ✓ | 85.1 | 68.7 | ||
| ✓ | ✓ | 86.4 | 73.9 | |
| ✓ | ✓ | ✓ | 87.4 | 76.4 |
-
Baseline (
RVT-2): 81.4% success (no noise), 57.0% (heavy noise). -
+ Decouple: Simply disentangling the semantic and geometric streams (without SGM or SPT) already provides a huge boost. Performance rises to 85.1% (+3.7%) with no noise and 68.7% (+11.7%) with heavy noise. This confirms that preventing noise from the depth channel from corrupting semantic features is highly effective.
-
+ SGM: Adding the Semantic-guided Geometric Module further improves performance to 86.4% (+1.3%) and 73.9% (+5.2%). The improvement is much larger under heavy noise, proving that the SGM's ability to fuse expert priors with raw depth is key to its noise robustness.
-
+ SPT (Full
SpatialActor): Finally, adding the Spatial Transformer brings performance to its peak at 87.4% (+1.0%) and 76.4% (+2.5%). The SPT contributes by enabling more precise spatial reasoning through its explicit modeling of low-level 3D positional cues.This step-by-step analysis clearly demonstrates that each proposed component contributes meaningfully to the model's final performance and robustness.
Real-World Evaluation
The real-world experiments (Table 6) confirm that the benefits observed in simulation translate to a physical robot. SpatialActor achieves an overall success rate of 63% across 8 tasks (15 variations), compared to 43% for RVT-2. This +20% absolute improvement in the real world is a strong validation of the method's practical utility. The qualitative results in Figures 5, 6, and 7 further show SpatialActor's superior stability and precision in grasping and placement compared to RVT-2.
The following figure (Figure 5 from the original paper) shows the model's generalization performance.
该图像是一个示意图,展示了在不同条件下评估SpatialActor的鲁棒性,包含默认设置、操控目标、接收对象、亮度和背景的变化。右侧柱状图显示了各条件下的性能评分,最高为80。
The following figure (Figure 6 from the original paper) shows a qualitative comparison on a real-world task.
该图像是一个示意图,展示了RVT-2和本研究提出的方法在进行物体抓取任务时的表现。上方是RVT-2的操作示例,其精确度不足,标注为'imprecise'。下方是我们的模型在相同任务中的表现,显示出更好的抓取成功率和精准度。
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully identifies a critical weakness in existing image-based robotic manipulation methods: their vulnerability to real-world sensor noise due to entangled semantic-geometric representations. The proposed SpatialActor framework provides an elegant and effective solution by disentangling these two information streams.
The core contributions—the Semantic-guided Geometric Module (SGM) for building a robust geometric representation and the Spatial Transformer (SPT) for precise spatial reasoning—are shown to be highly effective. SpatialActor not only sets a new state-of-the-art on the RLBench benchmark but, more importantly, demonstrates vastly superior robustness to noise, better generalization to new tasks, and successful deployment on a real robot. The work makes a strong case that disentangled spatial representations are a crucial step towards building more reliable and general-purpose robotic systems.
7.2. Limitations & Future Work
The authors candidly discuss failure cases in the supplementary material (Figure 8), which point to several limitations and directions for future work:
- Instruction Understanding Errors: The model sometimes misunderstands complex or ambiguous language instructions (e.g., opening the wrong drawer). This suggests that the current VLM, while powerful, could be improved. Integrating more advanced Large Language Models (LLMs) could enhance high-level reasoning and planning.
- Long-Horizon Task Failures: The model can fail in tasks requiring a long sequence of steps (e.g., stalling after placing a few cups). This points to a need for better long-term memory or planning capabilities, such as an episodic memory module.
- Semantic Ambiguity: The model can get confused when multiple similar-looking objects are present (e.g., stacking the wrong cup). This is a semantic grounding problem that could be addressed with more sophisticated attention mechanisms.
- Susceptibility to Distractors: In the real world, background clutter can sometimes distract the policy. This highlights the need for more robust attention filtering.
7.3. Personal Insights & Critique
This paper is an excellent piece of research with clear motivation, a well-designed methodology, and comprehensive experiments that strongly support its claims.
Strengths:
- Problem-Driven Innovation: The work is motivated by a real, practical problem (sensor noise) rather than just chasing incremental benchmark scores. The solution (disentanglement) is a direct and principled response to this problem.
- Clever Use of Foundation Models: The way
SpatialActoruses the pre-trained depth expert is very clever. Instead of just using it as a black box, it fuses its robust but coarse output with the fine-grained but noisy raw data, getting the best of both worlds. This is a great example of how to creatively leverage existing powerful models. - Rigorous Evaluation: The experimental design is top-notch. The combination of standard benchmarks, targeted noise/perturbation tests, few-shot generalization experiments, and real-world validation provides a convincing and multi-faceted case for the method's superiority.
Potential Areas for Improvement/Critique:
-
Computational Cost: The model uses multiple large networks (a VLM, a depth expert, plus its own encoders and transformer). The computational overhead of this framework might be significant, potentially limiting its application in resource-constrained or real-time critical scenarios. The paper does not discuss inference speed or computational costs.
-
Complexity of the System: While effective, the system has many moving parts. This could make it harder to debug, tune, and deploy compared to simpler end-to-end models.
-
Scalability to More Complex Scenes: The experiments are conducted in tabletop scenarios. While diverse, it remains to be seen how the approach scales to more cluttered, dynamic, and unstructured environments like a home kitchen or a warehouse.
Overall,
SpatialActorpresents a compelling architectural paradigm for robotic manipulation. The principle of disentanglement for robustness is a powerful idea that could inspire many future works in robotics and other fields where multi-modal fusion is critical. The paper is a significant step forward in building robotic systems that can operate reliably outside the pristine conditions of a lab.
Similar papers
Recommended via semantic vector search.