GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields
TL;DR Summary
GNFactor proposes a behavior cloning agent using Generalizable Neural Feature Fields, enhancing robots' multi-task manipulation in complex environments by optimizing reconstruction and decision-making modules. It significantly improves 3D structure understanding and semantic comp
Abstract
It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present , a visual behavior cloning agent for multi-task robotic manipulation with eneralizable eural feature ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model (, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields
1.2. Authors
Yanjie Ze (Shanghai Jiao Tong University), Ge Yan (UC San Diego), Yueh-Hua Wu (UC San Diego), Annabella Macaluso (UC San Diego), Yuying Ge (University of Hong Kong), Jianglong Ye (UC San Diego), Nicklas Hansen (UC San Diego), Li Erran Li (AWS AI, Amazon), and Xiaolong Wang (UC San Diego).
1.3. Journal/Conference
This paper was published at the 7th Conference on Robot Learning (CoRL 2023). CoRL is a premier, highly selective international conference focusing on the intersection of robotics and machine learning.
1.4. Publication Year
2023 (First version on arXiv: August 31, 2023).
1.5. Abstract
Developing robots that can handle diverse manipulation tasks in unstructured, real-world environments is a major challenge. To succeed, robots need a deep understanding of 3D geometry and semantics. The authors present GNFactor, a behavior cloning agent that uses a shared 3D voxel representation to bridge 3D reconstruction and decision-making. By incorporating features from vision-language foundation models (like Stable Diffusion) into a Generalizable Neural Feature Field (GNF), the model gains rich semantic knowledge. Evaluated on 3 real robot tasks and 10 RLBench simulation tasks, GNFactor significantly outperforms existing state-of-the-art methods in both seen and unseen scenarios.
1.6. Original Source Link
-
ArXiv Link: https://arxiv.org/abs/2308.16891
-
Project Website: https://yanjieze.com/GNFactor/
-
Publication Status: Published (CoRL 2023).
2. Executive Summary
2.1. Background & Motivation
In the field of robotics, enabling a robot to perform multiple tasks in different environments using only a few human demonstrations is a "holy grail." Most current methods struggle with generalization—the ability to perform a task when the object's color, size, or position changes, or when the environment is new.
The core problem identified by the authors is that standard robot learning agents lack a comprehensive 3D and semantic understanding of their surroundings.
-
2D methods often fail to understand depth, occlusion, and spatial relationships.
-
Existing 3D methods (like basic point clouds or voxels) capture shape but lack "meaning" (semantics). For example, a robot might see a "curved shape" but not know it is a "handle" that needs to be "pulled."
The authors' entry point is to combine the geometric strengths of
Neural Radiance Fields (NeRF)with the semantic strengths ofVision-Language Models (VLM).
2.2. Main Contributions / Findings
-
GNFactor Model: A novel architecture that jointly trains a reconstruction module (for 3D/semantic understanding) and a policy module (for action prediction).
-
Generalizable Neural Feature Fields (GNF): Unlike standard NeRFs, which require hours of training for a single scene, GNF is "generalizable," meaning it can infer the 3D structure of new, unseen scenes in a single forward pass.
-
Foundation Model Distillation: The model distills features from
Stable Diffusioninto a 3D voxel grid, giving the robot an "innate" understanding of what objects are. -
Superior Performance: The model achieved a 1.55x improvement in multi-task simulation and successfully handled complex real-world kitchen tasks with as few as 5 demonstrations.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Neural Radiance Fields (NeRF)
Neural Radiance Fields (NeRF) is a technology used to generate new views of a 3D scene from a few 2D images. It represents a scene as a continuous function, usually a neural network, that takes a 3D coordinate (x, y, z) and a viewing direction as input and outputs the color and density at that point. By "rendering" rays through this field, we can create photorealistic images.
3.1.2. Behavior Cloning (BC)
Behavior Cloning is a form of imitation learning where a robot tries to mimic a human expert. The robot is given a dataset of "observations" (images) and "actions" (how the expert moved the arm). The goal is to learn a policy that predicts the correct action given the observation .
3.1.3. Voxels
A voxel (volume element) is the 3D equivalent of a 2D pixel. Imagine a 3D grid of small cubes; each cube (voxel) stores information like color, density, or abstract features. In this paper, a voxel grid is used as the "brain's" map of the workspace.
3.1.4. Vision-Language Models (VLM) and Distillation
Models like CLIP or Stable Diffusion are trained on billions of images and text descriptions. They have a "semantic" understanding of the world. Distillation is the process of taking the knowledge (features) from these massive models and "teaching" a smaller or different model (like a robot's 3D grid) to recognize those same patterns.
3.2. Previous Works
The paper builds on Perceiver-Actor (PerAct), which uses a Perceiver Transformer to process voxel data.
- Formula - Attention Mechanism: The
Attentionmechanism used in Transformers (like Perceiver) is defined as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where (Query), (Key), and (Value) are vectors. This allows the robot to "attend" to specific parts of the 3D scene (like a handle) while ignoring irrelevant parts (like the floor).
3.3. Technological Evolution
-
Stage 1: Learning from 2D images (struggles with 3D geometry).
-
Stage 2: Learning from 3D point clouds/voxels (better geometry, lacks semantics).
-
Stage 3 (Current): Semantic 3D representations.
GNFactoris a leader here, usingNeRF-style rendering to force the 3D representation to be accurate and semantic.
4. Methodology
4.1. Principles
The core idea is Joint Optimization. Usually, you train a visual model first and then a robot policy. GNFactor trains them together. The vision part tries to reconstruct the scene, and the policy part tries to predict the expert's action. Because they share the same 3D "Voxel Encoder," the encoder is forced to learn features that are useful for both seeing the world and acting in it.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. The Voxel Encoder
The process starts with an RGB-D image (Red, Green, Blue + Depth).
- The image is projected into a 3D workspace to create a voxel grid.
- A 3D UNet (a type of neural network designed for volumetric data) processes this grid.
- Output: A deep volumetric representation . This means every "cube" in the 3D grid now has a 128-dimensional vector describing what is inside it.
4.2.2. The GNF Module (3D Reconstruction)
The GNF module ensures the voxel representation is geometrically and semantically accurate. It does this by attempting to "reconstruct" the scene from different angles using volumetric rendering.
For a pixel's camera ray , where is the camera origin, is the direction, and is the distance along the ray, the estimated color and semantic embedding are calculated as:
Explanation of Symbols:
-
: The density at a 3D point (how "solid" the point is).
-
: The predicted color at that point.
-
: The predicted semantic feature (from
Stable Diffusion). -
T(t): The "transmittance," defined as . It represents the probability that the ray travels to distance without hitting anything.The model is trained by minimizing the Reconstruction Loss: Where and are the "ground truth" color and
Stable Diffusionfeatures for that camera ray.
4.2.3. The Policy Module (Action Prediction)
Simultaneously, the same voxel is passed to a Perceiver Transformer.
- The voxel is downsampled to and flattened into a sequence.
- The robot's current position (
proprioception) and the language instruction (e.g., "open the drawer") are added to this sequence. - The Transformer processes this data to output a 3D grid of -values.
-
Translation : Which voxel should the gripper move to?
-
Rotation : How should the gripper be turned?
-
Gripper : Should it be open or closed?
The Action Loss is a cross-entropy loss (commonly used in classification): Where and is the ground-truth action from the expert demonstration.
-
4.2.4. Total Loss
The final objective function that trains the entire system end-to-end is: Where is a weight that balances the importance of "seeing correctly" versus "acting correctly."
The following figure (Figure 3 from the original paper) shows the system architecture:
该图像是示意图,展示了GNFactor的工作流程。图中显示,GNFactor接收RGB-D图像作为输入,通过体素编码器转换为深度3D体积特征。该体积共享给两个模块:体积渲染器(Renderer)和动作预测器(Perceiver)。此外,任务描述通过语言编码器进行处理,进一步支持机器人状态的预测与决策。整体流程旨在优化多任务机器人操作的性能。
5. Experimental Setup
5.1. Datasets
- RLBench (Simulation): 10 complex tasks (e.g.,
close jar,meat off grill,stack blocks). There are 166 total variations (different colors, sizes, placements). - Real Robot: An
xArm7robot in two different toy kitchens. Tasks includeopen microwave door,turn faucet, andrelocate teapot. - Data Samples:
- Simulation: 20 demonstrations per task.
- Real World: Only 5 demonstrations per task (extremely low data).
5.2. Evaluation Metrics
5.2.1. Success Rate
- Conceptual Definition: The percentage of episodes where the robot successfully completes the task instruction (e.g., the drawer is pulled open).
- Mathematical Formula: $ \text{Success Rate} = \frac{1}{N} \sum_{i=1}^{N} S_i $
- Symbol Explanation:
- : The total number of test trials (episodes).
- : A binary indicator where if the -th trial was successful, and otherwise.
5.3. Baselines
-
PerAct (Perceiver Actor): The current state-of-the-art that uses voxels and transformers but lacks the generalizable neural field and foundation model distillation.
-
PerAct (4 Cameras): A version of the baseline given 4 camera views to see if more data alone helps.
6. Results & Analysis
6.1. Core Results Analysis
- Simulation Mastery:
GNFactoroutperformedPerActsignificantly. For example, in thesweep to dustpantask,PerActhad a 0% success rate, whileGNFactorreached 28%. - Generalization: When tested on unseen tasks (e.g., larger blocks or new positions),
GNFactormaintained much higher success rates (28.3% vs 18.0% forPerAct). - Real World: In the "teapot" task, which requires precise 3D positioning,
GNFactorachieved 40% success with only 5 demos, whilePerActfailed completely (0%).
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Method / Task | close jar | open drawer | sweep to dustpan | meat off grill | turn tap | Average |
|---|---|---|---|---|---|---|
| slide block | put in drawer | drag stick | push buttons | stack blocks | ||
| PerAct | 18.7±8.2 | 54.7±18.6 | 0.0±0.0 | 40.0±17.0 | 38.7±6.8 | 20.4 |
| 18.7±13.6 | 2.7±3.3 | 5.3±5.0 | 18.7±12.4 | 6.7±1.9 | ||
| PerAct (4 Cameras) | 21.3±7.5 | 44.0±11.3 | 0.0±0.0 | 65.3±13.2 | 46.7±3.8 | 22.7 |
| 16.0±14.2 | 6.7±6.8 | 12.0±3.3 | 9.3±1.9 | 5.3±1.9 | ||
| GNFactor | 25.3±6.8 | 76.0±5.7 | 28.0±15.0 | 57.3±18.9 | 50.7±8.2 | 31.7 |
| 20.0±15.0 | 0.0±0.0 | 37.3±13.2 | 18.7±10.0 | 4.0±3.3 |
The following are the results from Table 3 (Real Robot) of the original paper:
| Method / Task | door (1) | faucet (1) | teapot (1) | door (1,d) | faucet (1,d) | teapot (1,d) | Average |
|---|---|---|---|---|---|---|---|
| Method / Task | door (2) | faucet (2) | teapot (2) | door (2,d) | faucet (2,d) | teapot (2,d) | |
| PerAct | 30 | 80 | 0 | 10 | 50 | 0 | 22.5 |
| 10 | 50 | 0 | 10 | 30 | 0 | ||
| GNFactor | 40 | 80 | 40 | 30 | 50 | 30 | 43.3 |
| 50 | 70 | 40 | 20 | 40 | 30 |
6.3. Ablation Studies
The authors found that:
-
w/o GNF Objective: Success dropped from 36.8% to 24.2%. This proves that forcing the model to "reconstruct" the scene is vital for learning how to act.
-
Diffusion vs. Others: Using
Stable Diffusionfeatures worked better than usingCLIPorDINOfeatures, likely because diffusion features are highly detailed and pixel-aligned.
7. Conclusion & Reflections
7.1. Conclusion Summary
GNFactor demonstrates that robotics and vision are not separate problems. By forcing a robot's 3D representation to be capable of high-quality novel view synthesis (reconstruction) and semantic labeling (distillation), the robot becomes much more capable of understanding and manipulating its environment. The model is particularly strong in "few-shot" scenarios, where the robot only sees a task a handful of times.
7.2. Limitations & Future Work
- Camera Setup: The training requires multiple cameras (around 3 in the real world, 19 in simulation) to provide multi-view supervision. While only one camera is needed at "test time," setting up many cameras for "training" is still a hurdle.
- Static Scenes: The current
GNFassumes the scene is static during the reconstruction phase. Handling dynamic, moving objects in the reconstruction module is a potential future direction.
7.3. Personal Insights & Critique
This paper is a brilliant example of Representation Learning. Instead of just trying to map "Pixels to Motor Torques," the authors focus on creating a "World Model" in the robot's head.
- Inspiration: The idea that a robot can learn what a "teapot" is by looking at it from multiple angles and comparing it to a foundation model's knowledge is powerful.
- Critique: While the 1.55x improvement is great, absolute success rates (around 31-43%) are still relatively low for industrial applications. This suggests that while
GNFactoris a step in the right direction, imitation learning still needs more efficient ways to handle the "long tail" of possible environmental changes.
Similar papers
Recommended via semantic vector search.