Paper status: completed

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

Published:09/01/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

GNFactor proposes a behavior cloning agent using Generalizable Neural Feature Fields, enhancing robots' multi-task manipulation in complex environments by optimizing reconstruction and decision-making modules. It significantly improves 3D structure understanding and semantic comp

Abstract

It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present GNFactor\textbf{GNFactor}, a visual behavior cloning agent for multi-task robotic manipulation with G\textbf{G}eneralizable N\textbf{N}eural feature F\textbf{F}ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model (e.g.\textit{e.g.}, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

1.2. Authors

Yanjie Ze (Shanghai Jiao Tong University), Ge Yan (UC San Diego), Yueh-Hua Wu (UC San Diego), Annabella Macaluso (UC San Diego), Yuying Ge (University of Hong Kong), Jianglong Ye (UC San Diego), Nicklas Hansen (UC San Diego), Li Erran Li (AWS AI, Amazon), and Xiaolong Wang (UC San Diego).

1.3. Journal/Conference

This paper was published at the 7th Conference on Robot Learning (CoRL 2023). CoRL is a premier, highly selective international conference focusing on the intersection of robotics and machine learning.

1.4. Publication Year

2023 (First version on arXiv: August 31, 2023).

1.5. Abstract

Developing robots that can handle diverse manipulation tasks in unstructured, real-world environments is a major challenge. To succeed, robots need a deep understanding of 3D geometry and semantics. The authors present GNFactor, a behavior cloning agent that uses a shared 3D voxel representation to bridge 3D reconstruction and decision-making. By incorporating features from vision-language foundation models (like Stable Diffusion) into a Generalizable Neural Feature Field (GNF), the model gains rich semantic knowledge. Evaluated on 3 real robot tasks and 10 RLBench simulation tasks, GNFactor significantly outperforms existing state-of-the-art methods in both seen and unseen scenarios.

2. Executive Summary

2.1. Background & Motivation

In the field of robotics, enabling a robot to perform multiple tasks in different environments using only a few human demonstrations is a "holy grail." Most current methods struggle with generalization—the ability to perform a task when the object's color, size, or position changes, or when the environment is new.

The core problem identified by the authors is that standard robot learning agents lack a comprehensive 3D and semantic understanding of their surroundings.

  • 2D methods often fail to understand depth, occlusion, and spatial relationships.

  • Existing 3D methods (like basic point clouds or voxels) capture shape but lack "meaning" (semantics). For example, a robot might see a "curved shape" but not know it is a "handle" that needs to be "pulled."

    The authors' entry point is to combine the geometric strengths of Neural Radiance Fields (NeRF) with the semantic strengths of Vision-Language Models (VLM).

2.2. Main Contributions / Findings

  1. GNFactor Model: A novel architecture that jointly trains a reconstruction module (for 3D/semantic understanding) and a policy module (for action prediction).

  2. Generalizable Neural Feature Fields (GNF): Unlike standard NeRFs, which require hours of training for a single scene, GNF is "generalizable," meaning it can infer the 3D structure of new, unseen scenes in a single forward pass.

  3. Foundation Model Distillation: The model distills features from Stable Diffusion into a 3D voxel grid, giving the robot an "innate" understanding of what objects are.

  4. Superior Performance: The model achieved a 1.55x improvement in multi-task simulation and successfully handled complex real-world kitchen tasks with as few as 5 demonstrations.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Neural Radiance Fields (NeRF)

Neural Radiance Fields (NeRF) is a technology used to generate new views of a 3D scene from a few 2D images. It represents a scene as a continuous function, usually a neural network, that takes a 3D coordinate (x, y, z) and a viewing direction (θ,ϕ)(\theta, \phi) as input and outputs the color and density at that point. By "rendering" rays through this field, we can create photorealistic images.

3.1.2. Behavior Cloning (BC)

Behavior Cloning is a form of imitation learning where a robot tries to mimic a human expert. The robot is given a dataset of "observations" (images) and "actions" (how the expert moved the arm). The goal is to learn a policy π(ao)\pi(a|o) that predicts the correct action aa given the observation oo.

3.1.3. Voxels

A voxel (volume element) is the 3D equivalent of a 2D pixel. Imagine a 3D grid of small cubes; each cube (voxel) stores information like color, density, or abstract features. In this paper, a 1003100^3 voxel grid is used as the "brain's" map of the workspace.

3.1.4. Vision-Language Models (VLM) and Distillation

Models like CLIP or Stable Diffusion are trained on billions of images and text descriptions. They have a "semantic" understanding of the world. Distillation is the process of taking the knowledge (features) from these massive models and "teaching" a smaller or different model (like a robot's 3D grid) to recognize those same patterns.

3.2. Previous Works

The paper builds on Perceiver-Actor (PerAct), which uses a Perceiver Transformer to process voxel data.

  • Formula - Attention Mechanism: The Attention mechanism used in Transformers (like Perceiver) is defined as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where QQ (Query), KK (Key), and VV (Value) are vectors. This allows the robot to "attend" to specific parts of the 3D scene (like a handle) while ignoring irrelevant parts (like the floor).

3.3. Technological Evolution

  1. Stage 1: Learning from 2D images (struggles with 3D geometry).

  2. Stage 2: Learning from 3D point clouds/voxels (better geometry, lacks semantics).

  3. Stage 3 (Current): Semantic 3D representations. GNFactor is a leader here, using NeRF-style rendering to force the 3D representation to be accurate and semantic.


4. Methodology

4.1. Principles

The core idea is Joint Optimization. Usually, you train a visual model first and then a robot policy. GNFactor trains them together. The vision part tries to reconstruct the scene, and the policy part tries to predict the expert's action. Because they share the same 3D "Voxel Encoder," the encoder is forced to learn features that are useful for both seeing the world and acting in it.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The Voxel Encoder

The process starts with an RGB-D image (Red, Green, Blue + Depth).

  1. The image is projected into a 3D workspace to create a 100×100×100100 \times 100 \times 100 voxel grid.
  2. A 3D UNet (a type of neural network designed for volumetric data) processes this grid.
  3. Output: A deep volumetric representation vR1003×128v \in \mathbb{R}^{100^3 \times 128}. This means every "cube" in the 3D grid now has a 128-dimensional vector describing what is inside it.

4.2.2. The GNF Module (3D Reconstruction)

The GNF module ensures the voxel representation vv is geometrically and semantically accurate. It does this by attempting to "reconstruct" the scene from different angles using volumetric rendering.

For a pixel's camera ray r(t)=o+td\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}, where o\mathbf{o} is the camera origin, d\mathbf{d} is the direction, and tt is the distance along the ray, the estimated color C^\hat{\mathbf{C}} and semantic embedding F^\hat{\mathbf{F}} are calculated as:

C^(r,v)=tntfT(t)σ(r(t),vx(t))c(r(t),d,vx(t))dt \hat{\mathbf{C}}(\mathbf{r}, v) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t), v_{\mathbf{x}(t)}) \mathbf{c}(\mathbf{r}(t), \mathbf{d}, v_{\mathbf{x}(t)}) \mathrm{d}t

F^(r,v)=tntfT(t)σ(r(t),vx(t))f(r(t),d,vx(t))dt \hat{\mathbf{F}}(\mathbf{r}, v) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t), v_{\mathbf{x}(t)}) \mathbf{f}(\mathbf{r}(t), \mathbf{d}, v_{\mathbf{x}(t)}) \mathrm{d}t

Explanation of Symbols:

  • σ\sigma: The density at a 3D point (how "solid" the point is).

  • c\mathbf{c}: The predicted color at that point.

  • f\mathbf{f}: The predicted semantic feature (from Stable Diffusion).

  • T(t): The "transmittance," defined as T(t)=exp(tntσ(s)ds)T(t) = \exp\left( - \int_{t_n}^t \sigma(s) \mathrm{d}s \right). It represents the probability that the ray travels to distance tt without hitting anything.

    The model is trained by minimizing the Reconstruction Loss: Lrecon=rRC(r)C^(r)22+λfeatF(r)F^(r)22 \mathcal{L}_{\mathrm{recon}} = \sum_{\mathbf{r} \in \mathcal{R}} \|\mathbf{C}(\mathbf{r}) - \hat{\mathbf{C}}(\mathbf{r})\|_2^2 + \lambda_{\mathrm{feat}} \|\mathbf{F}(\mathbf{r}) - \hat{\mathbf{F}}(\mathbf{r})\|_2^2 Where C(r)\mathbf{C}(\mathbf{r}) and F(r)\mathbf{F}(\mathbf{r}) are the "ground truth" color and Stable Diffusion features for that camera ray.

4.2.3. The Policy Module (Action Prediction)

Simultaneously, the same voxel vv is passed to a Perceiver Transformer.

  1. The 1003100^3 voxel is downsampled to 20320^3 and flattened into a sequence.
  2. The robot's current position (proprioception) and the language instruction (e.g., "open the drawer") are added to this sequence.
  3. The Transformer processes this data to output a 3D grid of QQ-values.
    • Translation QtransQ_{\mathrm{trans}}: Which voxel should the gripper move to?

    • Rotation QrotQ_{\mathrm{rot}}: How should the gripper be turned?

    • Gripper QopenQ_{\mathrm{open}}: Should it be open or closed?

      The Action Loss is a cross-entropy loss (commonly used in classification): Laction=EYtrans[logVtrans]EYrot[logVrot]EYopen[logVopen]EYcollide[logVcollide] \mathcal{L}_{\mathrm{action}} = -\mathbb{E}_{Y_{\mathrm{trans}}} [\log \mathcal{V}_{\mathrm{trans}}] - \mathbb{E}_{Y_{\mathrm{rot}}} [\log \mathcal{V}_{\mathrm{rot}}] - \mathbb{E}_{Y_{\mathrm{open}}} [\log \mathcal{V}_{\mathrm{open}}] - \mathbb{E}_{Y_{\mathrm{collide}}} [\log \mathcal{V}_{\mathrm{collide}}] Where Vi=softmax(Qi)\mathcal{V}_i = \mathrm{softmax}(Q_i) and YiY_i is the ground-truth action from the expert demonstration.

4.2.4. Total Loss

The final objective function that trains the entire system end-to-end is: LGNFactor=Laction+λreconLrecon \mathcal{L}_{\mathrm{GNFactor}} = \mathcal{L}_{\mathrm{action}} + \lambda_{\mathrm{recon}} \mathcal{L}_{\mathrm{recon}} Where λrecon\lambda_{\mathrm{recon}} is a weight that balances the importance of "seeing correctly" versus "acting correctly."

The following figure (Figure 3 from the original paper) shows the system architecture:

Figure 3: Overview of GNFactor. GNFactor takes an RGB-D image as input and encodes it using a voxel encoder to transform it into a feature in deep 3D volume.This volume is then shared by two modules:volumetric rendering (Renderer) and robot action prediction (Perceiver). These two modules are jointly trained, which optimizes the shared features to not only reconstruct vision-language embeddings (Diffusion Feature) and other views (RGB), but also to estimate accurate \(\\mathrm { Q }\) values \(Q _ { \\mathrm { t r a n s } }\) , \(Q _ { \\mathrm { r o t } }\) \(Q _ { \\mathrm { c o l l i d e } }\) , \(Q _ { \\mathrm { o p e n . } }\) . multi-task robotic manipulation. The task description is encoded with CLIP \[51\] to obtain the task embedding \(T\) An overview of GNFactor is shown in Figure 3. 该图像是示意图,展示了GNFactor的工作流程。图中显示,GNFactor接收RGB-D图像作为输入,通过体素编码器转换为深度3D体积特征。该体积共享给两个模块:体积渲染器(Renderer)和动作预测器(Perceiver)。此外,任务描述通过语言编码器进行处理,进一步支持机器人状态的预测与决策。整体流程旨在优化多任务机器人操作的性能。


5. Experimental Setup

5.1. Datasets

  1. RLBench (Simulation): 10 complex tasks (e.g., close jar, meat off grill, stack blocks). There are 166 total variations (different colors, sizes, placements).
  2. Real Robot: An xArm7 robot in two different toy kitchens. Tasks include open microwave door, turn faucet, and relocate teapot.
  3. Data Samples:
    • Simulation: 20 demonstrations per task.
    • Real World: Only 5 demonstrations per task (extremely low data).

5.2. Evaluation Metrics

5.2.1. Success Rate

  1. Conceptual Definition: The percentage of episodes where the robot successfully completes the task instruction (e.g., the drawer is pulled open).
  2. Mathematical Formula: $ \text{Success Rate} = \frac{1}{N} \sum_{i=1}^{N} S_i $
  3. Symbol Explanation:
    • NN: The total number of test trials (episodes).
    • SiS_i: A binary indicator where Si=1S_i = 1 if the ii-th trial was successful, and Si=0S_i = 0 otherwise.

5.3. Baselines

  • PerAct (Perceiver Actor): The current state-of-the-art that uses voxels and transformers but lacks the generalizable neural field and foundation model distillation.

  • PerAct (4 Cameras): A version of the baseline given 4 camera views to see if more data alone helps.


6. Results & Analysis

6.1. Core Results Analysis

  • Simulation Mastery: GNFactor outperformed PerAct significantly. For example, in the sweep to dustpan task, PerAct had a 0% success rate, while GNFactor reached 28%.
  • Generalization: When tested on unseen tasks (e.g., larger blocks or new positions), GNFactor maintained much higher success rates (28.3% vs 18.0% for PerAct).
  • Real World: In the "teapot" task, which requires precise 3D positioning, GNFactor achieved 40% success with only 5 demos, while PerAct failed completely (0%).

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Method / Task close jar open drawer sweep to dustpan meat off grill turn tap Average
slide block put in drawer drag stick push buttons stack blocks
PerAct 18.7±8.2 54.7±18.6 0.0±0.0 40.0±17.0 38.7±6.8 20.4
18.7±13.6 2.7±3.3 5.3±5.0 18.7±12.4 6.7±1.9
PerAct (4 Cameras) 21.3±7.5 44.0±11.3 0.0±0.0 65.3±13.2 46.7±3.8 22.7
16.0±14.2 6.7±6.8 12.0±3.3 9.3±1.9 5.3±1.9
GNFactor 25.3±6.8 76.0±5.7 28.0±15.0 57.3±18.9 50.7±8.2 31.7
20.0±15.0 0.0±0.0 37.3±13.2 18.7±10.0 4.0±3.3

The following are the results from Table 3 (Real Robot) of the original paper:

Method / Task door (1) faucet (1) teapot (1) door (1,d) faucet (1,d) teapot (1,d) Average
Method / Task door (2) faucet (2) teapot (2) door (2,d) faucet (2,d) teapot (2,d)
PerAct 30 80 0 10 50 0 22.5
10 50 0 10 30 0
GNFactor 40 80 40 30 50 30 43.3
50 70 40 20 40 30

6.3. Ablation Studies

The authors found that:

  1. w/o GNF Objective: Success dropped from 36.8% to 24.2%. This proves that forcing the model to "reconstruct" the scene is vital for learning how to act.

  2. Diffusion vs. Others: Using Stable Diffusion features worked better than using CLIP or DINO features, likely because diffusion features are highly detailed and pixel-aligned.


7. Conclusion & Reflections

7.1. Conclusion Summary

GNFactor demonstrates that robotics and vision are not separate problems. By forcing a robot's 3D representation to be capable of high-quality novel view synthesis (reconstruction) and semantic labeling (distillation), the robot becomes much more capable of understanding and manipulating its environment. The model is particularly strong in "few-shot" scenarios, where the robot only sees a task a handful of times.

7.2. Limitations & Future Work

  • Camera Setup: The training requires multiple cameras (around 3 in the real world, 19 in simulation) to provide multi-view supervision. While only one camera is needed at "test time," setting up many cameras for "training" is still a hurdle.
  • Static Scenes: The current GNF assumes the scene is static during the reconstruction phase. Handling dynamic, moving objects in the reconstruction module is a potential future direction.

7.3. Personal Insights & Critique

This paper is a brilliant example of Representation Learning. Instead of just trying to map "Pixels to Motor Torques," the authors focus on creating a "World Model" in the robot's head.

  • Inspiration: The idea that a robot can learn what a "teapot" is by looking at it from multiple angles and comparing it to a foundation model's knowledge is powerful.
  • Critique: While the 1.55x improvement is great, absolute success rates (around 31-43%) are still relatively low for industrial applications. This suggests that while GNFactor is a step in the right direction, imitation learning still needs more efficient ways to handle the "long tail" of possible environmental changes.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.