Paper status: completed

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Published:11/29/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

CogACT is a novel Vision-Language-Action model that enhances robotic manipulation by decoupling cognition and action. Its componentized architecture utilizes a powerful Vision-Language Model for comprehension and a Diffusion Action Module for precise control, significantly outper

Abstract

The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

1.2. Authors

  • Qixiu Li (Tsinghua University)
  • Yaobo Liang, Lin Luo, Xi Chen, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Jiaolong Yang, Baining Guo (Microsoft Research Asia)
  • Zeyu Wang, Yuanchun Shi (Tsinghua University)
  • Mozheng Liao (University of Science and Technology of China - USTC)
  • Xiaofan Wang (Institute of Microelectronics, CAS)

1.3. Journal/Conference

Source: arXiv Preprint Date: Published on November 29, 2024 Status: Preprint (under review or early release). The work involves collaboration between a top-tier academic institution (Tsinghua) and a leading industrial research lab (Microsoft Research Asia), suggesting high research quality.

1.4. Publication Year

2024

1.5. Abstract

This paper introduces CogACT, a new architecture for Vision-Language-Action (VLA) models in robotics. The authors identify that existing VLA models (which adapt Large Language Models for robot control) suffer from poor precision because they treat physical actions as simple discrete text tokens. CogACT proposes a componentized architecture: it uses a powerful Vision-Language Model (VLM) for "cognition" (understanding the scene and task) and a specialized Diffusion Transformer (DiT) for "action" (generating precise physical movements). The model significantly outperforms state-of-the-art baselines like OpenVLA and RT-2-X in both simulation (by over 35%) and real-world experiments (by over 55%), showing strong generalization to new objects and robots.

2. Executive Summary

2.1. Background & Motivation

  • The Problem: In the field of robotic learning, Vision-Language-Action (VLA) models are designed to let robots understand visual scenes and follow language instructions (e.g., "Pick up the apple"). Existing state-of-the-art models (like RT-2 or OpenVLA) typically take a pre-trained Vision-Language Model (VLM) and force it to output robot actions. They do this by quantizing actions—turning continuous movements (like moving an arm 1.5cm) into discrete "words" or tokens (e.g., "Token_55").
  • The Limitation: This "discretization" approach has two major flaws:
    1. Loss of Precision: Robot actions are continuous and precise. Chopping them into bins loses fine-grained control.
    2. Modality Mismatch: Language is discrete (words), but motion is continuous and multimodal (there are many smooth ways to grab a cup). Treating actions like text ignores their physical nature.
  • The Motivation: The authors propose that we should separate "Thinking" (Cognition) from "Doing" (Action). A VLM is great at thinking, but a specialized model is needed to handle the continuous, high-frequency nature of physical actions.

2.2. Main Contributions & Findings

  1. Componentized Architecture: The paper proposes CogACT, which decouples the model into a Cognition Module (derived from a VLM) and a specialized Action Module (based on Diffusion Transformers).
  2. Diffusion Action Module: Instead of predicting one token at a time, the action module uses Diffusion to generate smooth sequences of continuous actions, conditioned on the "thoughts" of the VLM.
  3. Scaling Laws: The authors discover that scaling up the size of the specialized Action Module (from 13M to 308M parameters) yields significant performance gains, following a log-linear trend.
  4. Adaptive Action Ensemble (AAE): A new algorithm is introduced to intelligently blend past and current action predictions, making robot movement smoother and more robust.
  5. Superior Performance: CogACT dominates benchmarks. It beats OpenVLA (a similar-sized 7B model) by 35-55% in success rates and even outperforms the massive 55B parameter RT-2-X model.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand CogACT, you need to grasp three core concepts:

  • Vision-Language-Action (VLA) Models: These are AI models that take Images (Vision) and Text (Language) as input and output Robot Controls (Action). Think of it as a robot brain that sees, reads, and moves.
  • Action Quantization (The "Old" Way): Most language models (like GPT-4) output text tokens. To make them control robots, previous researchers divided physical space into "bins" (e.g., bin 1 = 0-1cm, bin 2 = 1-2cm). The model predicts "bin 2" like it predicts a word. This is called quantization.
  • Diffusion Models (The "New" Way used here): Diffusion models (famous for generating images like in Midjourney) work by starting with random noise and gradually refining it into a clear output. In this paper, Diffusion Policies start with a sequence of random numbers and "denoise" them into a smooth, precise trajectory of robot arm movements. This preserves the continuous nature of motion.
  • Diffusion Transformer (DiT): A specific architecture that uses Transformer blocks (the engine behind ChatGPT) to perform the diffusion process. It is excellent at handling sequences of data.

3.2. Previous Works

  • RT-1 & RT-2 (Robotics Transformer): Pioneering works by Google. RT-1 uses a Transformer to predict discrete action tokens. RT-2 scales this up by using a massive VLM (PaLI-X) co-trained on internet data and robot data. Critique: They rely on discrete tokenization, losing precision.
  • OpenVLA: An open-source VLA based on the Llama 2 language model and Prismatic vision encoder. It also uses action quantization (256 bins). CogACT uses OpenVLA's pre-trained weights as a starting point but radically changes the architecture.
  • Octo: A generalist robot policy that uses a Transformer backbone with a small diffusion head (only ~3M parameters).
    • Differentiation: CogACT argues that a small head is insufficient. CogACT uses a massive, dedicated Diffusion Transformer (up to 300M+ parameters) as the action module, integrated with a large VLM.

3.3. Differentiation Analysis

Feature Traditional VLAs (RT-2, OpenVLA) CogACT (This Paper)
Architecture Monolithic (One model does all) Componentized (VLM for thought, DiT for action)
Action Output Discrete Tokens (Quantized) Continuous Signals (Diffusion)
Action Modeling Autoregressive (Next token prediction) Generative (Denoising whole sequences)

4. Methodology

4.1. Principles

The core design philosophy of CogACT is Decoupling.

  • Cognition: Requires broad semantic knowledge (what is a "cup"? what does "stack" mean?). This is handled by a large pre-trained VLM.
  • Action: Requires high-frequency, continuous, spatial precision. This is handled by a specialized Diffusion Transformer. The two are connected by a "Cognition Token" that passes semantic intent from the brain (VLM) to the cerebellum (Action Module).

The following figure (Figure 2 from the original paper) illustrates this architecture. On the left, the Vision and Language modules process the input to create a "Cognition Feature." On the right, the Action Module (Diffusion Transformer) uses this feature to denoise random noise into a precise action sequence.

该图像是示意图,展示了CogACT模型的组件,包括视觉模型、大语言模型和动作模型的交互。图中展示了从视觉和语言输入生成认知特征的过程,并显示了如何通过变换器块生成动作输出。公式部分为 `a_t = f(x_m, t)`,用于表示当前时间步的动作预测。 该图像是示意图,展示了CogACT模型的组件,包括视觉模型、大语言模型和动作模型的交互。图中展示了从视觉和语言输入生成认知特征的过程,并显示了如何通过变换器块生成动作输出。公式部分为 a_t = f(x_m, t),用于表示当前时间步的动作预测。

4.2. Core Methodology In-depth

Step 1: Vision and Language Processing (The "Cognition" Module)

The model first processes the raw sensory inputs.

  1. Vision Module: The robot sees an image oto_t. This image is fed into two powerful vision encoders: DINOv2 and SigLIP. These models break the image into visual tokens (features), denoted as ν\nu.
  2. Language Module: The instruction (e.g., "Pick the apple") is tokenized into text tokens T\mathcal{T}.
  3. Cognitive Fusion: The model (based on Llama-2) takes the text tokens T\mathcal{T}, the visual tokens ν\nu, and crucially, adds a special learnable Cognition Token [c].
  4. Cognition Output: The Llama-2 model processes these inputs via causal attention. The final output corresponding to the special token [c] is extracted as the Cognition Feature ftc\mathbf{f}_t^c.
    • Significance: ftc\mathbf{f}_t^c contains all the necessary "thinking" or planning information needed to perform the task, compressed into a vector.

Step 2: Diffusion Action Generation (The "Action" Module)

The Action Module is a Diffusion Transformer (DiT). Its job is to generate a sequence of actions (at,at+1,...,at+N)( \pmb { a } _ { t } , \pmb { a } _ { t + 1 } , . . . , \pmb { a } _ { t + N } ).

  • Input:

    1. The Cognition Feature ftc\mathbf{f}_t^c (from Step 1).
    2. A sequence of "noisy" actions (random noise that needs to be cleaned).
    3. A timestep indicator ii (telling the model how much noise is currently present).
  • The Diffusion Process: The model predicts the noise present in the action sequence. By subtracting this predicted noise, it recovers the clean action. The action prediction is defined as: π:(l,ot)(at,at+1,...,at+N) \pi : ( l , o _ { t } ) \to ( a _ { t } , { \bf a } _ { t + 1 } , . . . , a _ { t + N } ) where ll is the language instruction, oto_t is the observation, and the output is a sequence of future actions (length NN).

  • Action Definition: The action at\pmb{a}_t is a 7-dimensional vector representing the robot arm's movement: at=[Δx,Δy,Δz,Δϕ,Δθ,Δψ,g] \begin{array} { r } { \pmb { a } _ { t } = [ \Delta x , \Delta y , \Delta z , \Delta \phi , \Delta \theta , \Delta \psi , g ] } \end{array}

    • Δx,Δy,Δz\Delta x , \Delta y , \Delta z: Translation (movement in 3D space).
    • Δϕ,Δθ,Δψ\Delta \phi , \Delta \theta , \Delta \psi: Rotation (changes in orientation).
    • g{0,1}g \in \{ 0 , 1 \}: Gripper state (open or closed).

Step 3: Training Objective

The entire model (Vision, Language, and Action modules) is trained end-to-end. The goal is to minimize the difference between the noise predicted by the model and the actual noise added to the ground-truth actions.

The loss function is the Mean Squared Error (MSE): LMSE=EϵN(0,1),iϵ^iϵ2 \begin{array} { r } { \mathcal { L } _ { \mathrm { M S E } } = \mathbb { E } _ { \epsilon \sim \mathcal { N } ( 0 , 1 ) , i } | | \hat { \epsilon } ^ { i } - \epsilon | | _ { 2 } } \end{array}

  • Symbol Explanation:
    • LMSE\mathcal{L}_{\mathrm{MSE}}: The loss value (lower is better).
    • E\mathbb{E}: Expectation (average over many samples).
    • ϵ\epsilon: The ground truth noise (sampled from a Normal distribution N(0,1)\mathcal{N}(0,1)).
    • ϵ^i\hat{\epsilon}^i: The predicted noise output by the action module at diffusion step ii.
    • 2||\cdot||_2: The L2 norm (Euclidean distance), measuring the magnitude of the error.

Step 4: Adaptive Action Ensemble (AAE)

During inference (when the robot is actually running), the model predicts a sequence of actions for the future (e.g., the next 15 steps). At the next timestep, it predicts another sequence. This creates overlapping predictions for the same moment in time.

The authors propose Adaptive Action Ensemble to fuse these overlapping predictions smoothly.

The final action a^t\hat { \pmb { a } } _ { t } to be executed is a weighted sum of the current prediction and historical predictions: a^t=k=0Kwkadaatotk \hat { \pmb { a } } _ { t } = \sum _ { k = 0 } ^ { K } w _ { k } ^ { \mathrm { a d a } } \cdot \pmb { a } _ { t } | \pmb { o } _ { t - k }

  • Symbol Explanation:
    • KK: The size of the ensemble window (how far back we look).
    • atotk\pmb{a}_t | \pmb{o}_{t-k}: The action for time tt predicted kk steps ago (based on observation otko_{t-k}).
    • wkadaw_k^{\mathrm{ada}}: The adaptive weight given to that historical prediction.

How is the weight wkadaw_k^{\mathrm{ada}} calculated? The weight depends on how similar the historical prediction is to the current prediction. If a past prediction agrees with the current view, it gets a high weight. If it disagrees (maybe the robot slipped), it gets a low weight.

wkada=exp(α<atot,atotk>) w _ { k } ^ { \mathrm { a d a } } = \exp ( \alpha \cdot < a _ { t } | o _ { t } , \mathbf { \boldsymbol { a } } _ { t } | \mathbf { \boldsymbol { o } } _ { t - k } > )

  • Symbol Explanation:
    • exp\exp: Exponential function (makes weights positive and exaggerates differences).

    • α\alpha: A hyperparameter (set to 0.1) controlling sharpness.

    • <,>< \cdot , \cdot >: Cosine Similarity (dot product). It measures how aligned the two action vectors are.

    • atota_t | o_t: The action predicted now (most recent).

    • atotk\mathbf{a}_t | \mathbf{o}_{t-k}: The action predicted in the past.

      The following figure (Figure 3 from the original paper) visualizes this process. It shows how predictions made at different times (Historical vs. Current) are combined using weights to form the final "Integrated Output."

      Figure 3. Illustration of our action ensemble strategy with \(K = 2\) (using the last 2 historical action predictions) as an example. The action ensemble aggregates historical predictions with the current prediction to jointly determine the final action to be executed. 该图像是示意图,展示了动作集成策略的过程。左侧为输入信息,包含语音指令和不同的观察时间点,依次为 t=0t=0t=1t=1t=2t=2。中间部分显示了“动作块输出”,展示了不同时间点的动作预测。右侧为“集成输出”,整合了历史预测和当前预测以生成最终执行的动作。整个过程帮助理解如何利用历史信息提升动作选择的准确性。

5. Experimental Setup

5.1. Datasets

  • Training Data: Open X-Embodiment (OXE) dataset.
    • Scale: Over 1 million real-world robotic trajectories from 22 different robot types.
    • Subset Used: Similar to Octo and OpenVLA, employing about 22.5 million frames.
    • Nature: Contains images, language instructions, and robot arm actions.
  • Fine-tuning Data: Small sets of collected demonstrations for specific robots (Realman and Franka) to test real-world adaptation (approx. 391 demos for Realman, 400 for Franka).

5.2. Evaluation Environments

The paper uses both high-fidelity simulation and real-world robots.

  1. SIMPLER (Simulation): A benchmark that simulates Google Robot and WidowX Robot. It is designed to be highly correlated with real-world performance.

    • Settings: "Visual Matching" (standard) and "Variant Aggregation" (harder, with unseen backgrounds/distractors).
  2. Realman Robot (Real World): A 7-DoF arm with a gripper. Tasks include picking fruit, stacking cups, and placing blocks.

  3. Franka Robot (Real World): Another standard research robot. Tasks include opening ovens and picking up bowls.

    The following figure (Figure 4 from the original paper) shows the real-world setups. The left shows the Realman robot with fruits and plates; the right shows the Franka robot interacting with an oven and bowls.

    Figure 4. Real-world evaluation environments of Realman robot (left) and Franka robot (right). 该图像是图示,展示了机器人在执行一系列操作的步骤,包括将鳄梨放置在蓝色盘子上、将蓝色杯子堆叠到黄色杯子中、以及打开烤箱门等。每个步骤都标注了相应的任务说明,清晰地展示了机器人在不同环境中的操作场景。

5.3. Evaluation Metrics

  • Success Rate:
    • Definition: The percentage of task attempts that are successfully completed.
    • Formula: Success Rate=Number of Successful TrialsTotal Number of Trials×100% \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100\%
    • Context: A trial is "successful" if the robot completes the goal (e.g., the apple is inside the drawer) within a time limit without crashing.

5.4. Baselines

  • RT-1 / RT-1-X / RT-2-X: Google's transformer-based policies. RT-2-X is a massive 55B parameter model.
  • Octo: A diffusion-based policy but with a very small diffusion head (3M params) and no strong VLM backbone.
  • OpenVLA: The primary competitor. A 7B parameter VLA model that uses action quantization (discrete tokens). CogACT is directly compared against this to prove the superiority of the "Diffusion Action Module" over "Tokenization."

6. Results & Analysis

6.1. Core Results Analysis (Simulation)

The results in the SIMPLER environment are striking. CogACT outperforms all baselines, including models much larger than itself (RT-2-X).

The following are the results from Table 1 of the original paper. This table compares performance on the Google Robot in simulation. Note the massive gap between CogACT (Ours) and OpenVLA.

Google Robot Method Tasks Average
Pick Coke Can Move Near Open/Close Drawer Open Top Drawer and Place Apple
SIMPLER (Visual Matching) RT-1 [7] 85.7 44.2 73.0 6.5 52.4
RT-1-X [48] 56.7 31.7 59.7 21.3 42.4
RT-2-X [48] 78.7 77.9 25.0 3.7 46.3
Octo-Base [62] 17.0 4.2 22.7 0.0 11.0
OpenVLA [30] 18.0 56.3 63.0 0.0 34.3
Ours 91.3 85.0 71.8 50.9 74.8
SIMPLER (Variant Aggregation) RT-1 [7] 89.8 50.0 32.3 2.6 43.7
RT-1-X [48] 49.0 32.3 29.4 10.1 30.2
RT-2-X [48] 82.3 79.2 35.3 20.6 54.4
Octo-Base [62] 0.6 3.1 1.1 0.0 1.2
OpenVLA [30] 60.8 67.7 28.8 0.0 39.3
Ours 89.6 80.8 28.3 46.6 61.3

Analysis:

  • Vs. OpenVLA: In the standard setting (Visual Matching), CogACT achieves 74.8% vs OpenVLA's 34.3%. This is a massive 40% absolute improvement.
  • Vs. RT-2-X: CogACT (7B params) beats RT-2-X (55B params) which scored 46.3%. This proves that a better architecture (Componentized DiT) beats brute-force scale.
  • Complex Tasks: In the hardest task ("Open Top Drawer and Place Apple"), most models failed completely (0-3.7%), while CogACT achieved 50.9%.

6.2. Real-World Results

The real-world experiments confirm the simulation findings. The following figure (Figure 1 from the paper) summarizes the success rates across different platforms. Note the blue bar (Ours) consistently towering over others.

Figure 1. (a) Success rate \(( \\% )\) comparison of our model against RT-1 \[7\], RT-1-X \[48\], RT-2-X \[48\], Octo \[62\], and OpenVLA \[30\] across stacking multiple cups and picking and placing unseen objects. 该图像是图表,展示了我们模型与其他模型(如RT-1、RT-1-X和OpenVLA)在不同机器人任务上的成功率比较(图(a))。图(b)展示了行动模块大小对成功率的影响。图(c)显示了在真实场景中进行的任务演示,展示了我们模型在不同环境中的适应能力和性能。

The following are the detailed results from Table 3 (Realman Robot evaluation):

Method Pick Stack Place Task (All)
Banana Lemon Avocado Avg. Cup Bowl Avg. Pick Stack Avg. Avg.
Octo-Base [62] 25.0 0.0 0.0 8.3 0.0 0.0 0.0 12.5 0.0 6.3 4.9
OpenVLA [30] 12.5 12.5 0.0 8.3 25.0 6.3 15.6 25.0 4.2 12.5 12.1
Ours 75.0 50.0 87.5 70.8 95.8 68.8 82.3 87.5 33.3 60.4 71.2

Analysis:

  • CogACT reaches an average success rate of 71.2%, while OpenVLA only manages 12.1%. This suggests that OpenVLA's tokenization approach struggles significantly with the precision required for real-world physics (like stacking cups), whereas CogACT's diffusion approach handles it well.

6.3. Ablation Studies

The authors investigated specific components of their design.

1. Action Model Architecture: Does size matter for the Action Module? The following results from Table 7 show the impact of model size:

Action Model Params Average Success Rate
MLP (3-Layer) 3M 50.6
MLP (7-Layer) 89M 52.5
DiT-Small 13M 58.5
DiT-Base 89M 62.5
DiT-Large 308M 64.8
  • Finding: Yes, size matters. Performance scales log-linearly with model size. A dedicated large DiT (308M) is much better than a simple MLP head (used in Octo).

2. Adaptive Action Ensemble: Is the new ensemble method actually better than old methods (like Action Chunking)? The following results from Table 9 compare strategies:

Strategy Google Robot (GR) WidowX (WR) Average
VM VA
Action Chunking 67.4 52.5 32.1 50.7
Temporal Ensemble 75.0 59.9 41.9 58.9
Adaptive Ensemble 74.8 61.3 51.3 62.5
  • Finding: Adaptive Ensemble (62.5%) outperforms Action Chunking (50.7%) and standard Temporal Ensemble (58.9%). The adaptive weighting (using cosine similarity) helps the robot avoid jittery or conflicting movements.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully demonstrates that CogACT is a superior architecture for robotic manipulation. By abandoning the "VLM-does-everything" approach and instead adopting a Componentized strategy—where a large VLM handles cognition and a dedicated Diffusion Transformer handles action—the model achieves state-of-the-art results. It solves the precision and modality issues of previous quantization-based methods, delivering smooth, accurate, and generalizable robot control.

7.2. Limitations & Future Work

  • Inference Latency: Diffusion models are iterative (multiple denoising steps). While the authors use a small number of steps (8), it is inherently slower than a single-step feedforward prediction. This could be a bottleneck for high-speed dynamic tasks (e.g., catching a flying ball).
  • Complexity: The architecture is more complex than a monolithic model (like RT-2), requiring the management of separate modules and a specialized ensemble algorithm during inference.
  • Future Work: The authors suggest further scaling the action module (beyond 300M parameters) and exploring even more diverse robot embodiments to test the limits of generalization.

7.3. Personal Insights & Critique

  • The "Brain-Cerebellum" Analogy: This paper validates a biological intuition. Our brains have a cortex for high-level planning (Cognition/VLM) and a cerebellum for fine motor control (Action/DiT). Trying to force the "cortex" (LLM) to output muscle signals directly (Tokenization) was likely a suboptimal path. CogACT corrects this.
  • Scaling Laws for Action: The finding that the Action Module itself obeys scaling laws is crucial. It suggests that we haven't hit the ceiling yet; simply making the "cerebellum" bigger will likely continue to improve robot dexterity.
  • Generalization: The jump in performance on unseen objects (generalization) is the most promising aspect for real-world deployment, where robots constantly face new environments.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.