Paper status: completed

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Published:10/01/2024

Vision-Language Models (14)Failure Detection in Robotic Manipulation (1)AHA Dataset (1)Task and Motion Planning (1)Natural Language Failure Reasoning (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

AHA is an open-source vision-language model designed for detecting and reasoning about failures in robotic manipulation using natural language. By framing failure detection as free-form reasoning, it provides adaptable explanations across various tasks, demonstrating strong effec

Abstract

Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they still struggle with failure recognition, limiting their real-world applicability. We introduce AHA, an open-source VLM designed to detect and reason about failures in robotic manipulation using natural language. By framing failure detection as a free-form reasoning task, AHA identifies failures and provides detailed, adaptable explanations across different robots, tasks, and environments. We fine-tuned AHA using FailGen, a scalable framework that generates the first large-scale dataset of robotic failure trajectories, the AHA dataset. FailGen achieves this by procedurally perturbing successful demonstrations from simulation. Despite being trained solely on the AHA dataset, AHA generalizes effectively to real-world failure datasets, robotic systems, and unseen tasks. It surpasses the second-best model (GPT-4o in-context learning) by 10.3% and exceeds the average performance of six compared models including five state-of-the-art VLMs by 35.3% across multiple metrics and datasets. We integrate AHA into three manipulation frameworks that utilize LLMs/VLMs for reinforcement learning, task and motion planning, and zero-shot trajectory generation. AHA's failure feedback enhances these policies' performances by refining dense reward functions, optimizing task planning, and improving sub-task verification, boosting task success rates by an average of 21.4% across all three tasks compared to GPT-4 models.

Mind Map

In-depth Reading

English Analysis~24 min read · 32,824 chars

1. Bibliographic Information

1.1. Title

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

The title clearly states the paper's core contribution: a Vision-Language Model (VLM) named AHA specifically designed for a crucial but under-explored task in robotics—detecting and, more importantly, reasoning about why a manipulation task has failed.

1.2. Authors

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, and Yijie Guo.

The authors are affiliated with several prominent institutions in AI and robotics research, including NVIDIA, the University of Washington, Universidad Católica San Pablo, MIT, Nanyang Technological University, and the Allen Institute for Artificial Intelligence. This collaboration brings together expertise from leading corporate research labs and academic institutions, suggesting a strong foundation in both practical application and theoretical rigor. Several authors, like Dieter Fox and Ranjay Krishna, are well-known figures in the fields of robotics and computer vision.

1.3. Journal/Conference

The paper was submitted to arXiv, a preprint server for academic papers. This means it has not yet undergone a formal peer-review process for publication in a specific conference or journal. However, arXiv is a standard platform for disseminating cutting-edge research quickly within the AI and robotics communities. The high quality of the work and the authors' affiliations suggest it is likely intended for a top-tier robotics or AI conference such as CoRL (Conference on Robot Learning), RSS (Robotics: Science and Systems), or ICRA (International Conference on Robotics and Automation).

1.4. Publication Year

The paper was published on arXiv on October 1, 2024.

1.5. Abstract

The abstract introduces the problem that while modern Vision-Language Models (VLMs) and Large Language Models (LLMs) have advanced robotic capabilities, they lack a crucial skill: recognizing and understanding their own failures. To address this, the paper presents AHA, an open-source VLM designed to detect and explain robotic manipulation failures in natural language. The authors frame failure detection not as a simple binary classification but as a free-form reasoning task.

To train AHA, they developed FailGen, a framework for procedurally generating a large-scale dataset of robotic failures (AHA dataset) from successful demonstrations in simulation. Despite being trained only on this synthetic data, AHA shows strong generalization to real-world scenarios and unseen tasks. It significantly outperforms other state-of-the-art models, including GPT-4o. The authors demonstrate AHA's practical utility by integrating it into three different robotic frameworks (for reinforcement learning, task planning, and trajectory generation), where its feedback improved task success rates by an average of 21.4% compared to GPT-4 based systems.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2410.00371
PDF Link: https://arxiv.org/pdf/2410.00371v1.pdf
Publication Status: This is a preprint, meaning it has not yet been peer-reviewed and accepted at a formal academic venue.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper addresses is the brittleness of modern AI-powered robots in the face of failure. While foundation models like VLMs and LLMs have enabled robots to understand complex instructions and reason about the world, they often operate in an "open-loop" manner regarding their own performance. They execute tasks but struggle to recognize when something has gone wrong, such as dropping an object or misaligning a part. This inability to self-diagnose failures is a major barrier to deploying autonomous robots in unstructured, real-world environments where unexpected events are common.

Prior research often treated failure detection as a simple binary classification problem (success vs. failure). This approach is limited because it doesn't provide insight into why a failure occurred. Without understanding the root cause, a robot cannot intelligently recover or learn from its mistakes. Humans, in contrast, excel at learning through trial and error, a process that relies on identifying and understanding failures.

The paper's innovative entry point is to reframe failure detection as a free-form reasoning task. Instead of just outputting "failure," the proposed model, AHA, generates a detailed natural language explanation of the error. This rich feedback can be used by higher-level planning or learning systems to correct the robot's behavior, making the overall system more robust and adaptable.

2.2. Main Contributions / Findings

The paper presents three primary contributions:

FailGen and the AHA Dataset: The authors introduce FailGen, a scalable and automated pipeline for generating robotic failure data. FailGen takes successful task demonstrations from simulators and procedurally perturbs them to create a wide variety of realistic failures. Using this pipeline, they created the AHA dataset, the first large-scale (49K+ examples) dataset specifically for training models to reason about robotic manipulation failures.
AHA, a High-Performing Failure Reasoning VLM: They developed AHA, a VLM instruction-tuned on the AHA dataset. The paper demonstrates that AHA significantly outperforms existing state-of-the-art proprietary and open-source VLMs (including GPT-4o) in detecting and explaining failures. Crucially, AHA generalizes well beyond its training data, showing strong performance on unseen tasks, different simulation environments, and real-world robot data.
Demonstrated Utility in Downstream Robotic Applications: The paper shows that AHA is not just an academic benchmark success but a practically useful tool. By integrating AHA into three distinct LLM/VLM-based robotic frameworks, they prove that its high-quality failure feedback leads to tangible performance improvements. Specifically, AHA helped:
- Refine reward functions for reinforcement learning.
- Optimize task plans for task and motion planning systems.
- Improve sub-task verification for zero-shot trajectory generation. Across these applications, using AHA boosted task success rates by an average of 21.4% over systems using GPT-4 models for the same feedback role.

The key finding is that by creating a specialized dataset of failures and training a VLM to perform free-form reasoning, it's possible to build a powerful "failure expert" model that can close the learning loop for a wide range of robotic systems, making them more intelligent and robust.

3.1. Foundational Concepts

To fully understand this paper, a novice reader should be familiar with the following concepts:

Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data (e.g., the entire internet). They excel at understanding and generating human-like text. Examples include OpenAI's GPT series and Google's Gemini. In robotics, they are often used for high-level reasoning and task planning.
Vision-Language Models (VLMs): VLMs are an extension of LLMs that can process both images and text as input. They connect visual information with linguistic concepts, allowing them to perform tasks like describing an image, answering questions about it, or, in this paper's case, observing a robot's actions and explaining what went wrong. LLaVA and GPT-4o are examples of VLMs.
Instruction Tuning (or Fine-Tuning): This is a training process where a pre-trained foundation model (like an LLM or VLM) is further trained on a smaller, high-quality dataset of specific examples (input-output pairs). This specializes the model for a particular task. In this paper, a general-purpose VLM is instruction-tuned on the AHA dataset to become an expert in robotic failure reasoning.
Robotic Manipulation: This subfield of robotics focuses on the movement and handling of objects by a robot, typically using a gripper or hand. Tasks include picking, placing, stacking, opening drawers, etc.
Trajectory and Keyframes: A robot's motion is often represented as a trajectory, which is a sequence of states (e.g., joint angles, gripper position) over time. A keyframe is a specific, important pose (position and orientation) within that trajectory, often representing the start or end of a primitive action (like "move above the object" or "grasp the object").
Embodiment: This term refers to the physical body of the robot, including its shape, sensors, and actuators (e.g., its arm, gripper). A model that generalizes "across embodiments" can perform well on different types of robots (e.g., a UR5 arm vs. a Franka Emika arm).
Task and Motion Planning (TAMP): A classic robotics problem where a system must find a sequence of high-level actions (task plan) and the corresponding continuous motions (motion plan) to achieve a goal while respecting physical constraints (e.g., avoiding collisions).

3.2. Previous Works

The paper situates itself within three main areas of research:

Failure Detection in Robotic Manipulation: Historically, failure detection was studied in Human-Robot Interaction (HRI) and TAMP. More recently, with the rise of foundation models, many approaches use off-the-shelf VLMs as "success detectors." These models typically provide a binary output (success/fail) or a score. A key work cited is REFLECT, which also created a taxonomy of failures, but AHA goes further by framing the problem as free-form language reasoning rather than just detection. The authors argue that existing methods are limited because they don't explain why a failure occurred.
Data Generation in Robotics: Creating large datasets for training robot policies is a major challenge. Researchers have developed systems to automate this process in simulation. For example, MimicGen automates the generation of successful task demonstrations. RoboPoint uses simulation to generate data for fine-tuning VLMs on spatial reasoning tasks (e.g., predicting where to grasp an object). The AHA paper builds on this trend of using simulation for data generation but focuses on a novel data type: procedurally generated failure trajectories with corresponding textual explanations.
Foundation Models for Robotic Manipulation: LLMs and VLMs are increasingly being integrated into robot control systems. One approach is "prompting," where a model is given a visual input and a text prompt to generate low-level actions in real-time. Another approach, which this paper follows, is instruction-tuning a VLM for a specific robotics-related skill. Examples include RoboPoint for spatial affordance and Octopi for physical reasoning from tactile data. AHA fits into this second category, creating a specialized VLM for the domain of failure reasoning.

3.3. Technological Evolution

The evolution of technology leading to this paper can be seen as follows:

Early Robotics: Focused on hard-coded control logic for specific tasks in structured environments (e.g., factory assembly lines). Failure handling was rigid and based on simple sensor triggers.
Learning-based Robotics: The rise of machine learning led to policies trained via imitation learning or reinforcement learning. These systems could handle more variability but still struggled with generalization and explaining their failures.
The Foundation Model Era: The recent explosion of LLMs and VLMs provided a powerful new tool for robotics. Researchers began using them as high-level "brains" for task planning (Eureka), zero-shot control (VoxPoser), and success detection (VIP).
The Current Gap: While these models excel at execution, they lack introspection. They are good at answering "what should I do next?" but poor at answering "did what I just did work, and if not, why?".
AHA's Position: AHA represents the next logical step: creating specialized foundation models to fill this "introspection gap." It moves beyond simple success/failure classification to provide rich, actionable feedback, enabling a closed-loop learning system where the robot can understand and correct its own mistakes.

3.4. Differentiation Analysis

The core innovations of AHA compared to prior work are:

Free-Form Reasoning vs. Binary Classification: This is the most significant conceptual shift. While most previous works treated failure detection as a classification task, AHA treats it as a generative reasoning task. This provides much richer, more human-understandable, and ultimately more useful feedback for downstream systems.
Procedural Failure Generation (FailGen): While others have generated successful demonstrations, this paper pioneers a systematic, scalable method for generating a diverse dataset of failures. This is crucial because failure data is naturally rare and hard to collect in the real world. FailGen provides a principled way to create this data on a massive scale.
Focus on Generalization: The authors rigorously evaluate AHA's ability to generalize to new tasks, new simulators, and real-world robots with different embodiments. This demonstrates that the knowledge learned from the synthetic AHA dataset is not just pattern matching but a more fundamental understanding of manipulation failures.
Practical System Integration: The paper closes the loop by showing that AHA's improved reasoning directly translates to better performance in three different, state-of-the-art robotics applications. This moves the contribution from a theoretical improvement on a benchmark to a demonstrated practical benefit.

4. Methodology

4.1. Principles

The core idea behind AHA is that a robot can learn from its mistakes more effectively if it can understand why it failed. The methodology is built on two key principles:

Learning from Abundant, Diverse Failures: To train a model to understand failure, it needs to see many examples of it. Since real-world failure data is scarce, the authors propose generating it synthetically but systematically in a simulator. By defining a taxonomy of common failure modes, they can ensure the generated data is diverse and covers a wide range of realistic error scenarios.
Failure as a Language-Grounded Reasoning Problem: Instead of just predicting a "fail" label, the model is trained to generate a natural language sentence explaining the failure. This leverages the powerful reasoning and generative capabilities of modern VLMs. The language output is not only interpretable by humans but can also be parsed and used as feedback by other AI systems (like an LLM-based planner).

The overall pipeline, as shown in Figure 2, involves first using the FailGen framework to create the AHA dataset, and then using this dataset to instruction-tune a VLM, resulting in the AHA model.

$Figure 2: Overview of AHA Pipeline. (Top) The data generation for AHA is accomplished by taking a normal task trajectory in simulation and procedurally perturbing all keyframes using our taxonomy of failure modes. Through FailGen, we systematically alter keyframes to synthesize falure demonstrations conditioned on the rigial tasks. Simultaneously, w generate corspondin query and answer prompts for each task and failure mode, which are used for instruction-tuning. (Bottom) The instruction-tuning pipeline follows the same fine-tuning procedure as LLaVA-v1.5 \[24\], where we fine-tune only the LLM base model—in this case, LLaMA-2-13B and the projection linear layers, while freezing the image encoder and tokenizer.$ Figure 2: Overview of AHA Pipeline. (Top) The data generation for AHA is accomplished by taking a normal task trajectory in simulation and procedurally perturbing all keyframes using our taxonomy of failure modes. Through FailGen, we systematically alter keyframes to synthesize falure demonstrations conditioned on the rigial tasks. Simultaneously, w generate corspondin query and answer prompts for each task and failure mode, which are used for instruction-tuning. (Bottom) The instruction-tuning pipeline follows the same fine-tuning procedure as LLaVA-v1.5 [24], where we fine-tune only the LLM base model—in this case, LLaMA-2-13B and the projection linear layers, while freezing the image encoder and tokenizer.

Figure 2 from the paper, showing the FailGen data generation pipeline (top) and the VLM instruction-tuning process (bottom).

4.2. Core Methodology In-depth

4.2.1. The AHA Dataset Generation via FailGen

The foundation of the AHA model is the dataset it's trained on. The authors developed FailGen, a data generation pipeline, to create this dataset.

1. Failure Mode Taxonomy: The first step was to identify and categorize common ways a robot can fail a manipulation task. Based on analyzing existing robotics datasets and policy rollouts, they defined a taxonomy of seven failure modes:

Incomplete Grasp (No_Grasp): The robot's gripper moves to the correct position to grasp an object but fails to close around it.
Inadequate Grip Retention (Slip): The robot successfully grasps an object but then loses its grip while moving it.
Misaligned Keyframe (Translation): The gripper moves to a keyframe with an incorrect position (offset in X, Y, or Z), causing it to miss a target or collide.
Incorrect Rotation (Rotation): The gripper reaches the correct position but has the wrong orientation (offset in roll, pitch, or yaw).
Missing Rotation (No_Rotation): The gripper reaches the correct position but omits a necessary rotation.
Wrong Action Sequence (Wrong_action): The robot performs the sub-tasks in the wrong order (e.g., trying to place an object in a drawer before opening the drawer).
Wrong Target Object (Wrong_object): The robot interacts with the wrong object according to the language instruction (e.g., picking up a green cup when asked to pick the red one).

2. Procedural Failure Generation: FailGen is implemented as an environment wrapper for robotics simulators like RLBench. It works by taking a successful demonstration, which is defined by a sequence of keyframes, and systematically perturbing it to induce one of the seven failure modes.

For each task in the simulator, FailGen iterates through every keyframe.
At each keyframe, it attempts to apply each of the seven failure modes. For example, to induce a Translation failure, it adds a random offset to the keyframe's target position. To induce a Wrong_action failure, it reorders the sequence of keyframes.
After perturbation, the simulation is run. The simulator's internal success checker verifies if the task has indeed failed.
If a failure is successfully induced, the resulting trajectory (a sequence of images) and a corresponding text label are saved. The text label consists of a query and an answer. The query describes the sub-task the robot was attempting, and the answer states that it failed ("No") and provides a templated explanation based on the induced failure mode (e.g., "The robot gripper rotated with an incorrect roll angle").
This process is automated and scaled across 79 different tasks in RLBench, resulting in the AHA dataset with over 49,000 image-query-answer triplets.

4.2.2. Failure Reasoning Formulation

The problem is formulated as a visual question-answering task. The model is given an image representing the robot's trajectory and a text prompt, and it must output a text response.

Input Prompt: The text prompt asks the model to determine if a given sub-task succeeded ("yes" or "no") and, if it failed, to explain why. For example: "For the given sub-tasks, first determine if it has succeed by choosing from ["yes", "no"] and then explain the reason why the current sub-tasks has failed."
Input Image: A single image is constructed to provide both temporal and multi-view context.
- Temporal Context: Keyframe images from the start of the task up to the current sub-task are stitched together horizontally, from left to right. This shows the progression of the robot's actions.
- Multi-view Context: To avoid ambiguity from a single camera angle (e.g., occlusions), this horizontal strip of temporal keyframes is created for multiple camera viewpoints available in the simulator. These strips are then stacked vertically.
- The final input is a grid-like image (matrix $I$ ) where rows correspond to different viewpoints ( $V_0, ..., V_n$ ) and columns correspond to the temporal sequence of sub-tasks ( $S_0, ..., S_t$ ). This structured input provides a comprehensive visual summary of the robot's attempt.
Output: The model is expected to generate a response that first provides a binary answer ("No") and then a free-form natural language explanation of the failure (e.g., "The robot gripper rotated with an incorrect roll angle").

4.2.3. Instruction Fine-Tuning

The AHA model itself is created by instruction-tuning a pre-trained VLM.

Base Model: The authors use LLaVA-v1.5-13B as their base model. LLaVA is a powerful open-source VLM that combines a vision encoder (CLIP) with an LLM (LLaMA-2).
Architecture: The standard LLaVA architecture is used:
1. An Image Encoder processes the input image into a sequence of visual tokens.
2. A Linear Projector (a simple 2-layer neural network) maps these visual tokens into the same embedding space as the language model's text tokens.
3. The visual tokens are concatenated with the text tokens from the input prompt.
4. This combined sequence of multimodal tokens is fed into the LLM, which then autoregressively generates the text response.
Training Process: During fine-tuning, the image encoder and tokenizer are frozen (their weights are not updated). Only the weights of the linear projector and the LLM are updated. This is an efficient way to adapt the model to the new task without losing the powerful general knowledge from its original pre-training.
Co-finetuning Data Mix: To prevent the model from "forgetting" its general visual understanding capabilities (a problem known as catastrophic forgetting), the AHA dataset is mixed with general-purpose datasets. As shown in Table 1, the training data includes:
- The AHA dataset (49k pairs): For the specific skill of robotic failure reasoning.
- VQA dataset (665k pairs): For general visual question answering.
- LVIS dataset (100k instances): For object detection and localization reasoning. This data mix ensures that AHA becomes an expert in failure detection while remaining a capable general-purpose VLM.

The following table, adapted from Table 1 in the paper, summarizes the instruction-tuning data.

Source	The AHA dataset (Train)	LVIS [45]	VQA [24]
Source	49K	100k	665k
Query	For the given sub-tasks, first determine if it has succeed by choosing from ["yes", "no"] and then explain the reason why the current sub-tasks has failed.
Answer	No, The robot gripper rotated with an incorrect roll angle

5. Experimental Setup

5.1. Datasets

To evaluate AHA's generalization capabilities, the authors used three distinct benchmark datasets that were not part of the training data.

AHA dataset (Test):
- Source: Generated using FailGen in the RLBench simulator, just like the training data.
- Scale & Characteristics: Contains 11,000 image-question pairs from 10 RLBench tasks that were held out and not seen during training.
- Purpose: To evaluate AHA's ability to generalize to novel tasks and behaviors within the same simulation environment and robot embodiment it was trained on.
ManiSkill-Fail:
- Source: Generated using the FailGen wrapper, but applied to the ManiSkill simulator, a different simulation environment from RLBench.
- Scale & Characteristics: A smaller dataset of 130 image-question pairs across four tasks. It features different tasks, object assets, and camera viewpoints than the training data.
- Purpose: To test AHA's generalization to a different simulation domain. This shows whether the learned concepts of failure are tied to the specific physics and visuals of RLBench or are more abstract.
RoboFail:
- Source: An adapted version of an existing real-world failure dataset.
- Scale & Characteristics: Features a real UR5 robot performing seven tasks.
- Purpose: To evaluate generalization from simulation to the real world (sim-to-real) and to a different robot embodiment (the robot in RLBench is different from the UR5). This is the most challenging test of generalization.

5.2. Evaluation Metrics

The authors use four metrics to provide a comprehensive evaluation of both the binary success detection and the quality of the free-form language explanations.

Binary Success (%)
- Conceptual Definition: This is a simple accuracy metric. It measures how often the model correctly predicts the binary success condition ("yes" or "no") for a given sub-task.
- Mathematical Formula: $ \text{Binary Success} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
  - Number of Correct Predictions: The count of instances where the model's first word ("yes" or "no") matches the ground truth label.
  - Total Number of Predictions: The total number of examples in the evaluation dataset.
ROUGE-L
- Conceptual Definition: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic text summarization and machine translation. ROUGE-L specifically measures the longest common subsequence (LCS) between the model-generated text and the ground-truth reference text. A longer shared sequence indicates higher similarity.
- Mathematical Formula: The F1-score for ROUGE-L is calculated as: $ R_{lcs} = \frac{LCS(X, Y)}{m} $ $ P_{lcs} = \frac{LCS(X, Y)}{n} $ $ F_{lcs} = \frac{(1 + \beta^2) R_{lcs} P_{lcs}}{R_{lcs} + \beta^2 P_{lcs}} $
- Symbol Explanation:
  - $X$ : The ground-truth reference sentence.
  - $Y$ : The model-generated sentence.
  - LCS(X, Y): The length of the longest common subsequence between $X$ and $Y$ .
  - $m$ : The length of the reference sentence $X$ .
  - $n$ : The length of the generated sentence $Y$ .
  - $R_{lcs}$ : Recall based on LCS.
  - $P_{lcs}$ : Precision based on LCS.
  - $\beta$ : A parameter to balance precision and recall. When $\beta$ is large, recall is emphasized. For ROUGE-L, $\beta$ is typically set to a very large number, so the F-score primarily reflects recall.
Cosine Similarity
- Conceptual Definition: This metric measures the semantic similarity between the generated explanation and the ground-truth explanation by comparing their vector representations (embeddings). Two sentences with similar meanings will have embeddings that point in a similar direction in the vector space, resulting in a cosine similarity close to 1.
- Mathematical Formula: $ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $
- Symbol Explanation:
  - A, B: The vector embeddings of the ground-truth and generated sentences, respectively. These are obtained by passing the sentences through a pre-trained sentence embedding model.
  - $A \cdot B$ : The dot product of the two vectors.
  - $\|A\|$ and $\|B\|$ : The Euclidean norm (magnitude) of the vectors.
LLM Fuzzy Match
- Conceptual Definition: Since metrics like ROUGE-L can be rigid and fail to capture semantic similarity (e.g., "the gripper failed to close" vs. "the gripper did not shut"), the authors use another LLM as an impartial judge. An external LLM (Claude-3-Sonnet) is prompted to compare the generated text with the ground truth and score their semantic similarity. This is a "teacher-student" evaluation format.
- Mathematical Formula: There is no standard mathematical formula. The process involves designing a prompt that asks the "teacher" LLM to rate the "student" VLM's output against the ground truth, typically on a numerical scale or with a categorical judgment that is then converted to a score.

5.3. Baselines

AHA is benchmarked against a strong set of six state-of-the-art VLMs, including both open-source and proprietary models.

Open-Source Models:
- LLaVA-v1.5-13B: The base model that AHA is built upon. This comparison shows the direct impact of the instruction-tuning on the AHA dataset.
- LLaVA-NeXT-34B: A more recent and powerful version of LLaVA.
- Qwen-VL: A strong open-source VLM from Alibaba. (Note: This model is mentioned in the references list under [53] but not explicitly shown in the main results Table 2. The paper compares against six models in total, some of which are likely from the list but not all shown in the table to save space. The main text mentions "six compared models".)
Proprietary Models:
- Gemini-1.5 Flash: A powerful multimodal model from Google.
- GPT-4o: OpenAI's flagship multimodal model, representing the state-of-the-art. The paper notes that for GPT-4o, they use in-context learning (ICL), meaning they provide a few examples of the task in the prompt to help the model understand what is being asked. This gives GPT-4o a significant advantage.
AHA Variants:
- AHA-7B and AHA-13B: The authors test two sizes of their model, based on 7-billion and 13-billion parameter versions of LLaMA-2. This allows for analysis of model scaling.

6. Results & Analysis

6.1. Core Results Analysis

The main quantitative results are presented in Table 2, which compares AHA against the baseline models across the three evaluation datasets and four metrics. AHA demonstrates superior performance across the board.

The following is the complete data from Table 2 of the original paper:

Models	Evaluation Datasets	Evaluation Metrics
Models	Evaluation Datasets	ROUGEL ↑	Cosine Similarity ↑	Binary Success(%) ↑	LLM Fuzzy Match ↑
LLaVA-v1.5-13B [24]	AHA dataset (Test set)	0.061	0.208	0.080	0.648
	ManiSkill-Fail	0.000	0.208	0.022	0.270
	RoboFail [48]	0.000	0.203	0.000	0.404
LLaVA-NeXT-34B [52]	AHA dataset (Test set)	0.013	0.231	0.017	0.626
	ManiSkill-Fail	0.001	0.195	0.007	0.277
	RoboFail [48]	0.018	0.188	0.017	0.351
Qwen-VL-Plus [53]	AHA dataset (Test set)	0.000	0.161	0.000	0.426
	ManiSkill-Fail	0.037	0.301	0.116	0.034
	RoboFail [48]	0.000	0.159	0.000	0.050
Gemini-1.5 Flash [12]	AHA dataset (Test set)	0.120	0.231	0.371	0.566
	ManiSkill-Fail	0.003	0.121	0.014	0.032
	RoboFail [48]	0.000	0.042	0.000	0.393
GPT-4o (In-context learning)	AHA dataset (Test set)	0.251	0.308	0.500	0.784
	ManiSkill-Fail	0.142	0.335	0.688	0.453
	RoboFail [48]	0.114	0.318	0.554	0.438
AHA-7B	AHA dataset (Test set)	0.226	0.380	0.611	0.776
	ManiSkill-Fail	0.341	0.429	0.971	0.630
	RoboFail [48]	0.236	0.429	0.571	0.418
AHA-13B (Ours)	AHA dataset (Test set)	0.446	0.583	0.702	0.768
	ManiSkill-Fail	0.600	0.681	1.000	0.633
	RoboFail [48]	0.280	0.471	0.643	0.465

Key Observations:

AHA-13B Dominates: AHA-13B is the top-performing model across almost all metrics and datasets. It surpasses the second-best model, GPT-4o (with in-context learning), by a significant margin. For example, on the ManiSkill-Fail dataset, AHA-13B achieves a perfect 1.000 binary success rate, compared to 0.688 for GPT-4o. Its language reasoning scores (ROUGE-L, Cosine Similarity) are also dramatically higher.
Generalization is Strong: AHA's excellent performance on ManiSkill-Fail (unseen simulator) and RoboFail (real-world, unseen embodiment) demonstrates that it has learned a generalizable understanding of manipulation failures, not just memorized patterns from the RLBench training data.
Fine-tuning is Crucial: The baseline LLaVA-v1.5-13B performs very poorly on this specialized task. This highlights that even powerful, general-purpose VLMs are not inherently capable of nuanced robotic failure reasoning without specific training. The massive performance jump from LLaVA-v1.5 to AHA-13B proves the effectiveness of the FailGen data and the instruction-tuning approach.
Model Scaling Helps: AHA-13B generally outperforms AHA-7B, indicating that a larger model capacity is beneficial for this task.

6.2. Data Presentation (Tables)

6.2.1. VQA Benchmark Performance

To ensure that specializing AHA on failure detection did not degrade its general abilities (catastrophic forgetting), the authors evaluated it on standard VQA benchmarks.

The following are the results from Table 3 of the original paper:

	MMBench [54]	ScienceQA [55]	TextVQA [56]	POPE [57]	VizWiz[58]
LLaVA-13B (LLama-2) [24]	67.70	73.21	67.40	88.00	53.01
AHA-13B (LLama-2)	65.20	71.94	65.20	85.74	53.45

Analysis: AHA-13B performs on par with the original LLaVA-13B model on these general benchmarks. The performance drop is minimal (around 1.5% on average), which is an excellent result. It shows that the co-finetuning strategy with VQA and LVIS data was successful in preserving the model's general world knowledge while adding a new, specialized skill.

6.2.2. Downstream Application Performance

The most compelling evidence of AHA's utility is its impact on downstream robotics tasks. Figure 3 (Right) and Figure 4 showcase these applications.

Figure 3: (Left) Scaling law with the AHA dataset. Scaling of effect of model performance with varying domain specific fine-tuning data. (Right) Downstream Robotic Application Performance. AHA-13B outperforms GPT-40 in reasoning about failures within these robotic applications, leading to improved performance of the downstream tasks.

Figure 3 from the paper. (Left) Scaling law analysis. (Right) Performance on downstream applications.

Figure 4: Downstream Robotic Application. We demonstrated that AHA can be integrated into existing LLM/VLM-assisted robotic applications to provide failure reasoning and feedback, helping to accelerate and improve task success rates in these systems.

Figure 4 from the paper, illustrating the three downstream applications where AHA provides feedback.

The results from these integrations are summarized below:

Reinforcement Learning (Eureka): By using AHA's failure explanations to guide the automated refinement of dense reward functions, the system achieved success on all five ManiSkill tasks. This approach outperformed the GPT-4o baseline by 22.3% in task success rate. This indicates AHA's feedback is more accurate and useful for reward design than GPT-4o's.
Task and Motion Planning (PRoC3S): AHA was used to analyze failed plans in simulation and provide feedback to an LLM planner, and also to verify the final "valid" plan. This integration improved the success rate by 36.7% compared to using GPT-4o for the same role. This suggests AHA is better at identifying subtle plan failures that traditional TAMP checkers might miss.
Zero-Shot Data Generation (Manipulate-Anything): AHA replaced GPT-4V as the sub-task verification module. This module decides if a primitive action was successful before the system generates the next action. Using AHA improved the overall task success rate by an average of 5%. While a smaller improvement, it demonstrates that even in a complex, open-ended system, a more accurate verification module can lead to better performance.

Across all three applications, AHA's specialized knowledge consistently leads to better outcomes than using a more general but powerful model like GPT-4o/4V.

6.3. Ablation Studies / Parameter Analysis

The paper includes a data scaling study, shown in Figure 3 (Left), which acts as a form of ablation on the size of the training data.

Analysis of Data Scaling: The graph shows the performance of the model on the ManiSkill-Fail dataset as the amount of fine-tuning data from the AHA dataset is increased from 3k to 60k samples.

Across all four evaluation metrics, there is a clear and consistent positive trend: more data leads to better performance.
The performance doesn't appear to have saturated, suggesting that generating even more failure data with FailGen could lead to further improvements.
The authors report an average quadratic fit gradient of 0.0022, quantifying this strong positive correlation. This result validates the effectiveness and scalability of their FailGen data generation pipeline.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces AHA, a novel open-source Vision-Language Model that excels at detecting and reasoning about failures in robotic manipulation. The authors make three key contributions: (1) they develop FailGen, a scalable framework for procedurally generating the first large-scale dataset of robotic failures; (2) they train AHA on this dataset and show it significantly outperforms state-of-the-art models, including GPT-4o, with strong generalization to real-world scenarios; (3) they demonstrate AHA's practical value by integrating it into three diverse robotics frameworks, where its natural language feedback substantially improves task success rates.

The main finding is that by framing failure detection as a free-form reasoning task and training a VLM on a specialized, synthetically generated dataset, it is possible to create a powerful tool that helps close the loop in robotic learning and execution, making AI-powered robots more robust and intelligent.

7.2. Limitations & Future Work

The authors acknowledge two main limitations:

Limited Scope of Failure Reasoning: AHA's generated explanations are naturally biased towards the seven failure modes present in its training data. It may struggle to identify and describe novel or "open-ended" failures that fall outside its defined taxonomy.
Data Generation Dependency: The current FailGen pipeline relies on a predefined taxonomy and procedural perturbations. A potential future direction is to generate more diverse and open-ended failure examples by, for instance, distilling policies from large pre-trained models and sampling their failure modes, which might be more varied and unexpected.

7.3. Personal Insights & Critique

This paper presents a very well-executed and impactful piece of research.

Strengths and Inspirations:

Problem Framing: The shift from binary classification to free-form reasoning is a simple but powerful idea. It aligns much better with how intelligent systems should handle errors and opens the door for more sophisticated recovery and learning mechanisms.
The Power of Synthetic Data: This work is a strong testament to the power of synthetic data, especially for capturing "long-tail" events like failures. The FailGen pipeline is a clever and practical solution to a major data bottleneck in robotics. Its adaptability to different simulators (RLBench, ManiSkill) is a significant engineering contribution.
Rigorous Evaluation: The evaluation is comprehensive. Testing on unseen tasks, unseen simulators, and real-world robots provides strong evidence for the model's generalization capabilities. The inclusion of downstream tasks moves the work beyond a simple benchmark paper to one with demonstrated utility.

Potential Issues and Areas for Improvement:

Reliance on Simulators: While the sim-to-real results are promising, the core of the data generation is still in simulation. There might be subtle, real-world failure modes (e.g., related to sensor noise, unexpected friction, or material deformability) that are not captured by the current taxonomy or simulators.
Actionability of Feedback: The paper demonstrates that AHA's feedback improves existing systems, but it doesn't explicitly explore how an LLM or planner consumes this free-form text. Is the text parsed for keywords, or is it fed directly into an LLM's context? More analysis on the mechanism of how this natural language feedback is translated into corrective action would be valuable.
Scalability of the Taxonomy: The seven failure modes are comprehensive for pick-and-place style tasks. However, as robots tackle more complex tasks (e.g., assembly, tool use, deformable object manipulation), this taxonomy might need to be significantly expanded. The manual process of defining these modes could become a bottleneck. An interesting future direction would be to learn the failure taxonomy itself from data.

Overall, AHA is a significant step forward in making robots more aware and adaptive. It provides a practical and effective solution to the critical problem of failure reasoning, and the open-sourcing of the model and dataset will likely spur further research in this important area.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~24 min read · 32,824 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. The AHA Dataset Generation via FailGen

4.2.2. Failure Reasoning Formulation

4.2.3. Instruction Fine-Tuning

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.2.1. VQA Benchmark Performance

6.2.2. Downstream Application Performance

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers