Paper status: completed

ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills

Published:02/03/2025

Alignment of Simulation and Real-World Physics (1)Humanoid Whole-Body Skill Learning (1)Delta Action Compensation Model (1)Retargeted Human Motion Data (1)Dynamic Transfer Evaluation (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The ASAP framework addresses the dynamics mismatch for humanoid robots by utilizing a two-stage approach: pre-training motion tracking policies in simulation and fine-tuning them with real-world data to achieve agile whole-body skills.

Abstract

Humanoid robots hold the potential for unparalleled versatility in performing human-like, whole-body skills. However, achieving agile and coordinated whole-body motions remains a significant challenge due to the dynamics mismatch between simulation and the real world. Existing approaches, such as system identification (SysID) and domain randomization (DR) methods, often rely on labor-intensive parameter tuning or result in overly conservative policies that sacrifice agility. In this paper, we present ASAP (Aligning Simulation and Real-World Physics), a two-stage framework designed to tackle the dynamics mismatch and enable agile humanoid whole-body skills. In the first stage, we pre-train motion tracking policies in simulation using retargeted human motion data. In the second stage, we deploy the policies in the real world and collect real-world data to train a delta (residual) action model that compensates for the dynamics mismatch. Then, ASAP fine-tunes pre-trained policies with the delta action model integrated into the simulator to align effectively with real-world dynamics. We evaluate ASAP across three transfer scenarios: IsaacGym to IsaacSim, IsaacGym to Genesis, and IsaacGym to the real-world Unitree G1 humanoid robot. Our approach significantly improves agility and whole-body coordination across various dynamic motions, reducing tracking error compared to SysID, DR, and delta dynamics learning baselines. ASAP enables highly agile motions that were previously difficult to achieve, demonstrating the potential of delta action learning in bridging simulation and real-world dynamics. These results suggest a promising sim-to-real direction for developing more expressive and agile humanoids.

Mind Map

In-depth Reading

English Analysis~27 min read · 34,638 chars

1. Bibliographic Information

1.1. Title

ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills

1.2. Authors

The paper is authored by a large team, with Tairan He, Jiawei Gao, and Wenli Xiao noted as equal contributions. The affiliations include Carnegie Mellon University (CMU) and NVIDIA. This indicates a collaboration between a leading academic institution and a major technology company known for its work in simulation and AI/robotics.

1.3. Journal/Conference

The paper is published at (UTC): 2025-02-03T08:22:46.000Z, and the original source is from arXiv (v3). As it is a preprint, it has not yet undergone formal peer review and publication in a specific journal or conference proceedings, but its content suggests it is intended for a top-tier robotics or AI conference/journal.

1.4. Publication Year

2025

1.5. Abstract

Humanoid robots show great promise for performing complex, human-like whole-body skills. However, achieving agile and coordinated movements is hindered by the dynamics mismatch between simulated environments (where policies are typically trained) and the real world. Current sim-to-real transfer methods, such as System Identification (SysID) and Domain Randomization (DR), either demand extensive manual tuning or produce overly conservative policies that lack agility.

This paper introduces ASAP (Aligning Simulation and Real-World Physics), a two-stage framework designed to overcome this dynamics mismatch and enable agile humanoid whole-body skills. In the first stage, motion tracking policies are pre-trained in simulation (IsaacGym) using retargeted human motion data. In the second stage, these policies are deployed in the real world to collect data, which is then used to train a delta (residual) action model. This model learns to compensate for the observed dynamics mismatch. Subsequently, ASAP fine-tunes the pre-trained policies by integrating this delta action model into the simulator, effectively aligning the simulated dynamics with real-world physics.

The ASAP framework is evaluated across three transfer scenarios: IsaacGym to IsaacSim, IsaacGym to Genesis (both sim-to-sim), and IsaacGym to the real-world Unitree G1 humanoid robot. The results demonstrate that ASAP significantly enhances agility and whole-body coordination for various dynamic motions, achieving reduced tracking errors compared to SysID, DR, and delta dynamics learning baselines. ASAP successfully enables highly agile motions previously challenging to achieve, highlighting the potential of delta action learning in bridging the sim-to-real gap. These findings point towards a promising direction for developing more expressive and agile humanoid robots.

1.6. Original Source Link

https://arxiv.org/abs/2502.01143v3 The paper is currently a preprint on arXiv.

1.7. PDF Link

https://arxiv.org/pdf/2502.01143v3.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the significant challenge of enabling agile and coordinated whole-body skills in humanoid robots, primarily due to the dynamics mismatch between physics simulators and the real world. Humanoid robots hold immense potential for versatile, human-like tasks, but this sim-to-real gap prevents policies trained in simulation from performing effectively on physical hardware.

Current solutions to bridge this gap suffer from notable limitations:

System Identification (SysID) methods: These approaches attempt to estimate and calibrate physical parameters of the robot (e.g., motor characteristics, link masses) or the environment. However, they often require a pre-defined parameter space, may not capture the entire sim-to-real gap if real-world dynamics fall outside the modeled distribution, and frequently rely on ground truth torque measurements which are often unavailable on commercial robot platforms, limiting their practical applicability.
Domain Randomization (DR) methods: These techniques train policies in a simulator where physical parameters are randomized within a certain range, forcing the policy to be robust to variations. While effective for some tasks, DR can lead to overly conservative policies that prioritize stability over agility, thus hindering the execution of highly dynamic and expressive skills.
Learned Dynamics Methods: While successful in simpler, low-dimensional systems like drones or ground vehicles, their effectiveness for the complex, high-dimensional dynamics of humanoid robots remains largely unexplored.

The paper's entry point is to leverage the idea of learning a dynamics model using real-world data, but specifically by focusing on learning a residual correction to actions, rather than modeling the entire complex dynamics explicitly. This delta action learning is an innovative approach to directly compensate for the dynamics mismatch, aiming to achieve both robustness and agility.

2.2. Main Contributions / Findings

The paper presents ASAP, a two-stage framework, and its primary contributions are:

A Novel Framework for Sim-to-Real Transfer: ASAP introduces a two-stage framework that effectively bridges the sim-to-real gap. It leverages a delta action model trained using reinforcement learning (RL) with real-world data to directly compensate for dynamics mismatch. This delta action model enables policies trained in simulation to adapt seamlessly to real-world physics, allowing for the execution of agile whole-body humanoid skills.
Achievement of Previously Difficult Humanoid Motions: The framework successfully deploys RL-based whole-body control policies on real humanoid robots (Unitree G1), achieving highly dynamic and agile motions (e.g., agile jumps, kicks) that were previously challenging or impossible to perform with existing sim-to-real transfer techniques.
Extensive Validation and Superior Performance: Through comprehensive experiments in both sim-to-sim (IsaacGym to IsaacSim, IsaacGym to Genesis) and sim-to-real (IsaacGym to Unitree G1) scenarios, ASAP demonstrates its efficacy. It significantly reduces motion tracking errors (up to 52.7% in sim-to-real tasks) and consistently outperforms SysID, DR, and delta dynamics learning baselines, showcasing improved agility and coordination.
Open-Sourced Multi-Simulator Codebase: To foster further research and facilitate development in this area, the authors have developed and open-sourced a multi-simulator training and evaluation codebase. This resource aims to accelerate future advancements in sim-to-real transfer for humanoid robots.

The key conclusion is that delta action learning is a promising direction for creating more expressive and agile humanoids by effectively bridging the simulation and real-world dynamics gap.

3.1. Foundational Concepts

To understand the ASAP framework, a grasp of several fundamental concepts in robotics, simulation, and machine learning is essential:

Humanoid Robots: These are robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. Their potential lies in their ability to interact with human-designed environments and perform human-like tasks, making them versatile for a wide range of applications.
Whole-Body Skills: This refers to complex movements that involve the coordinated action of multiple degrees of freedom (DoFs) across the entire robot's body, including the torso, arms, and legs. Examples include jumping, kicking, dancing, and agile locomotion, requiring intricate balance and coordination.
Simulation-to-Real (Sim-to-Real) Transfer: This is a crucial area in robotics where policies (control strategies) are trained in a simulated environment and then deployed on physical robots. The goal is to leverage the benefits of simulation (speed, safety, data generation) while ensuring the learned skills translate effectively to the real world.
Dynamics Mismatch (Sim-to-Real Gap): The primary challenge in sim-to-real transfer. It refers to the discrepancies between the physics models used in simulators and the actual physical properties and behaviors of real-world robots and environments. This mismatch can arise from:
- Inaccurate System Parameters: Imperfect knowledge of robot mass, inertia, friction coefficients, motor characteristics (e.g., torque limits, delays).
- Unmodeled Dynamics: Complex phenomena not captured by simplified physics engines, such as uncalibrated sensors, mechanical compliance, backlash, or unknown environmental factors.
- Latency and Noise: Differences in control loop timings, sensor noise, and communication delays between simulation and reality.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent performs actions, receives states (observations) and rewards, and learns a policy (a mapping from states to actions) that maximizes cumulative reward over time.
- Policy ( $\pi$ ): The agent's strategy for choosing actions given states.
- State ( $s$ ): The current observation of the environment.
- Action ( $a$ ): An output from the agent that influences the environment.
- Reward ( $r$ ): A scalar feedback signal from the environment indicating the desirability of an action.
Proximal Policy Optimization (PPO): A popular on-policy RL algorithm that aims to find a policy that performs well. PPO is known for its stability and sample efficiency, making it suitable for complex control tasks. It works by making small updates to the policy to avoid large, destructive changes, typically by clipping the policy ratio during optimization.
PD Controller (Proportional-Derivative Controller): A widely used feedback control loop mechanism in robotics. It calculates an output torque or force based on the difference between a desired target position (or velocity) and the current measured position (or velocity) of a joint.
- Proportional (P) term: Responds to the current error, aiming to reduce it.
- Derivative (D) term: Responds to the rate of change of the error, helping to damp oscillations and improve stability.
- In the context of this paper, actions are target joint positions which are then fed into a PD controller to generate torques to actuate the robot.
Motion Capture (MoCap) Systems: Technology used to precisely record the movement of objects or people. In robotics, MoCap systems (e.g., Vicon, OptiTrack) are often used to get ground truth positional and orientational data of a robot in the real world, which is critical for evaluating tracking performance or generating reference trajectories.
SMPL (Skinned Multi-Person Linear Model): A widely used statistical model for representing human body shape and pose in 3D. It provides a parametric representation that can generate a wide range of realistic human body shapes and motions from a compact set of parameters (shape parameters $\beta$ and pose parameters $\theta$ ).
Physics Simulators: Software environments that model physical interactions and dynamics. They allow researchers to train and test robot control policies safely and efficiently. The paper uses:
- IsaacGym: A high-performance, GPU-accelerated physics simulation environment for robot learning, developed by NVIDIA. It's known for its ability to run thousands of parallel simulations.
- IsaacSim: Another NVIDIA simulator built on the Omniverse platform, offering high-fidelity rendering and simulation capabilities. It's often used for tasks requiring more visual realism or complex scene interactions than IsaacGym.
- Genesis: A universal and generative physics engine, also used as a testing environment in this paper, representing another sim-to-sim transfer scenario.
Unitree G1 Humanoid Robot: A specific humanoid robot platform used for real-world evaluations in this paper, known for its agile capabilities.

3.2. Previous Works

The paper discusses three main categories of prior approaches to address the dynamics mismatch:

System Identification (SysID) Methods:
- Concept: SysID methods aim to build mathematical models of a system's dynamics from observed input-output data. In robotics, this means estimating physical parameters (e.g., mass, inertia, friction, motor constants) of the robot and its environment to make the simulator more accurate.
- Examples: [102, 19] are cited as examples. Historically, SysID has been used for calibrating robot models [39, 5], like estimating inertial parameters of links [2] or actuator dynamics [85, 29].
- Limitations (as highlighted by the paper):
  - Require a pre-defined parameter space [49], meaning they can only adjust parameters that are explicitly modeled. If the sim-to-real gap stems from unmodeled dynamics, SysID will fail to capture it.
  - Often rely on ground truth torque measurements [29], which are frequently unavailable on many commercial hardware platforms, limiting their practical use.
  - Can struggle with long-horizon scenarios due to cumulative error buildup, as observed in the paper's experiments (Table III).
Domain Randomization (DR) Methods:
- Concept: Instead of precisely identifying parameters, DR approaches train RL policies in a simulator where physical properties (mass, friction, motor strength, latency, visual textures) are randomly varied within a specified range during training. The goal is to force the policy to become robust to these variations, thus generalizing better to the real world, whose parameters are assumed to fall within the randomized distribution [87, 68].
- Examples: [85, 79, 59] are cited for DR in robotics.
- Limitations (as highlighted by the paper):
  - Can lead to overly conservative policies [25]. By optimizing for performance across a wide range of randomized parameters, the policy may avoid aggressive actions that might fail under certain parameter extremes, sacrificing agility for robustness. This makes it difficult to achieve highly agile skills.
Learned Dynamics Methods (Dynamics Modeling):
- Concept: These methods learn a predictive model of the environment's dynamics directly from real-world data. The learned model can then be used to either improve the simulator or to directly inform model-predictive control (MPC) or policy optimization.
- Examples: Demonstrated success in low-dimensional systems like drones [81] and ground vehicles [97].
- Limitations (as highlighted by the paper): Its effectiveness for the high-dimensional and complex dynamics of humanoid robots remains largely unexplored and challenging due to the sheer complexity of learning a full, accurate dynamics model for a humanoid.
Residual Learning for Robotics:
- Concept: This is a broader category where a residual component (a small correction) is learned to augment or refine an existing base model or controller. This residual can correct inaccuracies in a dynamics model or modify the actions of a base policy.
- Examples:
  - Residual policy models: Refine actions of an initial controller [84, 34, 8, 1, 12, 20, 3, 33, 42].
  - Correcting dynamics models: [66, 35, 38, 82, 23]. RGAT [35] uses a residual action model with a learned forward dynamics to refine the simulator.
- Connection to ASAP: ASAP builds on this idea by using RL-based residual actions to align the dynamics mismatch specifically between simulation and real-world physics.

3.3. Technological Evolution

The field of sim-to-real transfer for robot control has evolved from:

Direct SysID of parameters: Early attempts focused on manually or automatically calibrating known physical parameters to make simulators more accurate. This was limited by the completeness of the physical model.
Robustness through Domain Randomization: Recognizing the difficulty of perfect SysID, DR emerged as a way to make policies robust to parameter uncertainties. While effective, it often led to conservative behaviors.
Learning Dynamics Models: More recently, data-driven approaches sought to learn dynamics models directly from real-world data. This showed promise but scaled poorly to complex systems like humanoids.
Residual Learning: This paradigm, often combined with RL, aims to learn only the "difference" or "correction" needed, rather than the entire system from scratch. This makes it more efficient and robust than learning full dynamics models and more adaptive than SysID or DR alone.

ASAP fits into this evolution by advancing the residual learning paradigm. Instead of learning a full residual dynamics model (which predicts state differences), ASAP learns a residual action model (delta action model) that directly modifies the actions applied to the simulator. This allows the simulator to better reflect real-world outcomes and subsequently enables policies to be fine-tuned in this "aligned" simulator. This approach offers a powerful way to bridge the sim-to-real gap for complex, agile humanoid motions, pushing beyond the limitations of previous methods.

3.4. Differentiation Analysis

Compared to the main methods in related work, ASAP presents several core differences and innovations:

Differentiation from SysID:
- ASAP: Learns a delta action model ( $\pi^\Delta$ ) as a residual correction term for actions. This model implicitly compensates for all dynamics mismatch (both modeled parameter inaccuracies and unmodeled dynamics) by adjusting the actions to match real-world outcomes. It does not explicitly estimate physical parameters.
- SysID: Explicitly estimates and tunes specific physical parameters (e.g., CoM shift, mass, PD gains). It is limited by the chosen parameter space and cannot account for unmodeled dynamics.
- Innovation: ASAP offers a more holistic and flexible approach to dynamics compensation that doesn't require prior knowledge of which parameters are mismatched or their precise values.
Differentiation from Domain Randomization (DR):
- ASAP: Collects specific real-world data and uses it to learn a targeted delta action model that aligns the simulator with the real world. Policies are then fine-tuned in this aligned simulator.
- DR: Randomizes parameters during initial policy training, making the policy robust to a distribution of dynamics. It doesn't explicitly learn from real-world discrepancies to correct the simulator.
- Innovation: ASAP provides a more adaptive and less conservative sim-to-real transfer. DR policies can be overly conservative to perform well across the randomized distribution, sacrificing agility. ASAP aligns the simulator to the specific real-world dynamics, allowing fine-tuned policies to be both robust and agile.
Differentiation from Delta Dynamics Learning (as a baseline in this paper):
- ASAP: Learns a delta action model ( $\pi^\Delta$ ) that outputs corrective actions ( $\Delta a_t$ ) which are added to the policy's actions ( $a_t$ ) before they are applied to the simulator. The simulator then computes $s_{t+1} = f^{sim}(s_t, a_t + \pi^\Delta(s_t, a_t))$ .
- Delta Dynamics Learning (baseline): Learns a residual dynamics model ( $f_\theta^\Delta(s_t, a_t)$ ) that predicts the difference in state transitions ( $s_{t+1}^{real} - s_{t+1}^{sim}$ ). This predicted difference is then added to the simulator's output state $s_{t+1} = f^{sim}(s_t, a_t) + f_\theta^\Delta(s_t, a_t)$ . (As detailed in Appendix C).
- Innovation: ASAP's delta action model directly influences the input to the simulator's physics step, effectively making the simulator behave more like the real world from the perspective of action execution. The Delta Dynamics baseline, by contrast, corrects the simulator's output state, which can be more prone to cumulative errors or less intuitive for policy fine-tuning as the actions themselves aren't being adjusted in the dynamics computation. The paper's results show ASAP significantly outperforms DeltaDynamics in both open-loop and closed-loop performance, suggesting that correcting the input action space is more effective for policy learning than correcting the output state space.
Integration of RL for Residual Learning: ASAP explicitly uses Reinforcement Learning (PPO) to train its delta action model. This RL formulation allows the model to learn a complex, non-linear mapping for corrections based on rewards that minimize state discrepancies, which can be more powerful than simpler regression-based approaches for learning residuals.

In essence, ASAP combines the strengths of data-driven learning with the concept of residual modeling in the action space, providing a refined approach that is less restrictive than SysID, more adaptive than DR, and empirically more effective than learning residual dynamics for agile humanoid control.

4. Methodology

The ASAP framework is a two-stage process designed to align simulation and real-world physics for learning agile humanoid whole-body skills.

4.1. Principles

The core idea behind ASAP is to address the sim-to-real gap not by exhaustively modeling real-world physics or randomizing all possible parameters, but by learning a specific residual correction to the actions generated by a policy. This residual correction, called a delta action, is learned from real-world data to make the simulator behave more like the real world. Once this delta action model is integrated into the simulator, the main motion tracking policy can be fine-tuned in this "aligned" simulator, allowing it to adapt to real-world dynamics while retaining agility.

The theoretical basis draws from reinforcement learning for policy optimization and residual learning for dynamics compensation. The intuition is that it's easier to learn the difference between simulated and real-world dynamics (the delta) than to model the entire real-world dynamics from scratch. By learning this delta in the action space, ASAP effectively perturbs the simulator's inputs to match real-world outcomes, thereby creating a more accurate training environment for fine-tuning policies.

4.2. Core Methodology In-depth (Layer by Layer)

The ASAP framework consists of two main stages: Pre-training and Post-training.

4.2.1. Stage 1: Pre-training Agile Humanoid Skills in Simulation

The first stage focuses on training base motion tracking policies entirely within a physics simulator (IsaacGym). This involves generating high-quality reference motions and then training an RL policy to imitate them.

4.2.1.1. Data Generation: Retargeting Human Video Data

To create expressive and agile imitation goals for motion-tracking policies, ASAP leverages human motion videos.

a) Transforming Human Video to SMPL Motions: The process begins by recording videos of humans performing diverse and agile motions. These videos are then processed using TRAM [93] (Trajectory and Motion of 3D Humans from In-the-Wild Videos). TRAM reconstructs 3D human motions from videos, estimating the global trajectory of the human in the SMPL parameter format [52]. The SMPL model is a statistical 3D body model that represents human pose and shape. Its parameters include:
- Global root translation ( $\pmb{p}$ ): The 3D position of the root joint (pelvis).
- Global root orientation ( $\pmb{\theta}$ ): The orientation of the root joint.
- Body poses (also $\pmb{\theta}$ ): Rotations for each joint in the SMPL model.
- Shape parameters ( $\beta$ ): Coefficients that define the individual's body shape (e.g., height, weight). The reconstructed motions are denoted as $\mathcal{D}_{\mathrm{SMPL}}$ .
b) Simulation-based Data Cleaning: Since 3D motion reconstruction from video can introduce noise and physically infeasible movements, a "sim-to-data" cleaning procedure is applied. This involves using MaskedMimic [86], a physics-based motion tracker, to imitate the SMPL motions from TRAM within the IsaacGym simulator. Motions that can be successfully tracked and validated in the simulator are considered physically feasible and are saved as the cleaned dataset $\mathcal{D}_{\mathrm{SMPL}}^{\mathrm{Cleaned}}$ . This step ensures that the reference motions are robust and suitable for robot control.
c) Retargeting SMPL Motions to Robot Motions: The cleaned SMPL motions ( $\mathcal{D}_{\mathrm{SMPL}}^{\mathrm{Cleaned}}$ ) are then retargeted to the specific humanoid robot (Unitree G1). This is done using a shape-and-motion two-stage retargeting process [25].
1. Shape Optimization: The SMPL shape parameter $\beta'$ is optimized to approximate the target humanoid robot's shape. This involves selecting 12 body links with correspondences between humans and humanoids and performing gradient descent on $\beta'$ to minimize joint distances in a rest pose.
2. Motion Optimization: Using the optimized shape $\beta'$ along with the original translation $\pmb{p}$ and pose $\pmb{\theta}$ from TRAM, further gradient descent is applied to minimize the distances of the body links. This ensures that the retargeted motions are kinematically and dynamically consistent with the robot's structure. The final retargeted robot motions are denoted as $\mathcal{D}_{\mathrm{Robot}}^{\mathrm{Cleaned}}$ .
  
  The entire data generation and retargeting process is summarized in Figure 3.
  
  Figure 3 from the original paper: The image is a diagram illustrating the motion conversion process from human actions to G1 robot movements. The left side shows human motion extracted from video, followed by the generation of SMPL motion through TRAM and reinforcement learning, ultimately transforming into G1 robot actions, highlighting the mapping between motion capture and real-world robotic applications.

4.2.1.2. Phase-based Motion Tracking Policy Training

The motion-tracking problem is formulated as a goal-conditioned reinforcement learning (RL) task. A policy $\pi$ is trained in IsaacGym to track the retargeted robot movement trajectories from $\mathcal{D}_{\mathrm{Robot}}^{\mathrm{Cleaned}}$ .

State Definition: The state $s_t$ for the RL agent includes the robot's proprioception $s_t^{\mathrm{p}}$ and a time phase variable $\phi \in [0, 1]$ .
- The time phase variable $\phi$ serves as the goal state $\boldsymbol{s}_t^{\mathrm{g}}$ for single-motion tracking [67], indicating the progress through a motion (e.g., $\phi=0$ is the start, $\phi=1$ is the end).
- The proprioception $s_t^{\mathrm{p}}$ $s_{t}^{p}$ is defined as a history of robot internal states: $s_t^{\mathrm{p}} \triangleq \left[ \pmb{q}_{t-4:t}, \dot{\pmb{q}}_{t-4:t}, \omega_{t-4:t}^{root}, \pmb{g}_{t-4:t}, \pmb{a}_{t-5:t-1} \right]$ $s_{t}^{p} ≜ [q_{t - 4 : t}, \overset{q}{˙}_{t - 4 : t}, ω_{t - 4 : t}^{roo t}, g_{t - 4 : t}, a_{t - 5 : t - 1}]$ Where:
  - $\pmb{q}_{t-4:t} \in \mathbb{R}^{23}$ : History of joint positions (for 23 DoFs) from t-4 to $t$ .
  - $\dot{\pmb{q}}_{t-4:t} \in \mathbb{R}^{23}$ : History of joint velocities from t-4 to $t$ .
  - $\boldsymbol{\omega}_{t-4:t}^{root} \in \mathbb{R}^{3}$ : History of root angular velocity.
  - $\pmb{g}_{t-4:t} \in \mathbb{R}^{3}$ : History of root projected gravity (gravity vector in the root frame).
  - $\pmb{a}_{t-5:t-1} \in \mathbb{R}^{23}$ : History of last actions (target joint positions) from t-5 to t-1.
Reward Function: The reward $r_t$ is defined as $r_t = \mathcal{R}(s_t^{\mathrm{p}}, s_t^{\mathrm{g}})$ , which is optimized by the policy.
Action Space: The action $\pmb{a}_t \in \mathbb{R}^{23}$ represents the target joint positions for the robot's 23 degrees of freedom. These target positions are sent to a PD controller which then actuates the robot's joints.
Policy Optimization: The Proximal Policy Optimization (PPO) algorithm [80] is used to optimize the policy, aiming to maximize the cumulative discounted reward $\mathbb{E}\left[ \sum_{t=1}^T \gamma^{t-1} r_t \right]$ , where $\gamma$ is the discount factor.

Several design choices are crucial for stable policy training:
a) Asymmetric Actor-Critic Training:
- Concept: In RL, an actor-critic architecture involves two neural networks: an actor that determines actions and a critic that estimates the value of states. Asymmetric training means the critic has access to more information (privileged information) during training than the actor, which must only use information available during real-world deployment.
- Implementation: The critic network is given privileged information, such as the global positions of the reference motion and the root linear velocity. In contrast, the actor network relies solely on proprioceptive inputs ( $s_t^{\mathrm{p}}$ ) and the time-phase variable ( $\phi$ ).
- Benefit: This design facilitates policy training in simulation by providing the critic with a richer understanding of the task, while ensuring the actor remains deployable in the real world without requiring complex external sensing (like odometry for global position, which is a common challenge for humanoid robots [25, 24]).
b) Termination Curriculum of Tracking Tolerance:
- Concept: A curriculum learning strategy where the difficulty of the task (specifically, the tolerance for deviation from the reference motion) is gradually increased during training.
- Implementation: Initially, a generous termination threshold of $1.5 \mathrm{m}$ is set; if the robot deviates from the reference motion by more than this, the episode terminates. As training progresses, this threshold is progressively tightened to $0.3 \mathrm{m}$ .
- Benefit: This allows the policy to first learn basic stability and balancing skills before being challenged with stricter motion tracking requirements, preventing early failures and enabling the learning of high-dynamic behaviors like jumping.
c) Reference State Initialization (RSI):
- Concept: A method for initializing RL episodes to improve training efficiency and stability, especially for complex, sequential tasks.
- Implementation: Instead of always starting an episode at the beginning of the reference motion ( $\phi=0$ ), RSI [67] randomly samples time-phase variables between 0 and 1. The robot's state (including root position and orientation, root linear and angular velocities, joint positions and velocities) is then initialized based on the corresponding reference motion at that sampled phase.
- Benefit: This allows the policy to learn different segments or phases of a motion in parallel, rather than being constrained to a strictly sequential learning process. For example, in a backflip, it allows the policy to practice landing (a later phase) before mastering the takeoff (an earlier phase).

d) Reward Terms: The reward function $r_t$ is a sum of three categories of terms: penalty, regularization, and task rewards. The detailed reward terms and their weights are provided in Table I.

Term	Weight	Term	Weight
Penalty
DoF position limits	−10.0	DoF velocity limits	−5.0
Torque limits	−5.0	Termination	−200.0
Regularization
Torques	−1 × 10−6	Action rate	−0.5
Feet orientation	−2.0	Feet heading	−0.1
Slippage	−1.0
Task Reward
Body position	1.0	VR 3-point	1.6
Body position (feet)	2.1	Body rotation	0.5
Body angular velocity	0.5	Body velocity	0.5
DoF position	0.75	DoF velocity	0.5

The above are the results from Table I of the original paper.

e) Domain Randomizations: To further enhance the robustness and generalization of the pre-trained policy, basic domain randomization techniques are applied during training in IsaacGym. These techniques involve randomly varying certain simulation parameters (e.g., friction, PD gains, control delay, external perturbations) to ensure the policy is less sensitive to exact parameter values. The specific domain randomization parameters are listed in Table VI in the Appendix.

The overall context of this pre-training stage is illustrated in Figure 2(a) within the full ASAP framework diagram.

该图像是示意图，展示了ASAP框架的两个阶段：运动跟踪预训练和实际轨迹收集，以及模型训练和微调过程。首先，使用人类视频数据集进行姿态估计和模仿目标生成。然后，通过环境模拟器进行增量动作模型训练，并在真实世界中部署以实现对动态不匹配的补偿。 Figure 2 from the original paper: The image is a diagram illustrating the two stages of the ASAP framework: motion tracking pre-training and real trajectory collection, as well as the model training and fine-tuning process. It begins with pose estimation and imitation goal generation from a human video dataset, followed by delta action model training in a simulator and real-world deployment to compensate for dynamics mismatch.

4.2.2. Stage 2: Post-training (Training Delta Action Model and Fine-tuning Motion Tracking Policy)

The second stage of ASAP addresses the sim-to-real gap directly by leveraging real-world data collected from the pre-trained policy. This data is used to learn a delta action model that compensates for dynamics mismatch, which is then used to fine-tune the policy.

4.2.2.1. Data Collection

The pre-trained policy (from Stage 1) is deployed on the real-world Unitree G1 humanoid robot to perform whole-body motion tracking tasks.
During these real-world rollouts, trajectories are recorded, denoted as $\mathcal{D}^{\mathrm{r}} = \{ s_0^{\mathrm{r}}, a_0^{\mathrm{r}}, \ldots, s_T^{\mathrm{r}}, a_T^{\mathrm{r}} \}$ .
At each timestep $t$ , state information $s_t^{\mathrm{r}}$ is captured using a motion capture device and onboard sensors. The state is defined as: $s_t^{\mathrm{r}} = [ p_t^{\mathrm{base}}, v_t^{\mathrm{base}}, \alpha_t^{\mathrm{base}}, \omega_t^{\mathrm{base}}, q_t, \dot{q}_t ]$ Where:
- $p_t^{\mathrm{base}} \in \mathbb{R}^3$ : Base position (typically from MoCap).
- $v_t^{\mathrm{base}} \in \mathbb{R}^3$ : Base linear velocity (from MoCap).
- $\alpha_t^{\mathrm{base}} \in \mathbb{R}^4$ : Base orientation (quaternion, from MoCap).
- $\omega_t^{\mathrm{base}} \in \mathbb{R}^3$ : Base angular velocity (from MoCap).
- $q_t \in \mathbb{R}^{23}$ : Joint positions (from robot's proprioceptive sensors).
- $\dot{q}_t \in \mathbb{R}^{23}$ : Joint velocities (from robot's proprioceptive sensors).
The actions $a_t^{\mathrm{r}}$ are the target joint positions commanded by the pre-trained policy in the real world.
This data collection process is conceptually shown in Figure 2(a) and the deployment scenario is also visually represented in Figure 9.

$Fig. 9. We deploy the pretrained policy of a forward jump motion tracking task, challenging the $1 . 3 5 \\mathrm { m }$ -tall Unitree G1 robot for a forward leap over 1m.$ Figure 9 from the original paper: The image is a scene showcasing the Unitree G1 robot performing a forward jump motion, challenging itself to leap over 1 meter. It captures the robot in various poses during the movement, demonstrating its capabilities in both simulated and real-world environments.

4.2.2.2. Training Delta Action Model

The key to ASAP is learning a delta action model ( $\pi_\theta^{\Delta}$ ) to compensate for the sim-to-real physics gap. This model is trained by observing the discrepancies when real-world actions are replayed in the simulator.

Concept: When real-world actions ( $a_t^{\mathrm{r}}$ ) are applied in the simulator ( $f^{\mathrm{sim}}$ ), the resulting simulated trajectory ( $s_t^{\mathrm{sim}}$ ) will deviate from the real-world recorded trajectory ( $s_t^{\mathrm{r}}$ ). This deviation is a signal for the dynamics mismatch. The delta action model learns to output corrective actions ( $\Delta a_t$ ) that, when added to the real-world actions, make the simulated state match the real-world state.
Model Definition: The delta action model is a policy $\pi_\theta^{\Delta}$ that learns to output corrective actions: $\Delta a_t = \pi_\theta^{\Delta}(s_t, a_t)$ Where:
- $s_t$ : Current state of the robot.
- $a_t$ : Action proposed by the main policy (or the real-world recorded action $a_t^{\mathrm{r}}$ during training).
- $\Delta a_t$ : The corrective (residual) action output by the delta action model.
Modified Simulator Dynamics for Training: The RL environment for training $\pi_\theta^{\Delta}$ incorporates this delta action model by modifying the simulator's dynamics as follows: $s_{t+1} = f^{\mathrm{sim}}(s_t, a_t^{\mathrm{r}} + \Delta a_t)$ Where:
- $f^{\mathrm{sim}}$ : Represents the simulator's dynamics function.
- $a_t^{\mathrm{r}}$ : The reference action recorded from real-world rollouts.
- $\Delta a_t$ : The corrective action learned by the delta action model. This equation indicates that the delta action model learns to modify the actions before they are processed by the simulator's physics engine.

RL Training Steps (using PPO):

Initialization: At the beginning of each RL step, the robot in the simulator is initialized at the real-world state $s_t^{\mathrm{r}}$ .

Reward Computation: A reward signal is computed to minimize the discrepancy between the resulting simulated state $s_{t+1}$ and the recorded real-world state $s_{t+1}^{\mathrm{r}}$ . An additional action magnitude regularization term ( $\exp( - \lVert \boldsymbol{a}_t \rVert ) - 1)$ ) is included to encourage minimal corrections. The reward terms for delta action learning are summarized in Table II.

Term	Weight	Term	Weight
Penalty
DoF position limits	-10.0	DoF velocity limits	-5.0
Torque limits	-0.1	Termination	-200.0
Regularization
Action rate	-0.01	Action norm	-0.2
Task Reward
Body position	1.0	VR 3-point	1.0
Body position (feet)	1.0	Body rotation	0.5
Body angular velocity	0.5	Body velocity	0.5
DoF position	0.5	DoF velocity	0.5

The above are the results from Table II of the original paper.

Policy Optimization: PPO is used to train the delta action policy $\pi_\theta^{\Delta}$ , learning to output the corrected actions $\Delta a_t$ that effectively match the simulation to the real world.

Benefit: This learning process allows the simulator to accurately reproduce real-world failures. For instance, if the real-world robot cannot jump due to weaker motors compared to the simulator's estimation, the delta action model will learn to reduce the intensity of lower-body actions, effectively simulating these motor limitations. This aligned simulator then enables more effective policy fine-tuning.

This process is depicted in Figure 2(b) and contrasted with other methods in Figure 4.

Figure 4 from the original paper: The image is an illustration that shows the different stages of the ASAP framework. On the left are the Vanilla and SysID methods, in the middle is Delta Dynamics, and on the right is Delta Action (ASAP). Each stage represents how information flows through the simulator using different strategies and state updates.

4.2.2.3. Fine-tuning Motion Tracking Policy under New Dynamics

Once the delta action model $\pi^\Delta(s_t, a_t)$ has been successfully trained, it is integrated into the simulator to create an aligned simulation environment.

Reconstructed Simulator Dynamics: The simulation environment is effectively reconstructed with ASAP dynamics ( $f^{\mathrm{ASAP}}$ ) such that: $s_{t+1} = f^{\mathrm{ASAP}}(s_t, a_t) = f^{\mathrm{sim}}(s_t, a_t + \pi^\Delta(s_t, a_t))$ Here, the delta action model $\pi^\Delta$ acts as a continuous correction layer that modifies the policy's actions $a_t$ before they are applied to the underlying simulator $f^{\mathrm{sim}}$ .
Policy Fine-tuning: The parameters of the delta action model $\pi^\Delta$ are frozen. The pre-trained policy (from Stage 1) is then fine-tuned within this augmented simulation environment. The same reward function (Table I) used during pre-training is applied during this fine-tuning phase.
Benefit: By training in this delta action-augmented simulator, the policy effectively adapts to the real-world physics without needing direct real-world interaction during fine-tuning. It learns to generate actions that, once modified by $\pi^\Delta$ within the simulator, lead to desired real-world-like outcomes.

This fine-tuning process is depicted in Figure 2(c).

4.2.2.4. Policy Deployment

Finally, the fine-tuned policy is deployed directly in the real world.
Crucially, during real-world deployment, the delta action model $\pi^\Delta$ is not used. The fine-tuned policy itself has learned to generate actions that are inherently robust and adapted to the real-world dynamics, thanks to the fine-tuning process in the aligned simulator.
The fine-tuned policy demonstrates enhanced real-world motion tracking performance compared to the pre-trained policy, showcasing ASAP's effectiveness in bridging the sim-to-real gap.

This final deployment stage is shown in Figure 2(d).

5. Experimental Setup

The ASAP framework is evaluated across various transfer scenarios to assess its ability to compensate for dynamics mismatch and improve policy performance.

5.1. Datasets

Simulation Data:
- The primary dataset used for motion tracking in simulation is the retargeted motion dataset $\mathcal{D}_{\mathrm{Robot}}^{\mathrm{Cleaned}}$ . This dataset is derived from human motion videos and retargeted to the humanoid robot.
- These motions are categorized into three difficulty levels: easy, medium, and hard, based on their complexity and the agility required. Examples of these motions are partially visualized in Figure 6, including Side Jump, Single Foot Balance, Squat, Step Backward, ``Step Forward, and Walk`.
  
  Figure 6 from the original paper: The image is a comparative table showing the performance of the ASAP method against other methods in different testing environments (IsaacSim and Genesis). The table includes data on success rates, trajectory errors, and other metrics, as well as illustrations of different actions (such as jumping, balancing, squatting, etc.). It also presents results across various difficulty levels (easy, medium, hard), demonstrating the significant advantages of ASAP.
- This dataset serves as the source of imitation goals for training and fine-tuning motion tracking policies in IsaacGym, IsaacSim, and Genesis.
Real-World Data (for Unitree G1):
- Real-world data collection is explicitly performed to train the delta action model. Due to practical constraints such as motor overheating and hardware failures during dynamic motion execution (two Unitree G1 robots reportedly broke), the full 23-DoF delta action model was deemed infeasible to train given limited data.
- Instead, a more sample-efficient approach was adopted, focusing on learning a 4-DoF ankle delta action model. This decision was justified by:
  1. The impracticality of collecting enough data ( $>400$ motion clips) for a 23-DoF model.
  2. The Unitree G1 robot's mechanical linkage design in the ankle, which introduces a significant and difficult-to-model sim-to-real gap [37].
- For the 4-DoF ankle delta action model, 100 motion clips were collected, which proved sufficient.
- The real-world experiments prioritize motion safety and representativeness, selecting five motion-tracking tasks:
  - (i) kick
  - (ii) jump forward
  - (iii) step forward and back
  - (iv) single foot balance
  - (v) single foot jump
- Each task's tracking policy was executed 30 times.
- Additionally, 10 minutes of locomotion data were collected to train a robust locomotion policy for policy transition between different motion-tracking tasks in the real world (as real robots cannot be easily reset like in simulators).

5.2. Evaluation Metrics

The paper uses several quantitative metrics to evaluate the performance of motion tracking policies, particularly focusing on tracking error and success rate.

Success Rate:
- Conceptual Definition: This metric indicates the proportion of attempts where the robot successfully imitates the reference motion without exceeding a certain tracking error threshold. It measures the robustness and overall ability of the policy to complete the task.
- Calculation: An imitation is deemed unsuccessful if, at any point during the motion, the average difference in body distance between the robot and the reference motion is greater than $0.5 \mathrm{m}$ .
- Formula: (Not explicitly provided in the paper, but implicitly defined) $ \text{Success Rate} = \frac{\text{Number of successful episodes}}{\text{Total number of episodes}} \times 100% $ Where a successful episode is one where $\forall t \in [0, T]$ , $\text{AverageBodyDistance}(t) \le 0.5 \mathrm{m}$ .
- Symbol Explanation:
  - Number of successful episodes: The count of experimental trials where the robot maintained tracking error within the specified limit.
  - Total number of episodes: The total count of experimental trials performed for a given task.
  - AverageBodyDistance(t): The average difference in position between the robot and the reference motion at time $t$ .
Global Body Position Tracking Error (MPJPE) ( $E_{\mathrm{g-mpjpe}}$ (mm)):
- Conceptual Definition: This metric quantifies the average Euclidean distance between corresponding body parts (or joints) of the robot's actual (or simulated) global position and the reference motion's global position. It measures how well the robot's entire body tracks the overall spatial trajectory of the reference. The g- prefix indicates "global."
- Mathematical Formula: $ E_{\mathrm{g-mpjpe}} = \frac{1}{N \cdot J} \sum_{i=1}^N \sum_{j=1}^J | \mathbf{p}{i,j}^{\text{robot}} - \mathbf{p}{i,j}^{\text{ref}} |_2 $
- Symbol Explanation:
  - $N$ : The total number of frames or timesteps in the motion sequence.
  - $J$ : The total number of tracked joints or body parts on the robot.
  - $\mathbf{p}_{i,j}^{\text{robot}}$ : The 3D global position vector of joint $j$ on the robot at frame $i$ .
  - $\mathbf{p}_{i,j}^{\text{ref}}$ : The 3D global position vector of joint $j$ in the reference motion at frame $i$ .
  - $\| \cdot \|_2$ : The Euclidean (L2) norm, calculating the straight-line distance.
Root-relative Mean Per-Joint Position Error (MPJPE) ( $E_{\mathrm{mpjpe}}$ (mm)):
- Conceptual Definition: This metric also quantifies the average Euclidean distance between corresponding joints, but after aligning both the robot's pose and the reference motion's pose to a common root joint (e.g., the pelvis). This normalization removes errors due to overall global translation and rotation, focusing specifically on the accuracy of the robot's internal body configuration and relative joint positions.
- Mathematical Formula: $ E_{\mathrm{mpjpe}} = \frac{1}{N \cdot J} \sum_{i=1}^N \sum_{j=1}^J | (\mathbf{p}{i,j}^{\text{robot}} - \mathbf{p}{i,\text{root}}^{\text{robot}}) - (\mathbf{p}{i,j}^{\text{ref}} - \mathbf{p}{i,\text{root}}^{\text{ref}}) |_2 $
- Symbol Explanation:
  - $N$ : The total number of frames or timesteps.
  - $J$ : The total number of tracked joints.
  - $\mathbf{p}_{i,j}^{\text{robot}}$ : The 3D global position vector of joint $j$ on the robot at frame $i$ .
  - $\mathbf{p}_{i,\text{root}}^{\text{robot}}$ : The 3D global position vector of the root joint on the robot at frame $i$ .
  - $\mathbf{p}_{i,j}^{\text{ref}}$ : The 3D global position vector of joint $j$ in the reference motion at frame $i$ .
  - $\mathbf{p}_{i,\text{root}}^{\text{ref}}$ : The 3D global position vector of the root joint in the reference motion at frame $i$ .
  - $\| \cdot \|_2$ : The Euclidean (L2) norm.
  - The terms $(\mathbf{p}_{i,j}^{\text{robot}} - \mathbf{p}_{i,\text{root}}^{\text{robot}})$ and $(\mathbf{p}_{i,j}^{\text{ref}} - \mathbf{p}_{i,\text{root}}^{\text{ref}})$ represent the root-relative positions of joint $j$ .
Acceleration Error ( $E_{\mathrm{acc}}$ (mm/frame $^2$ )):
- Conceptual Definition: Measures the average difference in acceleration between the robot's actual (or simulated) motion and the reference motion. High acceleration errors can indicate jerky, unnatural, or poorly controlled movements, especially important for agile motions.
- Calculation: The mean difference in acceleration values, typically derived from position data.
- Formula: (Not explicitly provided, but generally calculated as the mean squared error or mean absolute error of acceleration vectors) $ E_{\mathrm{acc}} = \frac{1}{N \cdot J} \sum_{i=1}^N \sum_{j=1}^J | \mathbf{a}{i,j}^{\text{robot}} - \mathbf{a}{i,j}^{\text{ref}} |_2 $
- Symbol Explanation:
  - $\mathbf{a}_{i,j}^{\text{robot}}$ : Acceleration vector of joint $j$ on the robot at frame $i$ .
  - $\mathbf{a}_{i,j}^{\text{ref}}$ : Acceleration vector of joint $j$ in the reference motion at frame $i$ .
Root Velocity Error ( $E_{\mathrm{vel}}$ (mm/frame)):
- Conceptual Definition: Measures the average difference in linear velocity of the robot's root (base) between its actual (or simulated) motion and the reference motion. This is critical for assessing how well the robot matches the desired speed and direction of overall movement.
- Calculation: The mean difference in velocity values of the root.
- Formula: (Not explicitly provided, but generally calculated as the mean squared error or mean absolute error of root velocity vectors) $ E_{\mathrm{vel}} = \frac{1}{N} \sum_{i=1}^N | \mathbf{v}{i,\text{root}}^{\text{robot}} - \mathbf{v}{i,\text{root}}^{\text{ref}} |_2 $
- Symbol Explanation:
  - $\mathbf{v}_{i,\text{root}}^{\text{robot}}$ : Linear velocity vector of the robot's root at frame $i$ .
  - $\mathbf{v}_{i,\text{root}}^{\text{ref}}$ : Linear velocity vector of the reference motion's root at frame $i$ .
    
    The mean values of these metrics are computed across all motion sequences used in the evaluation.

5.3. Baselines

The ASAP framework is compared against several baselines to demonstrate its effectiveness in bridging the dynamics gap and improving policy performance.

Oracle:
- Description: This baseline represents an ideal scenario where the RL policy is both trained and evaluated entirely within the same simulator (IsaacGym).
- Representativeness: It serves as an upper bound for performance in simulation, indicating the best possible tracking errors achievable if there were no sim-to-real gap or sim-to-sim mismatch.
Vanilla:
- Description: The RL policy is trained in IsaacGym (the source simulator) and then directly evaluated in the target simulation environment (IsaacSim, Genesis) or the real world (Unitree G1) without any sim-to-real adaptation or fine-tuning.
- Representativeness: This baseline quantifies the raw dynamics mismatch and the performance degradation that occurs when a policy is transferred without explicit adaptation strategies. It highlights the problem ASAP aims to solve.
SysID (System Identification):
- Description: This method attempts to align simulator dynamics with real-world dynamics by identifying and tuning specific physical parameters.
- Implementation: The authors specifically identify base center of mass (CoM) shift ( $c_x, c_y, c_z$ ), base link mass offset ratio ( $k_m$ ), and low-level PD gain ratios ( $k_p^i, k_d^i$ ) for each of the 23 DoFs. These parameters are searched within discrete ranges by replaying real-world recorded trajectories in simulation and finding the best alignment. After identifying the best SysID parameters, the pretrained policy is fine-tuned in IsaacGym with these adjusted parameters.
- Representativeness: This is a common and traditional approach to sim-to-real transfer, serving as a strong baseline for explicit dynamics calibration.
- The SysID parameters and their ranges are provided in Table VII in the Appendix:
  
  Parameter Range Parameter Range
  
  x [-0.02, 0.0, 0.02] [-0.02, 0.0, 0.02]
  
  Cz [−0.02, 0.0, 0.02] km [0.95, 1.0, 1.05]
  
  kp [0.95, 1.0, 1.05] `kif` [0.95, 1.0, 1.05]
  
  The above are the results from Table VII of the original paper. Note: The original table has a formatting issue where $y$ is missing for the second row, and kd is missing for the last row. Assuming standard SysID parameters, $y$ would typically be cy (base CoM shift in y) and kif would be kd (D gain ratio).
DeltaDynamics:
- Description: This baseline learns a residual dynamics model to capture the discrepancy between simulated and real-world physics.
- Implementation (from Appendix C): Using collected real-world trajectories ( $s_t^{\mathrm{real}}, a_t^{\mathrm{real}}$ ), the actions $a_t^{\mathrm{real}}$ are replayed in simulation to get simulated trajectories ( $s_t^{\mathrm{sim}}$ ). A neural dynamics model $f_\theta^\Delta$ is trained to predict the difference in state transitions: $s_{t+1}^{\mathrm{real}} - s_{t+1}^{\mathrm{sim}} = f_\theta^\Delta(s_t^{\mathrm{real}}, a_t^{\mathrm{real}})$ The loss function is a mean squared error (MSE) in an autoregressive setting, where the model predicts forward for $K$ steps: $ \mathcal{L} = \biggl | \biggl | s_{t+K}^{\mathrm{real}} - \underbrace{f^{\mathrm{sim}} \bigl( \ldots , f^{\mathrm{sim}}}{K} ( s_t, a_t ) + f\theta^\Delta(s_t, a_t) , \ldots , a_{t+K} \bigr ) \biggr | \biggr | $ After training, this residual dynamics model $f_\theta^\Delta$ is frozen and integrated into the simulator. The pretrained policy is then fine-tuned in this augmented simulation environment.
- Representativeness: This baseline represents a state-of-the-art approach to dynamics modeling and sim-to-real transfer by learning explicit state-level corrections. It directly contrasts with ASAP's delta action model approach.
  
  These baselines provide a comprehensive comparison, allowing the authors to evaluate ASAP against methods that rely on different principles for sim-to-real adaptation.

Parameter	Range	Parameter	Range
x	[-0.02, 0.0, 0.02]		[-0.02, 0.0, 0.02]
Cz	[−0.02, 0.0, 0.02]	km	[0.95, 1.0, 1.05]
kp	[0.95, 1.0, 1.05]	`kif`	[0.95, 1.0, 1.05]

6. Results & Analysis

The paper presents extensive experimental results to validate ASAP's performance across sim-to-sim and sim-to-real transfer scenarios, addressing three key questions (Q1-Q3) and further conducting extensive studies (Q4-Q6).

6.1. Core Results Analysis

6.1.1. Q1: Comparison of Dynamics Matching Capability (Open-Loop Performance)

This section addresses Q1: Can ASAP outperform other baseline methods to compensate for the dynamics mismatch? by evaluating open-loop performance. Open-loop evaluation measures how accurately a method can reproduce testing-environment trajectories in the training environment (IsaacGym) after learning the dynamics mismatch. This is done by replaying actions and comparing the resulting states using metrics like MPJPE.

The following are the results from Table III of the original paper:

Simulator & Length		IsaacSim				Genesis
Simulator & Length		Eg-mpipe	Empipe	Eacc	Evel	Eg-mpipe	Empipe	Eacc	Evel
0.25s	OpenLoop	19.5	15.1	6.44	5.80	19.8	15.3	6.53	5.88
	SysID	119.4	15.0	6.43	5.74	19.3	15.0	6.42	5.73
	DeltaDynamics	24.4	13.6	9.43	7.85	20.00	12.4	8.42	6.89
	ASAP	119.9	15.6	6.48	5.86	190	14.9	6.19	5.59
0.5s	OpenLoop	33.3	23.2	6.80	6.84	33.1	23.0	6.78	6.82
	SysID	32.1	22.2	6.57	6.56	32.2	22.3	6.57	6.57
	DeltaDynamics	36.5	16.4	8.89	7.98	27.8	14.0	7.63	6.74
	ASAP	26.8	19.2	5.09	5.36	25.9	18.4	4.93	5.19
1.0s	OpenLoop	80.8	43.5	10.6	11.1	82.5	44.5	10.8	11.4
	SysID	77.6	41.5	10.0	10.7	76.5	41.6	10.00	10.5
	DeltaDynamics	68.1	21.5	9.61	9.14	50.2	17.2	8.19	7.62
	ASAP	37.9	22.9	4.38	5.26	36.9	22.6	4.23	5.10

Analysis of Table III:

ASAP's Superiority: ASAP consistently outperforms the OpenLoop baseline across all replayed motion lengths (0.25s, 0.5s, 1.0s) and in both IsaacSim and Genesis. It achieves significantly lower $E_g-mpjpe$ (global body position tracking error) and E_mpjpe (root-relative mean per-joint position error) values. This indicates that ASAP's delta action model is highly effective at aligning the simulator's dynamics with the target environment, leading to improved open-loop trajectory reproduction. For example, at 1.0s length in IsaacSim, ASAP achieves an $E_g-mpjpe$ of 37.9, a substantial reduction from OpenLoop's 80.8.
SysID's Limitations: While SysID shows some improvement over OpenLoop for short horizons (e.g., 0.25s), it struggles with long-horizon scenarios. Its performance degrades as motion length increases, suggesting that SysID may not fully capture the complex dynamics mismatch or that cumulative errors build up over longer trajectories. The $E_g-mpjpe$ values are often still high for longer durations.
DeltaDynamics' Behavior: DeltaDynamics improves upon SysID and OpenLoop for long horizons (e.g., at 1.0s, $E_g-mpjpe$ of 68.1 in IsaacSim is better than SysID's 77.6). However, the paper notes that DeltaDynamics suffers from overfitting, leading to cascading errors magnified over time, which might explain some of its less consistent performance compared to ASAP.
Agility Metrics (E_acc, E_vel): ASAP also shows superior performance in acceleration error (E_acc) and root velocity error (E_vel), especially for longer durations. For 1.0s motions, ASAP significantly reduces these errors compared to all baselines in both IsaacSim and Genesis, indicating better agreement in terms of dynamic behavior and motion quality.

These results strongly emphasize the efficacy of ASAP's delta action model in reducing the physics gap and improving open-loop replay performance, showcasing its superior generalization capability.

The following figure (Figure 5 from the original paper) illustrates the per-step MPJPE error for different methods, visually supporting the quantitative results.

该图像是一个示意图，展示了不同方法（Vanilla、SysID、Delta Dynamics 和 ASAP）在时间步长上的 MPJPE（毫米）误差。图中分别展示了每种方法的动作执行效果，并在下方为误差曲线图。 Figure 5 from the original paper: The image is an illustration showing the MPJPE (millimeters) error over time steps for different methods (Vanilla, SysID, Delta Dynamics, and ASAP). It depicts the action execution results for each method, with a graph of the error curves below.

Analysis of Figure 5: The graph clearly shows that ASAP maintains the lowest MPJPE over time steps compared to Vanilla, SysID, and DeltaDynamics. The error accumulation for other methods is evident, with SysID and DeltaDynamics showing better initial performance than Vanilla but still diverging from ASAP over time. This visual evidence reinforces ASAP's ability to reduce dynamics mismatch effectively over longer horizons.

6.1.2. Q2: Comparison of Policy Fine-Tuning Performance (Closed-Loop Performance)

This section addresses Q2: Can ASAP finetune policy to outperform SysID and Delta Dynamics methods? by evaluating closed-loop performance. This involves fine-tuning RL policies in the respective modified training environments and then deploying them in the testing environments (IsaacSim, Genesis) to quantify motion-tracking errors.

The following are the results from Table IV of the original paper:

Method & Type		IsaacSim					Genesis
Method & Type		Success Rate (%)	Eg-mpipe	Empipe	Eacc	Evel	Success Rate (%)	Eg-mpipe	Empipe	Eacc	Evel
Easy	Vanilla	100	243	101	7.62	9.11	100	252	108	7.93	9.51
	SysID	100	157	65.8	6.25	7.52	100	168	78.0	6.32	7.81
	DeltaDynamics	100	142	57.3	8.06	9.14	100	137	76.2	7.26	8.44
	ASAP	100	106	44.3	4.23	5.17	100	125	73.5	4.62	5.43
Medium	Vanilla	100	305	126	9.82	11.4	100	316	131	9.94	11.6
	SysID	100	183	88.6	7.83	8.99	100	206	91.2	7.91	9.21
	DeltaDynamics	98	169	68.9	9.03	10.3	97	158	81.3	8.56	9.65
	ASAP	100	126	52.7	4.75	5.52	100	148	80.2	5.02	5.91
Hard	Vanilla	98	389	160	11.2	13.1	95	401	166	11.5	13.5
	SysID	98	212	109	9.01	10.5	96	235	112	9.31	10.7
	DeltaDynamics	94	196	79.4	10.1	11.5	90	177	86.7	9.23	10.5
	ASAP	100	134	55.3	5.23	6.01	100	129	77.0	5.46	6.28

Analysis of Table IV:

ASAP's Consistent Outperformance: ASAP consistently achieves the lowest tracking errors across all difficulty levels (Easy, Medium, Hard) in both IsaacSim and Genesis. For instance, in IsaacSim (Hard level), ASAP achieves $E_g-mpjpe$ of 134 and E_mpjpe of 55.3, significantly outperforming SysID (212, 109) and DeltaDynamics (196, 79.4). Similar trends are observed in Genesis.
Reduced Agility Errors: ASAP also shows remarkably lower acceleration error (E_acc) and root velocity error (E_vel) across all scenarios. This indicates that ASAP fine-tunes policies to produce smoother, more natural, and dynamically accurate motions, which is crucial for agile skills.
High Success Rate: ASAP consistently maintains a 100% success rate across all sim-to-sim transfer evaluations, unlike DeltaDynamics which experiences slightly lower success rates in harder environments (e.g., 90% in Genesis Hard). This demonstrates ASAP's robustness in closed-loop control.
Baselines' Limitations: Vanilla policies show the highest errors, as expected, highlighting the dynamics mismatch. SysID and DeltaDynamics improve upon Vanilla but still fall short of ASAP's performance, particularly in E_mpjpe, E_acc, and E_vel, suggesting their dynamics compensation is less effective or prone to overfitting (for DeltaDynamics) for closed-loop policy fine-tuning.

These results highlight ASAP's robustness and adaptability in addressing the sim-to-real gap for closed-loop control, preventing overfitting and ensuring reliable deployment.

The following figure (Figure 7 from the original paper) provides per-step visualizations of motion tracking error, comparing ASAP with RL policies deployed without fine-tuning.

该图像是示意图，展示了在两个场景下（IsaacGym 到 IsaacSim 和 IsaacGym 到 Genesis）应用Delta Action微调前后的运动跟踪误差变化。图中上方展示了运动状态，而下方则展示了随时间变化的跟踪误差曲线，红色和蓝色线条分别代表微调前后在MPJPE（毫米）上的误差对比。 Figure 7 from the original paper: The image is a schematic diagram that illustrates the changes in motion tracking error before and after the Delta Action fine-tuning across two scenarios (from IsaacGym to IsaacSim and from IsaacGym to Genesis). The top section shows the motion states, while the bottom section displays the error curves over time, with red and blue lines representing the comparison of MPJPE (mm) error before and after fine-tuning.

Analysis of Figure 7: The visualizations confirm that ASAP (Delta Action Fine-tuning) consistently maintains lower MPJPE over time compared to the Vanilla (No Fine-tuning) policies. The Vanilla policies accumulate errors, leading to degraded tracking, while ASAP adapts to the new dynamics and sustains stable performance. This visual evidence further validates ASAP's effectiveness in closed-loop performance.

6.1.3. Q3: Real-World Evaluations (Sim-to-Real Transfer)

This section addresses Q3: Does ASAP work for sim-to-real transfer? by deploying ASAP on the real-world Unitree G1 robot.

Real-World Data Strategy: As mentioned in Experimental Setup, due to hardware constraints and motor overheating, a sample-efficient approach was adopted, focusing on learning a 4-DoF ankle delta action model instead of the full 23-DoF model. This targeted approach addresses a known sim-to-real gap in the Unitree G1's ankle mechanical linkage.

Policy Transition: A robust locomotion policy was trained to handle transitions between different motion-tracking tasks in the real world, allowing the robot to maintain balance without manual resets.

The following are the results from Table V of the original paper:

Motion	Real-World-Kick				Real-World-LeBron (OOD)
Motion	Eg-mpipe	Empipe	Eacc	Evel	Eg-mpipe	Empipe	Eacc	Evel
Vanilla	61.2	40.1	2.46	2.70	159	47.5	2.84	5.94
ASAP	50.2	43.5	2.96	2.91	112	55.3	3.43	6.43

Analysis of Table V (Real-World Results):

ASAP's Real-World Performance: ASAP consistently outperforms the Vanilla baseline on both in-distribution (Real-World-Kick) and out-of-distribution (OOD) (Real-World-LeBron "Silencer") humanoid motion tracking tasks.
Reduced Global Tracking Error: For the Kick motion, ASAP reduces $E_g-mpjpe$ from 61.2 to 50.2, representing a significant improvement (approx. 18% reduction). For the more challenging LeBron (OOD) motion, ASAP achieves an $E_g-mpjpe$ of 112 compared to Vanilla's 159 (approx. 30% reduction), demonstrating its ability to generalize to unseen dynamic motions.
Mixed Root-Relative MPJPE: Interestingly, E_mpjpe is slightly higher for ASAP in both cases (43.5 vs 40.1 for Kick, and 55.3 vs 47.5 for LeBron). The paper mentions that ASAP fine-tuning makes the robot behave more smoothly and reduces jerky lower-body motions (Figure 8), which might lead to a different distribution of root-relative joint positions while improving overall global tracking and reducing jerky motions which is good for agility.
Agility Metrics (E_acc, E_vel): The E_acc and E_vel metrics are slightly higher for ASAP in the real-world results compared to Vanilla. This could imply that while ASAP achieves better global motion tracking and overall agility, the fine-tuned policies might introduce slightly higher instantaneous accelerations or velocities to execute the agile motions precisely, or perhaps reflects a different trade-off in real-world motor control. Despite this, the visual and qualitative assessment (Figure 8) points to smoother, more coordinated motion.

The findings highlight ASAP's effectiveness in improving sim-to-real transfer for agile humanoid motion tracking, even with a limited DoF delta action model.

The following figure (Figure 8 from the original paper) visually compares actions before and after Delta Action fine-tuning.

该图像是示意图，展示了在 Delta Action 微调前后的动作对比。上方为微调前的动作状态，下方为微调后的动作状态。左侧为在分布内的微调，右侧为分布外的微调，展示出改进后的灵活性与协调性。 Figure 8 from the original paper: The image is a diagram showing the comparison of actions before and after Delta Action fine-tuning. The top part illustrates the motion states before fine-tuning, while the bottom part shows the states after fine-tuning. The left side represents in-distribution fine-tuning, and the right side represents out-of-distribution fine-tuning, showcasing the improved agility and coordination.

Analysis of Figure 8: The visual comparison supports the claim that ASAP fine-tuning leads to smoother and more coordinated lower-body motions. The Vanilla policy (top row) appears to exhibit more jerky or less precise movements compared to the ASAP fine-tuned policy (bottom row), especially visible in the in-distribution kick motion and the out-of-distribution LeBron motion.

6.2. Extensive Studies and Analyses

6.2.1. Q4: Key Factors in Training Delta Action Models

This section addresses Q4: How to best train the delta action model of ASAP? through a systematic study on dataset size, training horizon, and action norm weight.

The following figure (Figure 10 from the original paper) presents the analysis of these factors.

$Fig. 10. Analysis of dataset size, training horizon, and action norm on the performance of $\\pi ^ { \\Delta }$ . (a) Dataset Size: Mean Per Joint Position Error (MPJPE) baseline without $\\pi ^ { \\Delta }$ performance, highlighting the trade-off between action smoothness and policy flexibility.$ Figure 10 from the original paper: The image is a chart that illustrates the impact of dataset size, training horizon, and action norm on the performance of $\\pi ^ { \\Delta }$ . Panel (a) shows the Mean Joint Position Error (MPJPE) across different dataset sizes, (b) presents the influence of training horizon on performance, and (c) describes the relationship between variations in action norm and closed-loop MPJPE.

a) Dataset Size (Figure 10a):
- Finding: Increasing the dataset size generally improves the delta action model ( $\pi^\Delta$ )'s generalization, leading to reduced MPJPE on out-of-distribution (unseen) trajectories during open-loop evaluation. However, the improvement in closed-loop performance (when the fine-tuned policy is deployed) saturates. A marginal decrease of only 0.65% in MPJPE was observed when scaling from 4300 to 43000 samples.
- Implication: This suggests that while more data helps the delta action model generalize better to new trajectories, there's a point of diminishing returns for policy fine-tuning. A moderately sized dataset (e.g., 4300 samples) can be sufficient for achieving good closed-loop performance.
b) Training Horizon (Figure 10b):
- Finding: Longer training horizons (the length of the trajectory segments used for training $\pi^\Delta$ ) generally improve open-loop performance, with a 1.5s horizon achieving the lowest errors for evaluations at 0.25s, 0.5s, and 1.0s. However, this trend does not consistently extend to closed-loop performance. The best closed-loop results are observed at a training horizon of 1.0s.
- Implication: An excessively long training horizon for the delta action model might not provide additional benefits for the final fine-tuned policy. There's an optimal trade-off, where a horizon that captures sufficient temporal dependencies (like 1.0s) is most effective for policy fine-tuning.
c) Action Norm Weight (Figure 10c):
- Finding: The action norm reward (part of the regularization term in Table II for delta action learning) is crucial for balancing dynamics alignment with minimal corrections. Both open-loop and closed-loop errors decrease as the action norm weight increases, reaching the lowest error at a weight of 0.1. However, increasing the weight further causes open-loop errors to rise.
- Implication: This indicates that carefully tuning the action norm weight is vital. Too low, and the delta action model might make unnecessary large corrections; too high, and the minimal action norm reward could dominate the learning objective, preventing effective dynamics compensation. A weight of 0.1 appears to be optimal for balancing these objectives.

6.2.2. Q5: Different Usage of Delta Action Model

This section addresses Q5: How to best use the delta action model of ASAP? by comparing different strategies for fine-tuning the nominal policy ( $\hat{\pi}(s)$ ) using the learned delta policy ( $\pi^\Delta(s,a)$ ). The goal is to obtain a fine-tuned policy $\pi(s)$ for real-world deployment.

The underlying relationship, derived from one-step dynamics matching, is: $f^{\mathrm{real}}(s, \pi(s)) = f^{\mathrm{sim}}(s, \pi(s) + \pi^\Delta(s, \pi(s)))$ Which simplifies to: $\pi(s) = \hat{\pi}(s) - \pi^\Delta(s, \pi(s))$ (Equation 2 in the paper's appendix)

The paper considers three approaches to solve for $\pi(s)$ :

Fixed-Point Iteration (RL-Free):
- Concept: This method iteratively refines the policy. It starts with an initial guess, $y_0 = \hat{\pi}(s)$ (the nominal policy), and then updates it using the learned delta action model in a fixed-point iteration: $y_{k+1} = \hat{\pi}(s) - \pi^\Delta(s, y_k)$ The iteration continues for $K$ steps, with $y_k$ expected to converge to the solution.
- Limitation: As an RL-free method, it is myopic (only considers one-step matching) and can suffer from out-of-distribution (OOD) issues if the iterated actions $y_k$ move into regions where $\pi^\Delta$ was not trained.
Gradient-Based Optimization (RL-Free):
- Concept: This approach formulates the problem as an optimization task to minimize a loss function that quantifies the discrepancy from the desired relationship: $l(y) = \| y + \pi^\Delta(s, y) - \hat{\pi}(s) \|^2$ Gradient descent is then used to find the optimal $y$ that minimizes this loss.
- Limitation: Similar to fixed-point iteration, this method is myopic (based on a one-step matching assumption) and can struggle with OOD data, especially for multi-step trajectories. It requires differentiating through the delta action model, which might be complex.
RL Fine-Tuning (ASAP's approach):
- Concept: Instead of attempting to solve the one-step matching equation directly, ASAP uses Reinforcement Learning (PPO) to fine-tune the nominal policy $\hat{\pi}(s)$ within the simulator augmented with the delta action model ( $f^{\mathrm{ASAP}}$ ). The RL agent learns directly through interaction in this aligned simulator.
- Benefit: This approach effectively performs a gradient-free multi-step matching procedure. By optimizing cumulative rewards over entire trajectories, RL fine-tuning can learn long-term consequences and adapt more robustly to potential OOD issues encountered during rollouts, which myopic RL-free methods cannot.
  
  The following figure (Figure 11 from the original paper) compares the MPJPE over timesteps for these fine-tuning methods.
  
  Figure 11 from the original paper: The image is a chart that shows the comparison of MPJPE (mm) over time for different fine-tuning methods. The ASAP method (red) performs best in terms of error, while Gradient Search (yellow) and Fixed Point Iteration (green) perform worse than the baseline (blue, Before DeltaA).
  
  Analysis of Figure 11: The graph clearly shows that RL Fine-Tuning (ASAP's method) achieves the lowest MPJPE during deployment, consistently outperforming the Fixed-Point Iteration and Gradient Search methods, as well as the Before DeltaA baseline (which is the Vanilla policy without any delta action compensation). Both RL-free approaches (Fixed-Point Iteration and Gradient Search) perform worse than the Before DeltaA baseline, indicating their failure to effectively adapt the policy. This validates ASAP's choice of RL fine-tuning as the most effective strategy for utilizing the delta action model, overcoming the myopic and OOD limitations of RL-free methods for multi-step dynamics adaptation.

6.2.3. Q6: Does ASAP Fine-Tuning Outperform Random Action Noise Fine-Tuning?

This section addresses Q6: Why and how does ASAP work? by comparing ASAP fine-tuning with injecting random action noise during fine-tuning, and by visualizing the learned delta action model's output.

Comparison with Random Action Noise:
- Random torque noise [7] is a common domain randomization technique. To test if ASAP's delta action is more than just a robustness enhancer, policies are fine-tuned in IsaacGym with random action noise: $s_{t+1} = f^{\mathrm{sim}}(s_t, a_t + \beta \delta_a)$ , where $\delta_a \sim \mathcal{U}[0, 1]$ (uniform random noise) and $\beta$ is the noise magnitude. These policies are then deployed in Genesis.
- Finding (Figure 12): Policies fine-tuned with random action noise in the range $\beta \in [0.025, 0.2]$ do show improved global tracking error (MPJPE) compared to no fine-tuning. However, their performance (best MPJPE of 173) does not match the precision achieved by ASAP (MPJPE of 126). Beyond $\beta=0.2$ , performance degrades.
- Implication: This suggests that while random noise can offer some robustness, it is less effective than ASAP's targeted delta action model for precise dynamics alignment.
  
  The following figure (Figure 12 from the original paper) shows MPJPE versus noise level.
  
  $Fig. 12. MPJPE vs. Noise Level for policies fine-tuned with random action noise. Policies with noise levels $\\beta \\in \[ 0 . 0 2 5 , 0 . 2 \]$ show improved performance compared to no fine-tuning. Delta action achieves better tracking precision (126 MPJPE) compared to the best action noise (173 MPJPE).$ Figure 12 from the original paper: The image is a chart showing the performance of closed-loop MPJPE (mm) at different action noise levels. As the action noise level increases, the MPJPE for the unfinetuned policy is 336.1, while the MPJPE after fine-tuning with ASAP is 126.9, indicating significant improvement. The data points reflect changes in performance across different noise levels.
Visualization of Delta Action Model Output:
- Finding (Figure 13): The average output magnitude of $\pi^\Delta$ , learned from IsaacSim data, reveals non-uniform discrepancies across joints. For the G1 humanoid robot, lower-body motors (especially ankle and knee joints) exhibit a significantly larger dynamics gap compared to upper-body joints. Furthermore, asymmetries between left and right body motors are evident.
- Implication: This structured discrepancy cannot be effectively addressed by merely adding uniform random action noise. ASAP's delta action model learns these specific, targeted corrections for each joint, which is why it achieves superior tracking precision compared to naive randomization strategies. It learns what and where to correct, rather than just forcing general robustness.
  
  The following figure (Figure 13 from the original paper) visualizes the delta action model's output magnitude.
  
  $Fig. 13. Visualization of IsaacGym-to-IsaacSim $\\pi ^ { \\Delta }$ output magnitude. We compute the average absolute value of each joint over the 4300-episode dataset. Larger red dots indicate higher values. The results suggest that lowerbody motors exhibit a larger discrepancy compared to upper-body joints, with the most significant gap observed in the ankle pitch joint of the G1 humanoid.$ Figure 13 from the original paper: The image is a diagram illustrating the output magnitude of $\\pi ^ { \\Delta }$ from IsaacGym to IsaacSim. The size of the red dots indicates the values for each joint, with numerical labels placed accordingly. The results suggest that the lower-body motors exhibit a larger discrepancy compared to upper-body joints, particularly in the output magnitude of the ankle pitch joint of the G1 humanoid robot.
  
  Conclusion for Q6: The delta action model in ASAP not only enhances policy robustness but also enables effective adaptation to real-world dynamics by learning and applying structured, non-uniform corrections that directly address specific dynamics mismatches. This targeted approach significantly outperforms naive randomization strategies.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces ASAP (Aligning Simulation and Real-World Physics), a novel two-stage framework that effectively bridges the sim-to-real gap for learning agile humanoid whole-body skills. The framework first pre-trains motion tracking policies in simulation using retargeted human motion data. In the second stage, it collects real-world data from the pre-trained policies to train a delta (residual) action model that specifically learns to compensate for dynamics mismatch. This delta action model is then integrated into the simulator, enabling fine-tuning of the pre-trained policies in an aligned simulation environment.

Extensive evaluations across sim-to-sim (IsaacGym to IsaacSim, IsaacGym to Genesis) and sim-to-real (IsaacGym to Unitree G1) scenarios demonstrate ASAP's superior performance. It achieves significant reductions in motion tracking errors (up to 52.7% in sim-to-real tasks), improves agility and whole-body coordination, and enables highly dynamic motions previously difficult to achieve. The framework consistently outperforms SysID, Domain Randomization, and delta dynamics learning baselines. ASAP's success highlights the potential of delta action learning as a powerful paradigm for developing more expressive and agile humanoid robots in real-world applications. The authors also open-sourced a multi-simulator codebase to support future research.

7.2. Limitations & Future Work

The authors acknowledge several real-world limitations of ASAP:

Hardware Constraints: Agile whole-body motions subject robots to significant stress, leading to motor overheating and hardware failures (e.g., two Unitree G1 robots were damaged). This limits the scale and diversity of safely collectible real-world motion data, which is crucial for training the delta action model.
Dependence on Motion Capture Systems: The current pipeline necessitates a MoCap setup to record ground truth real-world trajectories. This introduces practical deployment barriers in unstructured environments where MoCap systems are unavailable, restricting the widespread applicability of ASAP.
Data-Hungry Delta Action Training: While the authors successfully reduced the delta action model to 4-DoF ankle joints for sample efficiency in real-world experiments, training a full 23-DoF model remains impractical due to the substantial data demand (e.g., $>400 episodes$ in simulation for 23-DoF training).

Based on these limitations, the authors suggest future research directions:
Developing damage-aware policies to mitigate hardware risks during aggressive motion execution.
Exploring MoCap-free alignment techniques to eliminate the reliance on expensive and infrastructure-dependent motion capture systems.
Investigating adaptation techniques for delta action models to achieve sample-efficient few-shot alignment, reducing the amount of real-world data required.

7.3. Personal Insights & Critique

ASAP presents a highly compelling solution to the long-standing sim-to-real gap for agile humanoid control. My personal insights and critique are as follows:

Innovation in Delta Action Learning: The core innovation of learning a delta (residual) action model is particularly elegant. Instead of trying to identify complex physical parameters (like SysID) or generalize over broad uncertainties (like DR), ASAP learns the specific corrections needed in the action space to reconcile simulated and real-world dynamics. This is a powerful form of implicit dynamics compensation that can account for both known parameter mismatches and unmodeled complexities. The comparison against DeltaDynamics (which learns residual state dynamics) clearly shows the advantage of operating in the action space for policy fine-tuning.
Robust Two-Stage Framework: The two-stage approach (pre-train in an idealized simulation, then fine-tune in an aligned simulator) is very practical. It leverages the benefits of fast, safe simulation for initial skill acquisition and then efficiently addresses sim-to-real transfer with targeted real-world data collection. The asymmetric actor-critic and curriculum learning strategies in pre-training further enhance robustness.
Strong Empirical Validation: The extensive experimental results, covering sim-to-sim and sim-to-real scenarios, with clear quantitative metrics and visual comparisons, provide strong evidence for ASAP's effectiveness. The ablation studies on dataset size, training horizon, and action norm weight are thorough and provide valuable insights for future implementations. The direct comparison with random action noise further solidifies the argument that ASAP learns structured, meaningful corrections.
Applicability to Other Domains: The delta action learning principle is highly transferable. It could be applied to other robotic platforms (e.g., quadrupeds, manipulators) or even other complex physical systems where sim-to-real transfer is a challenge. The idea of learning a lightweight residual model to align discrepancies, rather than a full system model, is broadly applicable in model-based control and robot learning.
Critique on Data Dependency and MoCap: Despite the impressive results, the reliance on MoCap systems and the data-hungry nature of training the delta action model for full-DoF control remain significant practical hurdles. While the authors successfully trained a 4-DoF ankle model with 100 clips, scaling this to 23-DoF for a humanoid would require substantially more data, which is challenging and expensive to collect safely in the real world. This limits ASAP's immediate deployment in MoCap-free or resource-constrained environments. Future work on sample efficiency (e.g., few-shot learning for $\pi^\Delta$ ) and MoCap-free state estimation would be crucial for broader adoption.
Computational Cost: The paper doesn't deeply delve into the computational cost (training time, inference time) of training the delta action model and fine-tuning the policies, especially for a large 23-DoF robot. While IsaacGym is fast, the overall RL training loops can still be substantial.
Interpretability of Delta Actions: While effective, the delta action model is a neural network and thus largely a black box. Understanding why specific delta actions are applied could provide deeper insights into the underlying physics mismatch beyond just identifying which joints have larger discrepancies. This could potentially inform better physical modeling or robot design in the future.

In summary, ASAP represents a significant step forward in enabling agile humanoid control by providing a robust and empirically validated framework for sim-to-real transfer. Its core methodology of delta action learning is both elegant and powerful, paving the way for more capable and versatile humanoid robots.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~27 min read · 34,638 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Stage 1: Pre-training Agile Humanoid Skills in Simulation

4.2.1.1. Data Generation: Retargeting Human Video Data

4.2.1.2. Phase-based Motion Tracking Policy Training

4.2.2. Stage 2: Post-training (Training Delta Action Model and Fine-tuning Motion Tracking Policy)

4.2.2.1. Data Collection

4.2.2.2. Training Delta Action Model

4.2.2.3. Fine-tuning Motion Tracking Policy under New Dynamics

4.2.2.4. Policy Deployment

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Q1: Comparison of Dynamics Matching Capability (Open-Loop Performance)

6.1.2. Q2: Comparison of Policy Fine-Tuning Performance (Closed-Loop Performance)

6.1.3. Q3: Real-World Evaluations (Sim-to-Real Transfer)

6.2. Extensive Studies and Analyses

6.2.1. Q4: Key Factors in Training Delta Action Models

6.2.2. Q5: Different Usage of Delta Action Model

6.2.3. Q6: Does ASAP Fine-Tuning Outperform Random Action Noise Fine-Tuning?

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers