ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills
TL;DR Summary
The ASAP framework addresses the dynamics mismatch for humanoid robots by utilizing a two-stage approach: pre-training motion tracking policies in simulation and fine-tuning them with real-world data to achieve agile whole-body skills.
Abstract
Humanoid robots hold the potential for unparalleled versatility in performing human-like, whole-body skills. However, achieving agile and coordinated whole-body motions remains a significant challenge due to the dynamics mismatch between simulation and the real world. Existing approaches, such as system identification (SysID) and domain randomization (DR) methods, often rely on labor-intensive parameter tuning or result in overly conservative policies that sacrifice agility. In this paper, we present ASAP (Aligning Simulation and Real-World Physics), a two-stage framework designed to tackle the dynamics mismatch and enable agile humanoid whole-body skills. In the first stage, we pre-train motion tracking policies in simulation using retargeted human motion data. In the second stage, we deploy the policies in the real world and collect real-world data to train a delta (residual) action model that compensates for the dynamics mismatch. Then, ASAP fine-tunes pre-trained policies with the delta action model integrated into the simulator to align effectively with real-world dynamics. We evaluate ASAP across three transfer scenarios: IsaacGym to IsaacSim, IsaacGym to Genesis, and IsaacGym to the real-world Unitree G1 humanoid robot. Our approach significantly improves agility and whole-body coordination across various dynamic motions, reducing tracking error compared to SysID, DR, and delta dynamics learning baselines. ASAP enables highly agile motions that were previously difficult to achieve, demonstrating the potential of delta action learning in bridging simulation and real-world dynamics. These results suggest a promising sim-to-real direction for developing more expressive and agile humanoids.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills
1.2. Authors
The paper is authored by a large team, with Tairan He, Jiawei Gao, and Wenli Xiao noted as equal contributions. The affiliations include Carnegie Mellon University (CMU) and NVIDIA. This indicates a collaboration between a leading academic institution and a major technology company known for its work in simulation and AI/robotics.
1.3. Journal/Conference
The paper is published at (UTC): 2025-02-03T08:22:46.000Z, and the original source is from arXiv (v3). As it is a preprint, it has not yet undergone formal peer review and publication in a specific journal or conference proceedings, but its content suggests it is intended for a top-tier robotics or AI conference/journal.
1.4. Publication Year
2025
1.5. Abstract
Humanoid robots show great promise for performing complex, human-like whole-body skills. However, achieving agile and coordinated movements is hindered by the dynamics mismatch between simulated environments (where policies are typically trained) and the real world. Current sim-to-real transfer methods, such as System Identification (SysID) and Domain Randomization (DR), either demand extensive manual tuning or produce overly conservative policies that lack agility.
This paper introduces ASAP (Aligning Simulation and Real-World Physics), a two-stage framework designed to overcome this dynamics mismatch and enable agile humanoid whole-body skills. In the first stage, motion tracking policies are pre-trained in simulation (IsaacGym) using retargeted human motion data. In the second stage, these policies are deployed in the real world to collect data, which is then used to train a delta (residual) action model. This model learns to compensate for the observed dynamics mismatch. Subsequently, ASAP fine-tunes the pre-trained policies by integrating this delta action model into the simulator, effectively aligning the simulated dynamics with real-world physics.
The ASAP framework is evaluated across three transfer scenarios: IsaacGym to IsaacSim, IsaacGym to Genesis (both sim-to-sim), and IsaacGym to the real-world Unitree G1 humanoid robot. The results demonstrate that ASAP significantly enhances agility and whole-body coordination for various dynamic motions, achieving reduced tracking errors compared to SysID, DR, and delta dynamics learning baselines. ASAP successfully enables highly agile motions previously challenging to achieve, highlighting the potential of delta action learning in bridging the sim-to-real gap. These findings point towards a promising direction for developing more expressive and agile humanoid robots.
1.6. Original Source Link
https://arxiv.org/abs/2502.01143v3 The paper is currently a preprint on arXiv.
1.7. PDF Link
https://arxiv.org/pdf/2502.01143v3.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the significant challenge of enabling agile and coordinated whole-body skills in humanoid robots, primarily due to the dynamics mismatch between physics simulators and the real world. Humanoid robots hold immense potential for versatile, human-like tasks, but this sim-to-real gap prevents policies trained in simulation from performing effectively on physical hardware.
Current solutions to bridge this gap suffer from notable limitations:
-
System Identification (SysID) methods: These approaches attempt to estimate and calibrate physical parameters of the robot (e.g., motor characteristics, link masses) or the environment. However, they often require a pre-defined parameter space, may not capture the entire
sim-to-real gapif real-world dynamics fall outside the modeled distribution, and frequently rely onground truth torque measurementswhich are often unavailable on commercial robot platforms, limiting their practical applicability. -
Domain Randomization (DR) methods: These techniques train policies in a simulator where physical parameters are randomized within a certain range, forcing the policy to be robust to variations. While effective for some tasks,
DRcan lead tooverly conservative policiesthat prioritize stability over agility, thus hindering the execution of highly dynamic and expressive skills. -
Learned Dynamics Methods: While successful in simpler,
low-dimensional systemslike drones or ground vehicles, their effectiveness for the complex,high-dimensional dynamicsof humanoid robots remains largely unexplored.The paper's entry point is to leverage the idea of
learning a dynamics modelusing real-world data, but specifically by focusing on learning aresidual correctionto actions, rather than modeling the entire complex dynamics explicitly. Thisdelta action learningis an innovative approach to directly compensate for thedynamics mismatch, aiming to achieve both robustness and agility.
2.2. Main Contributions / Findings
The paper presents ASAP, a two-stage framework, and its primary contributions are:
-
A Novel Framework for Sim-to-Real Transfer:
ASAPintroduces atwo-stage frameworkthat effectively bridges thesim-to-real gap. It leverages adelta action modeltrained usingreinforcement learning (RL)with real-world data to directly compensate fordynamics mismatch. Thisdelta action modelenables policies trained in simulation to adapt seamlessly to real-world physics, allowing for the execution of agilewhole-body humanoid skills. -
Achievement of Previously Difficult Humanoid Motions: The framework successfully deploys
RL-based whole-body control policieson real humanoid robots (Unitree G1), achieving highly dynamic and agile motions (e.g., agile jumps, kicks) that were previously challenging or impossible to perform with existingsim-to-realtransfer techniques. -
Extensive Validation and Superior Performance: Through comprehensive experiments in both
sim-to-sim(IsaacGym to IsaacSim, IsaacGym to Genesis) andsim-to-real(IsaacGym to Unitree G1) scenarios,ASAPdemonstrates its efficacy. It significantly reducesmotion tracking errors(up to 52.7% in sim-to-real tasks) and consistently outperformsSysID,DR, anddelta dynamics learningbaselines, showcasing improved agility and coordination. -
Open-Sourced Multi-Simulator Codebase: To foster further research and facilitate development in this area, the authors have developed and
open-sourcedamulti-simulator training and evaluation codebase. This resource aims to accelerate future advancements insim-to-realtransfer for humanoid robots.The key conclusion is that
delta action learningis a promising direction for creating more expressive and agile humanoids by effectively bridging thesimulation and real-world dynamicsgap.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the ASAP framework, a grasp of several fundamental concepts in robotics, simulation, and machine learning is essential:
- Humanoid Robots: These are robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. Their potential lies in their ability to interact with human-designed environments and perform human-like tasks, making them versatile for a wide range of applications.
- Whole-Body Skills: This refers to complex movements that involve the coordinated action of multiple
degrees of freedom (DoFs)across the entire robot's body, including the torso, arms, and legs. Examples include jumping, kicking, dancing, and agile locomotion, requiring intricate balance and coordination. - Simulation-to-Real (Sim-to-Real) Transfer: This is a crucial area in robotics where policies (control strategies) are trained in a simulated environment and then deployed on physical robots. The goal is to leverage the benefits of simulation (speed, safety, data generation) while ensuring the learned skills translate effectively to the real world.
- Dynamics Mismatch (Sim-to-Real Gap): The primary challenge in
sim-to-realtransfer. It refers to the discrepancies between the physics models used insimulatorsand the actual physical properties and behaviors of real-world robots and environments. This mismatch can arise from:- Inaccurate System Parameters: Imperfect knowledge of robot mass, inertia, friction coefficients, motor characteristics (e.g., torque limits, delays).
- Unmodeled Dynamics: Complex phenomena not captured by simplified physics engines, such as uncalibrated sensors, mechanical compliance, backlash, or unknown environmental factors.
- Latency and Noise: Differences in control loop timings, sensor noise, and communication delays between simulation and reality.
- Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make decisions by interacting with anenvironment. Theagentperformsactions, receivesstates(observations) andrewards, and learns apolicy(a mapping from states to actions) that maximizes cumulativerewardover time.- Policy (): The
agent's strategy for choosingactionsgivenstates. - State (): The current observation of the
environment. - Action (): An output from the
agentthat influences theenvironment. - Reward (): A scalar feedback signal from the
environmentindicating the desirability of anaction.
- Policy (): The
- Proximal Policy Optimization (PPO): A popular
on-policyRL algorithmthat aims to find apolicythat performs well.PPOis known for its stability and sample efficiency, making it suitable for complex control tasks. It works by making small updates to thepolicyto avoid large, destructive changes, typically by clipping thepolicy ratioduring optimization. - PD Controller (Proportional-Derivative Controller): A widely used feedback control loop mechanism in robotics. It calculates an output
torqueorforcebased on the difference between a desiredtarget position(or velocity) and the currentmeasured position(or velocity) of a joint.Proportional (P)term: Responds to the current error, aiming to reduce it.Derivative (D)term: Responds to the rate of change of the error, helping to damp oscillations and improve stability.- In the context of this paper,
actionsaretarget joint positionswhich are then fed into aPD controllerto generatetorquesto actuate the robot.
- Motion Capture (MoCap) Systems: Technology used to precisely record the movement of objects or people. In robotics,
MoCapsystems (e.g., Vicon, OptiTrack) are often used to getground truthpositional and orientational data of a robot in the real world, which is critical for evaluatingtracking performanceor generatingreference trajectories. - SMPL (Skinned Multi-Person Linear Model): A widely used statistical model for representing human body shape and pose in 3D. It provides a parametric representation that can generate a wide range of realistic human body shapes and motions from a compact set of parameters (shape parameters and pose parameters ).
- Physics Simulators: Software environments that model physical interactions and dynamics. They allow researchers to train and test robot control policies safely and efficiently. The paper uses:
IsaacGym: A high-performance,GPU-accelerated physics simulation environmentforrobot learning, developed by NVIDIA. It's known for its ability to run thousands of parallel simulations.IsaacSim: Another NVIDIA simulator built on theOmniverseplatform, offering high-fidelity rendering and simulation capabilities. It's often used for tasks requiring more visual realism or complex scene interactions thanIsaacGym.Genesis: A universal and generative physics engine, also used as a testing environment in this paper, representing anothersim-to-sim transferscenario.
- Unitree G1 Humanoid Robot: A specific
humanoid robotplatform used for real-world evaluations in this paper, known for its agile capabilities.
3.2. Previous Works
The paper discusses three main categories of prior approaches to address the dynamics mismatch:
- System Identification (SysID) Methods:
- Concept:
SysIDmethods aim to build mathematical models of a system's dynamics from observed input-output data. In robotics, this means estimating physical parameters (e.g., mass, inertia, friction, motor constants) of the robot and its environment to make the simulator more accurate. - Examples: [102, 19] are cited as examples. Historically,
SysIDhas been used for calibrating robot models [39, 5], like estimating inertial parameters of links [2] or actuator dynamics [85, 29]. - Limitations (as highlighted by the paper):
- Require a pre-defined parameter space [49], meaning they can only adjust parameters that are explicitly modeled. If the
sim-to-real gapstems from unmodeled dynamics,SysIDwill fail to capture it. - Often rely on
ground truth torque measurements[29], which are frequently unavailable on many commercial hardware platforms, limiting their practical use. - Can struggle with
long-horizon scenariosdue tocumulative error buildup, as observed in the paper's experiments (Table III).
- Require a pre-defined parameter space [49], meaning they can only adjust parameters that are explicitly modeled. If the
- Concept:
- Domain Randomization (DR) Methods:
- Concept: Instead of precisely identifying parameters,
DRapproaches trainRL policiesin asimulatorwhere physical properties (mass, friction, motor strength, latency, visual textures) are randomly varied within a specified range during training. The goal is to force thepolicyto become robust to these variations, thus generalizing better to the real world, whose parameters are assumed to fall within the randomized distribution [87, 68]. - Examples: [85, 79, 59] are cited for
DRin robotics. - Limitations (as highlighted by the paper):
- Can lead to
overly conservative policies[25]. By optimizing for performance across a wide range of randomized parameters, thepolicymay avoid aggressive actions that might fail under certain parameter extremes, sacrificing agility for robustness. This makes it difficult to achievehighly agile skills.
- Can lead to
- Concept: Instead of precisely identifying parameters,
- Learned Dynamics Methods (Dynamics Modeling):
- Concept: These methods learn a predictive model of the
environment's dynamics directly from real-world data. The learned model can then be used to either improve thesimulatoror to directly informmodel-predictive control (MPC)orpolicy optimization. - Examples: Demonstrated success in
low-dimensional systemslike drones [81] and ground vehicles [97]. - Limitations (as highlighted by the paper): Its effectiveness for the
high-dimensionaland complex dynamics ofhumanoid robotsremains largely unexplored and challenging due to the sheer complexity of learning a full, accurate dynamics model for a humanoid.
- Concept: These methods learn a predictive model of the
- Residual Learning for Robotics:
- Concept: This is a broader category where a
residual component(a small correction) is learned to augment or refine an existingbase modelorcontroller. Thisresidualcan correct inaccuracies in adynamics modelor modify theactionsof abase policy. - Examples:
Residual policy models: Refine actions of an initial controller [84, 34, 8, 1, 12, 20, 3, 33, 42].- Correcting
dynamics models: [66, 35, 38, 82, 23].RGAT[35] uses aresidual action modelwith a learnedforward dynamicsto refine thesimulator.
- Connection to ASAP:
ASAPbuilds on this idea by usingRL-based residual actionsto align thedynamics mismatchspecifically betweensimulationandreal-world physics.
- Concept: This is a broader category where a
3.3. Technological Evolution
The field of sim-to-real transfer for robot control has evolved from:
-
Direct
SysIDof parameters: Early attempts focused on manually or automatically calibrating known physical parameters to make simulators more accurate. This was limited by the completeness of the physical model. -
Robustness through
Domain Randomization: Recognizing the difficulty of perfectSysID,DRemerged as a way to make policies robust to parameter uncertainties. While effective, it often led to conservative behaviors. -
Learning Dynamics Models: More recently, data-driven approaches sought to learn
dynamics modelsdirectly from real-world data. This showed promise but scaled poorly to complex systems like humanoids. -
Residual Learning: This paradigm, often combined with
RL, aims to learn only the "difference" or "correction" needed, rather than the entire system from scratch. This makes it more efficient and robust than learning fulldynamics modelsand more adaptive thanSysIDorDRalone.ASAPfits into this evolution by advancing theresidual learningparadigm. Instead of learning a fullresidual dynamics model(which predicts state differences),ASAPlearns aresidual action model(delta action model) that directly modifies the actions applied to the simulator. This allows thesimulatorto better reflect real-world outcomes and subsequently enables policies to be fine-tuned in this "aligned"simulator. This approach offers a powerful way to bridge thesim-to-real gapfor complex,agile humanoid motions, pushing beyond the limitations of previous methods.
3.4. Differentiation Analysis
Compared to the main methods in related work, ASAP presents several core differences and innovations:
-
Differentiation from
SysID:- ASAP: Learns a
delta action model() as aresidual correction termfor actions. This model implicitly compensates for alldynamics mismatch(both modeled parameter inaccuracies and unmodeled dynamics) by adjusting the actions to match real-world outcomes. It does not explicitly estimate physical parameters. - SysID: Explicitly estimates and tunes specific physical parameters (e.g.,
CoMshift, mass,PD gains). It is limited by the chosen parameter space and cannot account for unmodeled dynamics. - Innovation:
ASAPoffers a more holistic and flexible approach todynamics compensationthat doesn't require prior knowledge of which parameters are mismatched or their precise values.
- ASAP: Learns a
-
Differentiation from
Domain Randomization (DR):- ASAP: Collects specific real-world data and uses it to learn a targeted
delta action modelthat aligns thesimulatorwith the real world. Policies are then fine-tuned in this alignedsimulator. - DR: Randomizes parameters during initial policy training, making the policy robust to a distribution of dynamics. It doesn't explicitly learn from real-world discrepancies to correct the
simulator. - Innovation:
ASAPprovides a more adaptive and less conservativesim-to-realtransfer.DRpolicies can be overly conservative to perform well across the randomized distribution, sacrificing agility.ASAPaligns thesimulatorto the specific real-world dynamics, allowing fine-tuned policies to be both robust and agile.
- ASAP: Collects specific real-world data and uses it to learn a targeted
-
Differentiation from
Delta Dynamics Learning(as a baseline in this paper):- ASAP: Learns a
delta action model() that outputs corrective actions () which are added to the policy's actions () before they are applied to thesimulator. Thesimulatorthen computes . - Delta Dynamics Learning (baseline): Learns a
residual dynamics model() that predicts the difference in state transitions (). This predicted difference is then added to the simulator's output state . (As detailed in Appendix C). - Innovation:
ASAP'sdelta action modeldirectly influences the input to thesimulator's physics step, effectively making thesimulatorbehave more like the real world from the perspective of action execution. TheDelta Dynamicsbaseline, by contrast, corrects thesimulator's output state, which can be more prone to cumulative errors or less intuitive forpolicy fine-tuningas the actions themselves aren't being adjusted in the dynamics computation. The paper's results showASAPsignificantly outperformsDeltaDynamicsin bothopen-loopandclosed-loopperformance, suggesting that correcting the inputactionspace is more effective forpolicy learningthan correcting the outputstatespace.
- ASAP: Learns a
-
Integration of
RLfor Residual Learning:ASAPexplicitly usesReinforcement Learning(PPO) to train itsdelta action model. ThisRLformulation allows the model to learn a complex, non-linear mapping for corrections based onrewardsthat minimizestate discrepancies, which can be more powerful than simpler regression-based approaches for learning residuals.In essence,
ASAPcombines the strengths of data-driven learning with the concept ofresidual modelingin theaction space, providing a refined approach that is less restrictive thanSysID, more adaptive thanDR, and empirically more effective than learningresidual dynamicsforagile humanoid control.
4. Methodology
The ASAP framework is a two-stage process designed to align simulation and real-world physics for learning agile humanoid whole-body skills.
4.1. Principles
The core idea behind ASAP is to address the sim-to-real gap not by exhaustively modeling real-world physics or randomizing all possible parameters, but by learning a specific residual correction to the actions generated by a policy. This residual correction, called a delta action, is learned from real-world data to make the simulator behave more like the real world. Once this delta action model is integrated into the simulator, the main motion tracking policy can be fine-tuned in this "aligned" simulator, allowing it to adapt to real-world dynamics while retaining agility.
The theoretical basis draws from reinforcement learning for policy optimization and residual learning for dynamics compensation. The intuition is that it's easier to learn the difference between simulated and real-world dynamics (the delta) than to model the entire real-world dynamics from scratch. By learning this delta in the action space, ASAP effectively perturbs the simulator's inputs to match real-world outcomes, thereby creating a more accurate training environment for fine-tuning policies.
4.2. Core Methodology In-depth (Layer by Layer)
The ASAP framework consists of two main stages: Pre-training and Post-training.
4.2.1. Stage 1: Pre-training Agile Humanoid Skills in Simulation
The first stage focuses on training base motion tracking policies entirely within a physics simulator (IsaacGym). This involves generating high-quality reference motions and then training an RL policy to imitate them.
4.2.1.1. Data Generation: Retargeting Human Video Data
To create expressive and agile imitation goals for motion-tracking policies, ASAP leverages human motion videos.
-
a) Transforming Human Video to
SMPLMotions: The process begins by recording videos of humans performing diverse and agile motions. These videos are then processed usingTRAM[93] (Trajectory and Motion of 3D Humans from In-the-Wild Videos).TRAMreconstructs 3D human motions from videos, estimating the global trajectory of the human in theSMPL parameter format[52]. TheSMPL modelis a statistical3D body modelthat represents human pose and shape. Its parameters include:Global root translation(): The 3D position of theroot joint(pelvis).Global root orientation(): The orientation of theroot joint.Body poses(also ): Rotations for each joint in theSMPL model.Shape parameters(): Coefficients that define the individual's body shape (e.g., height, weight). The reconstructed motions are denoted as .
-
b) Simulation-based Data Cleaning: Since
3D motion reconstructionfrom video can introduce noise and physically infeasible movements, a "sim-to-data" cleaning procedure is applied. This involves usingMaskedMimic[86], aphysics-based motion tracker, to imitate theSMPL motionsfromTRAMwithin theIsaacGym simulator. Motions that can be successfully tracked and validated in thesimulatorare considered physically feasible and are saved as the cleaned dataset . This step ensures that thereference motionsare robust and suitable forrobot control. -
c) Retargeting
SMPLMotions to Robot Motions: The cleanedSMPL motions() are thenretargetedto the specifichumanoid robot(Unitree G1). This is done using ashape-and-motion two-stage retargeting process[25].-
Shape Optimization: The
SMPL shape parameteris optimized to approximate the targethumanoid robot's shape. This involves selecting 12body linkswith correspondences between humans and humanoids and performinggradient descenton to minimizejoint distancesin arest pose. -
Motion Optimization: Using the optimized shape along with the original
translationandposefromTRAM, furthergradient descentis applied to minimize the distances of thebody links. This ensures that theretargeted motionsare kinematically and dynamically consistent with the robot's structure. The finalretargeted robot motionsare denoted as .The entire data generation and retargeting process is summarized in Figure 3.
Figure 3 from the original paper: The image is a diagram illustrating the motion conversion process from human actions to G1 robot movements. The left side shows human motion extracted from video, followed by the generation of SMPL motion through TRAM and reinforcement learning, ultimately transforming into G1 robot actions, highlighting the mapping between motion capture and real-world robotic applications.
-
4.2.1.2. Phase-based Motion Tracking Policy Training
The motion-tracking problem is formulated as a goal-conditioned reinforcement learning (RL) task. A policy is trained in IsaacGym to track the retargeted robot movement trajectories from .
-
State Definition: The
statefor theRL agentincludes the robot'sproprioceptionand atime phase variable.- The
time phase variableserves as thegoal stateforsingle-motion tracking[67], indicating the progress through a motion (e.g., is the start, is the end). - The
proprioceptionis defined as a history of robot internal states: Where:- : History of
joint positions(for 23DoFs) fromt-4to . - : History of
joint velocitiesfromt-4to . - : History of
root angular velocity. - : History of
root projected gravity(gravity vector in the root frame). - : History of
last actions(target joint positions) fromt-5tot-1.
- : History of
- The
-
Reward Function: The
rewardis defined as , which is optimized by thepolicy. -
Action Space: The
actionrepresents thetarget joint positionsfor the robot's23 degrees of freedom. These target positions are sent to aPD controllerwhich then actuates the robot's joints. -
Policy Optimization: The
Proximal Policy Optimization (PPO)algorithm [80] is used to optimize thepolicy, aiming to maximize thecumulative discounted reward, where is thediscount factor.Several design choices are crucial for stable
policy training: -
a) Asymmetric Actor-Critic Training:
- Concept: In
RL, anactor-criticarchitecture involves two neural networks: anactorthat determines actions and acriticthat estimates the value of states.Asymmetric trainingmeans thecritichas access to more information (privileged information) during training than theactor, which must only use information available during real-world deployment. - Implementation: The
critic networkis given privileged information, such as theglobal positionsof thereference motionand theroot linear velocity. In contrast, theactor networkrelies solely onproprioceptive inputs() and thetime-phase variable(). - Benefit: This design facilitates
policy traininginsimulationby providing thecriticwith a richer understanding of the task, while ensuring theactorremains deployable in the real world without requiring complex external sensing (likeodometryfor global position, which is a common challenge for humanoid robots [25, 24]).
- Concept: In
-
b) Termination Curriculum of Tracking Tolerance:
- Concept: A
curriculum learningstrategy where the difficulty of the task (specifically, the tolerance for deviation from thereference motion) is gradually increased during training. - Implementation: Initially, a generous
termination thresholdof is set; if the robot deviates from thereference motionby more than this, the episode terminates. As training progresses, this threshold is progressively tightened to . - Benefit: This allows the
policyto first learn basic stability and balancing skills before being challenged with strictermotion trackingrequirements, preventing early failures and enabling the learning ofhigh-dynamic behaviorslike jumping.
- Concept: A
-
c) Reference State Initialization (RSI):
- Concept: A method for initializing
RL episodesto improvetraining efficiencyandstability, especially for complex, sequential tasks. - Implementation: Instead of always starting an episode at the beginning of the
reference motion(),RSI[67] randomly samplestime-phase variablesbetween 0 and 1. The robot's state (includingroot positionandorientation,root linear and angular velocities,joint positionsandvelocities) is then initialized based on the correspondingreference motionat that sampledphase. - Benefit: This allows the
policyto learn different segments or phases of a motion in parallel, rather than being constrained to a strictly sequential learning process. For example, in a backflip, it allows thepolicyto practice landing (a later phase) before mastering the takeoff (an earlier phase).
- Concept: A method for initializing
-
d) Reward Terms: The
reward functionis a sum of three categories of terms:penalty,regularization, andtask rewards. The detailedreward termsand theirweightsare provided in Table I.Term Weight Term Weight Penalty DoF position limits −10.0 DoF velocity limits −5.0 Torque limits −5.0 Termination −200.0 Regularization Torques −1 × 10−6 Action rate −0.5 Feet orientation −2.0 Feet heading −0.1 Slippage −1.0 Task Reward Body position 1.0 VR 3-point 1.6 Body position (feet) 2.1 Body rotation 0.5 Body angular velocity 0.5 Body velocity 0.5 DoF position 0.75 DoF velocity 0.5 The above are the results from Table I of the original paper.
-
e) Domain Randomizations: To further enhance the robustness and generalization of the
pre-trained policy,basic domain randomization techniquesare applied during training inIsaacGym. These techniques involve randomly varying certainsimulation parameters(e.g., friction,PD gains,control delay, external perturbations) to ensure thepolicyis less sensitive to exact parameter values. The specificdomain randomizationparameters are listed in Table VI in the Appendix.
The overall context of this pre-training stage is illustrated in Figure 2(a) within the full ASAP framework diagram.
Figure 2 from the original paper: The image is a diagram illustrating the two stages of the ASAP framework: motion tracking pre-training and real trajectory collection, as well as the model training and fine-tuning process. It begins with pose estimation and imitation goal generation from a human video dataset, followed by delta action model training in a simulator and real-world deployment to compensate for dynamics mismatch.
4.2.2. Stage 2: Post-training (Training Delta Action Model and Fine-tuning Motion Tracking Policy)
The second stage of ASAP addresses the sim-to-real gap directly by leveraging real-world data collected from the pre-trained policy. This data is used to learn a delta action model that compensates for dynamics mismatch, which is then used to fine-tune the policy.
4.2.2.1. Data Collection
-
The
pre-trained policy(from Stage 1) is deployed on thereal-world Unitree G1 humanoid robotto performwhole-body motion tracking tasks. -
During these real-world rollouts, trajectories are recorded, denoted as .
-
At each timestep ,
stateinformation is captured using amotion capture deviceandonboard sensors. Thestateis defined as: Where:- :
Base position(typically fromMoCap). - :
Base linear velocity(fromMoCap). - :
Base orientation(quaternion, fromMoCap). - :
Base angular velocity(fromMoCap). - :
Joint positions(from robot'sproprioceptive sensors). - :
Joint velocities(from robot'sproprioceptive sensors).
- :
-
The
actionsare thetarget joint positionscommanded by thepre-trained policyin the real world. -
This data collection process is conceptually shown in Figure 2(a) and the deployment scenario is also visually represented in Figure 9.
Figure 9 from the original paper: The image is a scene showcasing the Unitree G1 robot performing a forward jump motion, challenging itself to leap over 1 meter. It captures the robot in various poses during the movement, demonstrating its capabilities in both simulated and real-world environments.
4.2.2.2. Training Delta Action Model
The key to ASAP is learning a delta action model () to compensate for the sim-to-real physics gap. This model is trained by observing the discrepancies when real-world actions are replayed in the simulator.
-
Concept: When real-world actions () are applied in the
simulator(), the resultingsimulated trajectory() will deviate from thereal-world recorded trajectory(). This deviation is a signal for thedynamics mismatch. Thedelta action modellearns to output corrective actions () that, when added to the real-world actions, make thesimulated statematch thereal-world state. -
Model Definition: The
delta action modelis apolicythat learns to output corrective actions: Where:- : Current state of the robot.
- : Action proposed by the main
policy(or the real-world recorded action during training). - : The
corrective (residual) actionoutput by thedelta action model.
-
Modified Simulator Dynamics for Training: The
RL environmentfor training incorporates thisdelta action modelby modifying thesimulator's dynamicsas follows: Where:- : Represents the
simulator's dynamics function. - : The
reference actionrecorded fromreal-world rollouts. - : The
corrective actionlearned by thedelta action model. This equation indicates that thedelta action modellearns to modify the actions before they are processed by thesimulator's physics engine.
- : Represents the
-
RL Training Steps (using
PPO):-
Initialization: At the beginning of each
RL step, the robot in thesimulatoris initialized at thereal-world state. -
Reward Computation: A
reward signalis computed to minimize the discrepancy between the resultingsimulated stateand therecorded real-world state. An additionalaction magnitude regularization term() is included to encourage minimal corrections. Thereward termsfordelta action learningare summarized in Table II.Term Weight Term Weight Penalty DoF position limits -10.0 DoF velocity limits -5.0 Torque limits -0.1 Termination -200.0 Regularization Action rate -0.01 Action norm -0.2 Task Reward Body position 1.0 VR 3-point 1.0 Body position (feet) 1.0 Body rotation 0.5 Body angular velocity 0.5 Body velocity 0.5 DoF position 0.5 DoF velocity 0.5 The above are the results from Table II of the original paper.
-
Policy Optimization:
PPOis used to train thedelta action policy, learning to output thecorrected actionsthat effectively match thesimulationto thereal world.
-
-
Benefit: This learning process allows the
simulatorto accurately reproducereal-world failures. For instance, if thereal-world robotcannot jump due to weaker motors compared to thesimulator's estimation, thedelta action modelwill learn to reduce the intensity of lower-body actions, effectively simulating thesemotor limitations. This alignedsimulatorthen enables more effectivepolicy fine-tuning.This process is depicted in Figure 2(b) and contrasted with other methods in Figure 4.
Figure 4 from the original paper: The image is an illustration that shows the different stages of the ASAP framework. On the left are the Vanilla and SysID methods, in the middle is Delta Dynamics, and on the right is Delta Action (ASAP). Each stage represents how information flows through the simulator using different strategies and state updates.
4.2.2.3. Fine-tuning Motion Tracking Policy under New Dynamics
Once the delta action model has been successfully trained, it is integrated into the simulator to create an aligned simulation environment.
-
Reconstructed Simulator Dynamics: The
simulation environmentis effectively reconstructed withASAP dynamics() such that: Here, thedelta action modelacts as a continuous correction layer that modifies thepolicy's actionsbefore they are applied to the underlyingsimulator. -
Policy Fine-tuning: The parameters of the
delta action modelare frozen. Thepre-trained policy(from Stage 1) is thenfine-tunedwithin thisaugmented simulation environment. The samereward function(Table I) used duringpre-trainingis applied during thisfine-tuningphase. -
Benefit: By training in this
delta action-augmentedsimulator, thepolicyeffectively adapts to thereal-world physicswithout needing direct real-world interaction duringfine-tuning. It learns to generate actions that, once modified by within thesimulator, lead to desiredreal-world-likeoutcomes.This
fine-tuningprocess is depicted in Figure 2(c).
4.2.2.4. Policy Deployment
-
Finally, the
fine-tuned policyis deployed directly in thereal world. -
Crucially, during
real-world deployment, thedelta action modelis not used. Thefine-tuned policyitself has learned to generate actions that are inherently robust and adapted to thereal-world dynamics, thanks to thefine-tuningprocess in thealigned simulator. -
The
fine-tuned policydemonstrates enhancedreal-world motion tracking performancecompared to thepre-trained policy, showcasingASAP's effectiveness in bridging thesim-to-real gap.This final
deploymentstage is shown in Figure 2(d).
5. Experimental Setup
The ASAP framework is evaluated across various transfer scenarios to assess its ability to compensate for dynamics mismatch and improve policy performance.
5.1. Datasets
-
Simulation Data:
-
The primary dataset used for
motion trackinginsimulationis theretargeted motion dataset. This dataset is derived fromhuman motion videosandretargetedto thehumanoid robot. -
These motions are categorized into three difficulty levels:
easy,medium, andhard, based on their complexity and theagilityrequired. Examples of these motions are partially visualized in Figure 6, includingSide Jump,Single Foot Balance,Squat,Step Backward, ``Step Forward, andWalk`.
Figure 6 from the original paper: The image is a comparative table showing the performance of the ASAP method against other methods in different testing environments (IsaacSim and Genesis). The table includes data on success rates, trajectory errors, and other metrics, as well as illustrations of different actions (such as jumping, balancing, squatting, etc.). It also presents results across various difficulty levels (easy, medium, hard), demonstrating the significant advantages of ASAP. -
This dataset serves as the source of
imitation goalsfor training andfine-tuningmotion tracking policiesinIsaacGym,IsaacSim, andGenesis.
-
-
Real-World Data (for
Unitree G1):- Real-world data collection is explicitly performed to train the
delta action model. Due to practical constraints such asmotor overheatingandhardware failuresduring dynamic motion execution (twoUnitree G1robots reportedly broke), the full23-DoF delta action modelwas deemed infeasible to train given limited data. - Instead, a more
sample-efficient approachwas adopted, focusing on learning a4-DoF ankle delta action model. This decision was justified by:- The impracticality of collecting enough data ( motion clips) for a
23-DoF model. - The
Unitree G1robot'smechanical linkage designin the ankle, which introduces a significant and difficult-to-modelsim-to-real gap[37].
- The impracticality of collecting enough data ( motion clips) for a
- For the
4-DoF ankle delta action model, 100 motion clips were collected, which proved sufficient. - The real-world experiments prioritize
motion safetyandrepresentativeness, selecting fivemotion-tracking tasks:- (i)
kick - (ii)
jump forward - (iii)
step forward and back - (iv)
single foot balance - (v)
single foot jump
- (i)
- Each task's
tracking policywas executed 30 times. - Additionally, 10 minutes of
locomotion datawere collected to train a robustlocomotion policyforpolicy transitionbetween differentmotion-tracking tasksin the real world (as real robots cannot be easily reset like insimulators).
- Real-world data collection is explicitly performed to train the
5.2. Evaluation Metrics
The paper uses several quantitative metrics to evaluate the performance of motion tracking policies, particularly focusing on tracking error and success rate.
-
Success Rate:
- Conceptual Definition: This metric indicates the proportion of attempts where the robot successfully imitates the
reference motionwithout exceeding a certain tracking error threshold. It measures the robustness and overall ability of the policy to complete the task. - Calculation: An imitation is deemed unsuccessful if, at any point during the motion, the average difference in
body distancebetween the robot and thereference motionis greater than . - Formula: (Not explicitly provided in the paper, but implicitly defined) $ \text{Success Rate} = \frac{\text{Number of successful episodes}}{\text{Total number of episodes}} \times 100% $ Where a successful episode is one where , .
- Symbol Explanation:
Number of successful episodes: The count of experimental trials where the robot maintained tracking error within the specified limit.Total number of episodes: The total count of experimental trials performed for a given task.AverageBodyDistance(t): The average difference in position between the robot and the reference motion at time .
- Conceptual Definition: This metric indicates the proportion of attempts where the robot successfully imitates the
-
Global Body Position Tracking Error (MPJPE) ( (mm)):
- Conceptual Definition: This metric quantifies the average Euclidean distance between corresponding
body parts(orjoints) of the robot's actual (orsimulated) global position and thereference motion's global position. It measures how well the robot's entire body tracks the overall spatial trajectory of the reference. Theg-prefix indicates "global." - Mathematical Formula: $ E_{\mathrm{g-mpjpe}} = \frac{1}{N \cdot J} \sum_{i=1}^N \sum_{j=1}^J | \mathbf{p}{i,j}^{\text{robot}} - \mathbf{p}{i,j}^{\text{ref}} |_2 $
- Symbol Explanation:
- : The total number of frames or timesteps in the motion sequence.
- : The total number of tracked joints or body parts on the robot.
- : The 3D global position vector of joint on the robot at frame .
- : The 3D global position vector of joint in the reference motion at frame .
- : The Euclidean (L2) norm, calculating the straight-line distance.
- Conceptual Definition: This metric quantifies the average Euclidean distance between corresponding
-
Root-relative Mean Per-Joint Position Error (MPJPE) ( (mm)):
- Conceptual Definition: This metric also quantifies the average Euclidean distance between corresponding
joints, but after aligning both the robot's pose and thereference motion's pose to a commonroot joint(e.g., the pelvis). This normalization removes errors due to overall global translation and rotation, focusing specifically on the accuracy of the robot's internal body configuration and relativejoint positions. - Mathematical Formula: $ E_{\mathrm{mpjpe}} = \frac{1}{N \cdot J} \sum_{i=1}^N \sum_{j=1}^J | (\mathbf{p}{i,j}^{\text{robot}} - \mathbf{p}{i,\text{root}}^{\text{robot}}) - (\mathbf{p}{i,j}^{\text{ref}} - \mathbf{p}{i,\text{root}}^{\text{ref}}) |_2 $
- Symbol Explanation:
- : The total number of frames or timesteps.
- : The total number of tracked joints.
- : The 3D global position vector of joint on the robot at frame .
- : The 3D global position vector of the root joint on the robot at frame .
- : The 3D global position vector of joint in the reference motion at frame .
- : The 3D global position vector of the root joint in the reference motion at frame .
- : The Euclidean (L2) norm.
- The terms and represent the
root-relative positionsof joint .
- Conceptual Definition: This metric also quantifies the average Euclidean distance between corresponding
-
Acceleration Error ( (mm/frame)):
- Conceptual Definition: Measures the average difference in acceleration between the robot's actual (or
simulated) motion and thereference motion. Highacceleration errorscan indicate jerky, unnatural, or poorly controlled movements, especially important foragile motions. - Calculation: The mean difference in acceleration values, typically derived from position data.
- Formula: (Not explicitly provided, but generally calculated as the mean squared error or mean absolute error of acceleration vectors) $ E_{\mathrm{acc}} = \frac{1}{N \cdot J} \sum_{i=1}^N \sum_{j=1}^J | \mathbf{a}{i,j}^{\text{robot}} - \mathbf{a}{i,j}^{\text{ref}} |_2 $
- Symbol Explanation:
- : Acceleration vector of joint on the robot at frame .
- : Acceleration vector of joint in the reference motion at frame .
- Conceptual Definition: Measures the average difference in acceleration between the robot's actual (or
-
Root Velocity Error ( (mm/frame)):
- Conceptual Definition: Measures the average difference in linear velocity of the robot's
root(base) between its actual (orsimulated) motion and thereference motion. This is critical for assessing how well the robot matches the desired speed and direction of overall movement. - Calculation: The mean difference in velocity values of the
root. - Formula: (Not explicitly provided, but generally calculated as the mean squared error or mean absolute error of root velocity vectors) $ E_{\mathrm{vel}} = \frac{1}{N} \sum_{i=1}^N | \mathbf{v}{i,\text{root}}^{\text{robot}} - \mathbf{v}{i,\text{root}}^{\text{ref}} |_2 $
- Symbol Explanation:
-
: Linear velocity vector of the robot's root at frame .
-
: Linear velocity vector of the reference motion's root at frame .
The mean values of these metrics are computed across all
motion sequencesused in the evaluation.
-
- Conceptual Definition: Measures the average difference in linear velocity of the robot's
5.3. Baselines
The ASAP framework is compared against several baselines to demonstrate its effectiveness in bridging the dynamics gap and improving policy performance.
-
Oracle:
- Description: This baseline represents an ideal scenario where the
RL policyis both trained and evaluated entirely within the samesimulator(IsaacGym). - Representativeness: It serves as an
upper boundfor performance insimulation, indicating the best possible tracking errors achievable if there were nosim-to-real gaporsim-to-sim mismatch.
- Description: This baseline represents an ideal scenario where the
-
Vanilla:
- Description: The
RL policyis trained inIsaacGym(the sourcesimulator) and then directly evaluated in the targetsimulation environment(IsaacSim,Genesis) or thereal world(Unitree G1) without anysim-to-real adaptationorfine-tuning. - Representativeness: This baseline quantifies the raw
dynamics mismatchand the performance degradation that occurs when apolicyis transferred without explicit adaptation strategies. It highlights the problemASAPaims to solve.
- Description: The
-
SysID (System Identification):
-
Description: This method attempts to align
simulator dynamicswith real-world dynamics by identifying and tuning specific physical parameters. -
Implementation: The authors specifically identify
base center of mass (CoM)shift (),base link mass offset ratio(), andlow-level PD gain ratios() for each of the 23DoFs. These parameters are searched within discrete ranges by replayingreal-world recorded trajectoriesinsimulationand finding the best alignment. After identifying the bestSysID parameters, thepretrained policyisfine-tunedinIsaacGymwith these adjusted parameters. -
Representativeness: This is a common and traditional approach to
sim-to-real transfer, serving as a strong baseline for explicitdynamics calibration. -
The
SysID parametersand their ranges are provided in Table VII in the Appendix:Parameter Range Parameter Range x [-0.02, 0.0, 0.02] [-0.02, 0.0, 0.02] Cz [−0.02, 0.0, 0.02] km [0.95, 1.0, 1.05] kp [0.95, 1.0, 1.05] `kif` [0.95, 1.0, 1.05] The above are the results from Table VII of the original paper. Note: The original table has a formatting issue where is missing for the second row, and
kdis missing for the last row. Assuming standardSysIDparameters, would typically becy(base CoM shift in y) andkifwould bekd(D gain ratio).
-
-
DeltaDynamics:
-
Description: This baseline learns a
residual dynamics modelto capture the discrepancy betweensimulatedandreal-world physics. -
Implementation (from Appendix C): Using collected
real-world trajectories(), the actions are replayed insimulationto getsimulated trajectories(). Aneural dynamics modelis trained to predict the difference in state transitions: Theloss functionis amean squared error (MSE)in anautoregressive setting, where the model predicts forward for steps: $ \mathcal{L} = \biggl | \biggl | s_{t+K}^{\mathrm{real}} - \underbrace{f^{\mathrm{sim}} \bigl( \ldots , f^{\mathrm{sim}}}{K} ( s_t, a_t ) + f\theta^\Delta(s_t, a_t) , \ldots , a_{t+K} \bigr ) \biggr | \biggr | $ After training, thisresidual dynamics modelis frozen and integrated into thesimulator. Thepretrained policyis thenfine-tunedin thisaugmented simulation environment. -
Representativeness: This baseline represents a state-of-the-art approach to
dynamics modelingandsim-to-real transferby learning explicit state-level corrections. It directly contrasts withASAP'sdelta action modelapproach.These baselines provide a comprehensive comparison, allowing the authors to evaluate
ASAPagainst methods that rely on different principles forsim-to-real adaptation.
-
6. Results & Analysis
The paper presents extensive experimental results to validate ASAP's performance across sim-to-sim and sim-to-real transfer scenarios, addressing three key questions (Q1-Q3) and further conducting extensive studies (Q4-Q6).
6.1. Core Results Analysis
6.1.1. Q1: Comparison of Dynamics Matching Capability (Open-Loop Performance)
This section addresses Q1: Can ASAP outperform other baseline methods to compensate for the dynamics mismatch? by evaluating open-loop performance. Open-loop evaluation measures how accurately a method can reproduce testing-environment trajectories in the training environment (IsaacGym) after learning the dynamics mismatch. This is done by replaying actions and comparing the resulting states using metrics like MPJPE.
The following are the results from Table III of the original paper:
| Simulator & Length | IsaacSim | Genesis | |||||||
| Eg-mpipe | Empipe | Eacc | Evel | Eg-mpipe | Empipe | Eacc | Evel | ||
| 0.25s | OpenLoop | 19.5 | 15.1 | 6.44 | 5.80 | 19.8 | 15.3 | 6.53 | 5.88 |
| SysID | 119.4 | 15.0 | 6.43 | 5.74 | 19.3 | 15.0 | 6.42 | 5.73 | |
| DeltaDynamics | 24.4 | 13.6 | 9.43 | 7.85 | 20.00 | 12.4 | 8.42 | 6.89 | |
| ASAP | 119.9 | 15.6 | 6.48 | 5.86 | 190 | 14.9 | 6.19 | 5.59 | |
| 0.5s | OpenLoop | 33.3 | 23.2 | 6.80 | 6.84 | 33.1 | 23.0 | 6.78 | 6.82 |
| SysID | 32.1 | 22.2 | 6.57 | 6.56 | 32.2 | 22.3 | 6.57 | 6.57 | |
| DeltaDynamics | 36.5 | 16.4 | 8.89 | 7.98 | 27.8 | 14.0 | 7.63 | 6.74 | |
| ASAP | 26.8 | 19.2 | 5.09 | 5.36 | 25.9 | 18.4 | 4.93 | 5.19 | |
| 1.0s | OpenLoop | 80.8 | 43.5 | 10.6 | 11.1 | 82.5 | 44.5 | 10.8 | 11.4 |
| SysID | 77.6 | 41.5 | 10.0 | 10.7 | 76.5 | 41.6 | 10.00 | 10.5 | |
| DeltaDynamics | 68.1 | 21.5 | 9.61 | 9.14 | 50.2 | 17.2 | 8.19 | 7.62 | |
| ASAP | 37.9 | 22.9 | 4.38 | 5.26 | 36.9 | 22.6 | 4.23 | 5.10 | |
Analysis of Table III:
-
ASAP's Superiority:
ASAPconsistently outperforms theOpenLoopbaseline across all replayed motion lengths (0.25s, 0.5s, 1.0s) and in bothIsaacSimandGenesis. It achieves significantly lower (global body position tracking error) andE_mpjpe(root-relative mean per-joint position error) values. This indicates thatASAP'sdelta action modelis highly effective at aligning thesimulator's dynamicswith the target environment, leading to improved open-loop trajectory reproduction. For example, at 1.0s length inIsaacSim,ASAPachieves an of 37.9, a substantial reduction fromOpenLoop's 80.8. -
SysID's Limitations: While
SysIDshows some improvement overOpenLoopfor short horizons (e.g., 0.25s), it struggles withlong-horizon scenarios. Its performance degrades as motion length increases, suggesting thatSysIDmay not fully capture the complexdynamics mismatchor thatcumulative errorsbuild up over longer trajectories. The values are often still high for longer durations. -
DeltaDynamics' Behavior:
DeltaDynamicsimproves uponSysIDandOpenLoopforlong horizons(e.g., at 1.0s, of 68.1 inIsaacSimis better thanSysID's 77.6). However, the paper notes thatDeltaDynamicssuffers from overfitting, leading tocascading errorsmagnified over time, which might explain some of its less consistent performance compared toASAP. -
Agility Metrics (
E_acc,E_vel):ASAPalso shows superior performance inacceleration error(E_acc) androot velocity error(E_vel), especially for longer durations. For 1.0s motions,ASAPsignificantly reduces these errors compared to all baselines in bothIsaacSimandGenesis, indicating better agreement in terms of dynamic behavior and motion quality.These results strongly emphasize the efficacy of
ASAP'sdelta action modelin reducing thephysics gapand improvingopen-loop replay performance, showcasing its superior generalization capability.
The following figure (Figure 5 from the original paper) illustrates the per-step MPJPE error for different methods, visually supporting the quantitative results.
Figure 5 from the original paper: The image is an illustration showing the MPJPE (millimeters) error over time steps for different methods (Vanilla, SysID, Delta Dynamics, and ASAP). It depicts the action execution results for each method, with a graph of the error curves below.
Analysis of Figure 5: The graph clearly shows that ASAP maintains the lowest MPJPE over time steps compared to Vanilla, SysID, and DeltaDynamics. The error accumulation for other methods is evident, with SysID and DeltaDynamics showing better initial performance than Vanilla but still diverging from ASAP over time. This visual evidence reinforces ASAP's ability to reduce dynamics mismatch effectively over longer horizons.
6.1.2. Q2: Comparison of Policy Fine-Tuning Performance (Closed-Loop Performance)
This section addresses Q2: Can ASAP finetune policy to outperform SysID and Delta Dynamics methods? by evaluating closed-loop performance. This involves fine-tuning RL policies in the respective modified training environments and then deploying them in the testing environments (IsaacSim, Genesis) to quantify motion-tracking errors.
The following are the results from Table IV of the original paper:
| Method & Type | IsaacSim | Genesis | |||||||||
| Success Rate (%) | Eg-mpipe | Empipe | Eacc | Evel | Success Rate (%) | Eg-mpipe | Empipe | Eacc | Evel | ||
| Easy | Vanilla | 100 | 243 | 101 | 7.62 | 9.11 | 100 | 252 | 108 | 7.93 | 9.51 |
| SysID | 100 | 157 | 65.8 | 6.25 | 7.52 | 100 | 168 | 78.0 | 6.32 | 7.81 | |
| DeltaDynamics | 100 | 142 | 57.3 | 8.06 | 9.14 | 100 | 137 | 76.2 | 7.26 | 8.44 | |
| ASAP | 100 | 106 | 44.3 | 4.23 | 5.17 | 100 | 125 | 73.5 | 4.62 | 5.43 | |
| Medium | Vanilla | 100 | 305 | 126 | 9.82 | 11.4 | 100 | 316 | 131 | 9.94 | 11.6 |
| SysID | 100 | 183 | 88.6 | 7.83 | 8.99 | 100 | 206 | 91.2 | 7.91 | 9.21 | |
| DeltaDynamics | 98 | 169 | 68.9 | 9.03 | 10.3 | 97 | 158 | 81.3 | 8.56 | 9.65 | |
| ASAP | 100 | 126 | 52.7 | 4.75 | 5.52 | 100 | 148 | 80.2 | 5.02 | 5.91 | |
| Hard | Vanilla | 98 | 389 | 160 | 11.2 | 13.1 | 95 | 401 | 166 | 11.5 | 13.5 |
| SysID | 98 | 212 | 109 | 9.01 | 10.5 | 96 | 235 | 112 | 9.31 | 10.7 | |
| DeltaDynamics | 94 | 196 | 79.4 | 10.1 | 11.5 | 90 | 177 | 86.7 | 9.23 | 10.5 | |
| ASAP | 100 | 134 | 55.3 | 5.23 | 6.01 | 100 | 129 | 77.0 | 5.46 | 6.28 | |
Analysis of Table IV:
-
ASAP's Consistent Outperformance:
ASAPconsistently achieves the lowesttracking errorsacross all difficulty levels (Easy, Medium, Hard) in bothIsaacSimandGenesis. For instance, inIsaacSim(Hard level),ASAPachieves of 134 andE_mpjpeof 55.3, significantly outperformingSysID(212, 109) andDeltaDynamics(196, 79.4). Similar trends are observed inGenesis. -
Reduced Agility Errors:
ASAPalso shows remarkably loweracceleration error(E_acc) androot velocity error(E_vel) across all scenarios. This indicates thatASAPfine-tunes policies to produce smoother, more natural, and dynamically accurate motions, which is crucial foragile skills. -
High Success Rate:
ASAPconsistently maintains a100% success rateacross allsim-to-sim transferevaluations, unlikeDeltaDynamicswhich experiences slightly lower success rates in harder environments (e.g., 90% in Genesis Hard). This demonstratesASAP's robustness inclosed-loop control. -
Baselines' Limitations:
Vanillapolicies show the highest errors, as expected, highlighting thedynamics mismatch.SysIDandDeltaDynamicsimprove uponVanillabut still fall short ofASAP's performance, particularly inE_mpjpe,E_acc, andE_vel, suggesting theirdynamics compensationis less effective or prone to overfitting (forDeltaDynamics) forclosed-loop policy fine-tuning.These results highlight
ASAP's robustness and adaptability in addressing thesim-to-real gapforclosed-loop control, preventing overfitting and ensuring reliable deployment.
The following figure (Figure 7 from the original paper) provides per-step visualizations of motion tracking error, comparing ASAP with RL policies deployed without fine-tuning.
Figure 7 from the original paper: The image is a schematic diagram that illustrates the changes in motion tracking error before and after the Delta Action fine-tuning across two scenarios (from IsaacGym to IsaacSim and from IsaacGym to Genesis). The top section shows the motion states, while the bottom section displays the error curves over time, with red and blue lines representing the comparison of MPJPE (mm) error before and after fine-tuning.
Analysis of Figure 7: The visualizations confirm that ASAP (Delta Action Fine-tuning) consistently maintains lower MPJPE over time compared to the Vanilla (No Fine-tuning) policies. The Vanilla policies accumulate errors, leading to degraded tracking, while ASAP adapts to the new dynamics and sustains stable performance. This visual evidence further validates ASAP's effectiveness in closed-loop performance.
6.1.3. Q3: Real-World Evaluations (Sim-to-Real Transfer)
This section addresses Q3: Does ASAP work for sim-to-real transfer? by deploying ASAP on the real-world Unitree G1 robot.
-
Real-World Data Strategy: As mentioned in
Experimental Setup, due tohardware constraintsandmotor overheating, asample-efficient approachwas adopted, focusing on learning a4-DoF ankle delta action modelinstead of the full23-DoF model. This targeted approach addresses a knownsim-to-real gapin theUnitree G1's anklemechanical linkage. -
Policy Transition: A robust
locomotion policywas trained to handle transitions between differentmotion-tracking tasksin the real world, allowing the robot to maintain balance without manual resets.The following are the results from Table V of the original paper:
Motion Real-World-Kick Real-World-LeBron (OOD) Eg-mpipe Empipe Eacc Evel Eg-mpipe Empipe Eacc Evel Vanilla 61.2 40.1 2.46 2.70 159 47.5 2.84 5.94 ASAP 50.2 43.5 2.96 2.91 112 55.3 3.43 6.43
Analysis of Table V (Real-World Results):
-
ASAP's Real-World Performance:
ASAPconsistently outperforms theVanillabaseline on bothin-distribution(Real-World-Kick) andout-of-distribution (OOD)(Real-World-LeBron "Silencer")humanoid motion tracking tasks. -
Reduced Global Tracking Error: For the
Kickmotion,ASAPreduces from 61.2 to 50.2, representing a significant improvement (approx. 18% reduction). For the more challengingLeBron (OOD)motion,ASAPachieves an of 112 compared toVanilla's 159 (approx. 30% reduction), demonstrating its ability to generalize to unseen dynamic motions. -
Mixed Root-Relative MPJPE: Interestingly,
E_mpjpeis slightly higher forASAPin both cases (43.5 vs 40.1 forKick, and 55.3 vs 47.5 forLeBron). The paper mentions thatASAPfine-tuning makes the robot behave more smoothly and reduces jerky lower-body motions (Figure 8), which might lead to a different distribution of root-relative joint positions while improving overallglobal trackingand reducingjerky motionswhich is good for agility. -
Agility Metrics (
E_acc,E_vel): TheE_accandE_velmetrics are slightly higher forASAPin the real-world results compared toVanilla. This could imply that whileASAPachieves better global motion tracking and overall agility, the fine-tuned policies might introduce slightly higher instantaneous accelerations or velocities to execute the agile motions precisely, or perhaps reflects a different trade-off in real-world motor control. Despite this, the visual and qualitative assessment (Figure 8) points to smoother, more coordinated motion.The findings highlight
ASAP's effectiveness in improvingsim-to-real transferforagile humanoid motion tracking, even with a limitedDoF delta action model.
The following figure (Figure 8 from the original paper) visually compares actions before and after Delta Action fine-tuning.
Figure 8 from the original paper: The image is a diagram showing the comparison of actions before and after Delta Action fine-tuning. The top part illustrates the motion states before fine-tuning, while the bottom part shows the states after fine-tuning. The left side represents in-distribution fine-tuning, and the right side represents out-of-distribution fine-tuning, showcasing the improved agility and coordination.
Analysis of Figure 8: The visual comparison supports the claim that ASAP fine-tuning leads to smoother and more coordinated lower-body motions. The Vanilla policy (top row) appears to exhibit more jerky or less precise movements compared to the ASAP fine-tuned policy (bottom row), especially visible in the in-distribution kick motion and the out-of-distribution LeBron motion.
6.2. Extensive Studies and Analyses
6.2.1. Q4: Key Factors in Training Delta Action Models
This section addresses Q4: How to best train the delta action model of ASAP? through a systematic study on dataset size, training horizon, and action norm weight.
The following figure (Figure 10 from the original paper) presents the analysis of these factors.
Figure 10 from the original paper: The image is a chart that illustrates the impact of dataset size, training horizon, and action norm on the performance of . Panel (a) shows the Mean Joint Position Error (MPJPE) across different dataset sizes, (b) presents the influence of training horizon on performance, and (c) describes the relationship between variations in action norm and closed-loop MPJPE.
-
a) Dataset Size (Figure 10a):
- Finding: Increasing the
dataset sizegenerally improves thedelta action model()'s generalization, leading to reducedMPJPEonout-of-distribution(unseen) trajectories duringopen-loop evaluation. However, the improvement inclosed-loop performance(when the fine-tuned policy is deployed) saturates. A marginal decrease of only 0.65% inMPJPEwas observed when scaling from 4300 to 43000 samples. - Implication: This suggests that while more data helps the
delta action modelgeneralize better to new trajectories, there's a point of diminishing returns forpolicy fine-tuning. A moderately sized dataset (e.g., 4300 samples) can be sufficient for achieving goodclosed-loop performance.
- Finding: Increasing the
-
b) Training Horizon (Figure 10b):
- Finding: Longer
training horizons(the length of the trajectory segments used for training ) generally improveopen-loop performance, with a 1.5s horizon achieving the lowest errors for evaluations at 0.25s, 0.5s, and 1.0s. However, this trend does not consistently extend toclosed-loop performance. The bestclosed-loop resultsare observed at atraining horizonof 1.0s. - Implication: An excessively long
training horizonfor thedelta action modelmight not provide additional benefits for the finalfine-tuned policy. There's an optimal trade-off, where a horizon that captures sufficient temporal dependencies (like 1.0s) is most effective forpolicy fine-tuning.
- Finding: Longer
-
c) Action Norm Weight (Figure 10c):
- Finding: The
action norm reward(part of theregularization termin Table II fordelta action learning) is crucial for balancingdynamics alignmentwithminimal corrections. Bothopen-loopandclosed-loop errorsdecrease as theaction norm weightincreases, reaching the lowest error at a weight of 0.1. However, increasing theweightfurther causesopen-loop errorsto rise. - Implication: This indicates that carefully tuning the
action norm weightis vital. Too low, and thedelta action modelmight make unnecessary large corrections; too high, and theminimal action norm rewardcould dominate the learning objective, preventing effectivedynamics compensation. A weight of 0.1 appears to be optimal for balancing these objectives.
- Finding: The
6.2.2. Q5: Different Usage of Delta Action Model
This section addresses Q5: How to best use the delta action model of ASAP? by comparing different strategies for fine-tuning the nominal policy () using the learned delta policy (). The goal is to obtain a fine-tuned policy for real-world deployment.
The underlying relationship, derived from one-step dynamics matching, is:
Which simplifies to:
(Equation 2 in the paper's appendix)
The paper considers three approaches to solve for :
-
Fixed-Point Iteration (RL-Free):
- Concept: This method iteratively refines the policy. It starts with an initial guess, (the
nominal policy), and then updates it using the learneddelta action modelin a fixed-point iteration: The iteration continues for steps, with expected to converge to the solution. - Limitation: As an
RL-freemethod, it ismyopic(only considers one-step matching) and can suffer fromout-of-distribution (OOD)issues if the iterated actions move into regions where was not trained.
- Concept: This method iteratively refines the policy. It starts with an initial guess, (the
-
Gradient-Based Optimization (RL-Free):
- Concept: This approach formulates the problem as an optimization task to minimize a
loss functionthat quantifies the discrepancy from the desired relationship:Gradient descentis then used to find the optimal that minimizes this loss. - Limitation: Similar to
fixed-point iteration, this method ismyopic(based on a one-step matching assumption) and can struggle withOODdata, especially formulti-step trajectories. It requires differentiating through thedelta action model, which might be complex.
- Concept: This approach formulates the problem as an optimization task to minimize a
-
RL Fine-Tuning (ASAP's approach):
-
Concept: Instead of attempting to solve the
one-step matching equationdirectly,ASAPusesReinforcement Learning(PPO) tofine-tunethenominal policywithin thesimulatoraugmented with thedelta action model(). TheRL agentlearns directly through interaction in thisaligned simulator. -
Benefit: This approach effectively performs a
gradient-free multi-step matching procedure. By optimizing cumulativerewardsover entire trajectories,RL fine-tuningcan learn long-term consequences and adapt more robustly to potentialOODissues encountered during rollouts, whichmyopic RL-free methodscannot.The following figure (Figure 11 from the original paper) compares the
MPJPEover timesteps for thesefine-tuning methods.
Figure 11 from the original paper: The image is a chart that shows the comparison of MPJPE (mm) over time for different fine-tuning methods. The ASAP method (red) performs best in terms of error, while Gradient Search (yellow) and Fixed Point Iteration (green) perform worse than the baseline (blue, Before DeltaA).Analysis of Figure 11: The graph clearly shows that
RL Fine-Tuning(ASAP's method) achieves the lowestMPJPEduring deployment, consistently outperforming theFixed-Point IterationandGradient Searchmethods, as well as theBefore DeltaAbaseline (which is theVanillapolicy without anydelta actioncompensation). BothRL-freeapproaches (Fixed-Point IterationandGradient Search) perform worse than theBefore DeltaAbaseline, indicating their failure to effectively adapt the policy. This validatesASAP's choice ofRL fine-tuningas the most effective strategy for utilizing thedelta action model, overcoming themyopicandOODlimitations ofRL-freemethods formulti-step dynamics adaptation.
-
6.2.3. Q6: Does ASAP Fine-Tuning Outperform Random Action Noise Fine-Tuning?
This section addresses Q6: Why and how does ASAP work? by comparing ASAP fine-tuning with injecting random action noise during fine-tuning, and by visualizing the learned delta action model's output.
-
Comparison with Random Action Noise:
-
Random torque noise[7] is a commondomain randomization technique. To test ifASAP'sdelta actionis more than just a robustness enhancer, policies arefine-tunedinIsaacGymwith random action noise: , where (uniform random noise) and is thenoise magnitude. These policies are then deployed inGenesis. -
Finding (Figure 12): Policies
fine-tunedwithrandom action noisein the range do show improved globaltracking error (MPJPE)compared to nofine-tuning. However, their performance (bestMPJPEof 173) does not match the precision achieved byASAP(MPJPEof 126). Beyond , performance degrades. -
Implication: This suggests that while
random noisecan offer some robustness, it is less effective thanASAP's targeteddelta action modelfor precisedynamics alignment.The following figure (Figure 12 from the original paper) shows
MPJPEversusnoise level.
Figure 12 from the original paper: The image is a chart showing the performance of closed-loop MPJPE (mm) at different action noise levels. As the action noise level increases, the MPJPE for the unfinetuned policy is 336.1, while the MPJPE after fine-tuning with ASAP is 126.9, indicating significant improvement. The data points reflect changes in performance across different noise levels.
-
-
Visualization of Delta Action Model Output:
-
Finding (Figure 13): The
average output magnitudeof , learned fromIsaacSimdata, revealsnon-uniform discrepanciesacrossjoints. For theG1 humanoid robot,lower-body motors(especiallyankleandknee joints) exhibit a significantly largerdynamics gapcompared toupper-body joints. Furthermore,asymmetriesbetweenleft and right body motorsare evident. -
Implication: This
structured discrepancycannot be effectively addressed by merely addinguniform random action noise.ASAP'sdelta action modellearns these specific, targeted corrections for each joint, which is why it achieves superiortracking precisioncompared to naiverandomization strategies. It learnswhatandwhereto correct, rather than just forcing general robustness.The following figure (Figure 13 from the original paper) visualizes the
delta action model's output magnitude.
Figure 13 from the original paper: The image is a diagram illustrating the output magnitude of from IsaacGym to IsaacSim. The size of the red dots indicates the values for each joint, with numerical labels placed accordingly. The results suggest that the lower-body motors exhibit a larger discrepancy compared to upper-body joints, particularly in the output magnitude of the ankle pitch joint of the G1 humanoid robot.Conclusion for Q6: The
delta action modelinASAPnot only enhancespolicy robustnessbut also enables effective adaptation toreal-world dynamicsby learning and applyingstructured, non-uniform correctionsthat directly address specificdynamics mismatches. This targeted approach significantly outperformsnaive randomization strategies.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces ASAP (Aligning Simulation and Real-World Physics), a novel two-stage framework that effectively bridges the sim-to-real gap for learning agile humanoid whole-body skills. The framework first pre-trains motion tracking policies in simulation using retargeted human motion data. In the second stage, it collects real-world data from the pre-trained policies to train a delta (residual) action model that specifically learns to compensate for dynamics mismatch. This delta action model is then integrated into the simulator, enabling fine-tuning of the pre-trained policies in an aligned simulation environment.
Extensive evaluations across sim-to-sim (IsaacGym to IsaacSim, IsaacGym to Genesis) and sim-to-real (IsaacGym to Unitree G1) scenarios demonstrate ASAP's superior performance. It achieves significant reductions in motion tracking errors (up to 52.7% in sim-to-real tasks), improves agility and whole-body coordination, and enables highly dynamic motions previously difficult to achieve. The framework consistently outperforms SysID, Domain Randomization, and delta dynamics learning baselines. ASAP's success highlights the potential of delta action learning as a powerful paradigm for developing more expressive and agile humanoid robots in real-world applications. The authors also open-sourced a multi-simulator codebase to support future research.
7.2. Limitations & Future Work
The authors acknowledge several real-world limitations of ASAP:
-
Hardware Constraints:
Agile whole-body motionssubject robots to significant stress, leading tomotor overheatingandhardware failures(e.g., twoUnitree G1robots were damaged). This limits the scale and diversity of safely collectible real-world motion data, which is crucial for training thedelta action model. -
Dependence on Motion Capture Systems: The current pipeline necessitates a
MoCap setupto recordground truthreal-world trajectories. This introducespractical deployment barriersin unstructured environments whereMoCap systemsare unavailable, restricting the widespread applicability ofASAP. -
Data-Hungry Delta Action Training: While the authors successfully reduced the
delta action modelto4-DoF ankle jointsforsample efficiencyin real-world experiments, training afull 23-DoF modelremains impractical due to the substantialdata demand(e.g., insimulationfor23-DoF training).Based on these limitations, the authors suggest future research directions:
-
Developing
damage-aware policiesto mitigatehardware risksduring aggressive motion execution. -
Exploring
MoCap-free alignment techniquesto eliminate the reliance on expensive and infrastructure-dependentmotion capture systems. -
Investigating
adaptation techniquesfordelta action modelsto achievesample-efficient few-shot alignment, reducing the amount of real-world data required.
7.3. Personal Insights & Critique
ASAP presents a highly compelling solution to the long-standing sim-to-real gap for agile humanoid control. My personal insights and critique are as follows:
-
Innovation in Delta Action Learning: The core innovation of learning a
delta (residual) action modelis particularly elegant. Instead of trying to identify complex physical parameters (likeSysID) or generalize over broad uncertainties (likeDR),ASAPlearns the specific corrections needed in the action space to reconcilesimulatedandreal-world dynamics. This is a powerful form ofimplicit dynamics compensationthat can account for both known parameter mismatches and unmodeled complexities. The comparison againstDeltaDynamics(which learns residual state dynamics) clearly shows the advantage of operating in theaction spaceforpolicy fine-tuning. -
Robust Two-Stage Framework: The two-stage approach (pre-train in an idealized simulation, then fine-tune in an
aligned simulator) is very practical. It leverages the benefits of fast, safe simulation for initial skill acquisition and then efficiently addressessim-to-real transferwith targeted real-world data collection. Theasymmetric actor-criticandcurriculum learningstrategies inpre-trainingfurther enhance robustness. -
Strong Empirical Validation: The extensive experimental results, covering
sim-to-simandsim-to-realscenarios, with clear quantitative metrics and visual comparisons, provide strong evidence forASAP's effectiveness. The ablation studies ondataset size,training horizon, andaction norm weightare thorough and provide valuable insights for future implementations. The direct comparison withrandom action noisefurther solidifies the argument thatASAPlearns structured, meaningful corrections. -
Applicability to Other Domains: The
delta action learningprinciple is highly transferable. It could be applied to other robotic platforms (e.g., quadrupeds, manipulators) or even other complex physical systems wheresim-to-real transferis a challenge. The idea of learning a lightweight residual model to align discrepancies, rather than a full system model, is broadly applicable inmodel-based controlandrobot learning. -
Critique on Data Dependency and MoCap: Despite the impressive results, the reliance on
MoCap systemsand thedata-hungrynature of training thedelta action modelforfull-DoFcontrol remain significant practical hurdles. While the authors successfully trained a4-DoF ankle modelwith 100 clips, scaling this to23-DoFfor ahumanoidwould require substantially more data, which is challenging and expensive to collect safely in the real world. This limitsASAP's immediate deployment inMoCap-freeor resource-constrained environments. Future work onsample efficiency(e.g.,few-shot learningfor ) andMoCap-free state estimationwould be crucial for broader adoption. -
Computational Cost: The paper doesn't deeply delve into the computational cost (training time, inference time) of training the
delta action modelandfine-tuningthe policies, especially for a large23-DoFrobot. WhileIsaacGymis fast, the overallRL training loopscan still be substantial. -
Interpretability of Delta Actions: While effective, the
delta action modelis a neural network and thus largely a black box. Understanding why specificdelta actionsare applied could provide deeper insights into the underlyingphysics mismatchbeyond just identifying which joints have larger discrepancies. This could potentially inform betterphysical modelingorrobot designin the future.In summary,
ASAPrepresents a significant step forward in enablingagile humanoid controlby providing a robust and empirically validated framework forsim-to-real transfer. Its core methodology ofdelta action learningis both elegant and powerful, paving the way for more capable and versatilehumanoid robots.
Similar papers
Recommended via semantic vector search.