MoE-Loco: Mixture of Experts for Multitask Locomotion
TL;DR Summary
The paper presents MoE-Loco, a Mixture of Experts framework for legged robots, enabling a single policy to navigate diverse terrains while mitigating gradient conflicts in multitask reinforcement learning, enhancing training efficiency and performance.
Abstract
We present MoE-Loco, a Mixture of Experts (MoE) framework for multitask locomotion for legged robots. Our method enables a single policy to handle diverse terrains, including bars, pits, stairs, slopes, and baffles, while supporting quadrupedal and bipedal gaits. Using MoE, we mitigate the gradient conflicts that typically arise in multitask reinforcement learning, improving both training efficiency and performance. Our experiments demonstrate that different experts naturally specialize in distinct locomotion behaviors, which can be leveraged for task migration and skill composition. We further validate our approach in both simulation and real-world deployment, showcasing its robustness and adaptability.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "MoE-Loco: Mixture of Experts for Multitask Locomotion". It focuses on developing a single policy for legged robots to perform diverse locomotion tasks across various terrains and gaits using a Mixture of Experts framework.
1.2. Authors
The authors are:
-
Runhan Huang*
-
Shaoting Zhu*
-
Yilun Du
-
Hang Zhao+
Affiliations: Tsinghua University, Beijing, China Institute for AI Industry Research (AIR), Tsinghua University Massachusetts Institute of Technology (MIT)
(* indicates equal contribution, + indicates corresponding author)
1.3. Journal/Conference
The paper is published as a preprint on arXiv. While not yet peer-reviewed and published in a specific journal or conference, arXiv is a reputable platform for sharing research, particularly in machine learning, robotics, and artificial intelligence, allowing early dissemination and feedback. The authors' affiliations with prominent institutions like Tsinghua University and MIT suggest a high standard of research.
1.4. Publication Year
The paper was published on 2025-03-11.
1.5. Abstract
This paper introduces MoE-Loco, a Mixture of Experts (MoE) framework designed for multitask locomotion in legged robots. The core objective is to enable a single robotic policy to proficiently navigate diverse terrains, such as bars, pits, stairs, slopes, and baffles, while also supporting both quadrupedal and bipedal gaits. By employing the MoE architecture, the method effectively mitigates gradient conflicts—a common issue in multitask reinforcement learning (RL)—thereby enhancing both training efficiency and overall performance. The authors demonstrate through experiments that different experts within the MoE framework naturally specialize in distinct locomotion behaviors. This specialization facilitates task migration and skill composition, allowing for adaptable and reusable locomotion strategies. The MoE-Loco approach is validated through extensive experiments in both simulation and real-world deployments, showcasing its robustness and adaptability across complex environments.
1.6. Original Source Link
Official Source Link: https://arxiv.org/abs/2503.08564 PDF Link: https://arxiv.org/pdf/2503.08564v2.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the significant challenge of training a single, unified locomotion policy for legged robots that can generalize across multiple diverse tasks, terrains, and locomotion modes (gaits).
This problem is important because real-world robot applications demand versatility. Robots are frequently required to traverse varied environments (e.g., uneven ground, stairs, obstacles) and perform different types of movement (e.g., walking, climbing, balancing). While single-task reinforcement learning (RL) has achieved remarkable success in specific locomotion behaviors, combining these diverse skills into a single, cohesive policy remains elusive.
Specific challenges and gaps in prior research include:
-
Gradient Conflicts in Multitask RL (MTRL): When a single neural network attempts to learn multiple tasks simultaneously, the gradients from different task objectives can pull the network's weights in conflicting directions. This
gradient conflictleads to unstable training, reduced efficiency, and often inferior performance compared to specialized single-task policies. -
Complexity of Diverse Gaits and Terrains: Training a policy to handle both fundamentally different gaits (e.g., quadrupedal vs. bipedal) and a wide array of challenging terrains (e.g., bars, pits, stairs) within a single model exacerbates
gradient conflictsand can lead to model divergence or poor generalization. -
Lack of Interpretability and Reusability: Most
RLpolicies are "black boxes," making it difficult to understand how different skills are learned or to easily compose existing skills for novel tasks.The paper's entry point and innovative idea lie in integrating the Mixture of Experts (MoE) framework into
multitask reinforcement learningfor legged locomotion.MoEprovides a modular network structure that naturally addressesgradient conflictsby allowing different "experts" (sub-networks) to specialize in distinct behaviors, with a "gating network" dynamically selecting which experts to activate for a given task.
2.2. Main Contributions / Findings
The primary contributions of MoE-Loco are:
-
Development of a Unified Multitask Locomotion Policy: The paper successfully trains and deploys a single neural network policy that enables a quadruped robot to traverse a wide range of challenging terrains (bars, pits, stairs, slopes, baffles) and switch between fundamentally different locomotion modes (quadrupedal and bipedal gaits). This is a significant step towards more versatile and autonomous robots.
-
Integration of MoE for Gradient Conflict Mitigation: The authors effectively integrate the
Mixture of Experts (MoE)architecture into both the actor and critic networks of theirRLpolicy. This integration is shown to alleviategradient conflicts—a major hurdle inmultitask RL—leading to improved training efficiency and superior overall model performance compared to standardMLP-based policies. -
Demonstration of Expert Specialization and Composability: Through qualitative and quantitative analysis, the paper uncovers that individual
expertswithin theMoEmodel naturally specialize in distinct locomotion behaviors (e.g., balancing, crawling, obstacle crossing). Thisexpert specializationprovides a more interpretable policy and opens up new possibilities fortask migration(adapting to new tasks) andskill composition(combining experts with adjustable weights to synthesize new, complex skills, like a "dribbling gait," without additional training). -
Validation in Simulation and Real-World: The proposed
MoE-Locoframework is rigorously validated in both high-fidelity simulation environments (IsaacGym) and through zero-shot deployment on a realUnitree Go2quadruped robot. The real-world experiments, including outdoor environments, demonstrate the policy's robustness, adaptability, and generalization capabilities.The key conclusions and findings are that
MoEis a highly effective architecture formultitask locomotion learninginRL. It not only resolves the practical issue ofgradient conflictsbut also introducesinterpretabilityandcomposabilityto learned robot skills, addressing the challenge of creating versatile and adaptable robots.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
Reinforcement Learning (RL)
Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent observes the state of the environment, takes an action, and in response, the environment transitions to a new state and provides a reward (or penalty). The agent's goal is to learn a policy—a mapping from states to actions—that maximizes the cumulative reward over time. RL is particularly well-suited for control problems where manual programming of behaviors is difficult, such as robot locomotion.
Markov Decision Process (MDP)
A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is typically defined by a tuple :
-
: A set of possible
statesof the environment. -
: A set of possible
actionsthe agent can take. -
: A
transition functionrepresenting the probability of transitioning to state from state after taking action . -
: A
reward functionR(s, a, s')that gives the immediate reward received after transitioning from state to state via action . -
: A
discount factor() that determines the present value of future rewards. A higher makes future rewards more significant.The
Markov propertyimplies that the future state depends only on the current state and action, not on the sequence of events that preceded it.
Mixture of Experts (MoE)
A Mixture of Experts (MoE) is a neural network architecture designed for tasks that benefit from specialized sub-networks. It consists of multiple "expert" networks and a "gating network":
-
Experts (): These are individual neural networks, each specialized in handling a specific subset of the input space or a particular type of task.
-
Gating Network (): This network takes the input and learns to dynamically determine the weights or probabilities for each
expert. For any given input, thegating networkdecides whichexpert(or combination ofexperts) is most appropriate to process that input.The output of an
MoEmodel is typically a weighted sum of the outputs of theexperts, where the weights are provided by thegating network. This modularity allows different parts of the network to specialize, mitigatinggradient conflictsin multitask settings and potentially improving efficiency and performance.
Proximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO) is a popular reinforcement learning algorithm that falls under the category of policy gradient methods. PPO aims to find an optimal policy by iteratively updating it. A key feature of PPO is its use of a clipping mechanism or adaptive KL penalty to constrain the policy updates, preventing them from deviating too far from the previous policy. This ensures more stable and efficient training compared to earlier policy gradient methods like REINFORCE. PPO often achieves good performance with a relatively simple implementation, making it a widely used algorithm in various RL applications, including robot locomotion.
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture designed to handle sequential data and overcome the vanishing/exploding gradient problems common in traditional RNNs. LSTMs are particularly effective at learning long-term dependencies in data. They achieve this through a specialized internal structure called a "cell," which includes several "gates" (input gate, forget gate, output gate). These gates regulate the flow of information into and out of the cell, allowing the LSTM to selectively remember or forget past information, making it suitable for tasks requiring memory of historical observations, such as robot control where past sensor readings and actions are crucial.
Sim-to-Real Gap
The sim-to-real gap refers to the discrepancy between the performance of a robot policy trained in a simulated environment and its performance when deployed on a real physical robot. Simulations are approximations of reality, and mismatches can arise from various factors, including inaccurate physics models, sensor noise differences, communication latencies, and unmodeled real-world complexities. Bridging this gap is crucial for practical robot learning and often involves techniques like domain randomization and privileged information.
Domain Randomization
Domain Randomization is a technique used in robot learning to mitigate the sim-to-real gap. Instead of striving for a perfectly realistic simulation, domain randomization deliberately randomizes various physical parameters and environmental properties (e.g., mass, friction, sensor noise, latency) within the simulator. The idea is that if a policy is trained in a sufficiently diverse set of randomized simulations, it will become robust enough to generalize to the uncertainties and variations encountered in the real world. This makes the policy less sensitive to discrepancies between the simulated and real environments.
3.2. Previous Works
Reinforcement Learning for Robot Locomotion
Early works (e.g., [17], [18]) demonstrated RL's ability to learn legged locomotion in simulation, which was later extended to real robots (e.g., [3], [19]). Research has shown success in traversing complex terrains (e.g., [8], [20], [21]), achieving high-speed running (e.g., [22], [23]), and mastering extreme tasks like bipedal walking (e.g., [24], [25]), opening doors (e.g., [1], [4]), or high-speed parkour (e.g., [7], [9], [10], [27]). However, the paper notes that most of these works focus on specific skills and a limited number of terrains, rarely considering comprehensive multitask learning.
Multitask Learning (MTL)
Multitask Learning (MTL) aims to train a single network to perform across multiple tasks, leveraging shared knowledge [28], [29], [30]. While MTL can offer benefits, it often encounters negative gradient conflicts during training [33], [34]. For example, Gradient Surgery for Multi-Task Learning [33] proposed methods to modify gradients to avoid destructive interference. Conflict-Averse Gradient Descent [34] also addresses this by adjusting gradients.
In Multitask Reinforcement Learning (MTRL), algorithms like Model-Agnostic Meta-Learning (MAML) [36] and Fast reinforcement learning via slow reinforcement learning [37] were developed. MAML learns an initialization for a model's parameters that allows it to quickly adapt to new tasks with only a few gradient steps.
In robotics, MTRL has been widely applied to manipulation tasks (e.g., [16], [35], [39], [40]). For locomotion, ManyQuadrupeds [41] focused on learning a unified policy for different types of quadruped robots. MELA [42] used pretrained expert models to construct locomotion policies but primarily focused on basic skill acquisition and required substantial reward engineering for pretraining. MTAC [43] attempted hierarchical RL for different terrains but was limited to one gait and three simple terrains, without real-robot deployment.
Mixture of Experts (MoE)
The Mixture of Experts (MoE) concept was originally introduced in [13], [44]. It has recently gained significant attention and application in various fields:
- Natural Language Processing (NLP):
MoEmodels likeMixtral of Experts[47] andDeepseek-V3[48] have shown great success in scaling large language models. - Computer Vision (CV):
MoEhas been applied in areas like recommendation systems [49], [50]. - Multi-modal Learning:
MoEis used to mitigate data conflicts in instruction fine-tuning for large multi-modal models (e.g.,Llava-MoLE[52]). - Reinforcement Learning and Robotics:
DeepMind[14] exploredMoEfor scalingRL.MELA[39] (which is also mentioned underMTRL) proposed aMulti-Expert Learning Architectureto generate adaptive skills, but primarily for simple actions.Acquiring diverse skills using curriculum reinforcement learning with mixture of experts[16] also usedMoEfor diverse skill acquisition.
3.3. Technological Evolution
The field of robot locomotion has evolved from hand-crafted controllers to RL-based policies capable of handling increasingly complex single tasks and terrains. The early successes of RL in simulation (e.g., IsaacGym [17]) paved the way for robust real-world deployments. Concurrently, multitask learning emerged to address the need for versatile agents, but faced challenges like gradient conflicts. The Mixture of Experts architecture, initially proposed in the 1990s, has recently seen a resurgence, especially with the rise of large-scale neural networks, offering a promising solution to multitask learning challenges by allowing specialized components.
This paper's work, MoE-Loco, fits within this timeline by directly addressing the limitations of prior RL-based locomotion and multitask learning approaches. It leverages the modularity and conflict-mitigation capabilities of MoE to achieve a unified policy for highly diverse locomotion tasks and gaits, pushing the boundaries of what a single RL policy can achieve on legged robots. Specifically, it builds upon teacher-student training frameworks and methods for robust locomotion (e.g., [20]) by incorporating MoE to handle the complexity of simultaneous diverse tasks.
3.4. Differentiation Analysis
Compared to the main methods in related work, MoE-Loco presents several core differences and innovations:
- Comprehensive Multitask Scope: Unlike many prior works that focus on specific skills or a limited number of terrains and gaits,
MoE-Locoaims for a single policy capable of handling both diverse terrains (bars, pits, stairs, slopes, baffles) and fundamentally different locomotion modes (quadrupedal and bipedal gaits). This is a significantly broader and more challengingmultitask learningproblem than typically addressed. - Explicit Gradient Conflict Mitigation via MoE: The core innovation is the explicit integration of the
Mixture of Expertsarchitecture into both theactorandcriticnetworks. WhileMTRLworks often acknowledgegradient conflicts(e.g., [33], [34]),MoE-Locodirectly usesMoEas a structural solution to route gradients to specializedexperts, thereby inherently reducing conflicts. This contrasts with approaches that might use hierarchicalRL(likeMTAC[43]) or specializedteacher-student frameworkswithout this explicit modularity. - Enhanced Training Efficiency and Performance: By mitigating
gradient conflicts,MoE-Locodemonstrates improved training efficiency and superior performance across complex multitask benchmarks compared to baselines. The paper'sOurs w/o MoEbaseline, which uses a simpleMLPwith similar parameter count, highlights the specific advantage conferred by theMoEstructure itself. - Interpretability and Skill Composability: The
MoEframework naturally leads toexpert specialization, where differentexpertslearn distinct, human-interpretable behaviors (e.g., balancing, crossing obstacles). This property is leveraged forskill composition(e.g., creating a "dribbling gait" by adjustinggating weights) andtask migration(e.g., adapting to a three-footed gait by training a new expert while freezing others). StandardMLP-based policies, being "black boxes," lack this level of interpretability and explicit skill manipulation. - Robust Real-World Generalization: The work emphasizes zero-shot deployment on a real
Unitree Go2robot across mixed and single-task terrains, including outdoor environments. This strong real-world validation, coupled withdomain randomizationand aprivileged learningframework, demonstrates the practical robustness and generalization of theMoE-based policy, surpassing baselines likeRMA[3] which may struggle with diverse terrains despite its adaptation module.
4. Methodology
4.1. Principles
The core idea of MoE-Loco is to address the limitations of multitask reinforcement learning (MTRL) for robot locomotion, primarily the gradient conflicts that arise when a single neural network tries to learn diverse and often conflicting skills (e.g., quadrupedal bar crossing vs. bipedal stair descending). The theoretical basis relies on the Mixture of Experts (MoE) architecture, which proposes that instead of a single monolithic network, a system composed of multiple specialized "experts" can learn more effectively.
The intuition is as follows: Imagine a robot needing to cross a bar, climb stairs, and walk bipedally. Each task requires different control strategies and emphasizes different aspects of the robot's dynamics. If a single neural network tries to learn all these, the gradients from learning to cross bars might interfere with the gradients from learning to climb stairs, slowing down training or leading to suboptimal performance. By using MoE, the framework can:
-
Specialize: Assign different sub-networks (experts) to specialize in distinct behaviors or task subsets. For example, one expert might become proficient at balancing, another at lifting legs for obstacle clearance, and another for bipedal stability.
-
Route Dynamically: A
gating networkacts as a traffic controller, dynamically deciding whichexpert(or combination ofexperts) is most relevant for the robot's current state and command. This allows the appropriate expert to be activated for each task, effectively isolating the learning processes and mitigatinggradient conflicts. -
Compose: The specialized
expertscan then be combined or adjusted, offering a more interpretable way to understand and even compose new skills from existing ones.This modular approach improves training efficiency and performance by allowing each expert to focus on a narrower range of tasks, preventing detrimental interference between heterogeneous task gradients.
4.2. Core Methodology In-depth (Layer by Layer)
The MoE-Loco framework for multitask locomotion learning follows a two-stage training process, leveraging Proximal Policy Optimization (PPO) as the underlying reinforcement learning algorithm. The overall pipeline is depicted in Figure 3.
The following figure (Figure 3 from the original paper) shows an overview of the MoE-Loco pipeline. With the designed MoE architecture, the policy achieves robust multitask locomotion ability on various challenging terrains with multiple gaits.
该图像是示意图,展示了MoE-Loco框架中的信息流和结构。图中包含明确的和隐含的特权状态,提供了关于估计器、感知和长短期记忆(LSTM)处理的信息。特别强调了两个Mixture of Experts(MoE)模块——演员MoE和评论家MoE,以及其输出的加权和,以产生最终的行动决策。整个系统依赖于门控网络来整合不同信息,从而高效处理多任务运动学习。
4.2.1. Task Definition
The paper focuses on 9 challenging locomotion tasks for a quadruped robot, encompassing both quadrupedal and bipedal gaits. These tasks are:
-
Quadrupedal Gait Tasks: Bar crossing, pit crossing, baffle crawling, stair climbing, and slope walking.
-
Bipedal Gait Tasks: Standing up, plane walking, slope walking, and stair descending.
The robot receives velocity commands from a joystick and a one-hot vector to indicate the desired gait: for quadrupedal and for bipedal. The terrains used in simulation are shown in Figure 2.
The following figure (Figure 2 from the original paper) shows a snapshot of the terrain settings. From left to right: bar, pit, baffle, slope, and stairs.
该图像是图示,展示了腿部机器人在不同地形上的运动行为,包括行走过的边缘、跨越障碍和登上阶梯的场景。这些动作体现了多任务运动的能力。图中的不同场景展示了机器人适应多样化地形的能力。
Multitask locomotion is defined as a Markov Decision Process (MDP): .
-
State Space (S_{\tau}): Different locomotion terrains correspond to distinct subsets of the state space. -
Reward Function (R_{\tau}): Varies across different gaits, reflecting task-specific objectives. -
Transition Dynamics (T_{\tau}): Termination conditions depend on the gait type.The robot learns a policy that selects actions based on both the terrain and gait to maximize the cumulative reward across tasks: Here, is the objective function representing the expected cumulative discounted reward of policy . denotes the expectation over states and actions. sums over different tasks. is the
discount factorfor future rewards. is the reward received for task at time step for taking action in state and transitioning to state . The goal is to find a single policy that generalizes across various tasks, relying solely onproprioception(internal body sensing) forblind locomotion.
4.2.2. Observations (State Space)
The observation space is composed of four types of information at time :
- Proprioception (): Internal sensor data from the robot itself. This includes:
- Projected gravity (from IMU)
- Base angular velocity (from IMU)
- Joint positions
- Joint velocities
- The last action taken
- Explicit Privileged State (): Information that is available in simulation but not directly on the real robot, providing "privileged" insight. This includes:
- Base linear velocity (used instead of noisy IMU data)
- Ground friction
- Implicit Privileged State (): Additional privileged information, typically relating to contacts, which is encoded into a low-dimensional latent representation to reduce the
sim-to-real gap. This includes:- Contact force of different robot links
- Command (): The desired control input for the robot:
- Velocity command (linear velocities in x, y, and angular velocity around yaw axis).
- One-hot vector for gait selection: for quadrupedal, for bipedal.
4.2.3. Action Space
The action space consists of the desired joint positions for all 12 joints of the robot. These are typically target positions for PD controllers that then apply torques to achieve these positions.
4.2.4. Reward Design
The robot receives different rewards based on the gait command .
-
Quadrupedal Locomotion (): Total reward . This includes rewards for tracking desired linear and angular velocities, penalties for termination, and regularization terms for joint positions, velocities, accelerations, angular velocity stability, feet in air, hip positions, base height, balance, and joint limits.
-
Bipedal Locomotion (): Total reward . This includes similar tracking and regularization terms, but also a specific reward for standing up and maintaining bipedal balance.
The detailed reward functions are extensive and provided in Table V of the Appendix. The following are the reward functions from Table V of the original paper:
Type Item Formula Weight Quadrupedal Tracking Tracking lin vel exp kvc,xy−b,xyk2 ) σ 7.0 Tracking ang vel exp (eyn ωq)z ) σ 2.5 Termination −1 −1.0 Alive 1 1.0 Quadrupedal Regularization Joint pos (q − qdefault ) 2 −0.05 Joint vel ||1 2 −0.002 Joint acc ||ll2 −2 × 10−6 Ang vel stability (Hω,x 2 +H, z) −0.2 Feet in air ∥N [p<1 1[[ −0.05 Front hip pos ∑i Ihonti i qqdefal) 2 −0.2 Rear hip pos S∑i n ni defaut) 2 -0.5 Base height (base − * ∑(N hi hagr)2 -0.1 Balance |Ffee0 Ffe,2 ee,1 −2 × 10−5 Joint limit |(q < qmin) ν (q > qmax)| −0.01 Torque exceed limit ∑i= ∑Σ12 max{[i|− hm, 0} −2.0 Orientation (0.5 cos θ +0.5)2 θ = arccos( , (ZaTa, 0), 1) 1.0 Base height linear Tracking lin vel min(max( 0.8 exp( kvc −a2 I{cos θ > 0.95} Zh 5 Shigh -Slow Dosse -Slow 3.0 Bipedal Tracking Tracking ang vel (end −ωka)2 σ exp( )1{cos θ > 0.95} σang 2.5 Termination −1 -1.0 Alive 1 1.0 Rear air ΠiRI{riz < 1} + ∑iRI[ti2 < 11 I{ Bipedal Regularization Front hip pos ∑iµ (i qdefault) -0.5 -0.1 Rear hip pos −0.18 Rear pos balance Hqrear ket ear righ ll2 −0.05 Front joint pos 1{t > allow} ∆ieF(qi− qdef))2 −0.2 Front joint vel 1{t > Tallow } ∑ 2 -1× 10−3 Front joint acc 1 > Ta (8) 2 −2× 10−6 Legs energy substeps N ∑N, ∑Nr (τi qij)2 −1 × 10−6 Torque exceed limits ∑j ∑i max{|− imi, 0} −2.0 Joint limits ∑j ∑i {qij [qmin, qmax} −0.06 Collision ∑k 1{k > 1 -2.0 Action rate lalas ai2 −0.03 Joint vel |∥|ql2 −2× 10−3 Joint acc ||1 l2 −3× 10−6
4.2.5. Termination Conditions
Termination conditions are gait-dependent:
- Quadrupedal Gait (): The robot terminates if its roll angle radians or pitch angle radians (indicating it has fallen or is severely unstable).
- Bipedal Gait (): The robot terminates if any link other than its rear feet and calves contacts the ground after 1 second (indicating it has lost bipedal balance).
4.2.6. Training Process (Two-Stage Framework)
The training process follows a two-stage framework, inspired by prior work [20], and uses PPO [53].
Stage 1: Oracle Policy Training
In the first stage, an "Oracle" policy is trained with access to all available observation states: proprioception (), explicit privileged state (), implicit privileged state (), and command ().
- Implicit State Encoding: The
implicit privileged state(contact forces) is first passed through anencoder network(Implicit EncoderMLP, as per Table VI) to convert it into a low-dimensional latent representation, denoted as . This is crucial for mitigating thesim-to-real gapas raw contact forces are hard to obtain on real robots. - Dual-State Representation: This latent representation is then concatenated with the
explicit privileged stateandproprioceptionto form a comprehensive dual-state representation . - Historical Information Integration: A
Long Short-Term Memory (LSTM)module (Actor/Critic RNN) processes the dual-state representation combined with the command to integrate historical information into a hidden state . This allows the policy to account for past events and maintain memory, which is vital for dynamic locomotion tasks. - Mixture of Experts (MoE) Application: The hidden state is then fed into the
Mixture of Experts (MoE)architecture, which is incorporated into both theactor(policy network) andcritic(value network). This is wheregradient conflictsare addressed.- Gating Network: A
gating network(anMLPas per Table VI) takes the hidden state as input and outputs scores for each expert. These scores are then normalized using asoftmaxfunction to produce thegating weightsfor each expert : Here, is the gating weight for expert , is the output of the gating network, andsoftmaxensures these weights sum to 1. - Expert Combination: Each expert (an
Expert HeadMLP) also takes the hidden state as input and computes its proposed action. The final action is a weighted sum of the actions suggested by all experts, where the weights are provided by thegating network: Here, is the final action, is the number of experts, and is the output of expert . The paper uses experts. - Shared Gating Network: The
actor MoEandcritic MoEshare the samegating network. This ensures consistency between how actions are generated and how their values are evaluated.
- Gating Network: A
- Estimator Pretraining: In parallel, an
estimator moduleis pretrained during this stage. The estimator takesproprioceptionandcommandas input and aims to reconstruct theprivileged information. This is done using anL2 loss: Here, is the estimated privileged information from the estimator, and is the actual privileged information. denotes the rollout buffer. This estimator will be crucial in Stage 2. - Overall Optimization Objective: The total loss for
PPOin Stage 1 combines the standardPPOlosses with the reconstruction loss: Where is thePPO surrogate lossfor policy updates, and is thevalue function lossfor updating the critic.
Algorithm 1 provides the pseudocode for Training Stage 1:
Algorithm 1 Training Stage 1
1: for total iteration do
2: Initialize rollout buffer
3: for num steps do
4:
5:
6:
7:
8:
9: \mathbf{a}_t \gets \sum_{i=1}^N \hat{\mathbf{g}}_i \cdot f_i(\mathbf{h}_t)
10: Execute , observe reward and next state
11: Reset if terminated
12:
13: Store rollout in
14: end for
15: Compute PPO losses: ,
16: Compute reconstruction loss:
17:
18:
19: Update policy and value network
20: end for
21: return Oracle Policy
- Lines 4-6: These lines process the raw observations to create the full state representation and the estimator's prediction . is the encoded implicit privileged state.
- Lines 7-9: These lines describe the forward pass through the
LSTMandMoEmodules to get the action . - Lines 10-14: Standard
RLrollout collection. - Lines 15-18: Calculation of the combined
PPOandreconstruction loss. - Line 19: Optimization step for the policy and value networks.
Stage 2: Final Policy Adaptation
In the second stage, the policy learns to rely exclusively on proprioception () and command () as observations, mimicking the real-world scenario where privileged information is unavailable.
-
Initialization: The weights of the
estimator,low-level LSTM, andMoE modulesare initialized by copying them from theOracle Policytrained in Stage 1. -
Probability Annealing Selection (PAS): To adapt the policy to the potentially inaccurate estimates from the
estimatorwithout significant performance degradation,Probability Annealing Selection (PAS)[54] is employed.- During training, at each step , the system decides whether to use the actual
privileged information(from simulation) or theestimator's prediction. This selection is based on a probability , where is an annealing factor (e.g., slightly less than 1). - As training progresses ( increases), decreases, meaning the policy gradually relies more on the
estimator's output() and less on the ground-truthprivileged information(). This smooth transition helps the policy adapt robustly. - The state used for
LSTMinput is chosen by: .
- During training, at each step , the system decides whether to use the actual
-
Policy Update: The rest of the training proceeds similarly to Stage 1, using the combined loss . The
estimatorcontinues to be refined through , ensuring its predictions are as accurate as possible.Algorithm 2 provides the pseudocode for Training Stage 2:
Algorithm 2 Training Stage 2
1: Copy parameters from Oracle Policy
2: for total iteration do
3: Initialize rollout buffer
4: for num steps do
5:
6:
7:
8:
9:
10:
11: \mathbf{a}_t \gets \sum_{i=1}^N \hat{\mathbf{g}}_i \cdot f_i(\mathbf{h}_t)
12: Execute , observe reward and next state
13: Reset if terminated
14:
15: Store rollout in
16: end for
17: Compute PPO losses: ,
18: Compute reconstruction loss:
19:
20:
21: Update policy and value network
22: end for
23: return Final Policy
- Line 1: Initializes the networks from Stage 1.
- Lines 5-6: Similar to Stage 1, preparing the estimator's output and the true privileged state (though and would be internal simulation values, not real-world inputs).
- Lines 7-8: Implements the
Probability Annealing Selection (PAS)mechanism. is the probability of using the true privileged state, which anneals over time. - Lines 9-11: Forward pass through
LSTMandMoEusing the selected state . - The rest is similar to Stage 1, but with the policy now adapting to rely more on the estimator.
4.2.7. Skill Decomposition and Composition
The MoE framework inherently facilitates skill decomposition and composition.
- Decomposition: The
gating networkimplicitly decomposes complex tasks by assigning different weights to differentexperts. Through training, eachexpertnaturally specializes in distinct aspects of movement (e.g., balancing, crawling, obstacle crossing). This specialization is observed through quantitative analysis of expert coordination, as shown in the experiments. - Composition:
MoE-Locoallows forskill compositionby manually or dynamically adjusting thegating weightsof pretrained experts to synthesize new skills or gaits. The modified gating weights are calculated as: Here,w[i]is a manually defined or dynamically adjusted weight for expert . This means that after thesoftmaxoutput from the originalgating network, an additional factorw[i]can scale the contribution of each expert. This enables controlledskill blendingand adaptation to novel locomotion patterns without retraining the entire policy.
4.3. Network Architecture Details
The architecture of different modules used in the experiment is detailed in Table VI.
The following are the network architecture details from Table VI of the original paper:
| Network | Type | Dims |
| Actor RNN | LSTM | [256] |
| Critic RNN | LSTM | [256] |
| Estimator Module | LSTM | [256] |
| Estimator Latent Encoder | MLP | [256, 128] |
| Implicit Encoder | MLP | [32, 16] |
| Expert Head | MLP | [256, 128, 128] |
| Standard Head | MLP | [640, 640, 128] |
| Gating Network | MLP | [128] |
- Actor RNN / Critic RNN / Estimator Module: All use
LSTMs with a hidden dimension of 256. TheseLSTMsare crucial for processing sequential observations and integrating historical information. - Estimator Latent Encoder: An
MLPthat maps input features (likely fromproprioceptionandcommand) to a latent space of 128 dimensions via an intermediate 256-dimension layer. - Implicit Encoder: A small
MLPthat encodes theimplicit privileged state(e.g., 32 contact forces) into a 16-dimensional latent representation. This small dimension helps withsim-to-realtransfer. - Expert Head: Each
expertin theMoEis anMLPwith two hidden layers (256 -> 128 -> 128). The final output layer would match the action space dimension (12 joints). - Standard Head: This refers to the
MLPused in theOurs w/o MoEbaseline. It's a largerMLP(640 -> 640 -> 128, then to action space) designed to have a similar total parameter count to theMoEpolicy for fair comparison. - Gating Network: An
MLPwith a single hidden layer (outputting to 128 dimensions), responsible for generating thegating weightsfor the experts. The output dimension would typically match the number of experts (6 in this case) beforesoftmax.
4.4. Domain Randomization
To ensure the policy can safely transfer to real-world environments and overcome the sim-to-real gap, dynamic domain randomization is applied during training. This involves randomizing various physical parameters within the simulation.
The following are the domain randomization parameters from Table VII of the original paper:
| Parameters | Range | Unit |
| Base mass | [1, 3] | kg |
| Mass position of X axis | [-0.2, 0.2] | m |
| Mass position of Y axis | [-0.1, 0.1] | m |
| Mass position of Z axis | [-0.05, 0.05] | m |
| Friction | [0, 2] | - |
| Initial joint positions | [0.5, 1.5] × nominal value | rad |
| Motor strength | [0.9, 1.1] × nominal value | - |
| Proprioception latency | [0.005, 0.045] | s |
Additionally, Gaussian noise is added to the input observations to simulate real-world sensor noise:
The following are the Gaussian noise parameters from Table VIII of the original paper:
| Observation | Gaussian Noise Amplitude | Unit |
| Linear velocity | 0.05 | ms |
| Angular velocity | 0.2 | |
| Gravity | 0.05 | m/s2 |
| Joint position | 0.01 | rad |
| Joint velocity | 1.5 | d |
4.5. Training Details
-
Simulation Environment: Training is conducted in
IsaacGym[17], a high-performanceGPU-based physics simulator, utilizing 4096 robots concurrently on an NVIDIA RTX 3090 GPU. -
Ground Noise: To ensure robust performance on uneven terrains and prevent leg dragging, fractal noise [7] is applied to the ground, with a maximum noise scale .
-
Training Schedule:
- 40,000 iterations for initial plane walking in both bipedal and quadrupedal gaits.
- 80,000 iterations on challenging terrain tasks.
- 10,000 iterations for
Probability Annealing Selection (PAS)to adapt to pureproprioceptioninput (Stage 2).
-
Control Frequency: in both simulation and real world.
-
Low-level Control:
PD controlis used for joint execution with parameters and . -
Number of Experts: The number of experts is set to 6.
-
Reinforcement Learning Algorithm:
PPOis used.The hyperparameters for
PPOare shown in Table IX.
The following are the PPO hyperparameters from Table IX of the original paper:
| Hyperparameter | Value |
| clip min std | 0.05 |
| clip param | 0.2 |
| gamma | 0.99 |
| lam | 0.95 |
| desired kl | 0.01 |
| entropy coef | 0.01 |
| learning rate | 0.001 |
| max grad norm | 1 |
| num mini batch | 4 |
| num steps per env | 24 |
clip min std: Minimum standard deviation for the action distribution inPPO'sactornetwork.clip param: The clipping parameter used inPPO'ssurrogate lossfunction to limit policy updates.gamma: Thediscount factorfor future rewards.lam: Thelambdaparameter forGeneralized Advantage Estimation (GAE), used inPPOfor estimating advantages.desired kl: The targetKullback-Leibler (KL) divergenceused in adaptiveKL penalty PPOto control the policy update step size.entropy coef: Coefficient for theentropy regularizationterm in thePPOloss, encouraging exploration.learning rate: The learning rate for the optimizer (e.g., Adam).max grad norm: Gradient clipping threshold, used to prevent exploding gradients.num mini batch: Number of mini-batches used forPPOupdates.num steps per env: Number of environment steps collected per environment before performing aPPOupdate.
5. Experimental Setup
5.1. Datasets
The experiments are conducted in a simulated environment using IsaacGym [17], a high-performance GPU-based physics simulator. There isn't a traditional "dataset" in the sense of a fixed collection of samples, but rather an interactive environment where data is generated through robot-environment interaction.
The simulation environment uses a custom benchmark consisting of a 5m x 100m runway with various obstacles evenly distributed along the path. This setup acts as the experimental "dataset" for evaluating diverse locomotion skills.
The following are the benchmark tasks for simulation experiments from Table II of the original paper:
| Obstacle Type | Specification | Gait Mode |
| Bars | 5 bars, height: 0.05m - 0.2m | Quadrupedal |
| Pits | 5 pits, width: 0.05m - 0.2m | Quadrupedal |
| Baffles | 5 baffles, height: 0.3m - 0.22m | Quadrupedal |
| Up Stairs | 3 sets, step height: 5cm - 15cm | Quadrupedal |
| Down Stairs | 3 sets, step height: 5cm - 15cm | Quadrupedal |
| Up Slopes | 3 sets, incline: 10° - 35° | Quadrupedal |
| Down Slopes | 3 sets, incline: 10° - 35° | Quadrupedal |
| Plane | 10m flat surface | Bipedal |
| Up Slopes | 3 sets, incline: 10° - 35° | Bipedal |
| Down Slopes | 3 sets, incline: 10° - 35° | Bipedal |
| Down Stairs | 3 sets, step height: 5cm - 15cm | Bipedal |
For each challenging task, separate tracks of 30 meters in length were also used.
These environments and obstacles were chosen because they represent a diverse set of real-world challenges for legged robots, including different geometries (bars, pits, stairs), continuous height changes (slopes), and varying support surfaces (baffles). They are effective for validating the method's performance across a wide spectrum of locomotion skills and robustness under varied conditions. The use of both quadrupedal and bipedal gaits further increases the diversity of the "dataset."
5.2. Evaluation Metrics
The experiments use three primary metrics to evaluate performance:
-
Success Rate ():
- Conceptual Definition: This metric quantifies the proportion of trials in which the robot successfully completes a given task. It measures the reliability and overall capability of the policy to achieve its objective. A trial is considered successful if the robot reaches within 1 meter of the target point within a maximum allowed time (400 seconds).
- Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} $
- Symbol Explanation:
Number of Successful Trials: The count of trials where the robot reached within 1m of the target point within 400 seconds.Total Number of Trials: The total number of attempts made for a given task.
-
Average Pass Time ():
- Conceptual Definition: This metric measures the average time taken by the robot to complete a task. It indicates the efficiency and speed of the locomotion policy. For failed trials (where the robot does not reach the target or meets termination conditions), the pass time is recorded as the maximum allowed time (400 seconds) to penalize failure.
- Mathematical Formula: $ \text{Average Pass Time} = \frac{\sum_{j=1}^{\text{Total Trials}} \text{Time}_j}{\text{Total Number of Trials}} $
- Symbol Explanation:
- : The time taken to complete trial . If trial fails, seconds.
Total Number of Trials: The total number of attempts made for a given task.
-
Average Travel Distance ():
- Conceptual Definition: This metric measures the average lateral distance traveled by the robots at the end of the evaluation. While the primary goal is often forward locomotion, this metric can indicate stability and control, ensuring the robot doesn't deviate excessively laterally while moving.
- Mathematical Formula: The paper refers to "average lateral travel distance," which might imply distance from the starting line or desired path. Without a specific formula in the paper, a general interpretation for average travel distance is: $ \text{Average Travel Distance} = \frac{\sum_{j=1}^{\text{Total Trials}} \text{Distance Traveled}_j}{\text{Total Number of Trials}} $
- Symbol Explanation:
- : The distance the robot traveled in trial . For failed trials, this would be the distance covered before failure.
Total Number of Trials: The total number of attempts made for a given task.
5.3. Baselines
The paper compares MoE-Loco against two main baselines:
-
Ours w/o MoE [20]: This baseline uses the same overall framework (including the two-stage training,
privileged learning, andPPO) asMoE-Locobut replaces theMixture of Experts (MoE)module with a simpleMulti-Layer Perceptron (MLP)backbone. Critically, the total number of parameters in thisMLPpolicy is controlled to be approximately the same as in theMoEpolicy. This baseline is representative because it isolates the contribution of theMoEarchitecture, demonstrating whether the performance gains are due toMoEitself or other aspects of the training framework. It directly helps in validating the hypothesis thatMoEmitigatesgradient conflicts. -
RMA (Rapid Motor Adaptation) [3]: This is a well-known
RLapproach for legged robots that focuses on rapid adaptation to novel terrains.RMAemploys ateacher-student training frameworkwith a1D-CNN(Convolutional Neural Network) as an asynchronous adaptation module. It does not use anMoEmodule.RMAis a strong baseline for robust locomotion because it explicitly addresses generalization to unseen environmental conditions through its adaptation mechanism. Comparing againstRMAhelps to showMoE-Loco's advantages in handling diverse known tasks and gaits, as opposed to just adapting to unforeseen variations in a single task. The paper notes thatRMA's original implementation, with only anMLPbackbone andCNN encoder, might struggle with multiple challenging terrains.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Multitask Performance
Simulation Experiment
The simulation experiments evaluate the MoE-Loco policy against Ours w/o MoE and RMA across a mixed-task benchmark and individual challenging tasks. denotes quadrupedal gait, and denotes bipedal gait.
The following are the results from Table I of the original paper:
| Method | Success Rate ↑ | ||||||||
| Mix | Bar (q) | Baffle (q) | Stair (q) | Pit (q) | Slope (q) | Walk (b) | Slope (b) | Stair (b) | |
| Ours | 0.879 | 0.886 | 0.924 | 0.684 | 0.902 | 0.956 | 0.932 | 0.961 | 0.964 |
| Ours w/o MoE | 0.571 | 0.848 | 0.264 | 0.568 | 0.698 | 0.988 | 0.826 | 0.504 | 0.453 |
| RMA | 0.000 | 0.871 | 0.058 | 0.017 | 0.017 | 0.437 | 0.000 | 0.000 | 0.000 |
| Average Pass Time (s) ↓ | |||||||||
| Mix | Bar (q) | Baffle (q) | Stair (q) | Pit (q) | Slope (q) | Walk (b) | Slope (b) | Stair (b) | |
| Ours | 230.98 | 102.42 | 87.84 | 179.14 | 91.86 | 76.75 | 92.37 | 86.14 | 86.44 |
| Ours w/o MoE | 315.47 | 125.46 | 318.68 | 214.52 | 161.38 | 65.28 | 156.76 | 236.67 | 253.62 |
| RMA | 400.00 | 107.84 | 385.25 | 395.34 | 394.49 | 272.06 | 400.00 | 400.00 | 400.00 |
| Average Travel Distance (m) ↑ | |||||||||
| Mix | Bar (q) | Baffle (q) | Stair (q) | Pit (q) | Slope (q) | Walk (b) | Slope (b) | Stair (b) | |
| Ours | 89.41 | 28.05 | 28.02 | 20.42 | 27.82 | 27.62 | 27.20 | 27.99 | 28.04 |
| Ours w/o MoE | 57.12 | 27.59 | 17.41 | 22.66 | 25.59 | 28.49 | 22.73 | 26.21 | 14.23 |
| RMA | 13.40 | 27.39 | 11.31 | 3.92 | 12.48 | 21.33 | 2.00 | 2.00 | 2.00 |
Analysis:
- Overall Mixed-Task Performance (
Mixcolumn):Ours (MoE-Loco)significantly outperforms both baselines across all three metrics. It achieves aSuccess Rateof0.879, much higher thanOurs w/o MoE(0.571) andRMA(0.000). Similarly, itsAverage Pass Timeis lower (230.98svs315.47sand400.00s), andAverage Travel Distanceis higher (89.41mvs57.12mand13.40m). This strong performance in the mixed-task benchmark highlightsMoE-Loco's ability to generalize and robustly handle diverse tasks simultaneously. - Single-Task Performance:
Oursgenerally excels in single-task evaluations. For instance, inBaffle (q),Ourshas aSuccess Rateof0.924compared to0.264forOurs w/o MoEand0.058forRMA. This indicates thatMoEis critical for challenging tasks that likely induce significantgradient conflictsin a standardMLP.- The only exception is
Slope (q)(quadrupedal slope walking), whereOurs w/o MoEachieves a slightly higherSuccess Rate(0.988vs0.956) and lowerAverage Pass Time(65.28svs76.75s). The authors attribute this to the relative simplicity of quadrupedal slope walking, suggesting that for less complex tasks, the overhead or specialization ofMoEmight not offer a substantial advantage over a well-tunedMLP.
- Impact of MoE: The substantial performance gap between
OursandOurs w/o MoE(especially inMix,Baffle (q),Walk (b),Slope (b),Stair (b)) strongly validates the effectiveness of theMoEarchitecture. TheOurs w/o MoEstruggles significantly in many challenging scenarios, confirming the authors' hypothesis thatgradient conflictsseverely degrade performance inmultitask learningwithoutMoE. - RMA Performance:
RMAperforms very poorly, achieving a0.000success rate in mixed tasks and most bipedal tasks, and very low success rates in several quadrupedal tasks likeBaffle (q)andStair (q). This is attributed to its original implementation using anMLPbackbone andCNN encoder, which is not optimized for the diverse and challenging multitask terrains presented in this benchmark.
Real World Experiments
The MoE-Loco policy is deployed zero-shot (without any further training) on a real Unitree Go2 quadruped robot. The robot is tested on a mixed terrain course comprising a 20cm bar, 22cm baffle, 15cm stairs, 20cm pits, and 30-degree slopes in quadrupedal gait, followed by a switch to bipedal gait for standing up, walking up/down a 30-degree slope, and descending stairs. Each test is conducted for 20 trials.
The following figure (Figure 4 from the original paper) shows the real-world success rate over multiple terrains and gaits.
该图像是一幅图表,展示了在多种地形和步态下的真实成功率。图中使用不同颜色的柱子代表了不同的方法,包括我们的方法(蓝色)、不使用 MoE 的方法(橙色)和 RMA 方法(绿色),以展示它们在处理各种地形(如条形、障碍、楼梯等)时的表现差异。
The following figure (Figure 5 from the original paper) shows real-world experiments over multiple terrains and gaits.

Analysis:
- Figure 4 clearly shows that
MoE-Locoachieves superiorreal-world success ratesacross all tested terrains and gaits. Its performance is notably higher in the challengingMix Terrainscenario, demonstrating its robustgeneralizationcapabilities to complex, sequential tasks in the real world. - The
Ours w/o MoEbaseline performs poorly inMix Terrainand struggles withBaffle (q),Stair (q),Slope (b), andStair (b), echoing the simulation results and further emphasizing the necessity ofMoEfor real-worldmultitask locomotion. RMAagain shows very limited real-world capability for these diverse tasks, failing entirely inMix TerrainandBaffle (q), and exhibiting low success rates in other complex tasks.- Figure 5 visually supports these findings, showcasing the
Unitree Go2robot successfully traversing various challenging real-world terrains using theMoE-Locopolicy, including diverse outdoor environments. This demonstrates the policy'srobustnessandadaptabilitybeyond controlled lab settings.
6.1.2. Gradient Conflict Alleviation
To directly investigate whether MoE reduces gradient conflicts, experiments were conducted by resuming training from a checkpoint and running for 500 epochs with 4096 robots. Gradient conflict was measured using two metrics:
-
Cosine Similarity (): The normalized dot product of the gradients of all parameters for different tasks. A smaller (more negative) cosine similarity indicates larger
gradient conflicts(gradients pointing in opposite directions). A larger (more positive) cosine similarity indicates greater alignment.- Formula for Cosine Similarity between two vectors and : $ \text{Cosine Similarity} (A, B) = \frac{A \cdot B}{|A| |B|} $ Where is the dot product, and and are the magnitudes (L2 norms) of vectors and , respectively. The range is .
-
Negative Gradient Ratio (): The ratio of negative entries in the element-wise product of two task gradients. A larger ratio indicates more frequent instances where gradients from different tasks try to update the same parameter in opposing directions, thus signifying larger
gradient conflict.- No specific formula is provided in the paper, but it implies: $ \text{Negative Gradient Ratio} (G_1, G_2) = \frac{\text{Count of indices } j \text{ where } (G_1)_j \cdot (G_2)_j < 0}{\text{Total number of parameters}} $ Where and are the gradient vectors for two different tasks.
The following are the cosine similarity of gradients of different tasks from Table III of the original paper:
| MoE/Standard | Gradient Cosine Similarity ↑ | ||||
| Bar (q) | Baffle (q) | Stair (q) | Slope Up (b) | Slope down (b) | |
| Bar (q) | 0.519/0.474 | 0.606/0.592 | 0.278/-0.132 | 0.091/-0.128 | |
| Baffle (q) | - | 0.369/0.384 | 0.062/-0.091 | 0.061/-0.101 | |
| Stair (q) | - | 0.046/-0.023 | 0.052/0.015 | ||
| Slope up (b) | 0.806/0.709 | ||||
| Slope down (b) | - | ||||
The following are the negative entry ratio of MoE and Standard policy on different tasks from Table IV of the original paper:
| MoE/Standard | Gradient Negative Entries (%) ↓ | ||||
| Bar (q) | Baffle (q) | Stair (q) | Slope Up (b) | Slope down (b) | |
| Bar (q) | 35.72/37.33 | 32.67/32.62 | 45.50/ 55.12 | 49.83/50.80 | |
| Baffle (q) | - | 39.90/38.52 | 49.86/55.91 | 49.91/51.68 | |
| Stair (q) | - | 49.52/50.15 | 50.04/50.34 | ||
| Slope up (b) | 23.17/30.91 | ||||
| Slope down (b) | - | - | |||
Analysis:
- Cosine Similarity:
- For tasks involving a mix of quadrupedal and bipedal gaits (e.g.,
Bar (q)vs.Slope Up (b)orSlope down (b)), theMoEpolicy shows significantly highercosine similarity(e.g.,0.278forMoEvs.-0.132forStandardbetweenBar (q)andSlope Up (b)). A negative cosine similarity indicates gradients are pointing in largely opposite directions, causing strong conflicts.MoEeffectively shifts these into positive alignment or at least reduces the opposition. - Even between some quadrupedal tasks requiring distinct skills (e.g.,
Bar (q)vs.Baffle (q)),MoEslightly improvescosine similarity(0.519vs0.474), suggesting a reduction in conflict.
- For tasks involving a mix of quadrupedal and bipedal gaits (e.g.,
- Negative Gradient Ratio:
- Similar to cosine similarity, the
MoEpolicy consistently achieves a lowernegative gradient ratio, especially between bipedal and quadrupedal tasks. For instance, betweenBar (q)andSlope Up (b),MoEhas45.50%negative entries compared to55.12%for theStandardpolicy. This indicates fewer parameters are being pulled in conflicting directions by these diverse tasks. - The reduction is also visible, albeit sometimes smaller, between certain quadrupedal tasks (e.g.,
Baffle (q)vs.Slope Up (b):49.86%forMoEvs.55.91%forStandard).
- Similar to cosine similarity, the
- Conclusion: The results from both metrics unequivocally demonstrate that the
MoEarchitecture significantly reducesgradient conflictnot only between fundamentally different gaits (bipedal vs. quadrupedal) but also between quadrupedal tasks requiring distinct locomotion skills. This substantiates one of the core claims of the paper regardingMoE's benefits inmultitask reinforcement learning.
6.1.3. Training Performance
Training performance is evaluated based on mean reward (ability to exploit the environment) and mean episode length (how well the robot learns to stand and walk). The training curves are presented during the plane pretraining stage, where the robot learns both bipedal and quadrupedal plane walking.
The following figure (Figure 6 from the original paper) shows the training curve of our multitask policy in the pretraining stage.

Analysis:
- Figure 6 shows the
MoEpolicy (pink line) consistently outperforms theStandardpolicy (green line,Ours w/o MoE) in bothMean Episode LengthandMean Reward. - Mean Episode Length: The
MoEpolicy achieves a highermean episode lengthmuch faster and maintains it, indicating that it learns to stand and walk more effectively and for longer durations. This suggests faster and more stable learning of fundamental locomotion skills. - Mean Reward: The
MoEpolicy also converges to a highermean reward, demonstrating a better ability to exploit the environment and achieve task objectives. - Conclusion: These results confirm that
MoEimprovestraining efficiencyandoverall performanceeven in the initial pretraining phase, likely due to its ability to handle the subtlemultitasknature of learning two distinct gaits on a plane.
6.1.4. Expert Specialization Analysis
To understand how experts compose skills, both qualitative and quantitative analyses are performed.
The following figure (Figure 7 from the original paper) shows the average weight distribution of different experts in multitask gait training.

Analysis of Figure 7:
Figure 7 plots the mean weight of different experts (Expert 0 to Expert 5) across various tasks. The distinct distributions of gating weights for each task are evident:
- For
Walking(bipedal),Expert 0andExpert 1are heavily weighted, suggesting they are specialized in bipedal locomotion. - For
Crossing Bars(quadrupedal),Expert 2andExpert 3show higher activation, indicating their specialization in obstacle negotiation. - For
Rear Leg Crossing Bars(quadrupedal, a more specific variant of bar crossing),Expert 4andExpert 5are more prominent. This diverse activation pattern across tasks clearly demonstrates theexpertiseanddifferentiationthat emerge naturally among theMoE experts. Each expert learns to contribute optimally to specific locomotion behaviors.
The following figure (Figure 8 from the original paper) shows the t-SNE result of gating network output on different terrains and gaits.

Analysis of Figure 8 (t-SNE):
The t-SNE (t-Distributed Stochastic Neighbor Embedding) plot visualizes the high-dimensional outputs of the gating network (expert weights) in a 2D space.
- Gait Clustering: The plot shows clear separation between bipedal (
Bip) and quadrupedal (Quad) tasks, forming distinct clusters. This indicates that thegating networkeffectively distinguishes between these fundamentally different locomotion modes and activates different sets of experts for each. - Task Clustering within Gaits:
- Within the quadrupedal cluster, tasks like
Slope (q)andPit (q)(which often share similar underlying control strategies with basic plane walking) cluster closely together. - In contrast,
Bar (q),Baffle (q), andStair (q)(which require more distinct and complex gaits like high leg lifts or crawling) are further apart, forming their own sub-clusters.
- Within the quadrupedal cluster, tasks like
- Conclusion: The
t-SNEresults quantitatively confirm theexpert specialization. Thegating networklearns to produce distinct activation patterns (expert weights) for different tasks, effectively routing diverse behaviors to specialized experts and reflecting the underlying kinematic and dynamic requirements of each task.
6.1.5. Skill Composition
Leveraging the identified expert specialization, the authors demonstrate the ability to compose new skills.
The following figure (Figure 9 from the original paper) shows a manually designed new dribbling gait by selecting two experts.

Analysis of Figure 9:
- The paper identifies
expertsspecializing inbalancing(which helps lift the robot's body but limits agility) andcrossing(for lifting a front leg for obstacle traversal). - By manually selecting these two specific
experts, doubling thegating weightof thecrossing expert, and masking out all otherexperts, a noveldribbling gaitis synthesizedzero-shot(without further training). - Figure 9 illustrates the robot executing this new gait, periodically using one front leg to "kick" a ball while maintaining balance and walking effectively.
- Significance: This demonstrates the
interpretabilityandcomposabilityofMoE-based policies. The specializedexpertsare not just abstract network components but represent identifiable, reusable locomotion skills. This allows for flexible construction of new behaviors, a significant advantage over black-box neural networks.
6.1.6. Additional Experiment (Adaptation Learning)
An adaptation learning experiment further showcases the recomposability of pretrained experts for new tasks.
The following figure (Figure 10 from the original paper) shows MoE-Loco can quickly adapt to a three-footed gait by training a new expert.

Analysis of Figure 10:
- Task: The robot is tasked with learning a
three-footed gait. This is a novel behavior requiring one leg to be consistently lifted while maintaining stability and locomotion. - Method: A newly initialized expert is introduced into the existing
MoEframework. The parameters of theoriginal expertsare frozen, and only thegating networkis updated. - Result: Figure 10 shows the robot successfully walking on both
flat groundandslopesusing only three feet. - Significance: This highlights
MoE-Loco's efficientadaptability. The newly added expert only needs to learn the specific skill of lifting one leg. It can then leverage the existingwalkingandslope traversalcapabilities already learned and encoded in the frozen original experts, guided by the updatedgating network. This demonstrates a powerful form oftransfer learningandmodular adaptation, where only minimal parts of the network need to be trained for novel tasks.
6.2. Data Presentation (Tables)
All tables provided in the original paper (Table I, II, III, IV, V, VI, VII, VIII, IX) have been transcribed completely and accurately in the relevant sections above, using HTML for tables with merged cells (Tables I, III, IV, V) and Markdown for simple grid structures (Tables II, VI, VII, VIII, IX).
6.3. Ablation Studies / Parameter Analysis
The paper primarily conducts an ablation study by comparing Ours (with MoE) against Ours w/o MoE (standard MLP backbone with similar parameters). This is a direct ablation study for the MoE component. The results (Table I, Figure 4, Figure 6, Table III, Table IV) consistently show that MoE is crucial for performance, training efficiency, and gradient conflict alleviation, especially in complex multitask scenarios.
The effect of key hyperparameters, such as the number of experts () or PPO parameters, is not explicitly explored in separate ablation studies in the main text, but the chosen values are stated (e.g., ). The Probability Annealing Selection (PAS) mechanism for Stage 2 training is a form of parameter analysis (the annealing factor ) which helps the policy adapt to proprioception gracefully. The robustness of this mechanism is implicitly validated by the successful real-world deployment.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces MoE-Loco, a novel Mixture of Experts (MoE) framework, for multitask locomotion in legged robots. This framework enables a single policy to command a quadruped robot to navigate a broad spectrum of challenging terrains (bars, pits, stairs, slopes, baffles) and switch seamlessly between quadrupedal and bipedal gaits. A key contribution is the demonstration that MoE effectively mitigates gradient conflicts inherent in multitask reinforcement learning, leading to superior training efficiency and overall performance both in simulation and zero-shot real-world deployment. Furthermore, the work highlights the emergent specialization of experts within the MoE architecture, offering interpretability and enabling novel capabilities like skill composition and efficient adaptation to new tasks (e.g., three-footed gait) by only updating the gating network.
7.2. Limitations & Future Work
The authors point out the following limitation and suggest future work:
- Current Limitation: The current
MoE-Locoapproach is demonstrated primarily withblind locomotion(using onlyproprioceptionandprivileged informationin simulation). - Future Work: The authors propose extending this approach to incorporate
sensory perceptionsuch ascameraandLidar. This would enhance adaptability in even more complex and dynamic tasks where direct environmental observation is crucial.
7.3. Personal Insights & Critique
MoE-Loco presents a compelling solution to a critical challenge in robotics: creating truly versatile and general-purpose locomotion policies. The integration of Mixture of Experts is a theoretically sound and empirically validated approach for multitask learning, and its application to legged locomotion is particularly impactful given the diversity of movements required.
Inspirations and Transferability:
- Modular Learning: The
MoEframework's ability todecomposeandcomposeskills is highly inspiring. This modularity could be a blueprint for more interpretable and adaptableRLagents in other complex robotic tasks, such asmanipulation(where differentexpertscould handle grasping, pushing, or fine motor control) orhuman-robot interaction(whereexpertscould specialize in different social cues or interaction types). - Efficient Adaptation: The experiment on
three-footed gaitadaptation, where only thegating networkis updated whileexpertsare frozen, showcases a powerfultransfer learningparadigm. This could significantly reduce training time and data requirements for deploying robots in novel situations or learning new, related skills, makingRLmore practical for real-world deployment. - Beyond Locomotion: The
gradient conflictalleviation mechanism is generalizable to anymultitask RLproblem where task objectives might conflict. This could be applied tomulti-agent systems,multi-modal control, or any scenario demanding diverse outputs from a singleRLagent.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
- Expert Number Selection: The paper mentions selecting
N_expas 6. While this number worked well, the methodology does not extensively discuss how this number was chosen or its sensitivity. A detailed ablation study on the optimal number ofexpertsfor different task complexities would strengthen the argument. Too fewexpertsmight lead to lingeringgradient conflicts, while too many could introduce computational overhead and redundancy. - Dynamic Gating for Skill Composition: While manual
skill compositionis demonstrated, developing a neural network that dynamically adjusts thegating weightsw[i]for synthesizing new skills (instead of manual selection) could be a powerful next step. This would move beyond human intuition and allow for automatedskill discovery. - Complexity of Reward Design: The reward functions (Appendix V-A) are quite detailed and extensive, especially for bipedal locomotion. While necessary for high-performance
RL, such intricate reward engineering can be a bottleneck. Investigating howMoEmight simplify reward design or enable learning from simpler, more sparse rewards could be valuable. - Computational Cost of MoE: While
MoEmitigatesgradient conflicts, it does introduce additional computational costs, especially if allexpertsare dense networks and activated for every input (though typicalMoEuses sparse activation). The paper focuses onparameter countequivalence toMLPbutFLOPs(Floating Point Operations) per inference step could still be higher. A deeper analysis of the trade-off between performance gain and computational overhead would be insightful, especially for resource-constrained onboard robot computing. - Generalization to Unseen Tasks: The paper demonstrates skill composition for related tasks (dribbling, three-footed gait). It would be interesting to see how well the learned
expertscan be leveraged for entirely novel locomotion tasks that were not explicitly part of the training distribution, pushing the boundaries of true generalization and reusability.
Similar papers
Recommended via semantic vector search.