Abstract

We present MoE-Loco, a Mixture of Experts (MoE) framework for multitask locomotion for legged robots. Our method enables a single policy to handle diverse terrains, including bars, pits, stairs, slopes, and baffles, while supporting quadrupedal and bipedal gaits. Using MoE, we mitigate the gradient conflicts that typically arise in multitask reinforcement learning, improving both training efficiency and performance. Our experiments demonstrate that different experts naturally specialize in distinct locomotion behaviors, which can be leveraged for task migration and skill composition. We further validate our approach in both simulation and real-world deployment, showcasing its robustness and adaptability.

1. Bibliographic Information

1.1. Title

The central topic of this paper is "MoE-Loco: Mixture of Experts for Multitask Locomotion". It focuses on developing a single policy for legged robots to perform diverse locomotion tasks across various terrains and gaits using a Mixture of Experts framework.

1.2. Authors

The authors are:

Runhan Huang* $^1,2$
Shaoting Zhu* $^1,2$
Yilun Du $^3$
Hang Zhao+ $^1,2$

Affiliations: $^1$ Tsinghua University, Beijing, China $^2$ Institute for AI Industry Research (AIR), Tsinghua University $^3$ Massachusetts Institute of Technology (MIT)

(* indicates equal contribution, + indicates corresponding author)

1.3. Journal/Conference

The paper is published as a preprint on arXiv. While not yet peer-reviewed and published in a specific journal or conference, arXiv is a reputable platform for sharing research, particularly in machine learning, robotics, and artificial intelligence, allowing early dissemination and feedback. The authors' affiliations with prominent institutions like Tsinghua University and MIT suggest a high standard of research.

1.4. Publication Year

The paper was published on 2025-03-11.

1.5. Abstract

This paper introduces MoE-Loco, a Mixture of Experts (MoE) framework designed for multitask locomotion in legged robots. The core objective is to enable a single robotic policy to proficiently navigate diverse terrains, such as bars, pits, stairs, slopes, and baffles, while also supporting both quadrupedal and bipedal gaits. By employing the MoE architecture, the method effectively mitigates gradient conflicts—a common issue in multitask reinforcement learning (RL)—thereby enhancing both training efficiency and overall performance. The authors demonstrate through experiments that different experts within the MoE framework naturally specialize in distinct locomotion behaviors. This specialization facilitates task migration and skill composition, allowing for adaptable and reusable locomotion strategies. The MoE-Loco approach is validated through extensive experiments in both simulation and real-world deployments, showcasing its robustness and adaptability across complex environments.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2503.08564 PDF Link: https://arxiv.org/pdf/2503.08564v2.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the significant challenge of training a single, unified locomotion policy for legged robots that can generalize across multiple diverse tasks, terrains, and locomotion modes (gaits).

This problem is important because real-world robot applications demand versatility. Robots are frequently required to traverse varied environments (e.g., uneven ground, stairs, obstacles) and perform different types of movement (e.g., walking, climbing, balancing). While single-task reinforcement learning (RL) has achieved remarkable success in specific locomotion behaviors, combining these diverse skills into a single, cohesive policy remains elusive.

Specific challenges and gaps in prior research include:

Gradient Conflicts in Multitask RL (MTRL): When a single neural network attempts to learn multiple tasks simultaneously, the gradients from different task objectives can pull the network's weights in conflicting directions. This gradient conflict leads to unstable training, reduced efficiency, and often inferior performance compared to specialized single-task policies.
Complexity of Diverse Gaits and Terrains: Training a policy to handle both fundamentally different gaits (e.g., quadrupedal vs. bipedal) and a wide array of challenging terrains (e.g., bars, pits, stairs) within a single model exacerbates gradient conflicts and can lead to model divergence or poor generalization.
Lack of Interpretability and Reusability: Most RL policies are "black boxes," making it difficult to understand how different skills are learned or to easily compose existing skills for novel tasks.

The paper's entry point and innovative idea lie in integrating the Mixture of Experts (MoE) framework into multitask reinforcement learning for legged locomotion. MoE provides a modular network structure that naturally addresses gradient conflicts by allowing different "experts" (sub-networks) to specialize in distinct behaviors, with a "gating network" dynamically selecting which experts to activate for a given task.

2.2. Main Contributions / Findings

The primary contributions of MoE-Loco are:

Development of a Unified Multitask Locomotion Policy: The paper successfully trains and deploys a single neural network policy that enables a quadruped robot to traverse a wide range of challenging terrains (bars, pits, stairs, slopes, baffles) and switch between fundamentally different locomotion modes (quadrupedal and bipedal gaits). This is a significant step towards more versatile and autonomous robots.
Integration of MoE for Gradient Conflict Mitigation: The authors effectively integrate the Mixture of Experts (MoE) architecture into both the actor and critic networks of their RL policy. This integration is shown to alleviate gradient conflicts—a major hurdle in multitask RL—leading to improved training efficiency and superior overall model performance compared to standard MLP-based policies.
Demonstration of Expert Specialization and Composability: Through qualitative and quantitative analysis, the paper uncovers that individual experts within the MoE model naturally specialize in distinct locomotion behaviors (e.g., balancing, crawling, obstacle crossing). This expert specialization provides a more interpretable policy and opens up new possibilities for task migration (adapting to new tasks) and skill composition (combining experts with adjustable weights to synthesize new, complex skills, like a "dribbling gait," without additional training).
Validation in Simulation and Real-World: The proposed MoE-Loco framework is rigorously validated in both high-fidelity simulation environments (IsaacGym) and through zero-shot deployment on a real Unitree Go2 quadruped robot. The real-world experiments, including outdoor environments, demonstrate the policy's robustness, adaptability, and generalization capabilities.

The key conclusions and findings are that MoE is a highly effective architecture for multitask locomotion learning in RL. It not only resolves the practical issue of gradient conflicts but also introduces interpretability and composability to learned robot skills, addressing the challenge of creating versatile and adaptable robots.

3.1. Foundational Concepts

Reinforcement Learning (RL)

Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent observes the state of the environment, takes an action, and in response, the environment transitions to a new state and provides a reward (or penalty). The agent's goal is to learn a policy—a mapping from states to actions—that maximizes the cumulative reward over time. RL is particularly well-suited for control problems where manual programming of behaviors is difficult, such as robot locomotion.

Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is typically defined by a tuple $\langle S, A, T, R, \gamma \rangle$ :

$S$ : A set of possible states of the environment.
$A$ : A set of possible actions the agent can take.
$T$ : A transition function $T(s' | s, a)$ representing the probability of transitioning to state $s'$ from state $s$ after taking action $a$ .
$R$ : A reward function R(s, a, s') that gives the immediate reward received after transitioning from state $s$ to state $s'$ via action $a$ .
$\gamma$ : A discount factor ( $0 \le \gamma < 1$ ) that determines the present value of future rewards. A higher $\gamma$ makes future rewards more significant.

The Markov property implies that the future state depends only on the current state and action, not on the sequence of events that preceded it.

Mixture of Experts (MoE)

A Mixture of Experts (MoE) is a neural network architecture designed for tasks that benefit from specialized sub-networks. It consists of multiple "expert" networks and a "gating network":

Experts ( $f_i$ ): These are individual neural networks, each specialized in handling a specific subset of the input space or a particular type of task.
Gating Network ( $g$ ): This network takes the input and learns to dynamically determine the weights or probabilities for each expert. For any given input, the gating network decides which expert (or combination of experts) is most appropriate to process that input.

The output of an MoE model is typically a weighted sum of the outputs of the experts, where the weights are provided by the gating network. This modularity allows different parts of the network to specialize, mitigating gradient conflicts in multitask settings and potentially improving efficiency and performance.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a popular reinforcement learning algorithm that falls under the category of policy gradient methods. PPO aims to find an optimal policy by iteratively updating it. A key feature of PPO is its use of a clipping mechanism or adaptive KL penalty to constrain the policy updates, preventing them from deviating too far from the previous policy. This ensures more stable and efficient training compared to earlier policy gradient methods like REINFORCE. PPO often achieves good performance with a relatively simple implementation, making it a widely used algorithm in various RL applications, including robot locomotion.

Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture designed to handle sequential data and overcome the vanishing/exploding gradient problems common in traditional RNNs. LSTMs are particularly effective at learning long-term dependencies in data. They achieve this through a specialized internal structure called a "cell," which includes several "gates" (input gate, forget gate, output gate). These gates regulate the flow of information into and out of the cell, allowing the LSTM to selectively remember or forget past information, making it suitable for tasks requiring memory of historical observations, such as robot control where past sensor readings and actions are crucial.

Sim-to-Real Gap

The sim-to-real gap refers to the discrepancy between the performance of a robot policy trained in a simulated environment and its performance when deployed on a real physical robot. Simulations are approximations of reality, and mismatches can arise from various factors, including inaccurate physics models, sensor noise differences, communication latencies, and unmodeled real-world complexities. Bridging this gap is crucial for practical robot learning and often involves techniques like domain randomization and privileged information.

Domain Randomization

Domain Randomization is a technique used in robot learning to mitigate the sim-to-real gap. Instead of striving for a perfectly realistic simulation, domain randomization deliberately randomizes various physical parameters and environmental properties (e.g., mass, friction, sensor noise, latency) within the simulator. The idea is that if a policy is trained in a sufficiently diverse set of randomized simulations, it will become robust enough to generalize to the uncertainties and variations encountered in the real world. This makes the policy less sensitive to discrepancies between the simulated and real environments.

3.2. Previous Works

Reinforcement Learning for Robot Locomotion

Early works (e.g., [17], [18]) demonstrated RL's ability to learn legged locomotion in simulation, which was later extended to real robots (e.g., [3], [19]). Research has shown success in traversing complex terrains (e.g., [8], [20], [21]), achieving high-speed running (e.g., [22], [23]), and mastering extreme tasks like bipedal walking (e.g., [24], [25]), opening doors (e.g., [1], [4]), or high-speed parkour (e.g., [7], [9], [10], [27]). However, the paper notes that most of these works focus on specific skills and a limited number of terrains, rarely considering comprehensive multitask learning.

Multitask Learning (MTL)

Multitask Learning (MTL) aims to train a single network to perform across multiple tasks, leveraging shared knowledge [28], [29], [30]. While MTL can offer benefits, it often encounters negative gradient conflicts during training [33], [34]. For example, Gradient Surgery for Multi-Task Learning [33] proposed methods to modify gradients to avoid destructive interference. Conflict-Averse Gradient Descent [34] also addresses this by adjusting gradients. In Multitask Reinforcement Learning (MTRL), algorithms like Model-Agnostic Meta-Learning (MAML) [36] and Fast reinforcement learning via slow reinforcement learning [37] were developed. MAML learns an initialization for a model's parameters that allows it to quickly adapt to new tasks with only a few gradient steps. In robotics, MTRL has been widely applied to manipulation tasks (e.g., [16], [35], [39], [40]). For locomotion, ManyQuadrupeds [41] focused on learning a unified policy for different types of quadruped robots. MELA [42] used pretrained expert models to construct locomotion policies but primarily focused on basic skill acquisition and required substantial reward engineering for pretraining. MTAC [43] attempted hierarchical RL for different terrains but was limited to one gait and three simple terrains, without real-robot deployment.

Mixture of Experts (MoE)

The Mixture of Experts (MoE) concept was originally introduced in [13], [44]. It has recently gained significant attention and application in various fields:

Natural Language Processing (NLP): MoE models like Mixtral of Experts [47] and Deepseek-V3 [48] have shown great success in scaling large language models.
Computer Vision (CV): MoE has been applied in areas like recommendation systems [49], [50].
Multi-modal Learning: MoE is used to mitigate data conflicts in instruction fine-tuning for large multi-modal models (e.g., Llava-MoLE [52]).
Reinforcement Learning and Robotics: DeepMind [14] explored MoE for scaling RL. MELA [39] (which is also mentioned under MTRL) proposed a Multi-Expert Learning Architecture to generate adaptive skills, but primarily for simple actions. Acquiring diverse skills using curriculum reinforcement learning with mixture of experts [16] also used MoE for diverse skill acquisition.

3.3. Technological Evolution

The field of robot locomotion has evolved from hand-crafted controllers to RL-based policies capable of handling increasingly complex single tasks and terrains. The early successes of RL in simulation (e.g., IsaacGym [17]) paved the way for robust real-world deployments. Concurrently, multitask learning emerged to address the need for versatile agents, but faced challenges like gradient conflicts. The Mixture of Experts architecture, initially proposed in the 1990s, has recently seen a resurgence, especially with the rise of large-scale neural networks, offering a promising solution to multitask learning challenges by allowing specialized components.

This paper's work, MoE-Loco, fits within this timeline by directly addressing the limitations of prior RL-based locomotion and multitask learning approaches. It leverages the modularity and conflict-mitigation capabilities of MoE to achieve a unified policy for highly diverse locomotion tasks and gaits, pushing the boundaries of what a single RL policy can achieve on legged robots. Specifically, it builds upon teacher-student training frameworks and methods for robust locomotion (e.g., [20]) by incorporating MoE to handle the complexity of simultaneous diverse tasks.

3.4. Differentiation Analysis

Compared to the main methods in related work, MoE-Loco presents several core differences and innovations:

Comprehensive Multitask Scope: Unlike many prior works that focus on specific skills or a limited number of terrains and gaits, MoE-Loco aims for a single policy capable of handling both diverse terrains (bars, pits, stairs, slopes, baffles) and fundamentally different locomotion modes (quadrupedal and bipedal gaits). This is a significantly broader and more challenging multitask learning problem than typically addressed.
Explicit Gradient Conflict Mitigation via MoE: The core innovation is the explicit integration of the Mixture of Experts architecture into both the actor and critic networks. While MTRL works often acknowledge gradient conflicts (e.g., [33], [34]), MoE-Loco directly uses MoE as a structural solution to route gradients to specialized experts, thereby inherently reducing conflicts. This contrasts with approaches that might use hierarchical RL (like MTAC [43]) or specialized teacher-student frameworks without this explicit modularity.
Enhanced Training Efficiency and Performance: By mitigating gradient conflicts, MoE-Loco demonstrates improved training efficiency and superior performance across complex multitask benchmarks compared to baselines. The paper's Ours w/o MoE baseline, which uses a simple MLP with similar parameter count, highlights the specific advantage conferred by the MoE structure itself.
Interpretability and Skill Composability: The MoE framework naturally leads to expert specialization, where different experts learn distinct, human-interpretable behaviors (e.g., balancing, crossing obstacles). This property is leveraged for skill composition (e.g., creating a "dribbling gait" by adjusting gating weights) and task migration (e.g., adapting to a three-footed gait by training a new expert while freezing others). Standard MLP-based policies, being "black boxes," lack this level of interpretability and explicit skill manipulation.
Robust Real-World Generalization: The work emphasizes zero-shot deployment on a real Unitree Go2 robot across mixed and single-task terrains, including outdoor environments. This strong real-world validation, coupled with domain randomization and a privileged learning framework, demonstrates the practical robustness and generalization of the MoE-based policy, surpassing baselines like RMA [3] which may struggle with diverse terrains despite its adaptation module.

4. Methodology

4.1. Principles

The core idea of MoE-Loco is to address the limitations of multitask reinforcement learning (MTRL) for robot locomotion, primarily the gradient conflicts that arise when a single neural network tries to learn diverse and often conflicting skills (e.g., quadrupedal bar crossing vs. bipedal stair descending). The theoretical basis relies on the Mixture of Experts (MoE) architecture, which proposes that instead of a single monolithic network, a system composed of multiple specialized "experts" can learn more effectively.

The intuition is as follows: Imagine a robot needing to cross a bar, climb stairs, and walk bipedally. Each task requires different control strategies and emphasizes different aspects of the robot's dynamics. If a single neural network tries to learn all these, the gradients from learning to cross bars might interfere with the gradients from learning to climb stairs, slowing down training or leading to suboptimal performance. By using MoE, the framework can:

Specialize: Assign different sub-networks (experts) to specialize in distinct behaviors or task subsets. For example, one expert might become proficient at balancing, another at lifting legs for obstacle clearance, and another for bipedal stability.
Route Dynamically: A gating network acts as a traffic controller, dynamically deciding which expert (or combination of experts) is most relevant for the robot's current state and command. This allows the appropriate expert to be activated for each task, effectively isolating the learning processes and mitigating gradient conflicts.
Compose: The specialized experts can then be combined or adjusted, offering a more interpretable way to understand and even compose new skills from existing ones.

This modular approach improves training efficiency and performance by allowing each expert to focus on a narrower range of tasks, preventing detrimental interference between heterogeneous task gradients.

4.2. Core Methodology In-depth (Layer by Layer)

The MoE-Loco framework for multitask locomotion learning follows a two-stage training process, leveraging Proximal Policy Optimization (PPO) as the underlying reinforcement learning algorithm. The overall pipeline is depicted in Figure 3.

The following figure (Figure 3 from the original paper) shows an overview of the MoE-Loco pipeline. With the designed MoE architecture, the policy achieves robust multitask locomotion ability on various challenging terrains with multiple gaits.

该图像是示意图，展示了MoE-Loco框架中的信息流和结构。图中包含明确的和隐含的特权状态，提供了关于估计器、感知和长短期记忆（LSTM）处理的信息。特别强调了两个Mixture of Experts（MoE）模块——演员MoE和评论家MoE，以及其输出的加权和，以产生最终的行动决策。整个系统依赖于门控网络来整合不同信息，从而高效处理多任务运动学习。

4.2.1. Task Definition

The paper focuses on 9 challenging locomotion tasks for a quadruped robot, encompassing both quadrupedal and bipedal gaits. These tasks are:

Quadrupedal Gait Tasks: Bar crossing, pit crossing, baffle crawling, stair climbing, and slope walking.
Bipedal Gait Tasks: Standing up, plane walking, slope walking, and stair descending.

The robot receives velocity commands from a joystick and a one-hot vector $g$ to indicate the desired gait: $g=0$ for quadrupedal and $g=1$ for bipedal. The terrains used in simulation are shown in Figure 2.

The following figure (Figure 2 from the original paper) shows a snapshot of the terrain settings. From left to right: bar, pit, baffle, slope, and stairs.

Fig. 2: A snapshot of the terrain settings. From left to right: bar, pit, baffle, slope, and stairs. 该图像是图示，展示了腿部机器人在不同地形上的运动行为，包括行走过的边缘、跨越障碍和登上阶梯的场景。这些动作体现了多任务运动的能力。图中的不同场景展示了机器人适应多样化地形的能力。

Multitask locomotion is defined as a Markov Decision Process (MDP): $\langle S _ { \tau } , A _ { \tau } , T _ { \tau } , R _ { \tau } , \gamma _ { \tau } \rangle$ .

State Space (S_{\tau}): Different locomotion terrains correspond to distinct subsets of the state space.
Reward Function (R_{\tau}): Varies across different gaits, reflecting task-specific objectives.
Transition Dynamics (T_{\tau}): Termination conditions depend on the gait type.

The robot learns a policy $\pi(a|s)$ that selects actions based on both the terrain and gait to maximize the cumulative reward across tasks: $J ( \pi ) = \mathbb { E } \left[ \sum _ { \tau } \sum _ { t = 0 } ^ { \infty } \gamma ^ { t } R _ { \tau } ( s _ { t } , a _ { t } , s _ { t + 1 } ) \right]$ Here, $J(\pi)$ is the objective function representing the expected cumulative discounted reward of policy $\pi$ . $\mathbb{E}$ denotes the expectation over states and actions. $\tau$ sums over different tasks. $\gamma$ is the discount factor for future rewards. $R_{\tau}(s_t, a_t, s_{t+1})$ is the reward received for task $\tau$ at time step $t$ for taking action $a_t$ in state $s_t$ and transitioning to state $s_{t+1}$ . The goal is to find a single policy that generalizes across various tasks, relying solely on proprioception (internal body sensing) for blind locomotion.

4.2.2. Observations (State Space)

The observation space is composed of four types of information at time $t$ :

Proprioception ( $\pmb{p}_t$ ): Internal sensor data from the robot itself. This includes:
- Projected gravity (from IMU)
- Base angular velocity (from IMU)
- Joint positions
- Joint velocities
- The last action taken
Explicit Privileged State ( $\pmb{e}_t$ ): Information that is available in simulation but not directly on the real robot, providing "privileged" insight. This includes:
- Base linear velocity (used instead of noisy IMU data)
- Ground friction
Implicit Privileged State ( $\pmb{i}_t$ ): Additional privileged information, typically relating to contacts, which is encoded into a low-dimensional latent representation to reduce the sim-to-real gap. This includes:
- Contact force of different robot links
Command ( $\pmb{c}_t$ ): The desired control input for the robot:
- Velocity command $V = (\nu_x, \nu_y, \nu_{yaw})$ (linear velocities in x, y, and angular velocity around yaw axis).
- One-hot vector $g$ for gait selection: $g=0$ for quadrupedal, $g=1$ for bipedal.

4.2.3. Action Space

The action space $\pmb{a}_t \in \mathbb{R}^{12}$ consists of the desired joint positions for all 12 joints of the robot. These are typically target positions for PD controllers that then apply torques to achieve these positions.

4.2.4. Reward Design

The robot receives different rewards based on the gait command $g$ .

Quadrupedal Locomotion ( $g=0$ ): Total reward $r^{\mathrm{quad}} = r_{\mathrm{track}}^{\mathrm{quad}} + r_{\mathrm{reg}}^{\mathrm{quad}}$ . This includes rewards for tracking desired linear and angular velocities, penalties for termination, and regularization terms for joint positions, velocities, accelerations, angular velocity stability, feet in air, hip positions, base height, balance, and joint limits.

Bipedal Locomotion ( $g=1$ ): Total reward $r^{\mathrm{bip}} = r_{\mathrm{track}}^{\mathrm{bip}} + r_{\mathrm{stand}}^{\mathrm{bip}} + r_{\mathrm{reg}}^{\mathrm{bip}}$ . This includes similar tracking and regularization terms, but also a specific reward for standing up and maintaining bipedal balance.

The detailed reward functions are extensive and provided in Table V of the Appendix. The following are the reward functions from Table V of the original paper:

Type	Item	Formula		Weight
Quadrupedal Tracking	Tracking lin vel	exp kvc,xy−b,xyk2 ) σ		7.0
	Tracking ang vel	exp (eyn ωq)z ) σ		2.5
	Termination	−1		−1.0
	Alive	1		1.0
Quadrupedal Regularization	Joint pos	(q − qdefault ) 2		−0.05
	Joint vel	\|\|1 2		−0.002
	Joint acc	\|\|ll2		−2 × 10−6
	Ang vel stability	(Hω,x 2 +H, z)		−0.2
	Feet in air	∥N [p<1 1[[		−0.05
	Front hip pos	∑i Ihonti i qqdefal) 2		−0.2
	Rear hip pos	S∑i n ni defaut) 2		-0.5
	Base height	(base − * ∑(N hi hagr)2		-0.1
	Balance	\|Ffee0 Ffe,2 ee,1		−2 × 10−5
	Joint limit	\|(q < qmin) ν (q > qmax)\|		−0.01
	Torque exceed limit	∑i= ∑Σ12 max{[i\|− hm, 0}		−2.0
	Orientation	(0.5 cos θ +0.5)2 θ = arccos( , (ZaTa, 0), 1)		1.0
	Base height linear Tracking lin vel	min(max(		0.8
	Base height linear Tracking lin vel	exp( kvc −a2		I{cos θ > 0.95} Zh 5 Shigh -Slow Dosse -Slow	3.0
Bipedal Tracking	Tracking ang vel	(end −ωka)2 σ exp( )1{cos θ > 0.95} σang		2.5
	Termination	−1		-1.0
	Alive	1		1.0
	Rear air	ΠiRI{riz < 1} + ∑iRI[ti2 < 11 I{
Bipedal Regularization	Front hip pos	∑iµ (i qdefault)		-0.5 -0.1

	Rear hip pos			−0.18
	Rear pos balance	Hqrear ket ear righ ll2		−0.05
	Front joint pos	1{t > allow} ∆ieF(qi− qdef))2		−0.2
	Front joint vel	1{t > Tallow } ∑ 2		-1× 10−3
	Front joint acc	1 > Ta (8) 2		−2× 10−6
	Legs energy substeps	N ∑N, ∑Nr (τi qij)2		−1 × 10−6
	Torque exceed limits	∑j ∑i max{\|− imi, 0}		−2.0
	Joint limits	∑j ∑i {qij [qmin, qmax}		−0.06
Collision	∑k 1{k > 1		-2.0
Action rate	lalas ai2		−0.03
Joint vel	\|∥\|ql2		−2× 10−3
Joint acc	\|\|1 l2		−3× 10−6

4.2.5. Termination Conditions

Termination conditions are gait-dependent:

Quadrupedal Gait ( $g=0$ ): The robot terminates if its roll angle $\theta_{\mathrm{roll}} > 1.0$ radians or pitch angle $\theta_{\mathrm{pitch}} > 1.6$ radians (indicating it has fallen or is severely unstable).
Bipedal Gait ( $g=1$ ): The robot terminates if any link other than its rear feet and calves contacts the ground after 1 second (indicating it has lost bipedal balance).

4.2.6. Training Process (Two-Stage Framework)

The training process follows a two-stage framework, inspired by prior work [20], and uses PPO [53].

Stage 1: Oracle Policy Training

In the first stage, an "Oracle" policy is trained with access to all available observation states: proprioception ( $\pmb{p}_t$ ), explicit privileged state ( $\pmb{e}_t$ ), implicit privileged state ( $\pmb{i}_t$ ), and command ( $\pmb{c}_t$ ).

Implicit State Encoding: The implicit privileged state $\pmb{i}_t$ (contact forces) is first passed through an encoder network (Implicit Encoder MLP, as per Table VI) to convert it into a low-dimensional latent representation, denoted as $\mathrm{Enc}(\pmb{i}_t)$ . This is crucial for mitigating the sim-to-real gap as raw contact forces are hard to obtain on real robots.
Dual-State Representation: This latent representation $\mathrm{Enc}(\pmb{i}_t)$ is then concatenated with the explicit privileged state $\pmb{e}_t$ and proprioception $\pmb{p}_t$ to form a comprehensive dual-state representation $\pmb{l}_t = \left[ \mathrm{Enc}(\pmb{i}_t), \pmb{e}_t, \pmb{p}_t \right]$ .
Historical Information Integration: A Long Short-Term Memory (LSTM) module (Actor/Critic RNN) processes the dual-state representation $\pmb{l}_t$ combined with the command $\pmb{c}_t$ to integrate historical information into a hidden state $\pmb{h}_t$ . This allows the policy to account for past events and maintain memory, which is vital for dynamic locomotion tasks.
Mixture of Experts (MoE) Application: The hidden state $\pmb{h}_t$ $h_{t}$ is then fed into the Mixture of Experts (MoE) architecture, which is incorporated into both the actor (policy network) and critic (value network). This is where gradient conflicts are addressed.
- Gating Network: A gating network $g$ (an MLP as per Table VI) takes the hidden state $\pmb{h}_t$ as input and outputs scores for each expert. These scores are then normalized using a softmax function to produce the gating weights $\hat{\pmb{\mathscr{g}}}_i$ for each expert $i$ : $\hat { \pmb { \mathscr { g } } } _ { i } = \mathrm { s o f t m a x } \big ( g \big ( { \pmb h } _ { t } \big ) \big ) [ i ]$ Here, $\hat{\pmb{\mathscr{g}}}_i$ is the gating weight for expert $i$ , $g(\pmb{h}_t)$ is the output of the gating network, and softmax ensures these weights sum to 1.
- Expert Combination: Each expert $f_i$ (an Expert Head MLP) also takes the hidden state $\pmb{h}_t$ as input and computes its proposed action. The final action $\pmb{a}_t$ is a weighted sum of the actions suggested by all $N$ experts, where the weights are provided by the gating network: $\pmb { a } _ { t } = \sum _ { i = 1 } ^ { N } \hat { \pmb { g } } _ { i } \cdot { f } _ { i } \big ( { \pmb h } _ { t } \big )$ Here, $\pmb{a}_t$ is the final action, $N$ is the number of experts, and $f_i(\pmb{h}_t)$ is the output of expert $i$ . The paper uses $N=6$ experts.
- Shared Gating Network: The actor MoE and critic MoE share the same gating network. This ensures consistency between how actions are generated and how their values are evaluated.
Estimator Pretraining: In parallel, an estimator module is pretrained during this stage. The estimator takes proprioception $\pmb{p}_t$ and command $\pmb{c}_t$ as input and aims to reconstruct the privileged information $[\mathrm{Enc}(\pmb{i}_t), \pmb{e}_t]$ . This is done using an L2 loss $L_{\mathrm{recon}}$ : $L _ { \mathrm { r e c o n } } = \sum _ { \hat { \mathbf { l } } _ { i } , \mathbf { l } _ { i } \in \mathcal { D } } \left\| \hat { \mathbf { l } } _ { i } - \mathbf { l } _ { i } \right\| ^ { 2 }$ Here, $\hat{\pmb{l}}_i$ is the estimated privileged information from the estimator, and $\pmb{l}_i$ is the actual privileged information. $\mathcal{D}$ denotes the rollout buffer. This estimator will be crucial in Stage 2.
Overall Optimization Objective: The total loss for PPO in Stage 1 combines the standard PPO losses with the reconstruction loss: $L = L _ { \mathrm { s u r r o } } + L _ { \mathrm { v a l u e } } + L _ { \mathrm { r e c o n } }$ Where $L_{\mathrm{surro}}$ is the PPO surrogate loss for policy updates, and $L_{\mathrm{value}}$ is the value function loss for updating the critic.

Algorithm 1 provides the pseudocode for Training Stage 1:

Algorithm 1 Training Stage 1 1: for total iteration do 2: Initialize rollout buffer $\mathcal{D} \leftarrow \emptyset$ 3: for num steps do 4: $\mathbf{z}_t \gets \mathrm{Enc}(\mathbf{i}_t)$ 5: $\hat{\mathbf{l}}_t \gets [\mathrm{Estimator}(\mathbf{p}_t, \mathbf{c}_t), \mathbf{p}_t]$ 6: $\mathbf{l}_t \gets [\mathbf{z}_t, \mathbf{e}_t, \mathbf{p}_t]$ 7: $\mathbf{h}_t \gets \mathrm{LSTM}([\mathbf{l}_t, \mathbf{c}_t])$ 8: $\hat{\mathbf{g}} \gets \mathrm{softmax}(g(\mathbf{h}_t))$ 9: \mathbf{a}_t \gets \sum_{i=1}^N \hat{\mathbf{g}}_i \cdot f_i(\mathbf{h}_t) 10: Execute $\mathbf{a}_t$ , observe reward $r_t$ and next state 11: Reset if terminated 12: $t \gets t+1$ 13: Store rollout in $\mathcal{D}$ 14: end for 15: Compute PPO losses: $L_{\mathrm{surro}}$ , $L_{\mathrm{value}}$ 16: Compute reconstruction loss: 17: $L_{\mathrm{recon}} = \sum_{\hat{\mathbf{l}}_i, \mathbf{l}_i \in \mathcal{D}} \|\hat{\mathbf{l}}_i - \mathbf{l}_i\|^2$ 18: $L = L_{\mathrm{surro}} + L_{\mathrm{value}} + L_{recon}$ 19: Update policy and value network 20: end for 21: return Oracle Policy

Lines 4-6: These lines process the raw observations to create the full state representation $\mathbf{l}_t$ and the estimator's prediction $\hat{\mathbf{l}}_t$ . $\mathbf{z}_t$ is the encoded implicit privileged state.
Lines 7-9: These lines describe the forward pass through the LSTM and MoE modules to get the action $\mathbf{a}_t$ .
Lines 10-14: Standard RL rollout collection.
Lines 15-18: Calculation of the combined PPO and reconstruction loss.
Line 19: Optimization step for the policy and value networks.

Stage 2: Final Policy Adaptation

In the second stage, the policy learns to rely exclusively on proprioception ( $\pmb{p}_t$ ) and command ( $\pmb{c}_t$ ) as observations, mimicking the real-world scenario where privileged information is unavailable.

Initialization: The weights of the estimator, low-level LSTM, and MoE modules are initialized by copying them from the Oracle Policy trained in Stage 1.
Probability Annealing Selection (PAS): To adapt the policy to the potentially inaccurate estimates from the estimator without significant performance degradation, Probability Annealing Selection (PAS) [54] is employed.
- During training, at each step $t$ , the system decides whether to use the actual privileged information $\pmb{l}_t$ (from simulation) or the estimator's prediction $\hat{\pmb{l}}_t$ . This selection is based on a probability $P_t = \alpha^t$ , where $\alpha$ is an annealing factor (e.g., slightly less than 1).
- As training progresses ( $t$ increases), $P_t$ decreases, meaning the policy gradually relies more on the estimator's output ( $\hat{\pmb{l}}_t$ ) and less on the ground-truth privileged information ( $\pmb{l}_t$ ). This smooth transition helps the policy adapt robustly.
- The state $\bar{\pmb{l}}_t$ used for LSTM input is chosen by: $\bar{\pmb{l}}_t \gets \mathrm{Probability Selection}(\pmb{P}_t, \hat{\pmb{l}}_t, \pmb{l}_t)$ .
Policy Update: The rest of the training proceeds similarly to Stage 1, using the combined loss $L = L_{\mathrm{surro}} + L_{\mathrm{value}} + L_{\mathrm{recon}}$ . The estimator continues to be refined through $L_{\mathrm{recon}}$ , ensuring its predictions are as accurate as possible.

Algorithm 2 provides the pseudocode for Training Stage 2:

Algorithm 2 Training Stage 2 1: Copy parameters from Oracle Policy 2: for total iteration do 3: Initialize rollout buffer $\mathcal{D} \leftarrow \emptyset$ 4: for num steps do 5: $\hat{\mathbf{l}}_t \gets [\mathrm{Estimator}(\mathbf{p}_t, \mathbf{c}_t), \mathbf{p}_t]$ 6: $\mathbf{l}_t \gets [\mathbf{z}_t, \mathbf{e}_t, \mathbf{p}_t]$ 7: $\mathbf{P}_t \gets \alpha^t$ 8: $\bar{\mathbf{l}}_t \gets \mathrm{Probability Selection}(\mathbf{P}_t, \hat{\mathbf{l}}_t, \mathbf{l}_t)$ 9: $\mathbf{h}_t \gets \mathrm{LSTM}([\bar{\mathbf{l}}_t, \mathbf{c}_t])$ 10: $\hat{\mathbf{g}} \gets \mathrm{softmax}(g(\mathbf{h}_t))$ 11: \mathbf{a}_t \gets \sum_{i=1}^N \hat{\mathbf{g}}_i \cdot f_i(\mathbf{h}_t) 12: Execute $\mathbf{a}_t$ , observe reward $r_t$ and next state 13: Reset if terminated 14: $t \gets t+1$ 15: Store rollout in $\mathcal{D}$ 16: end for 17: Compute PPO losses: $L_{\mathrm{surro}}$ , $L_{\mathrm{value}}$ 18: Compute reconstruction loss: 19: $L_{\mathrm{recon}} = \sum_{\hat{\mathbf{l}}_i, \mathbf{l}_i \in \mathcal{D}} \|\hat{\mathbf{l}}_i - \mathbf{l}_i\|^2$ 20: $L = L_{\mathrm{surro}} + L_{\mathrm{value}} + L_{recon}$ 21: Update policy and value network 22: end for 23: return Final Policy

Line 1: Initializes the networks from Stage 1.
Lines 5-6: Similar to Stage 1, preparing the estimator's output and the true privileged state (though $\mathbf{z}_t$ and $\mathbf{e}_t$ would be internal simulation values, not real-world inputs).
Lines 7-8: Implements the Probability Annealing Selection (PAS) mechanism. $\mathbf{P}_t$ is the probability of using the true privileged state, which anneals over time.
Lines 9-11: Forward pass through LSTM and MoE using the selected state $\bar{\mathbf{l}}_t$ .
The rest is similar to Stage 1, but with the policy now adapting to rely more on the estimator.

4.2.7. Skill Decomposition and Composition

The MoE framework inherently facilitates skill decomposition and composition.

Decomposition: The gating network implicitly decomposes complex tasks by assigning different weights to different experts. Through training, each expert naturally specializes in distinct aspects of movement (e.g., balancing, crawling, obstacle crossing). This specialization is observed through quantitative analysis of expert coordination, as shown in the experiments.
Composition: MoE-Loco allows for skill composition by manually or dynamically adjusting the gating weights of pretrained experts to synthesize new skills or gaits. The modified gating weights $\hat{\pmb{g}}_i$ are calculated as: $\pmb { \hat { g } } _ { i } = w [ i ] \cdot \mathrm { s o f t m a x } ( g ( \pmb { h } _ { t } ) ) [ i ] ,$ Here, w[i] is a manually defined or dynamically adjusted weight for expert $i$ . This means that after the softmax output from the original gating network $g(\pmb{h}_t)$ , an additional factor w[i] can scale the contribution of each expert. This enables controlled skill blending and adaptation to novel locomotion patterns without retraining the entire policy.

4.3. Network Architecture Details

The architecture of different modules used in the experiment is detailed in Table VI.

The following are the network architecture details from Table VI of the original paper:

Network	Type	Dims
Actor RNN	LSTM	[256]
Critic RNN	LSTM	[256]
Estimator Module	LSTM	[256]
Estimator Latent Encoder	MLP	[256, 128]
Implicit Encoder	MLP	[32, 16]
Expert Head	MLP	[256, 128, 128]
Standard Head	MLP	[640, 640, 128]
Gating Network	MLP	[128]

Actor RNN / Critic RNN / Estimator Module: All use LSTMs with a hidden dimension of 256. These LSTMs are crucial for processing sequential observations and integrating historical information.
Estimator Latent Encoder: An MLP that maps input features (likely from proprioception and command) to a latent space of 128 dimensions via an intermediate 256-dimension layer.
Implicit Encoder: A small MLP that encodes the implicit privileged state (e.g., 32 contact forces) into a 16-dimensional latent representation. This small dimension helps with sim-to-real transfer.
Expert Head: Each expert in the MoE is an MLP with two hidden layers (256 -> 128 -> 128). The final output layer would match the action space dimension (12 joints).
Standard Head: This refers to the MLP used in the Ours w/o MoE baseline. It's a larger MLP (640 -> 640 -> 128, then to action space) designed to have a similar total parameter count to the MoE policy for fair comparison.
Gating Network: An MLP with a single hidden layer (outputting to 128 dimensions), responsible for generating the gating weights for the experts. The output dimension would typically match the number of experts (6 in this case) before softmax.

4.4. Domain Randomization

To ensure the policy can safely transfer to real-world environments and overcome the sim-to-real gap, dynamic domain randomization is applied during training. This involves randomizing various physical parameters within the simulation.

The following are the domain randomization parameters from Table VII of the original paper:

Parameters	Range	Unit
Base mass	[1, 3]	kg
Mass position of X axis	[-0.2, 0.2]	m
Mass position of Y axis	[-0.1, 0.1]	m
Mass position of Z axis	[-0.05, 0.05]	m
Friction	[0, 2]	-
Initial joint positions	[0.5, 1.5] × nominal value	rad
Motor strength	[0.9, 1.1] × nominal value	-
Proprioception latency	[0.005, 0.045]	s

Additionally, Gaussian noise is added to the input observations to simulate real-world sensor noise:

The following are the Gaussian noise parameters from Table VIII of the original paper:

Observation	Gaussian Noise Amplitude	Unit
Linear velocity	0.05	ms
Angular velocity	0.2
Gravity	0.05	m/s2
Joint position	0.01	rad
Joint velocity	1.5	d

4.5. Training Details

Simulation Environment: Training is conducted in IsaacGym [17], a high-performance GPU-based physics simulator, utilizing 4096 robots concurrently on an NVIDIA RTX 3090 GPU.
Ground Noise: To ensure robust performance on uneven terrains and prevent leg dragging, fractal noise [7] is applied to the ground, with a maximum noise scale $z_{\mathrm{max}} = 0.1$ .
Training Schedule:
- 40,000 iterations for initial plane walking in both bipedal and quadrupedal gaits.
- 80,000 iterations on challenging terrain tasks.
- 10,000 iterations for Probability Annealing Selection (PAS) to adapt to pure proprioception input (Stage 2).
Control Frequency: $50 \mathrm{Hz}$ in both simulation and real world.
Low-level Control: PD control is used for joint execution with parameters $K_p = 40.0$ and $K_d = 0.5$ .
Number of Experts: The number of experts $N_{\mathrm{exp}}$ is set to 6.
Reinforcement Learning Algorithm: PPO is used.

The hyperparameters for PPO are shown in Table IX.

The following are the PPO hyperparameters from Table IX of the original paper:

Hyperparameter	Value
clip min std	0.05
clip param	0.2
gamma	0.99
lam	0.95
desired kl	0.01
entropy coef	0.01
learning rate	0.001
max grad norm	1
num mini batch	4
num steps per env	24

clip min std: Minimum standard deviation for the action distribution in PPO's actor network.
clip param: The clipping parameter used in PPO's surrogate loss function to limit policy updates.
gamma: The discount factor for future rewards.
lam: The lambda parameter for Generalized Advantage Estimation (GAE), used in PPO for estimating advantages.
desired kl: The target Kullback-Leibler (KL) divergence used in adaptive KL penalty PPO to control the policy update step size.
entropy coef: Coefficient for the entropy regularization term in the PPO loss, encouraging exploration.
learning rate: The learning rate for the optimizer (e.g., Adam).
max grad norm: Gradient clipping threshold, used to prevent exploding gradients.
num mini batch: Number of mini-batches used for PPO updates.
num steps per env: Number of environment steps collected per environment before performing a PPO update.

5. Experimental Setup

5.1. Datasets

The experiments are conducted in a simulated environment using IsaacGym [17], a high-performance GPU-based physics simulator. There isn't a traditional "dataset" in the sense of a fixed collection of samples, but rather an interactive environment where data is generated through robot-environment interaction.

The simulation environment uses a custom benchmark consisting of a 5m x 100m runway with various obstacles evenly distributed along the path. This setup acts as the experimental "dataset" for evaluating diverse locomotion skills.

The following are the benchmark tasks for simulation experiments from Table II of the original paper:

Obstacle Type	Specification	Gait Mode
Bars	5 bars, height: 0.05m - 0.2m	Quadrupedal
Pits	5 pits, width: 0.05m - 0.2m	Quadrupedal
Baffles	5 baffles, height: 0.3m - 0.22m	Quadrupedal
Up Stairs	3 sets, step height: 5cm - 15cm	Quadrupedal
Down Stairs	3 sets, step height: 5cm - 15cm	Quadrupedal
Up Slopes	3 sets, incline: 10° - 35°	Quadrupedal
Down Slopes	3 sets, incline: 10° - 35°	Quadrupedal
Plane	10m flat surface	Bipedal
Up Slopes	3 sets, incline: 10° - 35°	Bipedal
Down Slopes	3 sets, incline: 10° - 35°	Bipedal
Down Stairs	3 sets, step height: 5cm - 15cm	Bipedal

For each challenging task, separate tracks of 30 meters in length were also used. These environments and obstacles were chosen because they represent a diverse set of real-world challenges for legged robots, including different geometries (bars, pits, stairs), continuous height changes (slopes), and varying support surfaces (baffles). They are effective for validating the method's performance across a wide spectrum of locomotion skills and robustness under varied conditions. The use of both quadrupedal and bipedal gaits further increases the diversity of the "dataset."

5.2. Evaluation Metrics

The experiments use three primary metrics to evaluate performance:

Success Rate ( $\uparrow$ ):
- Conceptual Definition: This metric quantifies the proportion of trials in which the robot successfully completes a given task. It measures the reliability and overall capability of the policy to achieve its objective. A trial is considered successful if the robot reaches within 1 meter of the target point within a maximum allowed time (400 seconds).
- Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} $
- Symbol Explanation:
  - Number of Successful Trials: The count of trials where the robot reached within 1m of the target point within 400 seconds.
  - Total Number of Trials: The total number of attempts made for a given task.
Average Pass Time ( $\downarrow$ ):
- Conceptual Definition: This metric measures the average time taken by the robot to complete a task. It indicates the efficiency and speed of the locomotion policy. For failed trials (where the robot does not reach the target or meets termination conditions), the pass time is recorded as the maximum allowed time (400 seconds) to penalize failure.
- Mathematical Formula: $ \text{Average Pass Time} = \frac{\sum_{j=1}^{\text{Total Trials}} \text{Time}_j}{\text{Total Number of Trials}} $
- Symbol Explanation:
  - $\text{Time}_j$ : The time taken to complete trial $j$ . If trial $j$ fails, $\text{Time}_j = 400$ seconds.
  - Total Number of Trials: The total number of attempts made for a given task.
Average Travel Distance ( $\uparrow$ ):
- Conceptual Definition: This metric measures the average lateral distance traveled by the robots at the end of the evaluation. While the primary goal is often forward locomotion, this metric can indicate stability and control, ensuring the robot doesn't deviate excessively laterally while moving.
- Mathematical Formula: The paper refers to "average lateral travel distance," which might imply distance from the starting line or desired path. Without a specific formula in the paper, a general interpretation for average travel distance is: $ \text{Average Travel Distance} = \frac{\sum_{j=1}^{\text{Total Trials}} \text{Distance Traveled}_j}{\text{Total Number of Trials}} $
- Symbol Explanation:
  - $\text{Distance Traveled}_j$ : The distance the robot traveled in trial $j$ . For failed trials, this would be the distance covered before failure.
  - Total Number of Trials: The total number of attempts made for a given task.

5.3. Baselines

The paper compares MoE-Loco against two main baselines:

Ours w/o MoE [20]: This baseline uses the same overall framework (including the two-stage training, privileged learning, and PPO) as MoE-Loco but replaces the Mixture of Experts (MoE) module with a simple Multi-Layer Perceptron (MLP) backbone. Critically, the total number of parameters in this MLP policy is controlled to be approximately the same as in the MoE policy. This baseline is representative because it isolates the contribution of the MoE architecture, demonstrating whether the performance gains are due to MoE itself or other aspects of the training framework. It directly helps in validating the hypothesis that MoE mitigates gradient conflicts.
RMA (Rapid Motor Adaptation) [3]: This is a well-known RL approach for legged robots that focuses on rapid adaptation to novel terrains. RMA employs a teacher-student training framework with a 1D-CNN (Convolutional Neural Network) as an asynchronous adaptation module. It does not use an MoE module. RMA is a strong baseline for robust locomotion because it explicitly addresses generalization to unseen environmental conditions through its adaptation mechanism. Comparing against RMA helps to show MoE-Loco's advantages in handling diverse known tasks and gaits, as opposed to just adapting to unforeseen variations in a single task. The paper notes that RMA's original implementation, with only an MLP backbone and CNN encoder, might struggle with multiple challenging terrains.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Multitask Performance

Simulation Experiment

The simulation experiments evaluate the MoE-Loco policy against Ours w/o MoE and RMA across a mixed-task benchmark and individual challenging tasks. $q$ denotes quadrupedal gait, and $b$ denotes bipedal gait.

The following are the results from Table I of the original paper:

Method	Success Rate ↑
Method	Mix	Bar (q)	Baffle (q)	Stair (q)	Pit (q)	Slope (q)	Walk (b)	Slope (b)	Stair (b)
Ours	0.879	0.886	0.924	0.684	0.902	0.956	0.932	0.961	0.964
Ours w/o MoE	0.571	0.848	0.264	0.568	0.698	0.988	0.826	0.504	0.453
RMA	0.000	0.871	0.058	0.017	0.017	0.437	0.000	0.000	0.000
	Average Pass Time (s) ↓
	Mix	Bar (q)	Baffle (q)	Stair (q)	Pit (q)	Slope (q)	Walk (b)	Slope (b)	Stair (b)
Ours	230.98	102.42	87.84	179.14	91.86	76.75	92.37	86.14	86.44
Ours w/o MoE	315.47	125.46	318.68	214.52	161.38	65.28	156.76	236.67	253.62
RMA	400.00	107.84	385.25	395.34	394.49	272.06	400.00	400.00	400.00
	Average Travel Distance (m) ↑
	Mix	Bar (q)	Baffle (q)	Stair (q)	Pit (q)	Slope (q)	Walk (b)	Slope (b)	Stair (b)
Ours	89.41	28.05	28.02	20.42	27.82	27.62	27.20	27.99	28.04
Ours w/o MoE	57.12	27.59	17.41	22.66	25.59	28.49	22.73	26.21	14.23
RMA	13.40	27.39	11.31	3.92	12.48	21.33	2.00	2.00	2.00

Analysis:

Overall Mixed-Task Performance (Mix column): Ours (MoE-Loco) significantly outperforms both baselines across all three metrics. It achieves a Success Rate of 0.879, much higher than Ours w/o MoE (0.571) and RMA (0.000). Similarly, its Average Pass Time is lower (230.98s vs 315.47s and 400.00s), and Average Travel Distance is higher (89.41m vs 57.12m and 13.40m). This strong performance in the mixed-task benchmark highlights MoE-Loco's ability to generalize and robustly handle diverse tasks simultaneously.
Single-Task Performance:
- Ours generally excels in single-task evaluations. For instance, in Baffle (q), Ours has a Success Rate of 0.924 compared to 0.264 for Ours w/o MoE and 0.058 for RMA. This indicates that MoE is critical for challenging tasks that likely induce significant gradient conflicts in a standard MLP.
- The only exception is Slope (q) (quadrupedal slope walking), where Ours w/o MoE achieves a slightly higher Success Rate (0.988 vs 0.956) and lower Average Pass Time (65.28s vs 76.75s). The authors attribute this to the relative simplicity of quadrupedal slope walking, suggesting that for less complex tasks, the overhead or specialization of MoE might not offer a substantial advantage over a well-tuned MLP.
Impact of MoE: The substantial performance gap between Ours and Ours w/o MoE (especially in Mix, Baffle (q), Walk (b), Slope (b), Stair (b)) strongly validates the effectiveness of the MoE architecture. The Ours w/o MoE struggles significantly in many challenging scenarios, confirming the authors' hypothesis that gradient conflicts severely degrade performance in multitask learning without MoE.
RMA Performance: RMA performs very poorly, achieving a 0.000 success rate in mixed tasks and most bipedal tasks, and very low success rates in several quadrupedal tasks like Baffle (q) and Stair (q). This is attributed to its original implementation using an MLP backbone and CNN encoder, which is not optimized for the diverse and challenging multitask terrains presented in this benchmark.

Real World Experiments

The MoE-Loco policy is deployed zero-shot (without any further training) on a real Unitree Go2 quadruped robot. The robot is tested on a mixed terrain course comprising a 20cm bar, 22cm baffle, 15cm stairs, 20cm pits, and 30-degree slopes in quadrupedal gait, followed by a switch to bipedal gait for standing up, walking up/down a 30-degree slope, and descending stairs. Each test is conducted for 20 trials.

The following figure (Figure 4 from the original paper) shows the real-world success rate over multiple terrains and gaits.

Fig. 4: Real world success rate over multiple terrains and gaits. 该图像是一幅图表，展示了在多种地形和步态下的真实成功率。图中使用不同颜色的柱子代表了不同的方法，包括我们的方法（蓝色）、不使用 MoE 的方法（橙色）和 RMA 方法（绿色），以展示它们在处理各种地形（如条形、障碍、楼梯等）时的表现差异。

The following figure (Figure 5 from the original paper) shows real-world experiments over multiple terrains and gaits.

Fig. 5: Real-world experiments over multiple terrains and gaits: 1. Bar (Quad), 2. Pit (Quad), 3. Baffle (Quad), 4. Stair (Quad), 5. Slope (Quad), 6. Stand up (Bip), 7. Walk (Bip), 8. Slope (Bip), 9. Stair (Bip).

Analysis:

Figure 4 clearly shows that MoE-Loco achieves superior real-world success rates across all tested terrains and gaits. Its performance is notably higher in the challenging Mix Terrain scenario, demonstrating its robust generalization capabilities to complex, sequential tasks in the real world.
The Ours w/o MoE baseline performs poorly in Mix Terrain and struggles with Baffle (q), Stair (q), Slope (b), and Stair (b), echoing the simulation results and further emphasizing the necessity of MoE for real-world multitask locomotion.
RMA again shows very limited real-world capability for these diverse tasks, failing entirely in Mix Terrain and Baffle (q), and exhibiting low success rates in other complex tasks.
Figure 5 visually supports these findings, showcasing the Unitree Go2 robot successfully traversing various challenging real-world terrains using the MoE-Loco policy, including diverse outdoor environments. This demonstrates the policy's robustness and adaptability beyond controlled lab settings.

6.1.2. Gradient Conflict Alleviation

To directly investigate whether MoE reduces gradient conflicts, experiments were conducted by resuming training from a checkpoint and running for 500 epochs with 4096 robots. Gradient conflict was measured using two metrics:

Cosine Similarity ( $\uparrow$ ): The normalized dot product of the gradients of all parameters for different tasks. A smaller (more negative) cosine similarity indicates larger gradient conflicts (gradients pointing in opposite directions). A larger (more positive) cosine similarity indicates greater alignment.
- Formula for Cosine Similarity between two vectors $A$ and $B$ : $ \text{Cosine Similarity} (A, B) = \frac{A \cdot B}{|A| |B|} $ Where $A \cdot B$ is the dot product, and $\|A\|$ and $\|B\|$ are the magnitudes (L2 norms) of vectors $A$ and $B$ , respectively. The range is $[-1, 1]$ .
Negative Gradient Ratio ( $\downarrow$ ): The ratio of negative entries in the element-wise product of two task gradients. A larger ratio indicates more frequent instances where gradients from different tasks try to update the same parameter in opposing directions, thus signifying larger gradient conflict.
- No specific formula is provided in the paper, but it implies: $ \text{Negative Gradient Ratio} (G_1, G_2) = \frac{\text{Count of indices } j \text{ where } (G_1)_j \cdot (G_2)_j < 0}{\text{Total number of parameters}} $ Where $G_1$ and $G_2$ are the gradient vectors for two different tasks.

The following are the cosine similarity of gradients of different tasks from Table III of the original paper:

MoE/Standard	Gradient Cosine Similarity ↑
MoE/Standard	Bar (q)	Baffle (q)	Stair (q)	Slope Up (b)	Slope down (b)
Bar (q)		0.519/0.474	0.606/0.592	0.278/-0.132	0.091/-0.128
Baffle (q)		-	0.369/0.384	0.062/-0.091	0.061/-0.101
Stair (q)			-	0.046/-0.023	0.052/0.015
Slope up (b)					0.806/0.709
Slope down (b)					-

The following are the negative entry ratio of MoE and Standard policy on different tasks from Table IV of the original paper:

MoE/Standard	Gradient Negative Entries (%) ↓
MoE/Standard	Bar (q)	Baffle (q)	Stair (q)	Slope Up (b)	Slope down (b)
Bar (q)		35.72/37.33	32.67/32.62	45.50/ 55.12	49.83/50.80
Baffle (q)		-	39.90/38.52	49.86/55.91	49.91/51.68
Stair (q)			-	49.52/50.15	50.04/50.34
Slope up (b)					23.17/30.91
Slope down (b)		-			-

Analysis:

Cosine Similarity:
- For tasks involving a mix of quadrupedal and bipedal gaits (e.g., Bar (q) vs. Slope Up (b) or Slope down (b)), the MoE policy shows significantly higher cosine similarity (e.g., 0.278 for MoE vs. -0.132 for Standard between Bar (q) and Slope Up (b)). A negative cosine similarity indicates gradients are pointing in largely opposite directions, causing strong conflicts. MoE effectively shifts these into positive alignment or at least reduces the opposition.
- Even between some quadrupedal tasks requiring distinct skills (e.g., Bar (q) vs. Baffle (q)), MoE slightly improves cosine similarity (0.519 vs 0.474), suggesting a reduction in conflict.
Negative Gradient Ratio:
- Similar to cosine similarity, the MoE policy consistently achieves a lower negative gradient ratio, especially between bipedal and quadrupedal tasks. For instance, between Bar (q) and Slope Up (b), MoE has 45.50% negative entries compared to 55.12% for the Standard policy. This indicates fewer parameters are being pulled in conflicting directions by these diverse tasks.
- The reduction is also visible, albeit sometimes smaller, between certain quadrupedal tasks (e.g., Baffle (q) vs. Slope Up (b): 49.86% for MoE vs. 55.91% for Standard).
Conclusion: The results from both metrics unequivocally demonstrate that the MoE architecture significantly reduces gradient conflict not only between fundamentally different gaits (bipedal vs. quadrupedal) but also between quadrupedal tasks requiring distinct locomotion skills. This substantiates one of the core claims of the paper regarding MoE's benefits in multitask reinforcement learning.

6.1.3. Training Performance

Training performance is evaluated based on mean reward (ability to exploit the environment) and mean episode length (how well the robot learns to stand and walk). The training curves are presented during the plane pretraining stage, where the robot learns both bipedal and quadrupedal plane walking.

The following figure (Figure 6 from the original paper) shows the training curve of our multitask policy in the pretraining stage.

Fig. 6: Training curve of our multitask policy in the pretraining stage.

Analysis:

Figure 6 shows the MoE policy (pink line) consistently outperforms the Standard policy (green line, Ours w/o MoE) in both Mean Episode Length and Mean Reward.
Mean Episode Length: The MoE policy achieves a higher mean episode length much faster and maintains it, indicating that it learns to stand and walk more effectively and for longer durations. This suggests faster and more stable learning of fundamental locomotion skills.
Mean Reward: The MoE policy also converges to a higher mean reward, demonstrating a better ability to exploit the environment and achieve task objectives.
Conclusion: These results confirm that MoE improves training efficiency and overall performance even in the initial pretraining phase, likely due to its ability to handle the subtle multitask nature of learning two distinct gaits on a plane.

6.1.4. Expert Specialization Analysis

To understand how experts compose skills, both qualitative and quantitative analyses are performed.

The following figure (Figure 7 from the original paper) shows the average weight distribution of different experts in multitask gait training.

该图像是一个图表，展示了不同专家在多任务步态训练中的平均权重分配，包括行走、障碍和跨越栏杆等任务。每个子图所示的专家编号与相应的平均权重之间的关系反映了各专家在特定任务中的专业化程度。

Analysis of Figure 7: Figure 7 plots the mean weight of different experts (Expert 0 to Expert 5) across various tasks. The distinct distributions of gating weights for each task are evident:

For Walking (bipedal), Expert 0 and Expert 1 are heavily weighted, suggesting they are specialized in bipedal locomotion.
For Crossing Bars (quadrupedal), Expert 2 and Expert 3 show higher activation, indicating their specialization in obstacle negotiation.
For Rear Leg Crossing Bars (quadrupedal, a more specific variant of bar crossing), Expert 4 and Expert 5 are more prominent. This diverse activation pattern across tasks clearly demonstrates the expertise and differentiation that emerge naturally among the MoE experts. Each expert learns to contribute optimally to specific locomotion behaviors.

The following figure (Figure 8 from the original paper) shows the t-SNE result of gating network output on different terrains and gaits.

Fig. 8: t-SNE result of gating network output on different terrains and gaits.

Analysis of Figure 8 (t-SNE): The t-SNE (t-Distributed Stochastic Neighbor Embedding) plot visualizes the high-dimensional outputs of the gating network (expert weights) in a 2D space.

Gait Clustering: The plot shows clear separation between bipedal (Bip) and quadrupedal (Quad) tasks, forming distinct clusters. This indicates that the gating network effectively distinguishes between these fundamentally different locomotion modes and activates different sets of experts for each.
Task Clustering within Gaits:
- Within the quadrupedal cluster, tasks like Slope (q) and Pit (q) (which often share similar underlying control strategies with basic plane walking) cluster closely together.
- In contrast, Bar (q), Baffle (q), and Stair (q) (which require more distinct and complex gaits like high leg lifts or crawling) are further apart, forming their own sub-clusters.
Conclusion: The t-SNE results quantitatively confirm the expert specialization. The gating network learns to produce distinct activation patterns (expert weights) for different tasks, effectively routing diverse behaviors to specialized experts and reflecting the underlying kinematic and dynamic requirements of each task.

6.1.5. Skill Composition

Leveraging the identified expert specialization, the authors demonstrate the ability to compose new skills.

The following figure (Figure 9 from the original paper) shows a manually designed new dribbling gait by selecting two experts.

Fig. 9: Manually designed new dribbling gait by selecting two experts.

Analysis of Figure 9:

The paper identifies experts specializing in balancing (which helps lift the robot's body but limits agility) and crossing (for lifting a front leg for obstacle traversal).
By manually selecting these two specific experts, doubling the gating weight of the crossing expert, and masking out all other experts, a novel dribbling gait is synthesized zero-shot (without further training).
Figure 9 illustrates the robot executing this new gait, periodically using one front leg to "kick" a ball while maintaining balance and walking effectively.
Significance: This demonstrates the interpretability and composability of MoE-based policies. The specialized experts are not just abstract network components but represent identifiable, reusable locomotion skills. This allows for flexible construction of new behaviors, a significant advantage over black-box neural networks.

6.1.6. Additional Experiment (Adaptation Learning)

An adaptation learning experiment further showcases the recomposability of pretrained experts for new tasks.

The following figure (Figure 10 from the original paper) shows MoE-Loco can quickly adapt to a three-footed gait by training a new expert.

Fig. 10: MoE-Loco can quickly adapt to a three-footed gait by training a new expert. 1) ground plane, 2) slope up, and slope down.

Analysis of Figure 10:

Task: The robot is tasked with learning a three-footed gait. This is a novel behavior requiring one leg to be consistently lifted while maintaining stability and locomotion.
Method: A newly initialized expert is introduced into the existing MoE framework. The parameters of the original experts are frozen, and only the gating network is updated.
Result: Figure 10 shows the robot successfully walking on both flat ground and slopes using only three feet.
Significance: This highlights MoE-Loco's efficient adaptability. The newly added expert only needs to learn the specific skill of lifting one leg. It can then leverage the existing walking and slope traversal capabilities already learned and encoded in the frozen original experts, guided by the updated gating network. This demonstrates a powerful form of transfer learning and modular adaptation, where only minimal parts of the network need to be trained for novel tasks.

6.2. Data Presentation (Tables)

All tables provided in the original paper (Table I, II, III, IV, V, VI, VII, VIII, IX) have been transcribed completely and accurately in the relevant sections above, using HTML for tables with merged cells (Tables I, III, IV, V) and Markdown for simple grid structures (Tables II, VI, VII, VIII, IX).

6.3. Ablation Studies / Parameter Analysis

The paper primarily conducts an ablation study by comparing Ours (with MoE) against Ours w/o MoE (standard MLP backbone with similar parameters). This is a direct ablation study for the MoE component. The results (Table I, Figure 4, Figure 6, Table III, Table IV) consistently show that MoE is crucial for performance, training efficiency, and gradient conflict alleviation, especially in complex multitask scenarios.

The effect of key hyperparameters, such as the number of experts ( $N_{\mathrm{exp}}=6$ ) or PPO parameters, is not explicitly explored in separate ablation studies in the main text, but the chosen values are stated (e.g., $N_{\mathrm{exp}}=6$ ). The Probability Annealing Selection (PAS) mechanism for Stage 2 training is a form of parameter analysis (the annealing factor $\alpha^t$ ) which helps the policy adapt to proprioception gracefully. The robustness of this mechanism is implicitly validated by the successful real-world deployment.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces MoE-Loco, a novel Mixture of Experts (MoE) framework, for multitask locomotion in legged robots. This framework enables a single policy to command a quadruped robot to navigate a broad spectrum of challenging terrains (bars, pits, stairs, slopes, baffles) and switch seamlessly between quadrupedal and bipedal gaits. A key contribution is the demonstration that MoE effectively mitigates gradient conflicts inherent in multitask reinforcement learning, leading to superior training efficiency and overall performance both in simulation and zero-shot real-world deployment. Furthermore, the work highlights the emergent specialization of experts within the MoE architecture, offering interpretability and enabling novel capabilities like skill composition and efficient adaptation to new tasks (e.g., three-footed gait) by only updating the gating network.

7.2. Limitations & Future Work

The authors point out the following limitation and suggest future work:

Current Limitation: The current MoE-Loco approach is demonstrated primarily with blind locomotion (using only proprioception and privileged information in simulation).
Future Work: The authors propose extending this approach to incorporate sensory perception such as camera and Lidar. This would enhance adaptability in even more complex and dynamic tasks where direct environmental observation is crucial.

7.3. Personal Insights & Critique

MoE-Loco presents a compelling solution to a critical challenge in robotics: creating truly versatile and general-purpose locomotion policies. The integration of Mixture of Experts is a theoretically sound and empirically validated approach for multitask learning, and its application to legged locomotion is particularly impactful given the diversity of movements required.

Inspirations and Transferability:

Modular Learning: The MoE framework's ability to decompose and compose skills is highly inspiring. This modularity could be a blueprint for more interpretable and adaptable RL agents in other complex robotic tasks, such as manipulation (where different experts could handle grasping, pushing, or fine motor control) or human-robot interaction (where experts could specialize in different social cues or interaction types).
Efficient Adaptation: The experiment on three-footed gait adaptation, where only the gating network is updated while experts are frozen, showcases a powerful transfer learning paradigm. This could significantly reduce training time and data requirements for deploying robots in novel situations or learning new, related skills, making RL more practical for real-world deployment.
Beyond Locomotion: The gradient conflict alleviation mechanism is generalizable to any multitask RL problem where task objectives might conflict. This could be applied to multi-agent systems, multi-modal control, or any scenario demanding diverse outputs from a single RL agent.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Expert Number Selection: The paper mentions selecting N_exp as 6. While this number worked well, the methodology does not extensively discuss how this number was chosen or its sensitivity. A detailed ablation study on the optimal number of experts for different task complexities would strengthen the argument. Too few experts might lead to lingering gradient conflicts, while too many could introduce computational overhead and redundancy.
Dynamic Gating for Skill Composition: While manual skill composition is demonstrated, developing a neural network that dynamically adjusts the gating weights w[i] for synthesizing new skills (instead of manual selection) could be a powerful next step. This would move beyond human intuition and allow for automated skill discovery.
Complexity of Reward Design: The reward functions (Appendix V-A) are quite detailed and extensive, especially for bipedal locomotion. While necessary for high-performance RL, such intricate reward engineering can be a bottleneck. Investigating how MoE might simplify reward design or enable learning from simpler, more sparse rewards could be valuable.
Computational Cost of MoE: While MoE mitigates gradient conflicts, it does introduce additional computational costs, especially if all experts are dense networks and activated for every input (though typical MoE uses sparse activation). The paper focuses on parameter count equivalence to MLP but FLOPs (Floating Point Operations) per inference step could still be higher. A deeper analysis of the trade-off between performance gain and computational overhead would be insightful, especially for resource-constrained onboard robot computing.
Generalization to Unseen Tasks: The paper demonstrates skill composition for related tasks (dribbling, three-footed gait). It would be interesting to see how well the learned experts can be leveraged for entirely novel locomotion tasks that were not explicitly part of the training distribution, pushing the boundaries of true generalization and reusability.

MoE-Loco: Mixture of Experts for Multitask Locomotion

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~30 min read · 39,203 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Reinforcement Learning (RL)

Markov Decision Process (MDP)

Mixture of Experts (MoE)

Proximal Policy Optimization (PPO)

Long Short-Term Memory (LSTM)

Sim-to-Real Gap

Domain Randomization

3.2. Previous Works

Reinforcement Learning for Robot Locomotion

Multitask Learning (MTL)

Mixture of Experts (MoE)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Task Definition

4.2.2. Observations (State Space)

4.2.3. Action Space

4.2.4. Reward Design

4.2.5. Termination Conditions

4.2.6. Training Process (Two-Stage Framework)

Stage 1: Oracle Policy Training

Stage 2: Final Policy Adaptation

4.2.7. Skill Decomposition and Composition

4.3. Network Architecture Details

4.4. Domain Randomization

4.5. Training Details

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Multitask Performance

Simulation Experiment

Real World Experiments

6.1.2. Gradient Conflict Alleviation

6.1.3. Training Performance

6.1.4. Expert Specialization Analysis

6.1.5. Skill Composition

6.1.6. Additional Experiment (Adaptation Learning)

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers