MoMa: Skinned motion retargeting using masked pose modeling
TL;DR Summary
MoMa introduces a novel skinned motion retargeting method that integrates skeleton-aware and shape-aware capabilities, effectively transferring animations across characters with different structures using a transformer-based auto-encoder and a face-based optimizer.
Abstract
Motion retargeting requires to carefully analyze the differences in both skeletal structure and body shape between source and target characters. Existing skeleton-aware and shape-aware approaches can deal with such differences, but they struggle when the source and target characters exhibit significant dissimilarities in both skeleton (like joint count and bone length) and shape (like geometry and mesh properties). In this work we introduce MoMa, a novel approach for skinned motion retargeting which is both skeleton and shape-aware. Our skeleton-aware module learns to retarget animations by recovering the differences between source and target using a custom transformer-based auto-encoder coupled with a spatio-temporal masking strategy. The auto-encoder can transfer the motion between input and target skeletons by reconstructing the masked skeletal differences using shared joints as a reference point. Surpassing the limitations of previous approaches, we can also perform retargeting between skeletons with a varying number of leaf joints. Our shape-aware module incorporates a novel face-based optimizer that adapts skeleton positions to limit collisions between body parts. In contrast to conventional vertex-based methods, our face-based optimizer excels in resolving surface collisions within a body shape, resulting in more accurate retargeted motions. The proposed architecture outperforms the state-of-the-art results on the Mixamo dataset, both quantitatively and qualitatively. Our code is available at: [Github link upon acceptance, see supplementary materials].
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
MoMa: Skinned motion retargeting using masked pose modeling
1.2. Authors
Giulia Martinelli *, Nicola Garau, Niccoló Bisagno, Nicola Conci
Affiliations: University of Trento, Via Sommarive 14, Trento, 38123, Italy CNIT Consorzio Nazional Interuniversitario per Telecomunicazioni, Via Sommarive 14, Trento, 38123, Italy
1.3. Journal/Conference
The paper does not explicitly state the full name of the journal or conference in the provided text. However, the keywords mention "CVU" which is likely a typo for "CVIU" (Computer Vision and Image Understanding), a reputable journal in the field of computer vision. Given the structure of the paper and the "ARTICLE INFO" section, it is presented as a journal publication. Computer Vision and Image Understanding (CVIU) is a well-regarded academic journal for research in computer vision, image analysis, and related fields.
1.4. Publication Year
2024
1.5. Abstract
Motion retargeting is a complex task that involves adapting animations between characters with different skeletal structures and body shapes. Existing methods often struggle when these dissimilarities are significant, for example, varying joint counts or mesh properties. This paper introduces MoMa, a novel approach for skinned motion retargeting that is both skeleton-aware and shape-aware. The skeleton-aware module uses a custom transformer-based auto-encoder with a spatio-temporal masking strategy to learn and recover differences between source and target skeletons, enabling retargeting even with varying numbers of leaf joints (non-homeomorphic skeletons). The shape-aware module employs a novel face-based optimizer to prevent collisions between body parts, which is more accurate than conventional vertex-based methods in resolving surface interpenetrations. MoMa achieves state-of-the-art quantitative and qualitative results on the Mixamo dataset, offering a robust solution for diverse character animation challenges.
1.6. Original Source Link
/files/papers/69607d92d6fd1ceb59987821/paper.pdf (Note: This is a relative path. The full link would depend on the base URL of the hosting platform). Publication status: The paper is published, with a publication date of 2024-09-14T00:00:00.000Z.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the effective and automated transfer of motion (motion retargeting) between two characters that exhibit significant differences in both their skeletal structure and body shape. This is a critical task in computer graphics and animation, with applications ranging from film and game production to human-computer interaction.
In the current field, motion retargeting faces several challenges:
-
Skeletal Dissimilarities: Characters can have different joint counts (e.g., a human vs. a quadruped), varying bone lengths, and distinct skeletal topologies. Existing skeleton-aware methods are often limited to
isomorphic(same number of joints) orhomeomorphic(same number of end-effectors but different topology) skeletons, failing fornon-homeomorphicones (different number of leaf joints). -
Shape Dissimilarities: Characters possess unique geometries and mesh properties (e.g., body proportions, clothing, accessories). When motion is transferred, the character's mesh might interpenetrate itself (self-collision) or other parts of the environment, leading to unrealistic or "broken" animations. Shape-aware methods exist, but they typically rely on
vertex-basedcollision detection, which can lead to erratic movements or distortions. -
Manual Effort: Traditionally, motion retargeting, especially when dealing with complex dissimilarities and collision resolution, is a labor-intensive and time-consuming process performed manually by 3D artists.
-
Lack of Explicit Mappings: The goal is to transfer motion without explicit mapping between skeletons or paired motion data for source and target characters, making the problem unsupervised.
The paper's entry point and innovative idea lie in addressing these limitations by developing a comprehensive approach that is simultaneously
skeleton-awareandshape-aware. It introduces a novelmasked pose modelingtechnique to handle diverse skeletal topologies, includingnon-homeomorphicones, and aface-based optimizerfor more robust collision avoidance in various body shapes.
2.2. Main Contributions / Findings
The paper's primary contributions and key findings are:
-
Novel Skeleton and Shape-Aware Pipeline: Introduction of MoMa, a new end-to-end pipeline for skinned motion retargeting that explicitly accounts for both skeletal structure and body shape differences.
-
First Approach for Non-Homeomorphic Skeletons: MoMa is the first motion retargeting method that can handle
non-homeomorphicskeletons (characters with varying numbers of leaf joints) without requiring paired motion data or ad-hoc skeleton mappings. This significantly expands the applicability of automated retargeting. -
Novel Pose Masking Auto-Encoder: Development of a transformer-based auto-encoder that uses a
spatio-temporal masking strategyto reconstruct masked portions of skeletal data. This allows the network to learn generalized representations for retargeting between diverse skeletal topologies by predicting missing joints based on shared reference points. -
Novel Face-Based Optimizer: Implementation of a
face-based optimizerfor resolvingmesh interpenetrations(collisions). Unlike conventionalvertex-basedmethods, this approach operates on triangular faces, which leads to more consistent and precise collision resolution, minimizing unwanted mesh surface deformations and producing more accurate retargeted motions. -
State-of-the-Art Performance: MoMa achieves state-of-the-art quantitative and qualitative results on the Mixamo dataset, outperforming existing methods in terms of both skeletal accuracy (
MSE) and collision avoidance (FIE), as demonstrated by the combined metric (SCE). -
Framework for Real-World Motion Transfer: The proposed method also provides a framework for transferring motion from real-world videos to synthetic characters, showcasing its robustness and generalization ability on diverse datasets like CMU, LAFAN1, and SFV.
These findings collectively address critical gaps in motion retargeting research, enabling more robust, accurate, and automated animation transfer across a wider range of character types.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a beginner should be familiar with the following fundamental concepts:
-
Motion Retargeting: This is the process of transferring an animation sequence from a source character (e.g., a human actor) to a target character (e.g., an avatar in a game or a different animal model). The goal is to adapt the motion such that it looks natural on the target character, despite differences in proportions, skeletal structure, or body shape, while preserving the original motion's dynamics.
-
Character Representation:
- Skeleton: The underlying hierarchical structure of a character, composed of
joints(nodes, typically representing bone endpoints or articulation points) andbones(edges connecting joints). Each joint has a parent-child relationship, forming a kinematic chain.End-effectorsorleaf jointsare the outermost joints in a chain (e.g., fingertips, toes, tail tip). - Mesh: The visible surface geometry of a character, typically made of
vertices(points in 3D space),edges(lines connecting vertices), andfaces(planar surfaces, often triangles, formed by edges and vertices). This defines the character'sbody shape. - Skinning: The process of binding a mesh to a skeleton. When the skeleton moves, the mesh deforms accordingly.
Linear Blend Skinning (LBS)is a common technique where each vertex of the mesh is influenced by a weighted sum of transformations from nearby bones.
- Skeleton: The underlying hierarchical structure of a character, composed of
-
Skeletal Topologies:
- Isomorphic Skeletons: Skeletons that have the exact same number of joints and the same hierarchical structure. Retargeting between these is often simpler, but still requires adapting to different bone lengths.
- Homeomorphic Skeletons: Skeletons that share the same number of
end-effectors(leaf joints) and have the same overalltopology(e.g., a human with different numbers of intermediate spinal joints, but still two arms, two legs, one head). They might have different numbers of intermediate joints but the same "branching" structure at the extremities. - Non-Homeomorphic Skeletons: Skeletons that do not share the same number of
end-effectorsor have fundamentally different topologies (e.g., a human character and a character with a tail, or a spider). This is the most challenging case for motion retargeting.
-
Transformers: A neural network architecture that has revolutionized natural language processing and computer vision.
- Self-Attention Mechanism: A core component of transformers that allows the model to weigh the importance of different parts of the input sequence (e.g., different joints in a motion sequence) when processing each part. It calculates relevance scores between all input elements.
- Encoder-Decoder Structure: Transformers often consist of an
encoderthat maps an input sequence to a latent representation, and adecoderthat maps this latent representation to an output sequence. - Tokens: Discrete units of input data that transformers process (e.g., words in text, patches in images, or in this paper, individual joints or frames of motion).
-
Auto-encoders: A type of neural network used for unsupervised learning of efficient data codings (representations). An auto-encoder tries to learn a function that is approximately equal to the identity function, . It has two main parts:
- Encoder: Compresses the input data into a lower-dimensional
latent spacerepresentation. - Decoder: Reconstructs the input data from the latent space representation. The goal is for the latent representation to capture the most important features of the input data.
- Encoder: Compresses the input data into a lower-dimensional
-
Mean Squared Error (MSE): A common loss function used in regression tasks and for evaluating the difference between predicted and true values. It calculates the average of the squared differences between corresponding elements of the prediction and the target. Lower MSE indicates better accuracy.
-
Quaternions: A mathematical way to represent rotations in 3D space. They are preferred over Euler angles (which can suffer from
gimbal lock) and rotation matrices (which require more parameters and orthonormalization constraints) for their compactness, efficiency, and numerical stability in animation. A4D quaternion(x, y, z, w)describes a rotation around an axis by a certain angle. -
Forward Kinematics (FK): A method in computer animation to calculate the positions of all joints in a kinematic chain (skeleton) given the initial position of the root joint and the relative rotations and lengths of all bones down the hierarchy. It determines the global position of each joint based on the transformations from its parent joints.
-
Collision Detection / Interpenetration: The process of determining if two or more geometric objects (e.g., parts of a character's mesh) are overlapping or occupying the same space.
Interpenetrationrefers to this unwanted overlap, which leads to unrealistic visuals in animation. -
Quasi-Newton Methods (e.g., L-BFGS): A class of numerical optimization algorithms that approximate the Hessian matrix (matrix of second-order partial derivatives) to find the minimum of a function.
Limited-memory BFGS (L-BFGS)is a popular quasi-Newton method that is efficient for high-dimensional problems because it avoids explicitly computing and storing the full Hessian matrix, instead using a limited amount of memory to store past gradient evaluations. It's often used when exact second derivatives are too costly to compute.
3.2. Previous Works
The paper frames its contributions by contrasting them with existing approaches in motion retargeting, broadly categorized into skeleton-aware and shape-aware methods, and drawing inspiration from masked modeling.
-
Masked Modeling for Representation Learning:
- Masked Language Modeling (MLM): Pioneered by BERT (Devlin et al., 2018; Liu et al., 2019), this technique involves masking out random words in a sentence and training a model to predict them. This self-supervised pre-training allows models to learn rich contextual representations of language.
- Masked Image Modeling (MIM): Inspired by MLM, techniques like Masked Autoencoders (MAE) (He et al., 2022) and SimMIM (Xie et al., 2022) apply masking to image patches. Models are trained to reconstruct missing image patches from visible ones, enabling powerful self-supervised learning for visual representation.
- Spatio-temporal Masked Modeling: Extensions to video, such as VideoMAE (Tong et al., 2022; Feichtenhofer et al., 2022), mask out both spatial patches and temporal frames, demonstrating effectiveness in learning video representations.
- Relevance to MoMa: MoMa adopts a similar strategy by randomly masking a subset of skeleton joints both in space (which joints) and time (which frames) to train its auto-encoder. The key difference highlighted by MoMa is that in pose, modeling relationships between individual joints (each with limited numerical values) is more critical and challenging than between rich image patches.
-
Skeleton-aware Motion Retargeting: These methods focus primarily on adapting the skeletal motion.
- Basic Copy Rotation: As described by Aberman et al. (2020), this is a simple baseline where rotations from the source skeleton are directly copied to the target, assuming a common
T-pose(a standardized neutral pose). It does not account for scale or bone length differences. - Methods for Isomorphic Skeletons: Many neural approaches, such as Neural Kinematic Networks (NKN) (Villegas et al., 2018) and PMNet (Lim et al., 2019), are designed for retargeting between
isomorphicskeletons (same number of joints). They often learn to disentangle pose and movement. Zhang et al. (2023) and Villegas et al. (2021) also fall into this category. - Methods for Homeomorphic Skeletons: Skeleton-Aware Networks (SAN) by Aberman et al. (2020) explicitly introduced mechanisms to handle
homeomorphicskeletons (same number of end-effectors but different topologies). SAME (Lee et al., 2023) also tackles isomorphic and homeomorphic skeletons in a skeleton-agnostic manner. - Limitations of Previous Works: Crucially, most prior neural methods struggle to generalize to
non-homeomorphicskeletons. While non-neural methods (Yamane et al., 2010; Seol et al., 2013) have explored non-homeomorphic retargeting, they often rely on paired motions or explicit skeleton mappings, which MoMa aims to avoid. - Example (Attention Mechanism in Transformers): Since transformers are a foundational concept for MoMa's skeleton-aware module, it's important to understand the
Attentionmechanism. The core idea is to compute a weighted sum ofValuevectors, where the weights are determined by the similarity betweenQueryandKeyvectors. TheAttentionmechanism is defined as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices representing linear projections of the input embeddings. For
self-attention,Q, K, Vare derived from the same input. - calculates the dot product similarity between queries and keys.
- is the dimension of the key vectors, used for scaling to prevent the dot products from becoming too large and pushing the
softmaxinto regions with extremely small gradients. - normalizes the scores to create probability distributions for weights.
- The output is a weighted sum of the
Valuevectors, allowing the model to focus on relevant input parts.
- (Query), (Key), (Value) are matrices representing linear projections of the input embeddings. For
- Basic Copy Rotation: As described by Aberman et al. (2020), this is a simple baseline where rotations from the source skeleton are directly copied to the target, assuming a common
-
Shape-aware Motion Retargeting: These methods explicitly incorporate the character's mesh into the optimization process to avoid unrealistic
interpenetrations.- Early Methods: Initial methods like NKN (Villegas et al., 2018) and PMNet (Lim et al., 2019) were not
shape-awareand used simpleLinear Blend Skinning (LBS)without collision resolution, leading to unrealistic skinned characters. - Vertex-based Collision Detection: More recent methods, such as those by Villegas et al. (2021) and Zhang et al. (2023), are
shape-aware. They detect contacts between different parts of the mesh using individualvertices. Villegas et al. (2021) use an encoder-decoder for optimization, while Zhang et al. (2023) use an attractive/repulsive field mechanism. - Limitations of Previous Works: While these methods generalize to diverse body shapes, their
vertex-basedapproach can lead to localized, erratic deformations as individual vertices respond to collisions. Zhang et al.'s method specifically notes it only solvesbody-limbscollisions, notlimbs-limbsorbody-bodyones.
- Early Methods: Initial methods like NKN (Villegas et al., 2018) and PMNet (Lim et al., 2019) were not
3.3. Technological Evolution
The evolution of motion retargeting has progressed from simple rule-based or manual adjustments to increasingly sophisticated neural network approaches:
- Manual/Rule-based Retargeting: Early methods involved artists manually adjusting poses or simple copy-rotation techniques. These were labor-intensive and lacked flexibility for diverse characters.
- Skeletal-based Retargeting for Isomorphic Skeletons: Initial neural approaches (e.g., NKN, PMNet) focused on learning the mapping between
isomorphicskeletons, primarily transferring rotations and poses. These methods often overlooked complex topological differences and body shape. - Skeletal-based Retargeting for Homeomorphic Skeletons: Advances like SAN started addressing
homeomorphicskeletons, allowing for variations in intermediate joint counts while maintaining similar end-effector structures. This was a step towards more flexible retargeting. - Shape-aware Retargeting: Recognizing the importance of visual realism, researchers began incorporating mesh information to prevent
interpenetrations. These methods (e.g., Villegas et al. 2021, Zhang et al. 2023) introduced collision detection, but typically relied onvertex-basedoptimizations. - MoMa's Position: MoMa represents a significant leap by combining and enhancing both
skeleton-awareandshape-awareaspects:-
It extends
skeleton-awarecapabilities to the challenging domain ofnon-homeomorphicskeletons, a gap largely unaddressed by neural methods without paired data. It does this by leveragingmasked pose modelinginspired by self-supervised learning. -
It refines
shape-awaretechniques by introducing a more robustface-based optimizer, which provides a more consistent and precise way to handle mesh collisions compared to previousvertex-basedmethods.This paper positions itself as a comprehensive solution that pushes the boundaries of automated motion retargeting to handle a wider spectrum of character variations more effectively and realistically.
-
3.4. Differentiation Analysis
Compared to the main methods in related work, MoMa introduces several core differences and innovations:
-
Handling Non-Homeomorphic Skeletons:
- Previous Limitation: Most state-of-the-art neural methods (e.g., NKN, PMNet, R2ET) are designed for
isomorphicskeletons. Even advanced methods like SAN and SAME, which handlehomeomorphicskeletons, cannot manage characters with fundamentally different numbers ofleaf joints(non-homeomorphic). Non-neural methods for non-homeomorphic skeletons often require paired motion or explicit mappings. - MoMa's Innovation: MoMa is the first neural approach that can perform motion retargeting between
non-homeomorphicskeletons without requiring paired motion data or manual mappings. This is achieved through itspose masking auto-encoder, which learns a generalized representation by recovering masked joints, making it adaptable to varying skeletal structures.
- Previous Limitation: Most state-of-the-art neural methods (e.g., NKN, PMNet, R2ET) are designed for
-
Face-based vs. Vertex-based Collision Optimization:
- Previous Limitation: Existing
shape-awaremethods (e.g., Villegas et al. 2021, R2ET) typically usevertex-basedapproaches for collision detection and resolution. This means they optimize individual vertex positions to avoid interpenetration. This can lead to localized, sometimes erratic, mesh deformations or a loss of surface consistency as vertices respond independently. Moreover, some vertex-based methods only address specific types of collisions (e.g.,body-limbbut notlimb-limb). - MoMa's Innovation: MoMa introduces a novel
face-based optimizer. By detecting and resolving collisions at the level of triangularfaces(the fundamental building blocks of a mesh surface) rather than individual vertices, it ensures a more coherent and consistent deformation of the mesh surface. This approach minimizes unwanted mesh distortions, leads to more accurate and physically plausible retargeted motions, and can solve all types of collisions (body-body, limb-limb, etc.).
- Previous Limitation: Existing
-
Simplicity and Efficiency of Skeleton-aware Module:
-
Previous Complexity: Many prior methods, especially those dealing with complex skeletal differences, often rely on sophisticated techniques like
cycle consistency lossesoradversarial losses(e.g., in R2ET and Villegas et al. 2021) to ensure robust retargeting. -
MoMa's Innovation: MoMa's
pose masking auto-encoderprovides a simpler baseline. By leveraging the power oftransformersandmasked modeling, it effectively learns to reconstruct missing joint information and generalize across diverse skeletons without the need for complex cycle consistency or adversarial training schemes, while still achieving superior results.In essence, MoMa differentiates itself by offering a more generalized skeletal retargeting capability (non-homeomorphic) combined with a more robust and visually consistent shape-aware collision resolution mechanism, all within a conceptually simpler training framework.
-
4. Methodology
4.1. Principles
The core idea behind MoMa is to tackle skinned motion retargeting by simultaneously addressing both skeletal structure and body shape differences through two primary principles:
-
Skeleton-aware Masked Pose Modeling: The method leverages a
transformer-based auto-encoderwith aspatio-temporal masking strategy. This principle is inspired bymasked language modelingandmasked image modeling, where the model learns to reconstruct missing parts of the input data. In MoMa's context, it learns to predict the missing (or differing) joints in a target skeleton by observing the motion of shared joints. This allows for robust retargeting across diverse skeletal topologies, includingnon-homeomorphicones, by learning to fill in the "gaps" or adapt to "extra" joints. -
Shape-aware Face-based Collision Optimization: Recognizing that skeletal retargeting alone can lead to mesh
interpenetrations, MoMa introduces a novelface-based optimizer. This principle aims to refine the retargeted skeletal positions by minimizing collisions between thetriangular facesof the character's mesh. Unlikevertex-basedmethods, optimizing at the face level provides a more consistent and physically plausible deformation, preventing localized distortions and ensuring a visually realistic animated character.These two principles work in conjunction: the
skeleton-aware moduleprovides an initial, topologically adapted motion, and theshape-aware modulethen refines this motion to be physically valid and collision-free on the target character's specific geometry.
4.2. Core Methodology In-depth
MoMa's methodology involves a three-step process: Skeleton motion retargeting, Skinning and collision detection, and Mesh optimization to solve collisions. Before diving into these steps, let's understand how characters and animations are represented.
4.2.1. Character and Animation Representation
A character is defined by its skeleton and mesh .
-
Skeleton:
- Static Representation (): This component describes the inherent, unchanging structure of the skeleton. It contains information about the
offsets(bone lengths) between joints and implicitly defines the skeleton'stopology(e.g., parent-child relationships, graph structure). For a character, is constant throughout an animation. It is represented as an vector, where is the number of joints and for xyz coordinates (presumably for bone vectors or initial joint positions relative to their parents). - Motion Representation (): This component captures the dynamic aspect of the animation. At any specific frame, consists of the relative rotations of each of the joints. These rotations are expressed as
4D quaternionsto ensure numerical stability and avoid issues likegimbal lock. The motion representation has dimensions , where is the number of joints, is the window length (number of frames in a sequence), and for the quaternion components.
- Static Representation (): This component describes the inherent, unchanging structure of the skeleton. It contains information about the
-
Mesh Representation (): This component defines the
body shapeof the character. It consists of a set ofvertices,edges, andfaces. For an animation, the mesh vertices change position throughskinningto reflect the skeleton's movement. is expressed as a set of vertices, with each vertex havingxyzcoordinates, resulting in a dimension of , where is the number of vertices and .In summary, an animation for a character is denoted as . The goal of retargeting is to transform this into for a target character .
4.2.2. Skeleton-aware Pose Masking Auto-encoder
This module (illustrated in Figure 3) is responsible for transferring the motion from the source skeleton to the target skeleton , handling differences in joint count and bone length.

该图像是示意图,展示了骨骼感知姿态掩盖自编码器的工作流程。从输入动画 开始,编码器将每个输入关节嵌入到一组标记 中。随后,随机掩盖部分 的元素,形成 ,接着利用变换器模型建立嵌入关节之间的关系。最后,解码器从潜在空间提取超骨骼运动 ,以推导重建的输入运动 和重定向运动 。
Fig. 3. Skeleton-aware pose masking auto-encoder. Starting from an input animation an encoder embeds each input joint into a set of tokens E _ { k } . Next, we randomly mask (black squares) a subset of E _ { k } and concatenate the remaining missing joints to include all the possible topologies, resulting in . To model the relationships between the embedded joints in S _ { k } representing the static where all the masked joints have been predicted. inally, he decoder extracts the super-skeleton motion from the latent space using the learnt token ( S _ { k } at training time and S _ { t } at test time), from which we can derive the reconstructed input motion and the retargeted motion A ( S _ { t } , Q _ { t } ) . We train our auto-encoder to predict the masked joints by enforcing a MSE loss between the input A ( S _ { k } , Q _ { k } ) and the reconstructed .
Objective: Given a skeletal motion for an input character , the auto-encoder retargets it to obtain the same motion for a target character .
1. Encoder:
- The
encoderprocesses the dynamic part of the input animation, . - Each joint of for every frame is
tokenized.Tokenizationhere involves converting the quaternion representation of each joint's rotation into a fixed-size vector embedding. - This is done using a single
linear layerfollowed by aLeaky ReLUactivation function. - The output of this tokenization for the entire animation is an embedded sequence with dimensions , where is the window length, is the number of joints in the input skeleton, and is the dimension of the joint embedding vector.
2. Pose Masking Strategy:
- Inspired by
masked image modeling, a subset of the embedded joints in israndomly masked. This means their values are replaced, typically with a specialmask tokenvector of dimension . The paper mentions replacing with a vector that "can contain different values depending on the chosen masking strategy," implying flexibility in how the mask token is defined (e.g., zero vector, learnable vector, random noise). - To handle diverse skeletal topologies, especially
non-homeomorphicones, the method concatenates a set ofempty tokensto the masked embedded animation joints. represents the maximum possible number of joints across all characters in the dataset . This operation expands the input representation to a "super-skeleton" size, ensuring that any input animation can be mapped into a latent representation large enough to accommodate all possible skeleton topologies. - The resulting latent representation, , has dimensions . Each of these newly added empty tokens is also initialized as a masked vector.
3. Spatial-Temporal Positional Embedding:
Positional embeddingsare crucial for transformers to understand the order and relationships between elements in a sequence, as transformers themselves are permutation-invariant.- MoMa uses two separable learnable positional embeddings:
- Spatial Embedding (): With dimensions , it models the spatial relationships between the joints, regardless of their specific character. This embedding is repeated for each of the frames. Its goal is to help the network learn the hierarchical graph representation of joints (as shown in Figure 2).
- Temporal Embedding (): With dimensions , it models the temporal relationships between the frames. This embedding is repeated for each of the joints.
- The total
spatial-temporal embeddingis given by the sum , resulting in dimensions . This combined embedding is added to themasked embedded animationto provide positional context to the transformer. Separating spatial and temporal embeddings prevents the overall embedding size from becoming too large in 3D.
4. Encoding Transformer:
- The input to the
encoding transformeris . - The transformer itself is a custom
Vision Transformer (ViT)architecture (Dosovitskiy et al., 2020), adapted for spatio-temporal joint information. It featuresmulti-head attentionmechanisms, and the paper notes "different activation functions and number of attention heads" are used. - The transformer processes this spatio-temporal joint information to output a latent space representation , which contains the embedded reconstructed motion. A
static token(of size ) is also implicitly used during encoding; it models the static part of the skeleton and acts as a selector for the joints corresponding to a given character.
5. Decoder:
- The
decoder(prediction head) is simpler, consisting of a singlelinear layer. - Its output is an animation , which can be conceptualized as the animation of a "super-skeleton" that encompasses all possible skeletal topologies in the dataset .
6. Training:
-
Extraction for Loss: At training time, the network needs to reconstruct the original input motion. From the
super-skeleton motion, the specific motion for the input character is extracted. This is facilitated by the learnedstatic token, which acts as a selector for the relevant joints of character . -
Loss Function: The auto-encoder is trained to reconstruct all input joints (both masked and unmasked) using the
Mean Squared Error (MSE)loss. The loss is applied on the 3D spatial positions of the joints, computed using aForward Kinematic (FK) layer. This is important because applying the loss directly on quaternions can lead to accumulated errors down the kinematic chain.The training loss is defined as: $ \mathcal { L } _ { M S E } ( A ( Q _ { k } ) , A ^ { \prime } ( Q _ { k } ) ) = \frac { \sum _ { w = 1 } ^ { W } \sum _ { n = 1 } ^ { N _ { k } } ( F K ( S _ { k } , j _ { n } ) - F K ( S _ { k } , j _ { n } ^ { \prime } ) ) ^ { 2 } } { W \times N _ { k } } $ Where:
-
: The ground truth input motion representation for character .
-
: The motion representation reconstructed by the auto-encoder for character .
-
: Index of the frame, ranging from
1to (window length). -
: Index of the joint, ranging from
1to (number of joints in character 's skeleton). -
: The
Forward Kinematicsfunction that calculates the 3D spatial position of the -th joint () in the skeleton . This takes the static skeleton structure () and the joint's quaternion rotation () to output its global 3D position. -
: The
Forward Kinematicsfunction for the reconstructed -th joint (j'_n). -
The subtraction calculates the difference in 3D position for each joint. This difference is squared and summed across all joints and all frames.
-
The sum is then normalized by to get the average
Mean Squared Errorper joint per frame.
7. Test Time:
- Given the reconstructed
super-skeleton motionfrom the decoder, the motion for any specific target skeleton is obtained. - This is done by applying the same
Forward Kinematic layerusing the target skeleton's static representation : . This means the learned motion (in terms of joint quaternions,j'_n) is applied to the target's specific bone lengths and hierarchy to generate its 3D joint positions.
4.2.3. Shape-aware Face-based Optimizer
This module (illustrated in Figure 4) refines the retargeted skeletal motion to avoid mesh interpenetrations and ensure a visually realistic result. It operates on a per-frame basis, iteratively adjusting joint positions.

该图像是插图,展示了面向形状的基于面片的优化器的工作流程。图示中的部分(a)展示了重定向的骨骼动作和对应的T姿势网格,经过皮肤处理后生成完整动画(b)。在每次迭代中,网格可能发生碰撞(c),这些碰撞被检测并加权。优化器通过调整骨骼位置 来最小化损失,直至得到无碰撞的重定向动作(d)。
Fig. 4. Shape-aware face-based optimizer. (a) Given the retargeted skeletal motion A ( S _ { t } , Q _ { t } ) and the corresponding mesh in T-pose, we apply the skinning to obtain the full animation A ( ( S _ { t } , Q _ { t } ) , M _ { k } ) . (b) During each iteration the mesh can display collisions, which (c) are detected and weighted by the collision penalizer. The face-based optimizer minimizes the loss L _ { o } ( S _ { t } , Q _ { t } ) by adapting the skeleton position ( S _ { t } , Q _ { t } ) until obtaining (d) a retargeted motion without collisions.
1. Skinning:
- After the
skeleton-aware moduleprovides the retargeted skeletal motion for the target character , this motion needs to be applied to the target character's mesh . - For each frame, a
Linear Blend Skinning (LBS)process is applied. LBS takes the vertices of the target mesh in itsT-pose(a canonical neutral pose) and the transformations (rotations and translations) derived from the skeleton to compute the new positions of the mesh vertices. This yields the animated mesh .
2. Collision Penalizer:
-
The core of this module is to detect and penalize
collisions(interpenetrations) between different parts of the character's mesh. -
Face-based Approach: Unlike
vertex-basedmethods, MoMa's approach isface-based. It identifiescolliding trianglesusing aBounding Volume Hierarchy (BVH)(Karras, 2012; Pavlakos et al., 2019), which is an efficient data structure for accelerating collision detection queries. -
Volumetric Distance Field: For each
triangular faceon the external surface of the mesh, aconic 3D volumetric distance fieldis built, along with its normal vector . This field conceptually defines a "forbidden zone" around each face. -
Bidirectional Interpenetration: When two triangles, and , collide, the interpenetration is considered
bi-directional. This means theverticesof are treated asintrudersinto the distance field of thereceiver triangle, and vice-versa. -
Collision Term: The collision term to be minimized quantifies this interpenetration. Being grouped in triangular faces, a vertex cannot respond individually to a collision, which, as the paper argues, leads to an improved mesh consistency.
The collision term is defined as: $ \xi ( S _ { t } , Q _ { t } ) = \sum _ { ( A _ { i } , A _ { r } ) } \bigg { \sum _ { v _ { i } \in A _ { i } } | - \varPsi _ { A _ { r } } ( v _ { i } ) \hat { v } _ { i } | ^ { 2 } + \sum _ { v _ { r } \in A _ { r } } | - \varPsi _ { A _ { i } } ( v _ { r } ) \hat { v } _ { r } | ^ { 2 } \bigg } $ Where:
-
: Represents a pair of colliding
trianglesorregionson the mesh (e.g., from different body parts like arm and torso). -
: A vertex belonging to the "intruder" triangle or region .
-
: A vertex belonging to the "receiver" triangle or region .
-
: The
volumetric distance fieldassociated with receiver triangle , evaluated at the position of intruder vertex . This value is negative if is inside the forbidden zone of , indicating interpenetration. -
: The normal vector of the intruder vertex (or the normal of the face it belongs to).
-
: This term represents the squared penetration depth of into , weighted by its normal. The negative sign makes it a positive penalty for interpenetration. The sum is performed over all vertices in that penetrate , and vice-versa for penetrating .
-
The overall sum accumulates penalties from all colliding pairs of regions.
The distinction between
vertex-basedandface-basedmethods is highlighted in Figure 5:
该图像是图示,展示了两种网格(Mesh 1 和 Mesh 2)之间的碰撞检测方式。图 (a) 表示顶点基础方法,使用全局距离场的方式进行碰撞检测;图 (b) 显示面基础方法,采用碰撞三角形表面的方式,提供更高效的一对一度量。
Fig. 5. (a) Vertex-based methods use a global distance field between each vertex and the others, thus providing a computationally expensive one-to-many metric. (b) Facebased methods identify collisions with colliding triangle surfaces, thus a more efficient one-to-one metric.
- Self-Contact Filtering: To improve the optimization process, certain
self-contacts(e.g., an eye colliding with the skull) that are inherent to the character'sT-poseor occur between kinematically neighboring parts (e.g., neck and torso, upper arm and lower arm) are detected andexcludedfrom the collision term. This prevents the optimizer from trying to resolve non-physical or pre-existing contacts. This step involves setting all input meshes toT-poseand identifying these inherent contacts.
3. Optimization:
-
A
Limited-memory BFGS (L-BFGS)optimizer (Nocedal and Wright, 1999) is chosen. L-BFGS is aquasi-Newton methodefficient for high-dimensional optimization, approximating the Hessian matrix without explicit computation, making it suitable for adapting joint positions. -
The optimizer minimizes a loss function for each frame of the animation. This loss balances two objectives: resolving collisions and preserving the original motion dynamics.
The optimization loss is given by: $ \boldsymbol { \mathcal { L } } _ { O } ( S _ { t } , Q _ { t } ) = \lambda \boldsymbol { \xi } ( S _ { t } , Q _ { t } ) + ( 1 - \lambda ) \mathbf { L } _ { M S E } ( ( S _ { t } , Q _ { t } ) , ( S _ { t } , Q _ { t - 1 } ) ) $ Where:
-
: The total optimization loss for the target skeleton in its current state .
-
: A
balancing weight(hyper-parameter) between0and1. It determines the trade-off between collision avoidance and motion preservation. -
: The
collision termdiscussed above. Minimizing this term drives the optimizer to resolve mesh interpenetrations. -
: An
MSE lossterm that encouragestemporal consistency. It measures the difference between the current frame's joint positions and the previous frame's optimized joint positions . This term helps prevent erratic movements from frame to frame and keeps the motion close to the original dynamics by referencing the prior frame's solution. -
The optimizer learns to adapt the joint positions in (the motion representation) to minimize this combined loss, ultimately producing – a retargeted motion that is both collision-free and preserves the original motion's dynamics.
5. Experimental Setup
5.1. Datasets
The authors evaluate MoMa on a variety of datasets to demonstrate its capabilities across different skeletal topologies and motion types.
-
Mixamo Dataset (Adobe, 2020): This is the primary dataset used for quantitative evaluation, especially for comparing retargeting between
isomorphic,homeomorphic, andnon-homeomorphiccharacters.- Characteristics: It consists of a diverse set of 3D characters and pre-recorded motion clips. Retargeting in Mixamo is typically performed by copying rotations.
- Setup:
- Isomorphic Skeletons: 7 characters for training, 11 for testing.
- Homeomorphic Skeletons: A subset of characters from Mixamo (excluding Liam, Pearl, Jasper which are no longer available). SAN is used as a baseline.
- Non-Homeomorphic Skeletons: Two CC-Licensed characters from Sketchfab (animated with Mixamo) combined with characters from the isomorphic experiments. These characters would typically have extra end-effectors like tails or different limb configurations.
- Motion Data: An average of 1250 motions per training character, 100 for validation, and 60 for each test character. The test set includes unseen characters and unseen motions.
- Why chosen: It's a standard benchmark in motion retargeting, allowing comparison with prior work. However, the authors note inconsistencies in GT (ground truth) data, sometimes showing interpenetrations.
-
Ubisoft La Forge Animation Dataset (LAFAN1) (Harvey et al., 2020):
- Characteristics: Comprises human motion data, featuring 5 subjects and 77 different motion sequences.
- Why chosen: Used for evaluating the network's ability to reconstruct masked segments of motion, showcasing robustness on human motion data.
-
Quadruped Dataset (Zhang et al., 2018):
- Characteristics: Includes 52 unique sequences of dog motions, covering various activities like idle, walk, run, sit, stand, and jumps.
- Why chosen: Used to test generalization to non-humanoid characters, specifically quadrupedal motion.
-
Carnegie Mellon University (CMU) Motion Capture Database (Anon):
- Characteristics: A large dataset capturing real human motion with 144 actors performing diverse movements, collected using an Optical Motion Capture System. This introduces more complex and varied movement patterns compared to the Mixamo dataset's copy-rotation based motions.
- Why chosen: Used to assess the network's ability to reconstruct segments of real motion, simulating scenarios with occluded markers or noisy data. Quantitative evaluation (MSE, FIE) with Mixamo isn't feasible due to unpaired motions, so it focuses on reconstruction.
-
Skills from Video (SFV) Data (Peng et al., 2018):
- Characteristics: Contains real-world complex human motions, potentially derived from videos. The paper suggests these motions can be processed by common
SMPL mesh(a parameterized body model) for use with simulation. - Why chosen: Demonstrates the robustness and generalization ability of MoMa to transfer motions from real-world complex human videos to synthetic characters.
- Characteristics: Contains real-world complex human motions, potentially derived from videos. The paper suggests these motions can be processed by common
5.2. Evaluation Metrics
The paper uses three main metrics to evaluate the performance of MoMa, combining both skeletal accuracy and mesh realism.
-
Mean Squared Error (MSE):
- Conceptual Definition:
Mean Squared Errormeasures the average of the squares of the errors between the retargeted joint positions and the ground truth joint positions. A lower MSE indicates that the retargeted motion's skeleton is closer to the intended or ground truth motion. In the context of motion retargeting, it quantifies how accurately the spatial configuration of the target skeleton matches the source's intended pose. The MSE is calculated after aligning the root joint of the retargeted motion with the ground truth and normalizing by character height, which accounts for differences in overall scale and global position. - Mathematical Formula: The paper refers to the MSE calculated similarly to the loss function for the skeleton-aware module, but for evaluation, it's a comparison between retargeted 3D joint positions and ground truth 3D joint positions. $ MSE = \frac { 1 } { W \times N } \sum _ { w = 1 } ^ { W } \sum _ { n = 1 } ^ { N } | P _ { w , n } - \hat { P } _ { w , n } | ^ { 2 } $
- Symbol Explanation:
- : The total number of frames in the animation.
- : The total number of joints in the skeleton.
- : The 3D spatial position (e.g.,
xyzcoordinates) of the -th joint at frame in the ground truth animation. - : The 3D spatial position of the -th joint at frame in the retargeted animation.
- : The squared Euclidean distance between the two 3D joint positions.
- Conceptual Definition:
-
Face Interpenetration Error (FIE):
- Conceptual Definition:
Face Interpenetration Errorquantifies the extent of unwantedcollisionsorinterpenetrationswithin the character's mesh. It is expressed as the percentage of triangular faces that are currently in a state of collision in an animation. A lower FIE indicates a more physically plausible and realistic animation, free from visual artifacts caused by mesh self-intersection. - Mathematical Formula: $ FIE = \frac { 1 } { W } \sum _ { w = 1 } ^ { W } \frac { # \Delta ( S _ { t } , Q _ { t } ) } { # F ( M _ { t } ) } % $
- Symbol Explanation:
- : The total number of frames in the animation.
- : The number of colliding faces detected in the target mesh at a specific frame , given the skeleton state .
- : The total number of faces in the target mesh .
- : Indicates that the ratio is expressed as a percentage.
- Conceptual Definition:
-
Skeleton and Collisions Error (SCE):
- Conceptual Definition:
Skeleton and Collisions Erroris a comprehensive, combined metric that aims to balance both the accuracy of the skeletal motion transfer (measured byMSE) and the physical realism of the skinned mesh (measured byFIE). It acknowledges that achieving a low MSE might sometimes come at the cost of increased mesh collisions, and vice-versa. By multiplying these two metrics, it provides a single score that rewards methods achieving both accurate skeletal motion and collision-free mesh deformation. A lower SCE value indicates a better overall retargeting quality. The authors note that while a weighted sum or other relations could be chosen, multiplication captures the dependency and is a valid choice for this combined evaluation. - Mathematical Formula: $ SCE = MSE \times FIE $
- Symbol Explanation:
MSE: TheMean Squared Erroras defined above.FIE: TheFace Interpenetration Erroras defined above.
- Conceptual Definition:
5.3. Baselines
MoMa's performance is compared against several state-of-the-art and baseline methods in motion retargeting:
-
Copy-rotation: A simple baseline method where joint rotations from the source skeleton are directly copied to the target skeleton, typically after aligning them in a common T-pose. It doesn't account for bone length differences or skeletal topology variations beyond isomorphism.
-
NKN (Neural Kinematic Networks) (Villegas et al., 2018): An early neural method for unsupervised motion retargeting, primarily designed for
isomorphicskeletons. It uses an encoder-decoder network to learn kinematic representations. -
PMNet (Lim et al., 2019): Another neural approach that aims to learn disentangled
poseandmovementrepresentations for unsupervised motion retargeting, also mainly forisomorphicskeletons. -
SAN (Skeleton-Aware Networks) (Aberman et al., 2020): A method specifically designed to handle retargeting between
homeomorphicskeletons. It explicitly models skeletal structure differences. MoMa compares against SAN forhomeomorphicscenarios. The paper also mentions "SAN opt", indicating SAN with MoMa's shape-aware optimizer applied on top for fairness. -
SAME (Skeleton-Agnostic Motion Embedding) (Lee et al., 2023): A recent method that aims to solve various animation tasks in a skeleton-agnostic manner, tackling both
isomorphicandhomeomorphicskeletons. -
R2ET (Skinned motion retargeting with residual perception of motion semantics & geometry) (Zhang et al., 2023): A recent
skeleton-awareandshape-awaremethod that employs an attractive/repulsive field mechanism for collision resolution. It's designed forisomorphicskeletons.These baselines represent different levels of sophistication and capabilities, from basic copy-pasting to advanced neural networks handling specific types of skeletal variations and rudimentary shape awareness, allowing for a comprehensive evaluation of MoMa's advancements.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate MoMa's superior performance across various scenarios, including isomorphic, homeomorphic, and non-homeomorphic skeletons, and its ability to reconstruct real motion capture data.
Isomorphic Skeletons
The results in Table 1, comparing MoMa against several baselines for isomorphic retargeting, highlight its effectiveness.
The following are the results from Table 1 of the original paper:
| Isomorphic | ||||
| Methods | MSE | FIE%↓ | SCE↓ | |
| GT (Mixamo) | 4.10 | |||
| Copy | 0.045 | 3.23 | 0.145 | |
| NKN* | 0.575 | |||
| PMnet* | 0.281 | |||
| SAN | 0.141 | 1.53 | 0.216 | |
| SAME | 0.176 | 1.21 | 0.213 | |
| Ours (no opt) | 0.043 | 3.13 | 0.134 | |
| R2ET | 0.042 | 3.96 | 0.166 | |
| Shape-aware | SAN opt | 0.163 | 1.02 | 0.166 |
| Ours | 0.049 | 1.01 | 0.049 | |
- MSE:
R2ET(0.042) andOurs (no opt)(0.043) achieve the lowest MSE values, indicating excellent skeletal accuracy, even outperforming theCopybaseline (0.045). This suggests that MoMa's skeleton-aware module is highly effective at preserving the original motion's skeletal dynamics. and have significantly higher MSE, indicating they are not as accurate on this specific setup. - FIE%↓ (Face Interpenetration Error): This metric is crucial for
shape-awareevaluation.Ours(the full MoMa approach with the shape-aware optimizer) achieves the lowest FIE of 1.01%, significantly better than all other methods, includingSAN opt(1.02%) andR2ET(3.96%). This strongly validates the efficacy of MoMa'sface-based optimizerin resolving mesh collisions. It's notable thatOurs (no opt)already has a lower FIE (3.13%) thanR2ET, indicating that MoMa's skeleton-aware module inherently produces motions with fewer collisions. TheGT (Mixamo)FIE of 4.10% is important; it reveals that even the ground truth in the Mixamo dataset can contain interpenetrations, making the task challenging and highlighting the need for robust collision resolution. - SCE↓ (Skeleton and Collisions Error): As a combined metric,
SCEprovides an overall measure of quality.Oursachieves the lowestSCEof 0.049, demonstrating its superior balance between skeletal accuracy and collision avoidance. This is a significant improvement overR2ET(0.166) andSAN opt(0.166). The results for "SAN opt" prove that MoMa's shape-aware module is effective and can be applied independently to other skeleton-aware methods.
Homeomorphic Skeletons
For homeomorphic skeletons, MoMa is compared against SAN, which is specifically designed for such variations.
The following are the results from Table 2 of the original paper:
| Homeomorphic | |||
| Methods | MSE | FIE%↓ | SCE↓ |
| SAN | 0.108 | 1.25 | 0.135 |
| SAME | 0.122 | 1.12 | 0.137 |
| Ours (no opt) | 0.025 | 3.6 | 0.090 |
| SAN opt | 0.117 | 0.91 | 0.106 |
| Ours | 0.031 | 0.9 | 0.028 |
Ours (no opt)achieves a remarkably lowMSEof 0.025, significantly outperformingSAN(0.108) andSAME(0.122). This highlights the strength of MoMa'sskeleton-aware modulein handlinghomeomorphicdifferences.Ours(full approach) maintains an excellentFIEof 0.9%, the lowest among all methods, further validating theface-based optimizer.- The
SCEforOursis 0.028, which is substantially lower thanSAN(0.135) andSAME(0.137), demonstrating clear state-of-the-art performance inhomeomorphicretargeting.
Motion Reconstruction
The evaluation on LAFAN1 and CMU datasets (Table 4) focuses on the network's ability to reconstruct masked segments of real motion, simulating noisy or incomplete motion capture data.
The following are the results from Table 4 of the original paper:
| Dataset | Masked joints | MSE |
| LAFAN1 | 5 | 0.042 |
| 8 | 0.032 | |
| 10 | 0.050 | |
| 15 | 0.070 | |
| CMU | 5 | 0.047 |
| 8 | 0.035 | |
| 10 | 0.081 | |
| 15 | 0.15 |
The results show that MoMa effectively reconstructs real motion capture data, with MSE values comparable to those seen on Mixamo. For both datasets, an intermediate number of masked joints (around 8) yields the lowest MSE, suggesting an optimal masking level for learning robust representations. As the number of masked joints increases to 15, the MSE predictably rises, as reconstructing more missing information becomes harder. This validates the pose masking auto-encoder's robustness and generalization to real-world, noisy motion data.
Qualitative Results
The paper provides qualitative results (Figures 6, 7, 8, 10) that visually reinforce the quantitative findings:
- Isomorphic Skeletons (Fig. 6): MoMa effectively solves collisions (e.g., hands, arm-head interpenetration) while preserving motion dynamics, unlike baselines that either fail to resolve collisions or excessively deform the pose.
- Homeomorphic Skeletons (Fig. 7): MoMa avoids interpenetrations between body parts, which
SAN(a skeleton-aware-only method) fails to do, especially with large variations in skeleton and shape. - Non-Homeomorphic Skeletons (Fig. 8): The paper visually demonstrates retargeting from a standard skeleton to characters with extra
end-effectors(like a tail) and different limb configurations, showcasing MoMa's unique capability. - Real-world Character (Fig. 10): MoMa successfully retargets complex real-world motions from the SFV dataset to synthetic characters, highlighting its robustness and generalization to unseen skeletons.
Computational Cost
The following are the results from Table 5 of the original paper:
| Time of execution | Avg. number of collisions per frame | Length of the animation (frames) | |
| (s) SGD | L-BFGS | ||
| 144.2 | 73.1 | 522 | 56 |
| 345.3 | 180.39 | 450 | 144 |
Table 5 shows the computational cost of the optimization step, comparing SGD and L-BFGS. L-BFGS consistently outperforms SGD in execution time (e.g., 73.1s vs 144.2s for 56 frames), justifying its choice as the optimizer for the shape-aware module. The number of collisions and animation length influence the total time, but the mesh characteristics (joints, faces) have minimal impact on the per-frame computation load, as it's primarily determined by the collisions needing resolution.
6.2. Data Presentation (Tables)
The tables presented in the "Core Results Analysis" section are exact transcriptions of the original paper's tables, as per the guidelines.
6.3. Ablation Studies / Parameter Analysis
The authors conduct ablation studies to understand the contribution of different components of MoMa, particularly for the skeleton-aware module and the impact of the parameter in the shape-aware module.
Skeleton-aware Module Ablation Studies
The following are the results from Table 3 of the original paper:
| M | Masking strategy | Encoding | |||
| 5 | 0.095 | zero | 0.043 | No encoding | 0.524 |
| 10 | 0.043 | random | 0.095 | E | 1.025 |
| 15 | 0.094 | perturb | 1.025 | ε | 1.450 |
| ε ε | 0.043 | ||||
This table shows the MSE values for different configurations of the skeleton-aware module using the isomorphic skeletons setup.
-
Effect of Number of Masked Joints (M):
- (0.095 MSE): If too few joints are masked, the network doesn't have enough reconstruction tasks to learn robust representations, resulting in higher
MSE. - (0.043 MSE): This configuration yields the best
MSE. It suggests an optimal balance where the network is challenged enough to learn meaningful spatio-temporal relationships without being overwhelmed. - (0.094 MSE): If too many joints are masked, the reconstruction task becomes excessively difficult, hindering the network's ability to learn the underlying motion structure, leading to higher
MSE. - Conclusion: The number of masked joints () is a crucial hyper-parameter, and an optimal value (e.g., ) is essential for effective learning.
- (0.095 MSE): If too few joints are masked, the network doesn't have enough reconstruction tasks to learn robust representations, resulting in higher
-
Masking Strategy:
zero(0.043 MSE): Replacing masked tokens with a vector of zeros (or a specific fixed value) provides a consistent starting point for the reconstruction task, leading to the best performance.random(0.095 MSE): Replacing with random values introduces noise, making the reconstruction harder and increasingMSE.perturb(1.025 MSE): This strategy (likely perturbing original values rather than fully masking) performs poorly, indicating that a clear masking signal is beneficial.- Conclusion: Using a
zero(or consistent) masking strategy is most effective for thepose masking auto-encoder.
-
Encoding (Positional Embeddings):
No encoding(0.524 MSE): Without anypositional embeddings, the network performs poorly, demonstrating the critical need for understanding spatio-temporal relationships.- (1.025 MSE): This likely refers to only using one type of embedding (e.g., spatial, ). Performance is still poor because it fails to capture the other dimension of relationships.
- (1.450 MSE): This likely refers to only using the other type of embedding (e.g., temporal, ). Similarly, poor performance.
- (0.043 MSE): This represents the combination of both
spatial() andtemporal() embeddings (). This configuration yields the bestMSE, matching the optimal result. - Conclusion: Both
spatialandtemporal positional embeddingsare essential and their combination is critical for the network to accurately model the complex spatio-temporal relationships between joints in an animation.
Shape-aware Module Ablation Studies ( parameter)

该图像是一个图表,展示了在形状感知模块中不同 eta 值下的消融研究结果。图中显示的点越接近原点,表明最佳结果。可以看到,MoMa 在不同 eta 值时均优于形状感知 (R2ET) 和骨架感知 (SAN) 方法,其中最佳结果出现在 eta = 0.7。
Fig. 9. Ablation studies for the value in the shape-aware module. The best results are indicated by points closer to the origin; we can see that our MoMa can outperform both shape-aware (R2ET) and skeleton-aware (SAN) methods depending on the value of .We obtain our best result with .
Figure 9 illustrates the trade-off between MSE (skeletal accuracy) and FIE (collision avoidance) as the balancing weight in the shape-aware module's loss function is varied. The goal is to be as close to the origin (0 MSE, 0 FIE) as possible.
- Interpretation:
- A higher (closer to 1) puts more emphasis on minimizing
collisions(), potentially at the cost of deviating more from the original skeletal motion (higherMSE). - A lower (closer to 0) prioritizes
preserving the original skeletal motion(), which might lead to more collisions (higherFIE).
- A higher (closer to 1) puts more emphasis on minimizing
- Results: The chart shows a curve for MoMa's performance across different values.
- The point where is highlighted as yielding the "best result" for MoMa, indicating an optimal balance where both
MSEandFIEare acceptably low, resulting in high overall quality. - Crucially, the curve for MoMa consistently lies closer to the origin than the single data points for
R2ET(a shape-aware baseline) andSAN(a skeleton-aware baseline). This demonstrates that MoMa can achieve superior performance regardless of the chosen trade-off between skeletal accuracy and collision avoidance, outperforming state-of-the-art solutions across the spectrum.
- The point where is highlighted as yielding the "best result" for MoMa, indicating an optimal balance where both
- Conclusion: The parameter allows animators or users to control the desired balance, and MoMa's inherent capabilities allow it to achieve better outcomes across these trade-offs.
Computational Cost of the Optimization Step
The following are the results from Table 5 of the original paper:
| Time of execution | Avg. number of collisions per frame | Length of the animation (frames) | |
| (s) SGD | L-BFGS | ||
| 144.2 | 73.1 | 522 | 56 |
| 345.3 | 180.39 | 450 | 144 |
This table compares the execution time of two different optimizers, Stochastic Gradient Descent (SGD) and Limited-memory BFGS (L-BFGS), for the shape-aware module.
- Comparison: For both animation lengths (56 and 144 frames),
L-BFGSis significantly faster thanSGD. For the 56-frame animation,L-BFGStakes 73.1 seconds compared toSGD's 144.2 seconds. For the 144-frame animation,L-BFGStakes 180.39 seconds compared toSGD's 345.3 seconds. - Factor Influencing Cost: The paper notes that the mesh's joint and face numbers have minimal impact on performance; instead, the
average number of collisions per frameis the primary determinant of computation load. - Conclusion: This justifies the choice of
L-BFGSas the optimizer for theshape-aware moduledue to its superior efficiency, particularly important for processing longer animations or many collision events.
6.4. Data Presentation (Images)

该图像是示意图,展示了不同方法在运动重定向中的对比,包括源对象、Copy 方法、SAN 方法、R2ET 方法、无优化的我们的方案及优化后的我们的方案。图中红圈标注了各方法在身体部位的适应性与碰撞问题。
Fig. 6. State-of-the-art results for isomorphic skeletons, showing MoMa's qualitative superiority.

该图像是一个示意图,展示了源角色、SAN方法的结果以及我们的方法的比较。图中圆圈部分突出了三种表现下角色手部姿势的不同,明显可以看出我们的方法在动作重定向上更加自然和一致。
Fig. 7. Qualitative results for the retargeting between homeomorphic skeletons.

该图像是图示,展示了非同源骨骼之间的动画重定向结果。上方为源角色及重定向后的两个目标角色,下方为不同细节的特写图,包括服装和身体部分的碰撞处理,强调了重定向过程中细节的精细化。
Fig. 8. Qualitative results for the retargeting between non-homeomorphic skeletons.

该图像是插图,展示了使用MoMa方法进行动作重定向的过程。图中表现出源角色和目标角色在执行翻转动作时的动态关系,突显了骨骼和形状的适应性。该方法通过自定义的变换器自动编码器和面基优化器,实现了准确的动作迁移和碰撞处理。
Fig. 10. Real-world motion retargeting examples from Skills from Video (SFV) data, such as a backflip.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces MoMa, a novel and comprehensive approach for skinned motion retargeting that excels in both skeleton-aware and shape-aware aspects. MoMa's key innovation lies in its transformer-based auto-encoder with a spatio-temporal masking strategy, which uniquely enables fully automatic motion retargeting between isomorphic, homeomorphic, and crucially, non-homeomorphic character topologies—a capability previously lacking in neural methods without paired data. Furthermore, its shape-aware module employs a novel face-based optimizer that resolves mesh interpenetrations more consistently and precisely than conventional vertex-based methods, leading to more accurate and visually convincing retargeted motions. The experimental results, both quantitative and qualitative, demonstrate MoMa's state-of-the-art performance on the Mixamo dataset and its robustness in transferring complex motions from real videos to synthetic characters.
7.2. Limitations & Future Work
The authors highlight a key limitation and implicitly suggest future work based on their findings:
- Generalization to Non-Homeomorphic Skeletons: While MoMa is the first neural method to handle
non-homeomorphicskeletons without paired motion, its ability to generalize to all possible non-homeomorphic skeletons is limited. This generalization currently depends on whether unseen skeletons can be closely represented by existing skeletons in the training set (e.g., similar body proportions and number of joints) or by a combination of multiple training skeletons. - Future Direction - Diverse Training Data: To improve generalization to a wider range of non-homeomorphic skeletons, a more diverse and comprehensive training set would be required. This implies that if the training set is sufficiently diverse, MoMa could generalize to nearly all
isomorphicandhomeomorphicskeletons, but the scope fornon-homeomorphiccharacters still hinges on data coverage. - Future Direction - Real-World Motion Transfer: The paper demonstrates the potential of transferring motions from real-world human videos (processed by
SMPL mesh) to synthetic characters. This opens up promising avenues for future research in bridging the gap between real-world motion capture and virtual animation, potentially making character animation more accessible and versatile.
7.3. Personal Insights & Critique
MoMa presents a significant step forward in motion retargeting, especially in its ability to tackle non-homeomorphic skeletons without explicit mappings. The integration of masked pose modeling into this domain is particularly insightful, borrowing a powerful self-supervised learning paradigm from NLP and computer vision and adapting it effectively for skeletal data. This indicates the increasing cross-pollination of ideas across deep learning subfields.
The choice of a face-based optimizer over vertex-based methods for collision resolution is a crucial design decision that yields tangible improvements in visual quality. It addresses a common frustration in character animation where realistic motion is undermined by unsightly mesh interpenetrations. The ablation studies effectively demonstrate the contribution of each module and the importance of hyper-parameters like the number of masked joints and positional embeddings. The parameter in the shape-aware module provides a practical control knob for balancing accuracy and realism, which is highly valuable for animators.
One potential area for critique or further investigation might involve the SCE metric (). While the authors justify its use by stating it captures dependency and that the choice of weighting is a preference, multiplication can sometimes obscure nuances if one component is significantly larger or smaller than the other. For instance, a very small MSE could still result in a low SCE even with a relatively high FIE, or vice versa. A deeper analysis into the sensitivity of this metric and perhaps alternative composite metrics (e.g., geometric mean, or a more dynamically weighted sum) could be explored.
Furthermore, while the paper claims to be the "first" for non-homeomorphic skeletons without paired motion, it's a strong claim in a rapidly evolving field. It would be beneficial for future work to rigorously define the boundary conditions of non-homeomorphic generalization and explore active learning or domain adaptation techniques to improve it with less reliance on the sheer diversity of the training set.
The robust performance on CMU and LAFAN1 datasets with masked joints suggests a broader applicability for cleaning noisy motion capture data, beyond just retargeting. This could be an important side benefit or future application area. Overall, MoMa offers a robust and conceptually elegant solution that pushes the boundaries of automated motion retargeting.
Similar papers
Recommended via semantic vector search.