Paper status: completed

MoMa: Skinned motion retargeting using masked pose modeling

Published:09/14/2024

Shape-aware Motion Retargeting (1)Skeleton-aware Motion Retargeting (1)Transformer-based Auto-Encoder (1)Motion Transfer (1)Mixamo Dataset (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MoMa introduces a novel skinned motion retargeting method that integrates skeleton-aware and shape-aware capabilities, effectively transferring animations across characters with different structures using a transformer-based auto-encoder and a face-based optimizer.

Abstract

Motion retargeting requires to carefully analyze the differences in both skeletal structure and body shape between source and target characters. Existing skeleton-aware and shape-aware approaches can deal with such differences, but they struggle when the source and target characters exhibit significant dissimilarities in both skeleton (like joint count and bone length) and shape (like geometry and mesh properties). In this work we introduce MoMa, a novel approach for skinned motion retargeting which is both skeleton and shape-aware. Our skeleton-aware module learns to retarget animations by recovering the differences between source and target using a custom transformer-based auto-encoder coupled with a spatio-temporal masking strategy. The auto-encoder can transfer the motion between input and target skeletons by reconstructing the masked skeletal differences using shared joints as a reference point. Surpassing the limitations of previous approaches, we can also perform retargeting between skeletons with a varying number of leaf joints. Our shape-aware module incorporates a novel face-based optimizer that adapts skeleton positions to limit collisions between body parts. In contrast to conventional vertex-based methods, our face-based optimizer excels in resolving surface collisions within a body shape, resulting in more accurate retargeted motions. The proposed architecture outperforms the state-of-the-art results on the Mixamo dataset, both quantitatively and qualitatively. Our code is available at: [Github link upon acceptance, see supplementary materials].

Mind Map

In-depth Reading

English Analysis~40 min read · 52,502 chars

1. Bibliographic Information

1.1. Title

MoMa: Skinned motion retargeting using masked pose modeling

1.2. Authors

Giulia Martinelli *, Nicola Garau, Niccoló Bisagno, Nicola Conci

Affiliations: University of Trento, Via Sommarive 14, Trento, 38123, Italy CNIT Consorzio Nazional Interuniversitario per Telecomunicazioni, Via Sommarive 14, Trento, 38123, Italy

1.3. Journal/Conference

The paper does not explicitly state the full name of the journal or conference in the provided text. However, the keywords mention "CVU" which is likely a typo for "CVIU" (Computer Vision and Image Understanding), a reputable journal in the field of computer vision. Given the structure of the paper and the "ARTICLE INFO" section, it is presented as a journal publication. Computer Vision and Image Understanding (CVIU) is a well-regarded academic journal for research in computer vision, image analysis, and related fields.

1.4. Publication Year

2024

1.5. Abstract

Motion retargeting is a complex task that involves adapting animations between characters with different skeletal structures and body shapes. Existing methods often struggle when these dissimilarities are significant, for example, varying joint counts or mesh properties. This paper introduces MoMa, a novel approach for skinned motion retargeting that is both skeleton-aware and shape-aware. The skeleton-aware module uses a custom transformer-based auto-encoder with a spatio-temporal masking strategy to learn and recover differences between source and target skeletons, enabling retargeting even with varying numbers of leaf joints (non-homeomorphic skeletons). The shape-aware module employs a novel face-based optimizer to prevent collisions between body parts, which is more accurate than conventional vertex-based methods in resolving surface interpenetrations. MoMa achieves state-of-the-art quantitative and qualitative results on the Mixamo dataset, offering a robust solution for diverse character animation challenges.

1.6. Original Source Link

/files/papers/69607d92d6fd1ceb59987821/paper.pdf (Note: This is a relative path. The full link would depend on the base URL of the hosting platform). Publication status: The paper is published, with a publication date of 2024-09-14T00:00:00.000Z.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the effective and automated transfer of motion (motion retargeting) between two characters that exhibit significant differences in both their skeletal structure and body shape. This is a critical task in computer graphics and animation, with applications ranging from film and game production to human-computer interaction.

In the current field, motion retargeting faces several challenges:

Skeletal Dissimilarities: Characters can have different joint counts (e.g., a human vs. a quadruped), varying bone lengths, and distinct skeletal topologies. Existing skeleton-aware methods are often limited to isomorphic (same number of joints) or homeomorphic (same number of end-effectors but different topology) skeletons, failing for non-homeomorphic ones (different number of leaf joints).
Shape Dissimilarities: Characters possess unique geometries and mesh properties (e.g., body proportions, clothing, accessories). When motion is transferred, the character's mesh might interpenetrate itself (self-collision) or other parts of the environment, leading to unrealistic or "broken" animations. Shape-aware methods exist, but they typically rely on vertex-based collision detection, which can lead to erratic movements or distortions.
Manual Effort: Traditionally, motion retargeting, especially when dealing with complex dissimilarities and collision resolution, is a labor-intensive and time-consuming process performed manually by 3D artists.
Lack of Explicit Mappings: The goal is to transfer motion without explicit mapping between skeletons or paired motion data for source and target characters, making the problem unsupervised.

The paper's entry point and innovative idea lie in addressing these limitations by developing a comprehensive approach that is simultaneously skeleton-aware and shape-aware. It introduces a novel masked pose modeling technique to handle diverse skeletal topologies, including non-homeomorphic ones, and a face-based optimizer for more robust collision avoidance in various body shapes.

2.2. Main Contributions / Findings

The paper's primary contributions and key findings are:

Novel Skeleton and Shape-Aware Pipeline: Introduction of MoMa, a new end-to-end pipeline for skinned motion retargeting that explicitly accounts for both skeletal structure and body shape differences.
First Approach for Non-Homeomorphic Skeletons: MoMa is the first motion retargeting method that can handle non-homeomorphic skeletons (characters with varying numbers of leaf joints) without requiring paired motion data or ad-hoc skeleton mappings. This significantly expands the applicability of automated retargeting.
Novel Pose Masking Auto-Encoder: Development of a transformer-based auto-encoder that uses a spatio-temporal masking strategy to reconstruct masked portions of skeletal data. This allows the network to learn generalized representations for retargeting between diverse skeletal topologies by predicting missing joints based on shared reference points.
Novel Face-Based Optimizer: Implementation of a face-based optimizer for resolving mesh interpenetrations (collisions). Unlike conventional vertex-based methods, this approach operates on triangular faces, which leads to more consistent and precise collision resolution, minimizing unwanted mesh surface deformations and producing more accurate retargeted motions.
State-of-the-Art Performance: MoMa achieves state-of-the-art quantitative and qualitative results on the Mixamo dataset, outperforming existing methods in terms of both skeletal accuracy (MSE) and collision avoidance (FIE), as demonstrated by the combined metric (SCE).
Framework for Real-World Motion Transfer: The proposed method also provides a framework for transferring motion from real-world videos to synthetic characters, showcasing its robustness and generalization ability on diverse datasets like CMU, LAFAN1, and SFV.

These findings collectively address critical gaps in motion retargeting research, enabling more robust, accurate, and automated animation transfer across a wider range of character types.

3.1. Foundational Concepts

To fully understand this paper, a beginner should be familiar with the following fundamental concepts:

Motion Retargeting: This is the process of transferring an animation sequence from a source character (e.g., a human actor) to a target character (e.g., an avatar in a game or a different animal model). The goal is to adapt the motion such that it looks natural on the target character, despite differences in proportions, skeletal structure, or body shape, while preserving the original motion's dynamics.
Character Representation:
- Skeleton: The underlying hierarchical structure of a character, composed of joints (nodes, typically representing bone endpoints or articulation points) and bones (edges connecting joints). Each joint has a parent-child relationship, forming a kinematic chain. End-effectors or leaf joints are the outermost joints in a chain (e.g., fingertips, toes, tail tip).
- Mesh: The visible surface geometry of a character, typically made of vertices (points in 3D space), edges (lines connecting vertices), and faces (planar surfaces, often triangles, formed by edges and vertices). This defines the character's body shape.
- Skinning: The process of binding a mesh to a skeleton. When the skeleton moves, the mesh deforms accordingly. Linear Blend Skinning (LBS) is a common technique where each vertex of the mesh is influenced by a weighted sum of transformations from nearby bones.
Skeletal Topologies:
- Isomorphic Skeletons: Skeletons that have the exact same number of joints and the same hierarchical structure. Retargeting between these is often simpler, but still requires adapting to different bone lengths.
- Homeomorphic Skeletons: Skeletons that share the same number of end-effectors (leaf joints) and have the same overall topology (e.g., a human with different numbers of intermediate spinal joints, but still two arms, two legs, one head). They might have different numbers of intermediate joints but the same "branching" structure at the extremities.
- Non-Homeomorphic Skeletons: Skeletons that do not share the same number of end-effectors or have fundamentally different topologies (e.g., a human character and a character with a tail, or a spider). This is the most challenging case for motion retargeting.
Transformers: A neural network architecture that has revolutionized natural language processing and computer vision.
- Self-Attention Mechanism: A core component of transformers that allows the model to weigh the importance of different parts of the input sequence (e.g., different joints in a motion sequence) when processing each part. It calculates relevance scores between all input elements.
- Encoder-Decoder Structure: Transformers often consist of an encoder that maps an input sequence to a latent representation, and a decoder that maps this latent representation to an output sequence.
- Tokens: Discrete units of input data that transformers process (e.g., words in text, patches in images, or in this paper, individual joints or frames of motion).
Auto-encoders: A type of neural network used for unsupervised learning of efficient data codings (representations). An auto-encoder tries to learn a function that is approximately equal to the identity function, $f(x) \approx x$ . It has two main parts:
- Encoder: Compresses the input data into a lower-dimensional latent space representation.
- Decoder: Reconstructs the input data from the latent space representation. The goal is for the latent representation to capture the most important features of the input data.
Mean Squared Error (MSE): A common loss function used in regression tasks and for evaluating the difference between predicted and true values. It calculates the average of the squared differences between corresponding elements of the prediction and the target. Lower MSE indicates better accuracy.
Quaternions: A mathematical way to represent rotations in 3D space. They are preferred over Euler angles (which can suffer from gimbal lock) and rotation matrices (which require more parameters and orthonormalization constraints) for their compactness, efficiency, and numerical stability in animation. A 4D quaternion (x, y, z, w) describes a rotation around an axis by a certain angle.
Forward Kinematics (FK): A method in computer animation to calculate the positions of all joints in a kinematic chain (skeleton) given the initial position of the root joint and the relative rotations and lengths of all bones down the hierarchy. It determines the global position of each joint based on the transformations from its parent joints.
Collision Detection / Interpenetration: The process of determining if two or more geometric objects (e.g., parts of a character's mesh) are overlapping or occupying the same space. Interpenetration refers to this unwanted overlap, which leads to unrealistic visuals in animation.
Quasi-Newton Methods (e.g., L-BFGS): A class of numerical optimization algorithms that approximate the Hessian matrix (matrix of second-order partial derivatives) to find the minimum of a function. Limited-memory BFGS (L-BFGS) is a popular quasi-Newton method that is efficient for high-dimensional problems because it avoids explicitly computing and storing the full Hessian matrix, instead using a limited amount of memory to store past gradient evaluations. It's often used when exact second derivatives are too costly to compute.

3.2. Previous Works

The paper frames its contributions by contrasting them with existing approaches in motion retargeting, broadly categorized into skeleton-aware and shape-aware methods, and drawing inspiration from masked modeling.

Masked Modeling for Representation Learning:
- Masked Language Modeling (MLM): Pioneered by BERT (Devlin et al., 2018; Liu et al., 2019), this technique involves masking out random words in a sentence and training a model to predict them. This self-supervised pre-training allows models to learn rich contextual representations of language.
- Masked Image Modeling (MIM): Inspired by MLM, techniques like Masked Autoencoders (MAE) (He et al., 2022) and SimMIM (Xie et al., 2022) apply masking to image patches. Models are trained to reconstruct missing image patches from visible ones, enabling powerful self-supervised learning for visual representation.
- Spatio-temporal Masked Modeling: Extensions to video, such as VideoMAE (Tong et al., 2022; Feichtenhofer et al., 2022), mask out both spatial patches and temporal frames, demonstrating effectiveness in learning video representations.
- Relevance to MoMa: MoMa adopts a similar strategy by randomly masking a subset of skeleton joints both in space (which joints) and time (which frames) to train its auto-encoder. The key difference highlighted by MoMa is that in pose, modeling relationships between individual joints (each with limited numerical values) is more critical and challenging than between rich image patches.
Skeleton-aware Motion Retargeting: These methods focus primarily on adapting the skeletal motion.
- Basic Copy Rotation: As described by Aberman et al. (2020), this is a simple baseline where rotations from the source skeleton are directly copied to the target, assuming a common T-pose (a standardized neutral pose). It does not account for scale or bone length differences.
- Methods for Isomorphic Skeletons: Many neural approaches, such as Neural Kinematic Networks (NKN) (Villegas et al., 2018) and PMNet (Lim et al., 2019), are designed for retargeting between isomorphic skeletons (same number of joints). They often learn to disentangle pose and movement. Zhang et al. (2023) and Villegas et al. (2021) also fall into this category.
- Methods for Homeomorphic Skeletons: Skeleton-Aware Networks (SAN) by Aberman et al. (2020) explicitly introduced mechanisms to handle homeomorphic skeletons (same number of end-effectors but different topologies). SAME (Lee et al., 2023) also tackles isomorphic and homeomorphic skeletons in a skeleton-agnostic manner.
- Limitations of Previous Works: Crucially, most prior neural methods struggle to generalize to non-homeomorphic skeletons. While non-neural methods (Yamane et al., 2010; Seol et al., 2013) have explored non-homeomorphic retargeting, they often rely on paired motions or explicit skeleton mappings, which MoMa aims to avoid.
- Example (Attention Mechanism in Transformers): Since transformers are a foundational concept for MoMa's skeleton-aware module, it's important to understand the Attention mechanism. The core idea is to compute a weighted sum of Value vectors, where the weights are determined by the similarity between Query and Key vectors. The Attention mechanism is defined as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices representing linear projections of the input embeddings. For self-attention, Q, K, V are derived from the same input.
  - $Q K^T$ calculates the dot product similarity between queries and keys.
  - $d_k$ is the dimension of the key vectors, used for scaling to prevent the dot products from becoming too large and pushing the softmax into regions with extremely small gradients.
  - $\mathrm{softmax}$ normalizes the scores to create probability distributions for weights.
  - The output is a weighted sum of the Value vectors, allowing the model to focus on relevant input parts.
Shape-aware Motion Retargeting: These methods explicitly incorporate the character's mesh into the optimization process to avoid unrealistic interpenetrations.
- Early Methods: Initial methods like NKN (Villegas et al., 2018) and PMNet (Lim et al., 2019) were not shape-aware and used simple Linear Blend Skinning (LBS) without collision resolution, leading to unrealistic skinned characters.
- Vertex-based Collision Detection: More recent methods, such as those by Villegas et al. (2021) and Zhang et al. (2023), are shape-aware. They detect contacts between different parts of the mesh using individual vertices. Villegas et al. (2021) use an encoder-decoder for optimization, while Zhang et al. (2023) use an attractive/repulsive field mechanism.
- Limitations of Previous Works: While these methods generalize to diverse body shapes, their vertex-based approach can lead to localized, erratic deformations as individual vertices respond to collisions. Zhang et al.'s method specifically notes it only solves body-limbs collisions, not limbs-limbs or body-body ones.

3.3. Technological Evolution

The evolution of motion retargeting has progressed from simple rule-based or manual adjustments to increasingly sophisticated neural network approaches:

Manual/Rule-based Retargeting: Early methods involved artists manually adjusting poses or simple copy-rotation techniques. These were labor-intensive and lacked flexibility for diverse characters.
Skeletal-based Retargeting for Isomorphic Skeletons: Initial neural approaches (e.g., NKN, PMNet) focused on learning the mapping between isomorphic skeletons, primarily transferring rotations and poses. These methods often overlooked complex topological differences and body shape.
Skeletal-based Retargeting for Homeomorphic Skeletons: Advances like SAN started addressing homeomorphic skeletons, allowing for variations in intermediate joint counts while maintaining similar end-effector structures. This was a step towards more flexible retargeting.
Shape-aware Retargeting: Recognizing the importance of visual realism, researchers began incorporating mesh information to prevent interpenetrations. These methods (e.g., Villegas et al. 2021, Zhang et al. 2023) introduced collision detection, but typically relied on vertex-based optimizations.
MoMa's Position: MoMa represents a significant leap by combining and enhancing both skeleton-aware and shape-aware aspects:
- It extends skeleton-aware capabilities to the challenging domain of non-homeomorphic skeletons, a gap largely unaddressed by neural methods without paired data. It does this by leveraging masked pose modeling inspired by self-supervised learning.
- It refines shape-aware techniques by introducing a more robust face-based optimizer, which provides a more consistent and precise way to handle mesh collisions compared to previous vertex-based methods.
  
  This paper positions itself as a comprehensive solution that pushes the boundaries of automated motion retargeting to handle a wider spectrum of character variations more effectively and realistically.

3.4. Differentiation Analysis

Compared to the main methods in related work, MoMa introduces several core differences and innovations:

Handling Non-Homeomorphic Skeletons:
- Previous Limitation: Most state-of-the-art neural methods (e.g., NKN, PMNet, R2ET) are designed for isomorphic skeletons. Even advanced methods like SAN and SAME, which handle homeomorphic skeletons, cannot manage characters with fundamentally different numbers of leaf joints (non-homeomorphic). Non-neural methods for non-homeomorphic skeletons often require paired motion or explicit mappings.
- MoMa's Innovation: MoMa is the first neural approach that can perform motion retargeting between non-homeomorphic skeletons without requiring paired motion data or manual mappings. This is achieved through its pose masking auto-encoder, which learns a generalized representation by recovering masked joints, making it adaptable to varying skeletal structures.
Face-based vs. Vertex-based Collision Optimization:
- Previous Limitation: Existing shape-aware methods (e.g., Villegas et al. 2021, R2ET) typically use vertex-based approaches for collision detection and resolution. This means they optimize individual vertex positions to avoid interpenetration. This can lead to localized, sometimes erratic, mesh deformations or a loss of surface consistency as vertices respond independently. Moreover, some vertex-based methods only address specific types of collisions (e.g., body-limb but not limb-limb).
- MoMa's Innovation: MoMa introduces a novel face-based optimizer. By detecting and resolving collisions at the level of triangular faces (the fundamental building blocks of a mesh surface) rather than individual vertices, it ensures a more coherent and consistent deformation of the mesh surface. This approach minimizes unwanted mesh distortions, leads to more accurate and physically plausible retargeted motions, and can solve all types of collisions (body-body, limb-limb, etc.).
Simplicity and Efficiency of Skeleton-aware Module:
- Previous Complexity: Many prior methods, especially those dealing with complex skeletal differences, often rely on sophisticated techniques like cycle consistency losses or adversarial losses (e.g., in R2ET and Villegas et al. 2021) to ensure robust retargeting.
- MoMa's Innovation: MoMa's pose masking auto-encoder provides a simpler baseline. By leveraging the power of transformers and masked modeling, it effectively learns to reconstruct missing joint information and generalize across diverse skeletons without the need for complex cycle consistency or adversarial training schemes, while still achieving superior results.
  
  In essence, MoMa differentiates itself by offering a more generalized skeletal retargeting capability (non-homeomorphic) combined with a more robust and visually consistent shape-aware collision resolution mechanism, all within a conceptually simpler training framework.

4. Methodology

4.1. Principles

The core idea behind MoMa is to tackle skinned motion retargeting by simultaneously addressing both skeletal structure and body shape differences through two primary principles:

Skeleton-aware Masked Pose Modeling: The method leverages a transformer-based auto-encoder with a spatio-temporal masking strategy. This principle is inspired by masked language modeling and masked image modeling, where the model learns to reconstruct missing parts of the input data. In MoMa's context, it learns to predict the missing (or differing) joints in a target skeleton by observing the motion of shared joints. This allows for robust retargeting across diverse skeletal topologies, including non-homeomorphic ones, by learning to fill in the "gaps" or adapt to "extra" joints.
Shape-aware Face-based Collision Optimization: Recognizing that skeletal retargeting alone can lead to mesh interpenetrations, MoMa introduces a novel face-based optimizer. This principle aims to refine the retargeted skeletal positions by minimizing collisions between the triangular faces of the character's mesh. Unlike vertex-based methods, optimizing at the face level provides a more consistent and physically plausible deformation, preventing localized distortions and ensuring a visually realistic animated character.

These two principles work in conjunction: the skeleton-aware module provides an initial, topologically adapted motion, and the shape-aware module then refines this motion to be physically valid and collision-free on the target character's specific geometry.

4.2. Core Methodology In-depth

MoMa's methodology involves a three-step process: Skeleton motion retargeting, Skinning and collision detection, and Mesh optimization to solve collisions. Before diving into these steps, let's understand how characters and animations are represented.

4.2.1. Character and Animation Representation

A character $C_k$ is defined by its skeleton $(S_k, Q_k)$ and mesh $M_k$ .

Skeleton:
- Static Representation ( $S_k$ ): This component describes the inherent, unchanging structure of the skeleton. It contains information about the offsets (bone lengths) between joints and implicitly defines the skeleton's topology (e.g., parent-child relationships, graph structure). For a character, $S_k$ is constant throughout an animation. It is represented as an $N \times d$ vector, where $N$ is the number of joints and $d=3$ for xyz coordinates (presumably for bone vectors or initial joint positions relative to their parents).
- Motion Representation ( $A(Q_k)$ ): This component captures the dynamic aspect of the animation. At any specific frame, $Q_k$ consists of the relative rotations of each of the $N_k$ joints. These rotations are expressed as 4D quaternions to ensure numerical stability and avoid issues like gimbal lock. The motion representation $A(Q_k)$ has dimensions $N \times w \times d$ , where $N$ is the number of joints, $w$ is the window length (number of frames in a sequence), and $d=4$ for the quaternion components.
Mesh Representation ( $M_k$ ): This component defines the body shape of the character. It consists of a set of vertices, edges, and faces. For an animation, the mesh vertices change position through skinning to reflect the skeleton's movement. $M_k$ is expressed as a set of vertices, with each vertex having xyz coordinates, resulting in a dimension of $N_v \times d$ , where $N_v$ is the number of vertices and $d=3$ .

In summary, an animation for a character $C_k$ is denoted as $A((S_k, Q_k), M_k)$ . The goal of retargeting is to transform this into $A((S_t, Q_t), M_t)$ for a target character $C_t$ .

4.2.2. Skeleton-aware Pose Masking Auto-encoder

This module (illustrated in Figure 3) is responsible for transferring the motion from the source skeleton $A(S_k, Q_k)$ to the target skeleton $A(S_t, Q_t)$ , handling differences in joint count and bone length.

$Fig. 3. Skeleton-aware pose masking auto-encoder. Starting from an input animation $A ( Q _ { k } ) ,$ an encoder embeds each input joint into a set of tokens `E _ { k }` . Next, we randomly mask (black squares) a subset of `E _ { k }` and concatenate the remaining missing joints to include all the possible topologies, resulting in $E _ { C } ^ { M }$ . To model the relationships between the embedded joints in $E _ { C } ^ { M }$ $\\varepsilon _ { N } + \\varepsilon _ { W }$ `S _ { k }` representing the static $E _ { C } \\mathrm { : }$ where all the masked joints have been predicted. inally, he decoder extracts the super-skeleton motion $A ( Q _ { C } )$ from the latent space using the learnt token `( S _ { k }` at training time and `S _ { t }` at test time), from which we can derive the reconstructed input motion $A ^ { \\prime } ( S _ { k } , Q _ { k } )$ and the retargeted motion `A ( S _ { t } , Q _ { t } )` . We train our auto-encoder to predict the masked joints by enforcing a MSE loss between the input `A ( S _ { k } , Q _ { k } )` and the reconstructed $A ^ { \\prime } ( S _ { k } , Q _ { k } )$ .$
该图像是示意图，展示了骨骼感知姿态掩盖自编码器的工作流程。从输入动画 $A(Q_k)$ 开始，编码器将每个输入关节嵌入到一组标记 $E_k$ 中。随后，随机掩盖部分 $E_k$ 的元素，形成 $E_C^M$ ，接着利用变换器模型建立嵌入关节之间的关系。最后，解码器从潜在空间提取超骨骼运动 $A(Q_C)$ ，以推导重建的输入运动 $A'(S_k, Q_k)$ 和重定向运动 $A(S_t, Q_t)$ 。

Fig. 3. Skeleton-aware pose masking auto-encoder. Starting from an input animation $A ( Q _ { k } ) ,$ an encoder embeds each input joint into a set of tokens E _ { k } . Next, we randomly mask (black squares) a subset of E _ { k } and concatenate the remaining missing joints to include all the possible topologies, resulting in $E _ { C } ^ { M }$ . To model the relationships between the embedded joints in $E _ { C } ^ { M }$ $\varepsilon _ { N } + \varepsilon _ { W }$ S _ { k } representing the static $E _ { C } \mathrm { : }$ where all the masked joints have been predicted. inally, he decoder extracts the super-skeleton motion $A ( Q _ { C } )$ from the latent space using the learnt token ( S _ { k } at training time and S _ { t } at test time), from which we can derive the reconstructed input motion $A ^ { \prime } ( S _ { k } , Q _ { k } )$ and the retargeted motion A ( S _ { t } , Q _ { t } ) . We train our auto-encoder to predict the masked joints by enforcing a MSE loss between the input A ( S _ { k } , Q _ { k } ) and the reconstructed $A ^ { \prime } ( S _ { k } , Q _ { k } )$ .

Objective: Given a skeletal motion $A(S_k, Q_k)$ for an input character $C_k$ , the auto-encoder retargets it to obtain the same motion $A(S_t, Q_t)$ for a target character $C_t$ .

1. Encoder:

The encoder processes the dynamic part of the input animation, $A(Q_k)$ .
Each joint $j_n$ of $Q_k$ for every frame $w$ is tokenized. Tokenization here involves converting the quaternion representation of each joint's rotation into a fixed-size vector embedding.
This is done using a single linear layer followed by a Leaky ReLU activation function.
The output of this tokenization for the entire animation is an embedded sequence $E_k$ with dimensions $W \times N_k \times d$ , where $W$ is the window length, $N_k$ is the number of joints in the input skeleton, and $d$ is the dimension of the joint embedding vector.

2. Pose Masking Strategy:

Inspired by masked image modeling, a subset $M$ of the embedded joints in $E_k$ is randomly masked. This means their values are replaced, typically with a special mask token vector of dimension $d$ . The paper mentions replacing with a vector that "can contain different values depending on the chosen masking strategy," implying flexibility in how the mask token is defined (e.g., zero vector, learnable vector, random noise).
To handle diverse skeletal topologies, especially non-homeomorphic ones, the method concatenates a set of $N_C - N_k$ empty tokens to the masked embedded animation joints. $N_C$ represents the maximum possible number of joints across all characters in the dataset $C$ . This operation expands the input representation to a "super-skeleton" size, ensuring that any input animation can be mapped into a latent representation large enough to accommodate all possible skeleton topologies.
The resulting latent representation, $E_C^M$ , has dimensions $W \times N_C \times d$ . Each of these $N_C - N_k$ newly added empty tokens is also initialized as a masked vector.

3. Spatial-Temporal Positional Embedding:

Positional embeddings are crucial for transformers to understand the order and relationships between elements in a sequence, as transformers themselves are permutation-invariant.
MoMa uses two separable learnable positional embeddings:
- Spatial Embedding ( $\varepsilon_N$ ): With dimensions $N_C \times d$ , it models the spatial relationships between the $N_C$ joints, regardless of their specific character. This embedding is repeated for each of the $W$ frames. Its goal is to help the network learn the hierarchical graph representation of joints (as shown in Figure 2).
- Temporal Embedding ( $\varepsilon_W$ ): With dimensions $W \times d$ , it models the temporal relationships between the $W$ frames. This embedding is repeated for each of the $N_C$ joints.
The total spatial-temporal embedding is given by the sum $\varepsilon_N + \varepsilon_W$ , resulting in dimensions $W \times N_C \times d$ . This combined embedding is added to the masked embedded animation $E_C^M$ to provide positional context to the transformer. Separating spatial and temporal embeddings prevents the overall embedding size from becoming too large in 3D.

4. Encoding Transformer:

The input to the encoding transformer is $E_C^M + \varepsilon_N + \varepsilon_W$ .
The transformer itself is a custom Vision Transformer (ViT) architecture (Dosovitskiy et al., 2020), adapted for spatio-temporal joint information. It features multi-head attention mechanisms, and the paper notes "different activation functions and number of attention heads" are used.
The transformer processes this spatio-temporal joint information to output a latent space representation $E_C$ , which contains the embedded reconstructed motion. A static token $S_k$ (of size $d$ ) is also implicitly used during encoding; it models the static part of the skeleton and acts as a selector for the joints corresponding to a given character.

5. Decoder:

The decoder (prediction head) is simpler, consisting of a single linear layer.
Its output is an animation $A(Q_C)$ , which can be conceptualized as the animation of a "super-skeleton" $Q_C$ that encompasses all possible skeletal topologies in the dataset $C$ .

6. Training:

Extraction for Loss: At training time, the network needs to reconstruct the original input motion. From the super-skeleton motion $A(Q_C)$ , the specific motion for the input character $A'(Q_k)$ is extracted. This is facilitated by the learned static token $S_k$ , which acts as a selector for the relevant joints of character $k$ .
Loss Function: The auto-encoder is trained to reconstruct all input joints (both masked and unmasked) using the Mean Squared Error (MSE) loss. The loss is applied on the 3D spatial positions of the joints, computed using a Forward Kinematic (FK) layer. This is important because applying the loss directly on quaternions can lead to accumulated errors down the kinematic chain.

The training loss is defined as: $ \mathcal { L } _ { M S E } ( A ( Q _ { k } ) , A ^ { \prime } ( Q _ { k } ) ) = \frac { \sum _ { w = 1 } ^ { W } \sum _ { n = 1 } ^ { N _ { k } } ( F K ( S _ { k } , j _ { n } ) - F K ( S _ { k } , j _ { n } ^ { \prime } ) ) ^ { 2 } } { W \times N _ { k } } $ Where:
$A(Q_k)$ : The ground truth input motion representation for character $k$ .
$A'(Q_k)$ : The motion representation reconstructed by the auto-encoder for character $k$ .
$w$ : Index of the frame, ranging from 1 to $W$ (window length).
$n$ : Index of the joint, ranging from 1 to $N_k$ (number of joints in character $k$ 's skeleton).
$FK(S_k, j_n)$ : The Forward Kinematics function that calculates the 3D spatial position of the $n$ -th joint ( $j_n$ ) in the skeleton $S_k$ . This takes the static skeleton structure ( $S_k$ ) and the joint's quaternion rotation ( $j_n$ ) to output its global 3D position.
$FK(S_k, j'_n)$ : The Forward Kinematics function for the reconstructed $n$ -th joint (j'_n).
The subtraction $(FK(S_k, j_n) - FK(S_k, j'_n))$ calculates the difference in 3D position for each joint. This difference is squared and summed across all joints and all frames.
The sum is then normalized by $W \times N_k$ to get the average Mean Squared Error per joint per frame.

7. Test Time:

Given the reconstructed super-skeleton motion $A(Q_C)$ from the decoder, the motion for any specific target skeleton $A(Q_t)$ is obtained.
This is done by applying the same Forward Kinematic layer using the target skeleton's static representation $S_t$ : $A(S_t, Q_t) = FK(S_t, j'_n)$ . This means the learned motion (in terms of joint quaternions, j'_n) is applied to the target's specific bone lengths and hierarchy to generate its 3D joint positions.

4.2.3. Shape-aware Face-based Optimizer

This module (illustrated in Figure 4) refines the retargeted skeletal motion to avoid mesh interpenetrations and ensure a visually realistic result. It operates on a per-frame basis, iteratively adjusting joint positions.

$Fig. 4. Shape-aware face-based optimizer. (a) Given the retargeted skeletal motion `A ( S _ { t } , Q _ { t } )` and the corresponding mesh in T-pose, we apply the skinning to obtain the full animation `A ( ( S _ { t } , Q _ { t } ) , M _ { k } )` . (b) During each iteration the mesh can display collisions, which (c) are detected and weighted by the collision penalizer. The face-based optimizer minimizes the loss `L _ { o } ( S _ { t } , Q _ { t } )` by adapting the skeleton position `( S _ { t } , Q _ { t } )` until obtaining (d) a retargeted motion without collisions.$
该图像是插图，展示了面向形状的基于面片的优化器的工作流程。图示中的部分(a)展示了重定向的骨骼动作和对应的T姿势网格，经过皮肤处理后生成完整动画(b)。在每次迭代中，网格可能发生碰撞(c)，这些碰撞被检测并加权。优化器通过调整骨骼位置 $(S_t, Q_t)$ 来最小化损失 $L_o(S_t, Q_t)$ ，直至得到无碰撞的重定向动作(d)。

Fig. 4. Shape-aware face-based optimizer. (a) Given the retargeted skeletal motion A ( S _ { t } , Q _ { t } ) and the corresponding mesh in T-pose, we apply the skinning to obtain the full animation A ( ( S _ { t } , Q _ { t } ) , M _ { k } ) . (b) During each iteration the mesh can display collisions, which (c) are detected and weighted by the collision penalizer. The face-based optimizer minimizes the loss L _ { o } ( S _ { t } , Q _ { t } ) by adapting the skeleton position ( S _ { t } , Q _ { t } ) until obtaining (d) a retargeted motion without collisions.

1. Skinning:

After the skeleton-aware module provides the retargeted skeletal motion $A(S_t, Q_t)$ for the target character $C_t$ , this motion needs to be applied to the target character's mesh $M_t$ .
For each frame, a Linear Blend Skinning (LBS) process is applied. LBS takes the vertices $v$ of the target mesh in its T-pose (a canonical neutral pose) and the transformations (rotations and translations) derived from the skeleton $(S_t, Q_t)$ to compute the new positions $v'$ of the mesh vertices. This yields the animated mesh $A((S_t, Q_t), M_t)$ .

2. Collision Penalizer:

The core of this module is to detect and penalize collisions (interpenetrations) between different parts of the character's mesh.
Face-based Approach: Unlike vertex-based methods, MoMa's approach is face-based. It identifies colliding triangles $\Delta((S_t, Q_t))$ using a Bounding Volume Hierarchy (BVH) (Karras, 2012; Pavlakos et al., 2019), which is an efficient data structure for accelerating collision detection queries.
Volumetric Distance Field: For each triangular face $\Delta_{\Psi}$ on the external surface of the mesh, a conic 3D volumetric distance field $\psi$ is built, along with its normal vector $\hat{\nu}$ . This field conceptually defines a "forbidden zone" around each face.
Bidirectional Interpenetration: When two triangles, $\Delta_i$ and $\Delta_r$ , collide, the interpenetration is considered bi-directional. This means the vertices $v_i$ of $\Delta_i$ are treated as intruders into the distance field $\psi_{\Delta_r}$ of the receiver triangle $\Delta_r$ , and vice-versa.
Collision Term: The collision term $\xi$ to be minimized quantifies this interpenetration. Being grouped in triangular faces, a vertex cannot respond individually to a collision, which, as the paper argues, leads to an improved mesh consistency.

The collision term $\xi(S_t, Q_t)$ is defined as: $ \xi ( S _ { t } , Q _ { t } ) = \sum _ { ( A _ { i } , A _ { r } ) } \bigg { \sum _ { v _ { i } \in A _ { i } } | - \varPsi _ { A _ { r } } ( v _ { i } ) \hat { v } _ { i } | ^ { 2 } + \sum _ { v _ { r } \in A _ { r } } | - \varPsi _ { A _ { i } } ( v _ { r } ) \hat { v } _ { r } | ^ { 2 } \bigg } $ Where:
$(A_i, A_r)$ : Represents a pair of colliding triangles or regions on the mesh (e.g., from different body parts like arm and torso).
$v_i \in A_i$ : A vertex belonging to the "intruder" triangle or region $A_i$ .
$v_r \in A_r$ : A vertex belonging to the "receiver" triangle or region $A_r$ .
$\varPsi_{A_r}(v_i)$ : The volumetric distance field associated with receiver triangle $A_r$ , evaluated at the position of intruder vertex $v_i$ . This value is negative if $v_i$ is inside the forbidden zone of $A_r$ , indicating interpenetration.
$\hat{v}_i$ : The normal vector of the intruder vertex $v_i$ (or the normal of the face it belongs to).
$\| - \varPsi _ { A _ { r } } ( v _ { i } ) \hat { v } _ { i } \| ^ { 2 }$ : This term represents the squared penetration depth of $v_i$ into $A_r$ , weighted by its normal. The negative sign makes it a positive penalty for interpenetration. The sum is performed over all vertices in $A_i$ that penetrate $A_r$ , and vice-versa for $A_r$ penetrating $A_i$ .
The overall sum $\sum _ { ( A _ { i } , A _ { r } ) }$ accumulates penalties from all colliding pairs of regions.

The distinction between vertex-based and face-based methods is highlighted in Figure 5:

该图像是图示，展示了两种网格（Mesh 1 和 Mesh 2）之间的碰撞检测方式。图 (a) 表示顶点基础方法，使用全局距离场的方式进行碰撞检测；图 (b) 显示面基础方法，采用碰撞三角形表面的方式，提供更高效的一对一度量。

Fig. 5. (a) Vertex-based methods use a global distance field between each vertex and the others, thus providing a computationally expensive one-to-many metric. (b) Facebased methods identify collisions with colliding triangle surfaces, thus a more efficient one-to-one metric.

Self-Contact Filtering: To improve the optimization process, certain self-contacts (e.g., an eye colliding with the skull) that are inherent to the character's T-pose or occur between kinematically neighboring parts (e.g., neck and torso, upper arm and lower arm) are detected and excluded from the collision term. This prevents the optimizer from trying to resolve non-physical or pre-existing contacts. This step involves setting all input meshes to T-pose and identifying these inherent contacts.

3. Optimization:

A Limited-memory BFGS (L-BFGS) optimizer (Nocedal and Wright, 1999) is chosen. L-BFGS is a quasi-Newton method efficient for high-dimensional optimization, approximating the Hessian matrix without explicit computation, making it suitable for adapting joint positions.
The optimizer minimizes a loss function $\boldsymbol{\mathcal{L}}_O$ for each frame of the animation. This loss balances two objectives: resolving collisions and preserving the original motion dynamics.

The optimization loss $\boldsymbol { \mathcal { L } } _ { O } ( S _ { t } , Q _ { t } )$ is given by: $ \boldsymbol { \mathcal { L } } _ { O } ( S _ { t } , Q _ { t } ) = \lambda \boldsymbol { \xi } ( S _ { t } , Q _ { t } ) + ( 1 - \lambda ) \mathbf { L } _ { M S E } ( ( S _ { t } , Q _ { t } ) , ( S _ { t } , Q _ { t - 1 } ) ) $ Where:
$\boldsymbol{\mathcal{L}}_O(S_t, Q_t)$ : The total optimization loss for the target skeleton in its current state $(S_t, Q_t)$ .
$\lambda$ : A balancing weight (hyper-parameter) between 0 and 1. It determines the trade-off between collision avoidance and motion preservation.
$\boldsymbol{\xi}(S_t, Q_t)$ : The collision term discussed above. Minimizing this term drives the optimizer to resolve mesh interpenetrations.
$\mathbf{L}_{MSE}((S_t, Q_t), (S_t, Q_{t-1}))$ : An MSE loss term that encourages temporal consistency. It measures the difference between the current frame's joint positions $(S_t, Q_t)$ and the previous frame's optimized joint positions $(S_t, Q_{t-1})$ . This term helps prevent erratic movements from frame to frame and keeps the motion close to the original dynamics by referencing the prior frame's solution.
The optimizer learns to adapt the joint positions in $Q_t$ (the motion representation) to minimize this combined loss, ultimately producing $A((S_t, Q_t), M_t)$ – a retargeted motion that is both collision-free and preserves the original motion's dynamics.

5. Experimental Setup

5.1. Datasets

The authors evaluate MoMa on a variety of datasets to demonstrate its capabilities across different skeletal topologies and motion types.

Mixamo Dataset (Adobe, 2020): This is the primary dataset used for quantitative evaluation, especially for comparing retargeting between isomorphic, homeomorphic, and non-homeomorphic characters.
- Characteristics: It consists of a diverse set of 3D characters and pre-recorded motion clips. Retargeting in Mixamo is typically performed by copying rotations.
- Setup:
  - Isomorphic Skeletons: 7 characters for training, 11 for testing.
  - Homeomorphic Skeletons: A subset of characters from Mixamo (excluding Liam, Pearl, Jasper which are no longer available). SAN is used as a baseline.
  - Non-Homeomorphic Skeletons: Two CC-Licensed characters from Sketchfab (animated with Mixamo) combined with characters from the isomorphic experiments. These characters would typically have extra end-effectors like tails or different limb configurations.
  - Motion Data: An average of 1250 motions per training character, 100 for validation, and 60 for each test character. The test set includes unseen characters and unseen motions.
- Why chosen: It's a standard benchmark in motion retargeting, allowing comparison with prior work. However, the authors note inconsistencies in GT (ground truth) data, sometimes showing interpenetrations.
Ubisoft La Forge Animation Dataset (LAFAN1) (Harvey et al., 2020):
- Characteristics: Comprises human motion data, featuring 5 subjects and 77 different motion sequences.
- Why chosen: Used for evaluating the network's ability to reconstruct masked segments of motion, showcasing robustness on human motion data.
Quadruped Dataset (Zhang et al., 2018):
- Characteristics: Includes 52 unique sequences of dog motions, covering various activities like idle, walk, run, sit, stand, and jumps.
- Why chosen: Used to test generalization to non-humanoid characters, specifically quadrupedal motion.
Carnegie Mellon University (CMU) Motion Capture Database (Anon):
- Characteristics: A large dataset capturing real human motion with 144 actors performing diverse movements, collected using an Optical Motion Capture System. This introduces more complex and varied movement patterns compared to the Mixamo dataset's copy-rotation based motions.
- Why chosen: Used to assess the network's ability to reconstruct segments of real motion, simulating scenarios with occluded markers or noisy data. Quantitative evaluation (MSE, FIE) with Mixamo isn't feasible due to unpaired motions, so it focuses on reconstruction.
Skills from Video (SFV) Data (Peng et al., 2018):
- Characteristics: Contains real-world complex human motions, potentially derived from videos. The paper suggests these motions can be processed by common SMPL mesh (a parameterized body model) for use with simulation.
- Why chosen: Demonstrates the robustness and generalization ability of MoMa to transfer motions from real-world complex human videos to synthetic characters.

5.2. Evaluation Metrics

The paper uses three main metrics to evaluate the performance of MoMa, combining both skeletal accuracy and mesh realism.

Mean Squared Error (MSE):
- Conceptual Definition: Mean Squared Error measures the average of the squares of the errors between the retargeted joint positions and the ground truth joint positions. A lower MSE indicates that the retargeted motion's skeleton is closer to the intended or ground truth motion. In the context of motion retargeting, it quantifies how accurately the spatial configuration of the target skeleton matches the source's intended pose. The MSE is calculated after aligning the root joint of the retargeted motion with the ground truth and normalizing by character height, which accounts for differences in overall scale and global position.
- Mathematical Formula: The paper refers to the MSE calculated similarly to the loss function for the skeleton-aware module, but for evaluation, it's a comparison between retargeted 3D joint positions and ground truth 3D joint positions. $ MSE = \frac { 1 } { W \times N } \sum _ { w = 1 } ^ { W } \sum _ { n = 1 } ^ { N } | P _ { w , n } - \hat { P } _ { w , n } | ^ { 2 } $
- Symbol Explanation:
  - $W$ : The total number of frames in the animation.
  - $N$ : The total number of joints in the skeleton.
  - $P_{w,n}$ : The 3D spatial position (e.g., xyz coordinates) of the $n$ -th joint at frame $w$ in the ground truth animation.
  - $\hat{P}_{w,n}$ : The 3D spatial position of the $n$ -th joint at frame $w$ in the retargeted animation.
  - $\| \cdot \|^2$ : The squared Euclidean distance between the two 3D joint positions.
Face Interpenetration Error (FIE):
- Conceptual Definition: Face Interpenetration Error quantifies the extent of unwanted collisions or interpenetrations within the character's mesh. It is expressed as the percentage of triangular faces that are currently in a state of collision in an animation. A lower FIE indicates a more physically plausible and realistic animation, free from visual artifacts caused by mesh self-intersection.
- Mathematical Formula: $ FIE = \frac { 1 } { W } \sum _ { w = 1 } ^ { W } \frac { # \Delta ( S _ { t } , Q _ { t } ) } { # F ( M _ { t } ) } % $
- Symbol Explanation:
  - $W$ : The total number of frames in the animation.
  - $\# \Delta ( S _ { t } , Q _ { t } )$ : The number of colliding faces detected in the target mesh $M_t$ at a specific frame $w$ , given the skeleton state $(S_t, Q_t)$ .
  - $\# F ( M _ { t } )$ : The total number of faces in the target mesh $M_t$ .
  - $\%$ : Indicates that the ratio is expressed as a percentage.
Skeleton and Collisions Error (SCE):
- Conceptual Definition: Skeleton and Collisions Error is a comprehensive, combined metric that aims to balance both the accuracy of the skeletal motion transfer (measured by MSE) and the physical realism of the skinned mesh (measured by FIE). It acknowledges that achieving a low MSE might sometimes come at the cost of increased mesh collisions, and vice-versa. By multiplying these two metrics, it provides a single score that rewards methods achieving both accurate skeletal motion and collision-free mesh deformation. A lower SCE value indicates a better overall retargeting quality. The authors note that while a weighted sum or other relations could be chosen, multiplication captures the dependency and is a valid choice for this combined evaluation.
- Mathematical Formula: $ SCE = MSE \times FIE $
- Symbol Explanation:
  - MSE: The Mean Squared Error as defined above.
  - FIE: The Face Interpenetration Error as defined above.

5.3. Baselines

MoMa's performance is compared against several state-of-the-art and baseline methods in motion retargeting:

Copy-rotation: A simple baseline method where joint rotations from the source skeleton are directly copied to the target skeleton, typically after aligning them in a common T-pose. It doesn't account for bone length differences or skeletal topology variations beyond isomorphism.
NKN (Neural Kinematic Networks) (Villegas et al., 2018): An early neural method for unsupervised motion retargeting, primarily designed for isomorphic skeletons. It uses an encoder-decoder network to learn kinematic representations.
PMNet (Lim et al., 2019): Another neural approach that aims to learn disentangled pose and movement representations for unsupervised motion retargeting, also mainly for isomorphic skeletons.
SAN (Skeleton-Aware Networks) (Aberman et al., 2020): A method specifically designed to handle retargeting between homeomorphic skeletons. It explicitly models skeletal structure differences. MoMa compares against SAN for homeomorphic scenarios. The paper also mentions "SAN opt", indicating SAN with MoMa's shape-aware optimizer applied on top for fairness.
SAME (Skeleton-Agnostic Motion Embedding) (Lee et al., 2023): A recent method that aims to solve various animation tasks in a skeleton-agnostic manner, tackling both isomorphic and homeomorphic skeletons.
R2ET (Skinned motion retargeting with residual perception of motion semantics & geometry) (Zhang et al., 2023): A recent skeleton-aware and shape-aware method that employs an attractive/repulsive field mechanism for collision resolution. It's designed for isomorphic skeletons.

These baselines represent different levels of sophistication and capabilities, from basic copy-pasting to advanced neural networks handling specific types of skeletal variations and rudimentary shape awareness, allowing for a comprehensive evaluation of MoMa's advancements.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate MoMa's superior performance across various scenarios, including isomorphic, homeomorphic, and non-homeomorphic skeletons, and its ability to reconstruct real motion capture data.

Isomorphic Skeletons

The results in Table 1, comparing MoMa against several baselines for isomorphic retargeting, highlight its effectiveness. The following are the results from Table 1 of the original paper:

Isomorphic
	Methods	MSE	FIE%↓	SCE↓
	GT (Mixamo)		4.10
	Copy	0.045	3.23	0.145
	NKN*	0.575
	PMnet*	0.281
	SAN	0.141	1.53	0.216
	SAME	0.176	1.21	0.213
	Ours (no opt)	0.043	3.13	0.134
	R2ET	0.042	3.96	0.166
Shape-aware	SAN opt	0.163	1.02	0.166
Shape-aware	Ours	0.049	1.01	0.049

MSE: R2ET (0.042) and Ours (no opt) (0.043) achieve the lowest MSE values, indicating excellent skeletal accuracy, even outperforming the Copy baseline (0.045). This suggests that MoMa's skeleton-aware module is highly effective at preserving the original motion's skeletal dynamics. $NKN*$ and $PMnet*$ have significantly higher MSE, indicating they are not as accurate on this specific setup.
FIE%↓ (Face Interpenetration Error): This metric is crucial for shape-aware evaluation. Ours (the full MoMa approach with the shape-aware optimizer) achieves the lowest FIE of 1.01%, significantly better than all other methods, including SAN opt (1.02%) and R2ET (3.96%). This strongly validates the efficacy of MoMa's face-based optimizer in resolving mesh collisions. It's notable that Ours (no opt) already has a lower FIE (3.13%) than R2ET, indicating that MoMa's skeleton-aware module inherently produces motions with fewer collisions. The GT (Mixamo) FIE of 4.10% is important; it reveals that even the ground truth in the Mixamo dataset can contain interpenetrations, making the task challenging and highlighting the need for robust collision resolution.
SCE↓ (Skeleton and Collisions Error): As a combined metric, SCE provides an overall measure of quality. Ours achieves the lowest SCE of 0.049, demonstrating its superior balance between skeletal accuracy and collision avoidance. This is a significant improvement over R2ET (0.166) and SAN opt (0.166). The results for "SAN opt" prove that MoMa's shape-aware module is effective and can be applied independently to other skeleton-aware methods.

Homeomorphic Skeletons

For homeomorphic skeletons, MoMa is compared against SAN, which is specifically designed for such variations. The following are the results from Table 2 of the original paper:

Homeomorphic
Methods	MSE	FIE%↓	SCE↓
SAN	0.108	1.25	0.135
SAME	0.122	1.12	0.137
Ours (no opt)	0.025	3.6	0.090
SAN opt	0.117	0.91	0.106
Ours	0.031	0.9	0.028

Ours (no opt) achieves a remarkably low MSE of 0.025, significantly outperforming SAN (0.108) and SAME (0.122). This highlights the strength of MoMa's skeleton-aware module in handling homeomorphic differences.
Ours (full approach) maintains an excellent FIE of 0.9%, the lowest among all methods, further validating the face-based optimizer.
The SCE for Ours is 0.028, which is substantially lower than SAN (0.135) and SAME (0.137), demonstrating clear state-of-the-art performance in homeomorphic retargeting.

Motion Reconstruction

The evaluation on LAFAN1 and CMU datasets (Table 4) focuses on the network's ability to reconstruct masked segments of real motion, simulating noisy or incomplete motion capture data. The following are the results from Table 4 of the original paper:

Dataset	Masked joints	MSE
LAFAN1	5	0.042
	8	0.032
	10	0.050
	15	0.070
CMU	5	0.047
	8	0.035
	10	0.081
	15	0.15

The results show that MoMa effectively reconstructs real motion capture data, with MSE values comparable to those seen on Mixamo. For both datasets, an intermediate number of masked joints (around 8) yields the lowest MSE, suggesting an optimal masking level for learning robust representations. As the number of masked joints increases to 15, the MSE predictably rises, as reconstructing more missing information becomes harder. This validates the pose masking auto-encoder's robustness and generalization to real-world, noisy motion data.

Qualitative Results

The paper provides qualitative results (Figures 6, 7, 8, 10) that visually reinforce the quantitative findings:

Isomorphic Skeletons (Fig. 6): MoMa effectively solves collisions (e.g., hands, arm-head interpenetration) while preserving motion dynamics, unlike baselines that either fail to resolve collisions or excessively deform the pose.
Homeomorphic Skeletons (Fig. 7): MoMa avoids interpenetrations between body parts, which SAN (a skeleton-aware-only method) fails to do, especially with large variations in skeleton and shape.
Non-Homeomorphic Skeletons (Fig. 8): The paper visually demonstrates retargeting from a standard skeleton to characters with extra end-effectors (like a tail) and different limb configurations, showcasing MoMa's unique capability.
Real-world Character (Fig. 10): MoMa successfully retargets complex real-world motions from the SFV dataset to synthetic characters, highlighting its robustness and generalization to unseen skeletons.

Computational Cost

The following are the results from Table 5 of the original paper:

Time of execution		Avg. number of collisions per frame	Length of the animation (frames)
(s) SGD	L-BFGS
144.2	73.1	522	56
345.3	180.39	450	144

Table 5 shows the computational cost of the optimization step, comparing SGD and L-BFGS. L-BFGS consistently outperforms SGD in execution time (e.g., 73.1s vs 144.2s for 56 frames), justifying its choice as the optimizer for the shape-aware module. The number of collisions and animation length influence the total time, but the mesh characteristics (joints, faces) have minimal impact on the per-frame computation load, as it's primarily determined by the collisions needing resolution.

6.2. Data Presentation (Tables)

The tables presented in the "Core Results Analysis" section are exact transcriptions of the original paper's tables, as per the guidelines.

6.3. Ablation Studies / Parameter Analysis

The authors conduct ablation studies to understand the contribution of different components of MoMa, particularly for the skeleton-aware module and the impact of the $\lambda$ parameter in the shape-aware module.

Skeleton-aware Module Ablation Studies

The following are the results from Table 3 of the original paper:

M	Masking strategy			Encoding
5	0.095	zero	0.043	No encoding	0.524
10	0.043	random	0.095	E	1.025
15	0.094	perturb	1.025	ε	1.450
				ε ε	0.043

This table shows the MSE values for different configurations of the skeleton-aware module using the isomorphic skeletons setup.

Effect of Number of Masked Joints (M):
- $M=5$ (0.095 MSE): If too few joints are masked, the network doesn't have enough reconstruction tasks to learn robust representations, resulting in higher MSE.
- $M=10$ (0.043 MSE): This configuration yields the best MSE. It suggests an optimal balance where the network is challenged enough to learn meaningful spatio-temporal relationships without being overwhelmed.
- $M=15$ (0.094 MSE): If too many joints are masked, the reconstruction task becomes excessively difficult, hindering the network's ability to learn the underlying motion structure, leading to higher MSE.
- Conclusion: The number of masked joints ( $M$ ) is a crucial hyper-parameter, and an optimal value (e.g., $M=10$ ) is essential for effective learning.
Masking Strategy:
- zero (0.043 MSE): Replacing masked tokens with a vector of zeros (or a specific fixed value) provides a consistent starting point for the reconstruction task, leading to the best performance.
- random (0.095 MSE): Replacing with random values introduces noise, making the reconstruction harder and increasing MSE.
- perturb (1.025 MSE): This strategy (likely perturbing original values rather than fully masking) performs poorly, indicating that a clear masking signal is beneficial.
- Conclusion: Using a zero (or consistent) masking strategy is most effective for the pose masking auto-encoder.
Encoding (Positional Embeddings):
- No encoding (0.524 MSE): Without any positional embeddings, the network performs poorly, demonstrating the critical need for understanding spatio-temporal relationships.
- $E$ (1.025 MSE): This likely refers to only using one type of embedding (e.g., spatial, $\varepsilon_N$ ). Performance is still poor because it fails to capture the other dimension of relationships.
- $ε$ (1.450 MSE): This likely refers to only using the other type of embedding (e.g., temporal, $\varepsilon_W$ ). Similarly, poor performance.
- $ε ε$ (0.043 MSE): This represents the combination of both spatial ( $\varepsilon_N$ ) and temporal ( $\varepsilon_W$ ) embeddings ( $\varepsilon_N + \varepsilon_W$ ). This configuration yields the best MSE, matching the optimal result.
- Conclusion: Both spatial and temporal positional embeddings are essential and their combination is critical for the network to accurately model the complex spatio-temporal relationships between joints in an animation.

Shape-aware Module Ablation Studies ( $\lambda$ parameter)

$Fig. 9. Ablation studies for the $\\lambda$ value in the shape-aware module. The best results are indicated by points closer to the origin; we can see that our MoMa can outperform both shape-aware (R2ET) and skeleton-aware (SAN) methods depending on the value of $\\lambda$ .We obtain our best result with $\\lambda = 0 . 7$ .$
该图像是一个图表，展示了在形状感知模块中不同 eta 值下的消融研究结果。图中显示的点越接近原点，表明最佳结果。可以看到，MoMa 在不同 eta 值时均优于形状感知 (R2ET) 和骨架感知 (SAN) 方法，其中最佳结果出现在 $eta = 0.7$ 。

Fig. 9. Ablation studies for the $\\lambda$ value in the shape-aware module. The best results are indicated by points closer to the origin; we can see that our MoMa can outperform both shape-aware (R2ET) and skeleton-aware (SAN) methods depending on the value of $\\lambda$ .We obtain our best result with $\\lambda = 0 . 7$ .

Figure 9 illustrates the trade-off between MSE (skeletal accuracy) and FIE (collision avoidance) as the balancing weight $\lambda$ in the shape-aware module's loss function is varied. The goal is to be as close to the origin (0 MSE, 0 FIE) as possible.

Interpretation:
- A higher $\lambda$ (closer to 1) puts more emphasis on minimizing collisions ( $\xi(S_t, Q_t)$ ), potentially at the cost of deviating more from the original skeletal motion (higher MSE).
- A lower $\lambda$ (closer to 0) prioritizes preserving the original skeletal motion ( $\mathbf{L}_{MSE}((S_t, Q_t), (S_t, Q_{t-1}))$ ), which might lead to more collisions (higher FIE).
Results: The chart shows a curve for MoMa's performance across different $\lambda$ $λ$ values.
- The point where $\lambda = 0.7$ is highlighted as yielding the "best result" for MoMa, indicating an optimal balance where both MSE and FIE are acceptably low, resulting in high overall quality.
- Crucially, the curve for MoMa consistently lies closer to the origin than the single data points for R2ET (a shape-aware baseline) and SAN (a skeleton-aware baseline). This demonstrates that MoMa can achieve superior performance regardless of the chosen trade-off between skeletal accuracy and collision avoidance, outperforming state-of-the-art solutions across the spectrum.
Conclusion: The $\lambda$ parameter allows animators or users to control the desired balance, and MoMa's inherent capabilities allow it to achieve better outcomes across these trade-offs.

Computational Cost of the Optimization Step

The following are the results from Table 5 of the original paper:

Time of execution		Avg. number of collisions per frame	Length of the animation (frames)
(s) SGD	L-BFGS
144.2	73.1	522	56
345.3	180.39	450	144

This table compares the execution time of two different optimizers, Stochastic Gradient Descent (SGD) and Limited-memory BFGS (L-BFGS), for the shape-aware module.

Comparison: For both animation lengths (56 and 144 frames), L-BFGS is significantly faster than SGD. For the 56-frame animation, L-BFGS takes 73.1 seconds compared to SGD's 144.2 seconds. For the 144-frame animation, L-BFGS takes 180.39 seconds compared to SGD's 345.3 seconds.
Factor Influencing Cost: The paper notes that the mesh's joint and face numbers have minimal impact on performance; instead, the average number of collisions per frame is the primary determinant of computation load.
Conclusion: This justifies the choice of L-BFGS as the optimizer for the shape-aware module due to its superior efficiency, particularly important for processing longer animations or many collision events.

6.4. Data Presentation (Images)

该图像是示意图，展示了不同方法在运动重定向中的对比，包括源对象、Copy 方法、SAN 方法、R2ET 方法、无优化的我们的方案及优化后的我们的方案。图中红圈标注了各方法在身体部位的适应性与碰撞问题。

Fig. 6. State-of-the-art results for isomorphic skeletons, showing MoMa's qualitative superiority.

Fig. 7. Qualitative results for the retargeting between homeomorphic skeletons.
该图像是一个示意图，展示了源角色、SAN方法的结果以及我们的方法的比较。图中圆圈部分突出了三种表现下角色手部姿势的不同，明显可以看出我们的方法在动作重定向上更加自然和一致。

Fig. 7. Qualitative results for the retargeting between homeomorphic skeletons.

Fig. 8. Qualitative results for the retargeting between non-homeomorphic skeletons.
该图像是图示，展示了非同源骨骼之间的动画重定向结果。上方为源角色及重定向后的两个目标角色，下方为不同细节的特写图，包括服装和身体部分的碰撞处理，强调了重定向过程中细节的精细化。

Fig. 8. Qualitative results for the retargeting between non-homeomorphic skeletons.

该图像是插图，展示了使用MoMa方法进行动作重定向的过程。图中表现出源角色和目标角色在执行翻转动作时的动态关系，突显了骨骼和形状的适应性。该方法通过自定义的变换器自动编码器和面基优化器，实现了准确的动作迁移和碰撞处理。

Fig. 10. Real-world motion retargeting examples from Skills from Video (SFV) data, such as a backflip.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces MoMa, a novel and comprehensive approach for skinned motion retargeting that excels in both skeleton-aware and shape-aware aspects. MoMa's key innovation lies in its transformer-based auto-encoder with a spatio-temporal masking strategy, which uniquely enables fully automatic motion retargeting between isomorphic, homeomorphic, and crucially, non-homeomorphic character topologies—a capability previously lacking in neural methods without paired data. Furthermore, its shape-aware module employs a novel face-based optimizer that resolves mesh interpenetrations more consistently and precisely than conventional vertex-based methods, leading to more accurate and visually convincing retargeted motions. The experimental results, both quantitative and qualitative, demonstrate MoMa's state-of-the-art performance on the Mixamo dataset and its robustness in transferring complex motions from real videos to synthetic characters.

7.2. Limitations & Future Work

The authors highlight a key limitation and implicitly suggest future work based on their findings:

Generalization to Non-Homeomorphic Skeletons: While MoMa is the first neural method to handle non-homeomorphic skeletons without paired motion, its ability to generalize to all possible non-homeomorphic skeletons is limited. This generalization currently depends on whether unseen skeletons can be closely represented by existing skeletons in the training set (e.g., similar body proportions and number of joints) or by a combination of multiple training skeletons.
Future Direction - Diverse Training Data: To improve generalization to a wider range of non-homeomorphic skeletons, a more diverse and comprehensive training set would be required. This implies that if the training set is sufficiently diverse, MoMa could generalize to nearly all isomorphic and homeomorphic skeletons, but the scope for non-homeomorphic characters still hinges on data coverage.
Future Direction - Real-World Motion Transfer: The paper demonstrates the potential of transferring motions from real-world human videos (processed by SMPL mesh) to synthetic characters. This opens up promising avenues for future research in bridging the gap between real-world motion capture and virtual animation, potentially making character animation more accessible and versatile.

7.3. Personal Insights & Critique

MoMa presents a significant step forward in motion retargeting, especially in its ability to tackle non-homeomorphic skeletons without explicit mappings. The integration of masked pose modeling into this domain is particularly insightful, borrowing a powerful self-supervised learning paradigm from NLP and computer vision and adapting it effectively for skeletal data. This indicates the increasing cross-pollination of ideas across deep learning subfields.

The choice of a face-based optimizer over vertex-based methods for collision resolution is a crucial design decision that yields tangible improvements in visual quality. It addresses a common frustration in character animation where realistic motion is undermined by unsightly mesh interpenetrations. The ablation studies effectively demonstrate the contribution of each module and the importance of hyper-parameters like the number of masked joints and positional embeddings. The $\lambda$ parameter in the shape-aware module provides a practical control knob for balancing accuracy and realism, which is highly valuable for animators.

One potential area for critique or further investigation might involve the SCE metric ( $MSE \times FIE$ ). While the authors justify its use by stating it captures dependency and that the choice of weighting is a preference, multiplication can sometimes obscure nuances if one component is significantly larger or smaller than the other. For instance, a very small MSE could still result in a low SCE even with a relatively high FIE, or vice versa. A deeper analysis into the sensitivity of this metric and perhaps alternative composite metrics (e.g., geometric mean, or a more dynamically weighted sum) could be explored.

Furthermore, while the paper claims to be the "first" for non-homeomorphic skeletons without paired motion, it's a strong claim in a rapidly evolving field. It would be beneficial for future work to rigorously define the boundary conditions of non-homeomorphic generalization and explore active learning or domain adaptation techniques to improve it with less reliance on the sheer diversity of the training set.

The robust performance on CMU and LAFAN1 datasets with masked joints suggests a broader applicability for cleaning noisy motion capture data, beyond just retargeting. This could be an important side benefit or future application area. Overall, MoMa offers a robust and conceptually elegant solution that pushes the boundaries of automated motion retargeting.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

MoMa: Skinned motion retargeting using masked pose modeling

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~40 min read · 52,502 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Character and Animation Representation

4.2.2. Skeleton-aware Pose Masking Auto-encoder

4.2.3. Shape-aware Face-based Optimizer

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

Isomorphic Skeletons

Homeomorphic Skeletons

Motion Reconstruction

Qualitative Results

Computational Cost

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

Skeleton-aware Module Ablation Studies

Shape-aware Module Ablation Studies ( λ\lambdaλ parameter)

Computational Cost of the Optimization Step

6.4. Data Presentation (Images)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

Shape-aware Module Ablation Studies ( $\lambda$ parameter)