HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Published:06/19/2026

Analysis

~23 min read · 29,818 charsThis analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

1. Bibliographic Information

1.1. Title

The central topic of the paper is a controlled, matched-scale comparison of egocentric first-person human video and teleoperated real-robot trajectory data as pretraining sources for embodied foundation models (EFMs). It demonstrates that properly processed egocentric human video can outperform same-scale real-robot data for EFM pretraining, especially for out-of-distribution generalization.

1.2. Authors

The first authors with equal contribution are Juncheng Ma and Jianxin Bi. The corresponding author is Daquan Zhou. The author team includes researchers from top global institutions: Peking University (PKU), National University of Singapore (NUS), Massachusetts Institute of Technology (MIT), University of California Santa Barbara (UCSB), and NVIDIA. Senior authors include well-known experts in robotics and AI: Wojciech Matusik (MIT, leading robotics researcher), Tat-Seng Chua (NUS, pioneer in multimedia and foundation models), and Enze Xie (NVIDIA, expert in scalable computer vision).

1.3. Journal/Conference

As of the current date (2026-06-20), the paper is published as a preprint on arXiv, the leading open-access preprint server for computer science and artificial intelligence research. It has not yet completed peer review for formal conference or journal publication. Given its novelty and impact, it is expected to be submitted to top-tier venues in robotics and AI such as Robotics: Science and Systems (RSS), International Conference on Robotics and Automation (ICRA), Neural Information Processing Systems (NeurIPS), or International Conference on Machine Learning (ICML).

1.4. Publication Year

2026 (preprint released 2026-06-18 UTC)

1.5. Abstract

The paper's research objective is to resolve the open question of whether egocentric human video is competitive with or superior to teleoperated real-robot data as a pretraining source for EFMs, given the severe data bottleneck limiting EFM scaling. Its core methodology is a strictly controlled study where only the pretraining data source is varied, with all other variables (model architecture, pretraining compute, post-training data, evaluation protocol) fixed. The main results show that with the same 5,000 hours of pretraining data, models pretrained on filtered, pseudo-labeled egocentric video achieve 24% lower validation loss on real-robot action prediction, 52.5% higher in-distribution task success rate, and 90% higher out-of-distribution task success rate than models pretrained on real-robot data. The key conclusion is that a scalable paradigm for EFMs is feasible: pretrain on large, diverse, low-cost egocentric human video to learn general world representations, then adapt to target robot embodiments with a small amount of labeled real-robot data for action alignment.

2. Executive Summary

2.1. Background & Motivation

Core Problem to Solve

Embodied foundation models (EFMs) that can interact with the physical world are expected to follow the same scaling laws as large language models (LLMs), where performance improves predictably with increasing data, model size, and compute. However, EFMs face a far more severe data bottleneck than LLMs: the dominant pretraining data source, teleoperated real-robot trajectories, is extremely expensive to collect, limited in scale, and lacks diversity in environments, objects, and interactions.

Importance of the Problem

The high cost and limited scalability of robot data have prevented the field from building generalist EFMs that can operate reliably in open-world, unstructured environments across diverse tasks. Prior work has explored egocentric human video as a low-cost, high-diversity alternative for EFM pretraining, but no controlled, matched-scale head-to-head comparison with real-robot pretraining exists to validate its relative effectiveness.

Research Gap & Innovative Idea

The core gap is the lack of experimental evidence comparing the two pretraining data sources under fixed, isolated conditions, where all variables except the data source are held constant. The paper's innovative entry point is to design a strictly controlled study to isolate the effect of pretraining data source, directly testing whether the diversity advantage of egocentric human video outweighs the initial embodiment alignment advantage of real-robot data for pretraining.

2.2. Main Contributions / Findings

The paper's primary contributions and key findings are:

  1. Scaling behavior validation: Egocentric video pretraining follows a predictable log-linear scaling law, with validation loss decreasing monotonically as pretraining data volume increases from 100 to 5,000 hours, with no sign of saturation at the 5,000-hour mark. This indicates that further scaling of egocentric data will continue to improve EFM performance.
  2. Superior performance over robot pretraining: At matched pretraining data scale (5,000 hours), egocentric pretraining outperforms real-robot pretraining on both in-distribution (ID) and out-of-distribution (OOD) tasks, with the largest gains (20% lower OOD action loss, 90% higher OOD success rate) on generalization to unseen tasks. Real-robot pretraining fails to scale for OOD tasks, with performance stagnating even as data volume increases.
  3. Paradigm validation: The paper validates a new scalable paradigm for EFM development: pretrain on large-scale, low-cost, diverse egocentric human video to learn general open-world representations, then adapt the model to target robot embodiments with a small volume of aligned real-robot data. This paradigm directly addresses the EFM data bottleneck by leveraging the massive existing supply of human activity video.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

All key technical terms are explained below for beginner readers:

  • Embodied Foundation Model (EFM): A general-purpose AI model capable of perceiving the physical world, understanding natural language instructions, and generating actions to interact with the environment (e.g., controlling a robot). Two main EFM architectures are:
    • Vision-Language-Action (VLA) Model: Takes visual input (camera frames) and text task instructions as input, and directly outputs robot actions.
    • World-Action Model (WAM): Jointly predicts both future visual observations (how the scene will change after taking actions) and future actions, learning an implicit model of physical world dynamics.
  • Pretraining vs Post-training: A two-stage training paradigm widely used for foundation models:
    • Pretraining: The first stage where the model is trained on very large, diverse datasets to learn general, transferable representations of the world, rather than task-specific knowledge.
    • Post-training (Fine-tuning): The second stage where the pretrained model is trained on a smaller, task/embodiment-specific dataset to adapt its general representations to the target robot hardware, camera setup, and task distribution.
  • Egocentric Video: Video recorded from a first-person perspective (e.g., from a camera mounted on a person's glasses or headset), capturing exactly what the person sees while performing daily activities and interactions with objects.
  • Teleoperated Robot Trajectory: Data collected when a human operator remotely controls (teleoperates) a physical robot to complete tasks. The dataset includes synchronized recordings of the robot's camera feed and ground-truth robot action labels (end-effector position, rotation, gripper state) aligned exactly to the robot's embodiment.
  • In-distribution (ID) vs Out-of-distribution (OOD) Evaluation:
    • ID Evaluation: Tests the model on tasks and object categories that were seen during post-training, only with new instances or minor background variations. Measures robustness to small variations within the training distribution.
    • OOD Evaluation: Tests the model on completely new tasks, object categories, or environments never seen during post-training. Measures generalizability to novel, unstructured open-world scenarios, which is the core requirement for generalist EFMs.
  • Normalized Jerk: A metric for motion smoothness, where jerk is defined as the rate of change of acceleration over time. Lower normalized jerk values indicate smoother, more human-like motion.
  • Mixture-of-Transformers (MoT): A neural network architecture that uses separate specialized Transformer modules ("experts") for different input/output modalities (e.g., one expert for video processing, one expert for action processing), with a fusion layer to combine their outputs.
  • Log-linear Scaling Law: A predictable relationship observed across foundation model domains (LLMs, vision models, EFMs) where validation loss decreases as a linear function of the natural logarithm of dataset size, formalized as: $ \mathcal{L} = a - b\ln(D) $ Where:
    • L\mathcal{L} = downstream validation loss
    • aa = intercept term (estimated loss at 1 hour of pretraining data)
    • bb = scaling coefficient (magnitude of loss reduction per unit increase in log dataset size)
    • ln(D)\ln(D) = natural logarithm of pretraining dataset size (in hours)
    • R2R^2 value measures how well the data fits the scaling law, with values close to 1 indicating strong predictive power.

3.2. Previous Works

The paper builds on three core strands of prior research:

  1. Scaling Robot Learning with Real-Robot Data: Prior work including Open X-Embodiment (2023, pooled 1M+ robot trajectories across 22 embodiments), RT-1/RT-2 (Google's robot transformers, 2022-2023), DROID (2025, large-scale in-the-wild robot dataset), and π0 (2024, proprietary robot VLA) demonstrated that scaling real-robot teleoperation data improves in-distribution EFM performance. However, all these efforts faced inherent limitations of high collection cost, limited diversity, and poor OOD generalization.
  2. Egocentric Data for Robot Pretraining: Prior work including R3M (2022, visual representation learning from human video), EgoMimic (2024, co-training human and robot data), EgoScale (2026, 20k hours of egocentric data for dexterous manipulation), and HumanEgo (2025, minutes of human video replace longer robot demonstrations) showed that egocentric video provides useful transferable signals for robot learning. However, none of these works conducted a matched-scale head-to-head comparison with real-robot pretraining under controlled fixed protocols.
  3. World-Action Model Architectures: Prior work including DreamZero (2026, joint video-action diffusion models), LingBot-VA (2026, causal world modeling for robot control), and Fast-WAM (2026, WAM without test-time video generation) showed that joint training of video prediction and action prediction improves EFM representations, even when video generation is not used at inference time.

3.3. Technological Evolution

The paper's work sits at a critical point in the evolution of embodied AI:

  • 2022-2023: The field established that scaling real-robot data improves EFM performance, but also recognized the severe data bottleneck and poor OOD generalization of robot-only pretraining.
  • 2024-2025: Large-scale egocentric video datasets (Ego4D, EgoScale, HumanNet) and mature hand pose retargeting pipelines made it feasible to use human video for EFM pretraining, rather than just representation learning.
  • 2025: WAM architectures matured, with Fast-WAM demonstrating that joint video-action training improves performance without incurring test-time latency costs.
  • 2026: This paper is the first work to systematically validate that egocentric pretraining can outperform same-scale real-robot pretraining, providing a path to unlock EFM scaling beyond the limits of robot data collection.

3.4. Differentiation Analysis

The core differences and innovations of this paper compared to prior work are:

  1. Strictly controlled experimental design: All variables except pretraining data source (model architecture, compute budget, post-training data, evaluation protocol) are held constant, so performance differences can be definitively attributed to the pretraining data source, with no confounding factors.
  2. Matched-scale comparison: Both pretraining datasets are exactly 5,000 hours, eliminating bias from differing data volumes. This is the first such comparison, as prior work either used egocentric data as a supplement to robot data, or used vastly different dataset sizes for the two sources.
  3. Dual evaluation of both ID and OOD performance, plus real-world robot rollouts: Prior work focused mostly on validation loss or in-distribution performance, while this paper prioritizes OOD generalization, the most critical metric for generalist EFMs, and validates results on physical robot hardware.
  4. Paradigm shift: The paper contradicts the long-held default assumption in the field that real-robot data is always the best pretraining source for robot models, showing that the diversity advantage of egocentric data outweighs the initial embodiment alignment advantage of robot data, which can be corrected with minimal post-training.

4. Methodology

4.1. Principles

The core idea of the paper's method is built on the insight that pretraining and post-training have fundamentally different data requirements:

  • Pretraining prioritizes coverage: broad exposure to diverse scenes, objects, interactions, and motion patterns to learn generalizable world representations.

  • Post-training prioritizes alignment: embodiment-matched observations and actions to adapt the general pretrained representation to a specific robot's kinematics, camera setup, and task distribution.

    Egocentric human video is vastly superior to real-robot data on all axes that matter for pretraining: larger accessible scale, lower marginal cost, higher diversity of interactions and environments. Its only weakness, the embodiment gap between human motion and robot motion, can be fully addressed with a small volume of aligned real-robot data during post-training.

The theoretical foundation is the proven pretraining-post-training paradigm from LLMs and vision models: large-scale pretraining on diverse, unaligned data to learn general representations, followed by small-scale fine-tuning on aligned task-specific data.

4.2. Core Methodology In-depth

The method is broken into four sequential, fully controlled stages:

Step 1: Matched Pretraining Dataset Curation

Two pretraining datasets of exactly 5,000 hours are curated, differing only in source:

  1. Egocentric Pretraining Dataset:
    • Source: Curated from the 800,000-hour egocentric subset of HumanNet, a 1M-hour public human activity video dataset. Clips are filtered to retain only high-quality, interaction-rich content (blurry, non-interaction, and duplicate clips are removed).
    • Pseudo-action labeling pipeline: 3D hand pose is estimated from each video frame using a pre-trained hand pose estimator. The human hand motion is then retargeted to the robot's end-effector coordinate space, generating pseudo-action labels matching the robot's action space: 6D end-effector pose (3D position + 3D rotation) and binary gripper open/close state.
    • Information density: 100 hours of this dataset contains ~45,000 unique interaction trajectories, far more than the 8,000 trajectories in 100 hours of robot data, due to lower idle time and faster human motion.
  2. Real-Robot Pretraining Dataset:
    • Source: Aggregated from leading public real-robot datasets including Open X-Embodiment, AgiBot World, RoboMIND 2.0, and DROID.
    • Labels: Ground-truth end-effector pose and gripper state labels are provided natively by each robot dataset, aligned to the respective robot embodiments.
    • Information density: Limited to scripted lab tasks, with high idle time and slow robot motion, leading to far fewer unique trajectories per hour than egocentric data.

Step 2: Identical Pretraining Setup for Both Data Sources

The same autoregressive World-Action Model (WAM) with Mixture-of-Transformers (MoT) architecture is used for both pretraining runs, with identical compute budget:

  • Video Expert: Initialized from Wan 2.2, a state-of-the-art open-source video generation model pre-trained on millions of public videos, responsible for processing input video frames and learning visual-temporal representations.
  • Action Expert: Initialized via weight interpolation, responsible for processing action labels and learning motion representations.
  • Joint Pretraining Objective: The model is trained to simultaneously predict two targets, with the loss function: $ \mathcal{L}{pretrain} = \lambda_v \mathcal{L}{video} + \lambda_a \mathcal{L}_{action} $ Where:
    • Lvideo\mathcal{L}_{video} = Mean Squared Error (MSE) loss between predicted future video frames and ground-truth input video frames, measuring how well the model learns world dynamics.
    • Laction\mathcal{L}_{action} = MSE loss between predicted future actions and ground-truth (or pseudo) action labels, measuring how well the model learns to map observations to actions.
    • λv\lambda_v and λa\lambda_a = weight coefficients set to 1.0 for both losses, to balance the two training objectives.

Step 3: Identical Post-training Setup

Both pretrained models (and baseline models) undergo post-training with exactly the same data, compute budget, and hyperparameters:

  • Post-training Dataset: Curated from AgiBot World, a public real-robot dataset, with 15 manipulation tasks (e.g., place cup on coaster, sort fruits, stamp) and 100 expert demonstrations per task, totaling 1,500 trajectories. This dataset is fully disjoint from both pretraining datasets, with no overlapping tasks or objects.
  • Post-training Objective: The same joint video-action loss as pretraining is used, to adapt the general pretrained representation to the target AgiBot bimanual robot's embodiment, camera configuration, and task distribution.

Step 4: Fixed Evaluation Protocol

All models are evaluated on two disjoint splits, with no data leakage:

  1. Seen (ID) Split: Held-out trajectories from the 15 post-training tasks, with the same task semantics but unseen object instances and background variations, measuring in-distribution robustness.
  2. Unseen (OOD) Split: 25 completely new tasks never seen during post-training, measuring out-of-distribution generalization ability, the primary evaluation metric.
  3. Real-Robot Rollouts: The top-performing model is tested on a physical AgiBot bimanual robot on 3 tasks, each with ID (seen objects) and OOD (unseen objects) settings, to validate that simulation results transfer to real-world hardware.

5. Experimental Setup

5.1. Datasets

All datasets used in the experiments are detailed below:

Pretraining Datasets

Dataset Type Scale Source Key Characteristics
Egocentric 5,000 hours HumanNet egocentric subset 800k hours of first-person daily activity video, diverse scenes (homes, kitchens, workshops, outdoor), thousands of object and interaction types, pseudo-action labels from hand retargeting, 45k trajectories per 100 hours
Real-Robot 5,000 hours Aggregated public robot datasets (Open X-Embodiment, AgiBot World, RoboMIND, DROID) Multi-embodiment robot trajectories, ground-truth action labels, limited to lab/workstation settings, scripted tasks, 8k trajectories per 100 hours

Post-training Dataset

  • Source: AgiBot World subset
  • Scale: 1,500 trajectories, 15 manipulation tasks, 100 demonstrations per task
  • Characteristics: For AgiBot bimanual robot, diverse object instances and backgrounds, fully disjoint from both pretraining datasets

Evaluation Datasets

  1. Seen (ID) Split: Held-out trajectories from the 15 post-training tasks, unseen object instances
  2. Unseen (OOD) Split: 25 held-out tasks not included in post-training (e.g., pour water, stack blocks, open drawer)
  3. Real-Robot Rollout Dataset: 3 tasks (place cup, sort fruits, stamp) with 2 settings each:
    • ID setting: Objects match those seen during post-training
    • OOD setting: Completely unseen object shapes, colors, and sizes These datasets are chosen to eliminate data leakage, ensuring that performance differences reflect genuine generalization ability rather than memorization of overlapping tasks or objects.

5.2. Evaluation Metrics

Three core metrics are used, with full explanations below:

1. L2 Action Validation Loss

Conceptual Definition

Measures the average squared difference between the model's predicted actions and ground-truth expert actions on held-out data. Lower loss indicates more accurate action prediction.

Mathematical Formula

$ \mathcal{L}{action} = \frac{1}{N \times T \times D} \sum{i=1}^{N} \sum_{t=1}^{T} \sum_{d=1}^{D} ( \hat{a}{i,t,d} - a{i,t,d} )^2 $

Symbol Explanation

  • NN = Total number of evaluation trajectories
  • TT = Number of timesteps per trajectory
  • DD = Dimension of the robot action space (7 for this study: 3 for XYZ position, 3 for rotation, 1 for gripper state)
  • a^i,t,d\hat{a}_{i,t,d} = Predicted action value for trajectory ii, timestep tt, action dimension dd
  • ai,t,da_{i,t,d} = Ground-truth expert action value for trajectory ii, timestep tt, action dimension dd

2. In-Distribution (ID) Real-Robot Success Rate

Conceptual Definition

Percentage of task attempts the robot completes successfully using the model's policy, on tasks/objects seen during post-training. Higher values indicate better in-distribution robustness.

Mathematical Formula

$ SR_{ID} = \frac{\text{Number of successful ID task attempts}}{\text{Total number of ID task attempts}} \times 100% $

Symbol Explanation

A successful attempt is defined as completing all required steps of the task correctly within the allocated time limit (e.g., placing the cup fully on the coaster without tipping, sorting all fruits into the correct bins).

3. Out-of-Distribution (OOD) Real-Robot Success Rate

Conceptual Definition

Percentage of task attempts the robot completes successfully on tasks/objects never seen during post-training. Higher values indicate better open-world generalization, the core requirement for generalist EFMs.

Mathematical Formula

$ SR_{OOD} = \frac{\text{Number of successful OOD task attempts}}{\text{Total number of OOD task attempts}} \times 100% $

Symbol Explanation

Uses the same success criteria as the ID success rate, but all tasks/objects are completely novel and never seen during post-training.

5.3. Baselines

Two representative baselines are used for comparison, with identical post-training setup as the experimental models:

  1. Wan2.2 (No Pretraining): The vanilla Wan 2.2 video generation model, with no embodied pretraining of any kind, directly post-trained on the AgiBot dataset. This baseline establishes the lower bound of performance without any embodied pretraining signal.
  2. LingBot-VA: A state-of-the-art embodied baseline that fine-tunes the Wan 2.2 backbone on 20,000 hours of real-robot data (4x the size of the 5,000-hour pretraining datasets used in the study). This is a high benchmark, as it uses 4x more robot data than the experimental robot pretraining run, to test if 5,000 hours of egocentric pretraining can outperform even larger volumes of robot pretraining. These baselines are representative because they cover both the lower bound of no pretraining and the upper bound of state-of-the-art robot-only pretraining, ensuring the experimental results are meaningful.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results are grouped into three key findings:

1. Egocentric Pretraining Scales Consistently

As egocentric pretraining data volume increases from 100 to 5,000 hours, post-training action loss decreases monotonically for both evaluation splits:

  • Seen (ID) tasks: Loss drops from 0.0080 to 0.0067, a 16% reduction, and 35% lower than the no-pretraining baseline.
  • Unseen (OOD) tasks: Loss drops from 0.0234 to 0.0204, a 13% reduction, and 24% lower than the no-pretraining baseline. The scaling follows a log-linear law with R2=0.86R^2 = 0.86 for seen tasks and R2=0.94R^2 = 0.94 for unseen tasks, indicating highly predictable scaling with no saturation at 5,000 hours. This means further increasing egocentric pretraining data volume will continue to improve performance. The following figure (Figure 3 from the original paper) shows the scaling curves:

img-9.jpeg 该图像是图表,展示了在已知任务中的行动损失(L2L_2)随预训练数据量变化的情况。青色曲线表示自我预训练,橙色曲线表示机器人预训练。随着预训练小时数增加,自我预训练的行动损失显著降低,且在5000小时处的损失更优于机器人预训练。

img-10.jpeg 该图像是图表,展示了不同预训练数据来源在未见任务上的表现。图中蓝色曲线代表通过易增强预训练获得的结果,橙色曲线表示机器人预训练的效果。在5000小时的预训练数据下,易增强预训练实现了24%的性能提升,验证了在预训练阶段使用人类视频的有效性。


2. Egocentric Pretraining Outperforms Matched-Scale Robot Pretraining

At the same 5,000-hour pretraining scale:

  • Seen (ID) tasks: Egocentric pretraining achieves a loss of 0.0067, compared to 0.0071 for robot pretraining, with slightly better but comparable performance.
  • Unseen (OOD) tasks: Egocentric pretraining achieves a loss of 0.0204, compared to 0.0254 for robot pretraining, a ~20% reduction in loss. Critically, robot pretraining shows no scaling on OOD tasks: its loss remains nearly flat from 100 to 5,000 hours, meaning adding more robot data does not improve OOD generalization. This performance gap is explained by two factors:
  1. Higher diversity and information density: Egocentric data covers far more scenes, objects, and interaction types than robot data, which is limited to scripted lab tasks. 1 hour of egocentric data also contains far more unique interaction trajectories than 1 hour of robot data, due to lower idle time and faster human motion.
  2. Limited transfer of robot data: Real-robot data collected in constrained lab settings does not transfer to genuinely unseen tasks, as it lacks the open-world diversity needed for OOD generalization.

3. Real-World Robot Rollouts Confirm OOD Generalization Advantage

On physical AgiBot robot rollouts:

  • Egocentric pretrained model: 92.5% ID success rate, 90.0% OOD success rate, with only a 2.5% performance drop from ID to OOD.
  • No-pretraining baseline: 40.0% ID success rate, 0.0% OOD success rate, with catastrophic performance collapse on OOD tasks. This confirms that the open-world prior learned from egocentric human video transfers directly to physical robot hardware, making the model robust to distribution shift, while the model without this prior overfits to the narrow post-training data distribution. The following figure (Figure 5 from the original paper) shows example real-robot rollouts:

img-13.jpeg 该图像是一个示意图,展示了在两个不同情境下(在分布内和分布外)机器人执行“捡水果”任务的过程。上半部分为分布内的任务示例,而下半部分则展示了分布外的任务执行情况。通过对比,可以看出预训练模型在分布转移中表现出色,而基线模型未能成功。

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper, comparing the scale and coverage of public egocentric and real-robot datasets:

Corpus Hours Acquisition Coverage
Egocentric human video
EgoDex [14] 829 Consumer headset, native hand pose 194 tabletop manipulation tasks
Ego4D [12] 3,670 931 camera wearers, daily life 74 locations, 9 countries
Sekai [21] 5,000+ Web harvest (egocentric POV) Walking/exploration, global
Xperience-10M [31] 10,000 Wearable capture, 10M interactions Open-world daily experience
EgoScale [39]† 20,854 Aggregation + hand retargeting Dexterous manipulation
Egocentric-100K [6] 100,405 14,228 workers, head-mounted glasses Industrial / factory operations
Teleoperated real-robot
DROID [18] 350 50 collectors, 12 months 564 scenes, 84 tasks
Galaxea Open-World [16] 500 Single embodiment, in-the-wild Homes, kitchens, retail, offices
MolmoAct2 BimanualYAM [11] 720+ Bimanual YAM arms Largest open bimanual release
RoboMIND 2.0 [15] 1,000+ Bimanual mobile teleop, 310K+ trajs Bimanual coordination tasks
Open X-Embodiment [29]† ~2,000–3,000 Pooling of 60 datasets, 1M+ trajs 22 embodiments, 527 skills
AgiBot World [2] 2,976 100-robot fleet, 1M+ trajs 217 tasks, 106 scenes, 5 domains
RoVid-X [10]†† 10,000+ Open-source aggregation, 4M robot videos 1,300+ fine-grained robot skills
Being H-0.5 [25]†† ~35,000 OXE + AgiBot + RoboMIND + RoboCOIN + ... 30 embodiments, incl. sim
Reference points
π0 corpus [4] >10,000 Internal fleet, 7 platforms Proprietary, inaccessible
HumanNet [9] 1,000,000 Web curation, egocentric + third-person Open-world, long-tail interaction

Note: † marks aggregated datasets, †† marks datasets including simulated trajectories.

The following are the results from Table 2 of the original paper, showing real-robot rollout success rates:

Pretraining In-distribution Out-of-distribution
Wan2.2 (baseline) 40.0% 0.0%
Egocentric (ours) 92.5% 90.0%

6.3. Ablation Studies / Parameter Analysis

The paper conducted a scaling ablation study, testing 100, 1,000, and 5,000 hours of pretraining data for both egocentric and robot sources. The key results are:

  • For egocentric pretraining: Every increase in data volume leads to lower loss on both ID and OOD tasks, with no saturation observed at 5,000 hours, indicating that further scaling will yield additional performance gains.

  • For robot pretraining: Increasing data volume reduces loss only on ID tasks, with no measurable improvement on OOD tasks, confirming that robot data only improves performance on tasks similar to those in the pretraining set, and does not support open-world generalization. The paper also conducted a diversity analysis comparing the two pretraining data sources, as shown in Figure 2 from the original paper:

    img-3.jpeg 该图像是一个比较图表,展示了人类视频和真实机器人数据在运动平滑度上的表现。图表中使用对数归一化的抖动值,较低的值表示较好的运动平滑度。人类视频数据的运动平滑度明显优于真实机器人数据。

    img-4.jpeg 该图像是一个展示动作空闲时间的箱线图,比较了人类视频与真实机器人数据的动作空闲比例。图中显示,人类视频的空闲比例明显低于真实机器人数据,且后者的方差较大,表明真实机器人的行为多样性更高。

    img-5.jpeg 该图像是一个示意图,展示了人类视频(蓝色)与真实机器人数据(橙色)的工作空间覆盖情况。随着覆盖范围的扩大,图中以等高线形式呈现的数据分布,表明人类视频在多样性和覆盖率方面具有优势。

    img-6.jpeg 该图像是一个图表,展示了人类视频和真实机器人数据在不同归一化时间下的均值位置扩散(cm)。从图中可以看出,人类视频在多个时间段表现出更大的位置扩散,表明其在会话间的变异性更强,支持其数据多样性优势。

    img-7.jpeg 该图像是一个条形图,展示了人类视频与真实机器人数据在独特动词-对象对数量上的对比。人类视频的独特动词-对象对数量为 2744,而真实机器人数据仅为 107,表明人类视频在交互词汇方面具有显著优势。

    img-8.jpeg 该图像是一个条形图,展示了人类视频与真实机器人数据在视觉场景覆盖方面的比较。人类视频的独特场景术语数量为361,而真实机器人数据仅为156,表明人类视频提供了更广泛的视觉场景多样性。

These results confirm that egocentric data has:

  1. Lower normalized jerk (smoother motion)
  2. Lower action idle time
  3. Wider workspace coverage
  4. Higher inter-session position spread (more varied motion)
  5. 25x more unique verb-object interaction pairs
  6. 2.3x more unique scene terms All of these diversity factors contribute to the superior OOD generalization performance of egocentric pretraining.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents the first strictly controlled, matched-scale comparison of egocentric human video and teleoperated real-robot data as pretraining sources for embodied foundation models. Its core conclusions are:

  1. Egocentric human video is not just a viable substitute for real-robot pretraining data, but outperforms same-scale robot data, especially for OOD generalization, the most critical requirement for generalist EFMs.
  2. Egocentric pretraining follows a predictable log-linear scaling law with no saturation at 5,000 hours, meaning further scaling of egocentric data will continue to improve performance.
  3. The long-standing EFM data bottleneck can be solved by a new scalable paradigm: pretrain on large, low-cost, diverse egocentric human video to learn general open-world representations, then adapt to target robot embodiments with a small volume of aligned real-robot data. This work represents a paradigm shift in embodied AI research, moving away from the default focus on collecting expensive robot data to curating abundant, cheap human video for pretraining, unlocking the potential to scale EFMs in the same way LLMs are scaled using web data.

7.2. Limitations & Future Work

The authors explicitly note the following limitations and future research directions:

  1. Limited pretraining scale: The current study uses only 5,000 hours of egocentric data, limited by the maximum available scale of public real-robot datasets for comparison. Future work will scale egocentric pretraining to 100,000+ hours to test the limits of scaling.
  2. Single architecture tested: All experiments are conducted on World-Action Models (WAMs) with the Wan 2.2 backbone. Future work will validate the paradigm on Vision-Language-Action (VLA) models, the other dominant EFM architecture, to confirm the advantage holds across different model types.
  3. Single robot platform tested: Evaluations are limited to the AgiBot bimanual manipulation robot. Future work will test the paradigm on more diverse embodiments including mobile robots, humanoids, and dexterous hands, to confirm generalization across robot types.
  4. Pseudo-label quality: The current pseudo-action labeling pipeline uses hand pose retargeting, which may introduce noise. Future work will improve retargeting accuracy and test the effect of pseudo-label quality on pretraining performance.

7.3. Personal Insights & Critique

Key Inspirations

This paper upends a core implicit assumption in the embodied AI field: that robot data is always the best pretraining source for robot models. It validates that the same scaling paradigm that revolutionized NLP (large-scale unaligned pretraining + small-scale aligned fine-tuning) applies equally to embodied AI, using human video as the equivalent of web text for LLMs. This could drastically reduce the cost of training generalist EFMs, making state-of-the-art embodied AI accessible to small research teams without access to large robot fleets. The paradigm also has broad applicability beyond manipulation robots: it can be extended to autonomous vehicles (using dashcam video from human drivers as pretraining data), industrial robots (using egocentric video from factory workers), and humanoids (using full-body human motion video).

Potential Limitations and Improvements

  1. Pseudo-label noise ablation: The paper does not quantify the effect of pseudo-action label noise on performance. It would be valuable to conduct an ablation study varying the quality of pseudo-labels to identify the minimum label quality threshold where egocentric pretraining remains superior to robot pretraining.
  2. Limited real-robot task diversity: The real-robot evaluation is limited to 3 simple manipulation tasks. Testing on more diverse tasks (long-horizon tasks, mobile manipulation, contact-rich tasks like cooking) would better validate the generalizability of the results.
  3. Dataset bias: Most public egocentric video datasets are collected from first-world populations performing common daily tasks, which may introduce geographic and cultural bias, limiting generalization to tasks common in other regions or rare tasks. More diverse egocentric data collection is needed to address this.
  4. Comparison with simulated data: The paper does not compare egocentric pretraining to large-scale simulated robot pretraining, another low-cost scalable data source. Combining egocentric human video and simulated robot data may yield even better performance than either source alone. Overall, this is a landmark work that has the potential to reshape the entire direction of embodied AI research, unlocking the long-awaited scaling of generalist robot models.