Paper status: completed

Real-Time Execution of Action Chunking Flow Policies

Published:06/09/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a novel algorithm called real-time chunking (RTC) to address inference latency issues in real-time control of vision-language-action models, showing improved task throughput and high success rates in dynamic and real-world tasks.

Abstract

Modern AI systems, especially those interacting with the physical world, increasingly require real-time performance. However, the high latency of state-of-the-art generalist models, including recent vision-language action models (VLAs), poses a significant challenge. While action chunking has enabled temporal consistency in high-frequency control tasks, it does not fully address the latency problem, leading to pauses or out-of-distribution jerky movements at chunk boundaries. This paper presents a novel inference-time algorithm that enables smooth asynchronous execution of action chunking policies. Our method, real-time chunking (RTC), is applicable to any diffusion- or flow-based VLA out of the box with no re-training. It generates the next action chunk while executing the current one, "freezing" actions guaranteed to execute and "inpainting" the rest. To test RTC, we introduce a new benchmark of 12 highly dynamic tasks in the Kinetix simulator, as well as evaluate 6 challenging real-world bimanual manipulation tasks. Results demonstrate that RTC is fast, performant, and uniquely robust to inference delay, significantly improving task throughput and enabling high success rates in precise tasks \unicodex2013\unicode{x2013} such as lighting a match \unicodex2013\unicode{x2013} even in the presence of significant latency. See https://pi.website/research/real_time_chunking for videos.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Real-Time Execution of Action Chunking Flow Policies

The central topic is achieving true real-time control for robots that use high-latency, large, learned policies by introducing an inference-time algorithm called real-time chunking (RTC) that enables smooth, asynchronous execution of action-chunking policies (especially diffusion- or flow-based vision-language-action models, VLAs).

1.2. Authors

Kevin Black (Physical Intelligence; UC Berkeley), Manuel Y. Galliker (Physical Intelligence), Sergey Levine (Physical Intelligence; UC Berkeley).

Background and affiliations:

  • Physical Intelligence is an industrial research group focused on general-purpose robot learning and deployment.
  • UC Berkeley is a leading academic institution in machine learning and robotics.
  • Sergey Levine is widely known for foundational work in reinforcement learning, imitation learning, and robot learning.

1.3. Journal/Conference

arXiv preprint. The paper is currently posted on arXiv (a preprint server), which does not imply peer review but is a common venue for rapid dissemination in machine learning and robotics.

1.4. Publication Year

2025 (Published at UTC: 2025-06-09T01:01:59.000Z)

1.5. Abstract

The paper tackles the mismatch between the need for real-time control in embodied systems and the high inference latency of modern large models (including VLAs). Action chunking (having the model output a sequence of actions per inference) improves temporal consistency but causes pauses or jerkiness at chunk boundaries when latency is non-trivial. The authors propose RTC, a training-free inference-time algorithm that asynchronously generates the next action chunk while executing the current one. RTC “freezes” actions that are guaranteed to execute (due to delay) and “inpaints” the remainder using guidance in a flow/diffusion denoiser, with a soft mask that enforces cross-chunk continuity. They introduce a 12-task dynamic Kinetix simulator benchmark and test 6 real bimanual manipulation tasks. RTC improves speed, robustness to delay, and success rates in precise tasks (e.g., lighting a match), even under significant latency.

2. Executive Summary

2.1. Background & Motivation

  • Core problem: Real-time control with high-latency, large visuomotor policies. Robots must act continuously while the world evolves; pauses or abrupt changes degrade performance and cause distribution shift relative to training.
  • Why important: Large VLAs (billions of parameters) power generalist behaviors but are slow; remote inference adds network delay; traditional synchronous chunking leads to pauses and discontinuity between chunks (jerkiness, mode switching).
  • Gap in prior research:
    • Action chunking improves temporal consistency but not delay tolerance; discontinuities at chunk boundaries persist under latency.
    • Naive asynchronous execution introduces abrupt jumps when switching to a new chunk; temporal ensembling averages actions but can produce invalid actions in multi-modal distributions.
    • Bidirectional Decoding (BID) supports closed-loop control but at high computational cost and underperforms under delay.
  • Entry point/innovation: Treat asynchronous chunk generation as an inpainting problem within iterative denoising (flow/diffusion): freeze guaranteed-to-execute actions, then guide denoising to inpaint the rest. Use soft masking (weights decay from 1 to 0 across overlap) and guidance weight clipping to maintain continuity and stability.

2.2. Main Contributions / Findings

  • Method: Real-time chunking (RTC), an inference-time algorithm for flow/diffusion VLAs that:
    1. Runs inference concurrently with execution;
    2. Freezes actions that will have already been executed before the new chunk is ready;
    3. Inpaints remaining actions via guidance in the denoising steps with a soft mask across all overlapping timesteps.
  • Applicability: Works out-of-the-box for diffusion- or flow-based policies; no retraining required. Diffusion policies can be treated as flows at inference via known equivalences.
  • Benchmark: A new simulator benchmark of 12 highly dynamic tasks in Kinetix to stress delay robustness and closed-loop corrections; also 6 challenging real-world bimanual manipulation tasks.
  • Results:
    • In simulation: RTC is uniquely robust to inference delay, outperforming naive async, temporal ensembling (TE), and BID, with the advantage growing as delay increases. Soft masking further improves continuity and performance over hard masking.
    • In real robots: RTC improves average task throughput (speed × performance), remains robust even under +100ms and +200ms injected latency, where TE variants fail (trigger protective stops), and outperforms synchronous inference both in speed and in earlier task progress completion (fewer retries), with notable gains in precise tasks like lighting a candle with a match.
  • Significance: RTC addresses the real-time requirement for large policies with unavoidable latency, enabling smoother and faster executions without retraining and maintaining closed-loop responsiveness.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Vision-Language-Action models (VLAs): Large models that take visual inputs (images/video), natural language instructions, and produce action outputs (e.g., robot joint positions or torques). They often use transformer backbones and are trained on diverse robot datasets and web-scale data for generalization.
  • Action chunking: A policy outputs a sequence (chunk) of future actions per inference call. If the chunk length is HH, the controller executes the first ss actions (execution horizon), then fetches a new chunk. Chunking improves temporal smoothness but reduces reactivity; short horizons increase responsiveness but risk discontinuities across chunks.
  • Flow matching (conditional flow matching): A generative modeling technique where a learned velocity field maps noise to data by integrating over a continuous “time” variable τ[0,1]\tau \in [0,1]. For policies, it generates action chunks by iterative denoising steps using the velocity field conditioned on observation.
  • Diffusion generative models: Alternative iterative denoising frameworks; diffusion and flow matching are closely related and can be interconverted at inference with appropriate transformations.
  • Inpainting (in generative models): Generating missing or corrupted parts of data (e.g., masked regions of an image) consistent with observed parts using pre-trained generative models, often via guidance added to the denoiser.
  • Guidance via vector-Jacobian product: Adding a gradient-based correction term to the denoiser velocity/score to steer generation toward matching target constraints. Implemented efficiently via automatic differentiation as a vector-Jacobian product.
  • Real-time control constraints:
    • Controller sampling period Δt\Delta t: The duration between control steps (e.g., 20 ms for 50 Hz).
    • Inference latency δ\delta: Time for the model to output a new chunk.
    • Inference delay d=δ/Δtd = \lfloor \delta / \Delta t \rfloor: Number of control steps that elapse before the new chunk becomes available.

3.2. Previous Works

  • Action chunking for imitation learning: Diffusion policies and other expressive models (variational, flow, VQ, tokenization) have delivered strong performance in manipulation with chunked execution, but commonly rely on synchronous inference.
  • Temporal ensembling (TE): Averaging actions from multiple predicted chunks to smooth outputs; problematic in multi-modal distributions where averages can be invalid actions and fails under high delays.
  • Bidirectional Decoding (BID): Closed-loop resampling using rejection sampling to ensure continuity, but requires multiple samples per step (high compute) and underperforms RTC under delays in simulation.
  • Speed-up strategies: Consistency models/policies, streaming diffusion policies, parallel decoding for VLAs, quantization—these reduce latency but cannot make inference faster than at least one forward pass and do not address overlapping execution/inference continuity.
  • Classical MPC and concurrent control: Plan over a receding horizon and warm-start next plans; effective in narrow domains with explicit dynamics models and cost functions, but not directly applicable to general learned VLA policies without hand-crafted models.
  • Diffuser planning with inpainting: Uses diffusion-based inpainting for planning constraints but not guidance-based denoising for real-time control; comparison suggests RTC’s guidance approach better preserves continuity in this setting.

3.3. Technological Evolution

  • From single-step policies to action chunking for smoothness.
  • From small policies to large VLAs, increasing capability but also latency (due to big backbones and iterative denoising).
  • From synchronous chunking to asynchronous strategies under delay—but naive async causes discontinuities; averaging fails in multi-modal action spaces.
  • RTC extends generative inpainting/guidance techniques to sequential action generation, solving cross-chunk continuity under latency without retraining.

3.4. Differentiation Analysis

  • RTC vs. synchronous chunking: RTC eliminates pauses and reduces chunk-boundary discontinuities by freezing and guided inpainting; maintains reactivity through soft-mask-driven incorporation of new observations across overlapping timesteps.
  • RTC vs. naive async: Ensures continuity by freezing and guiding the overlap rather than jumping to a new mode.
  • RTC vs. TE: Avoids invalid action averaging; keeps actions on-manifold by leveraging the model’s denoiser with guidance.
  • RTC vs. BID: Achieves better performance-delay trade-off with lower compute than BID’s batch-based rejection sampling, particularly under high delays.

4. Methodology

4.1. Principles

Core idea: Treat asynchronous action chunk generation as an inpainting problem inside the iterative denoising process of flow/diffusion policies. When the new chunk arrives, some of its earliest timesteps are already past; those actions must be consistent with what was actually executed. RTC freezes actions that are guaranteed to be executed from the previous chunk and guides the denoiser to produce the remaining actions consistent with that frozen prefix and current observations. Soft masking assigns decaying weights across overlapping timesteps to improve cross-chunk continuity.

4.2. Policy and Chunking Setup

  • Policy: π(Atot)\pi(\mathbf{A}_t \mid \mathbf{o}_t) outputs an action chunk At=[at,at+1,,at+H1]\mathbf{A}_t = [\mathbf{a}_t, \mathbf{a}_{t+1}, \dots, \mathbf{a}_{t+H-1}] of length HH given observation ot\mathbf{o}_t at controller timestep tt.
  • Execution horizon: At each inference call, only the first sHs \leq H actions are executed. Larger ss increases temporal consistency but reduces reactivity.
  • Flow policy generation: Initialize with Gaussian noise At0N(0,I)\mathbf{A}_t^0 \sim \mathcal{N}(0, I), then integrate the learned velocity field vπ\mathbf{v}_{\pi} over nn denoising steps from τ=0\tau=0 to 1: $ \mathbf{A}{t}^{\tau + \frac{1}{n}} = \mathbf{A}{t}^{\tau} + \frac{1}{n} \mathbf{v}{\pi}(\mathbf{A}{t}^{\tau}, \mathbf{o}_{t}, \tau). $ Symbols:
  • Atτ\mathbf{A}_{t}^{\tau}: Current partially denoised chunk at flow time τ\tau.
  • nn: Number of denoising steps.
  • vπ\mathbf{v}_{\pi}: Learned velocity field (neural network) conditioned on observation and τ\tau.

4.3. Real-Time Constraint and Asynchronicity

  • Controller period: Δt\Delta t seconds.

  • Model latency: δ\delta seconds per chunk inference.

  • Inference delay: d:=δ/Δtd := \lfloor \delta / \Delta t \rfloor controller steps.

  • When executing chunk A0\mathbf{A}_0 with desired execution horizon ss, an async algorithm must start inference at step sds - d so that the next chunk Asd\mathbf{A}_{s-d} is available by the time step ss is reached—ensuring an action is always available. However, because Asd\mathbf{A}_{s-d} was generated without knowing the exact intermediate evolution, discontinuities may arise between as10\mathbf{a}_{s-1|0} and assd\mathbf{a}_{s|s-d}.

    As illustration, the following figure (Figure 2 from the original paper) shows a typical bifurcation between consecutive chunks and why naive async causes out-of-distribution jumps:

    Figure 2: An illustration of a typical bifurcation between consecutive chunks. Inference is started between timesteps 3 and 4. The original chunk that was executing, \(\\left\\{ a _ { t } \\right\\}\) (black), had planned to go above the obstacle while the newly generated chunk \(\\left\\{ a _ { t } ^ { \\prime } \\right\\}\) (red) goes below the obstacle. However, \(\\left\\{ a _ { t } ^ { \\prime } \\right\\}\) is not available until \(d = 7\) steps later. A naive asynchronous algorithm might jump from `a _ { 1 0 }` to \(a _ { 1 1 } ^ { \\prime }\) , inducing a very high, outof-distribution acceleration. Temporal ensembling \[68\], i.e., interpolating between chunks, reduces the acceleration but produces poor actions. 该图像是示意图,展示了在连续动作块之间的典型分叉情况。图中描绘了两个动作序列:执行中的动作块 {at}\left\{ a_t \right\}(黑色)计划通过障碍物,而新生成的动作块 {at}\left\{ a_t' \right\}(红色)则选择绕过障碍物。推理在时间步 34 之间开始,但新动作块的执行需要延迟 d=7d=7 步。图中还提到,简单的异步算法可能会导致高加速度,而时间集成方法虽然减小了加速度,但效果不佳。

4.4. Inference-Time Inpainting with Flow Matching

RTC builds on training-free inpainting guidance for iterative denoising. The velocity field is corrected at each denoising step with a gradient-based guidance that enforces consistency with a target Y\mathbf{Y} (the “corrupted” desired output that encodes frozen/overlap constraints via masking).

The guidance correction (IIGDM) specialized to action-chunk generation is: $ \begin{array}{rl} \mathbf{v}_{\mathrm{IIGDM}}(\mathbf{A}_t^{\tau}, \mathbf{o}_t, \tau) &= \mathbf{v}(\mathbf{A}_t^{\tau}, \mathbf{o}t, \tau) + \min\left(\beta, \frac{1 - \tau}{\tau \cdot r{\tau}^{2}}\right) \left(\mathbf{Y} - \widehat{\mathbf{A}_t^{1}}\right)^{\top} \mathrm{diag}(\mathbf{W}) \ \frac{\partial \widehat{\mathbf{A}_t^{1}}}{\partial \mathbf{A}_t^{\tau}} \ \text{where} \quad \widehat{\mathbf{A}_t^{1}} &= \mathbf{A}_t^{\tau} + (1 - \tau) \mathbf{v}(\mathbf{A}_t^{\tau}, \mathbf{o}t, \tau), \ r{\tau}^{2} &= \frac{(1 - \tau)^{2}}{\tau^{2} + (1 - \tau)^{2}}. \end{array} $ Symbols:

  • v\mathbf{v}: Base velocity field (same as vπ\mathbf{v}_{\pi}).

  • β\beta: Guidance weight clipping to cap the guidance coefficient for stability (especially at τ0\tau \to 0).

  • Y\mathbf{Y}: Target vector encoding the desired/frozen overlap region (constructed by right-padding previous chunk and masking).

  • At1^\widehat{\mathbf{A}_{t}^{1}}: One-step estimate of the final denoised chunk from the current state.

  • W\mathbf{W}: Mask weights (soft mask; explained in Section 4.5).

  • diag(W)\mathrm{diag}(\mathbf{W}): Diagonal matrix with entries from W\mathbf{W}.

  • At1^Atτ\frac{\partial \widehat{\mathbf{A}_t^{1}}}{\partial \mathbf{A}_t^{\tau}}: Jacobian of the estimate with respect to current state; the guidance uses a vector-Jacobian product, implemented via reverse-mode autodifferentiation.

    Interpretation:

  • The correction adds a term pushing the final denoised chunk toward Y\mathbf{Y}, weighted across timesteps by W\mathbf{W} and scaled by a schedule min(β,1ττrτ2)\min(\beta, \frac{1-\tau}{\tau r_{\tau}^{2}}).

  • The Jacobian ensures the guidance properly propagates through the generative flow.

    Practical integration update: $ \mathbf{A}{t}^{\tau + \frac{1}{n}} = \mathbf{A}{t}^{\tau} + \frac{1}{n} \left( \mathbf{v}_{\pi}(\mathbf{A}_t^{\tau}, \mathbf{o}t, \tau) + \min\left(\beta, \frac{1 - \tau}{\tau \cdot r{\tau}^{2}}\right) \cdot \text{VJP} \right), $ where “VJP” denotes the vector-Jacobian product (YAt1^)diag(W) At1^Atτ\left(\mathbf{Y} - \widehat{\mathbf{A}_t^{1}}\right)^{\top} \mathrm{diag}(\mathbf{W}) \ \frac{\partial \widehat{\mathbf{A}_t^{1}}}{\partial \mathbf{A}_t^{\tau}}.

4.5. Soft Masking for Cross-Chunk Continuity

Hard masking (only the first dd overlapping actions get full weight 1, others 0) may be too weak when dd is small and can still allow strategy switches. RTC proposes soft masking across all overlapping actions (H-s timesteps): weights decay from 1 (earliest overlap) to 0 (non-overlap/future region), reflecting growing uncertainty further in time.

Soft-mask definition: $ \mathbf{W}{i} = \left{ \begin{array}{ll} 1 & \text{if } i < d \ c{i} \frac{e^{c_{i}} - 1}{e - 1} & \text{if } d \leq i < H - s \ 0 & \text{if } i \geq H - s \end{array} \right. \quad \text{where } c_{i} = \frac{H - s - i}{H - s - d + 1}, \quad i \in {0, \ldots, H - 1}. $ Symbols:

  • ii: Index into the action chunk (0-based).

  • HH: Prediction horizon.

  • ss: Execution horizon in the upcoming chunk (must satisfy dsHdd \leq s \leq H - d).

  • dd: Inference delay (overlap length before new chunk becomes available).

  • cic_i: Normalized position within the overlap region; exponential schedule cieci1e1c_i \frac{e^{c_i}-1}{e-1} gives a smooth decay from 1 to 0.

    Empirical comparison shows soft masking better matches the frozen region and reduces fast direction changes versus hard masking. The following figure (Figure 4 from the original paper) visually compares the two:

    Figure 4: A comparison of naive inpainting (hard masking) and our proposed soft masking method: note that hard masking does not match the frozen region very well and produces faster changes in direction. 该图像是一个图表,展示了硬掩膜和软掩膜方法的性能比较。图中蓝色曲线代表硬掩膜,绿色曲线代表软掩膜,黑色曲线表示之前的动作块。通过对比可以观察到,硬掩膜在与冻结区域匹配上表现不佳,并且在方向变化上速度较快。

4.6. Full RTC System Architecture

RTC runs inference asynchronously while execution proceeds, with three key components:

  1. A controller interface (GETACTION) called every Δt\Delta t to consume the next action and provide the latest observation.
  2. A background INFERENCELOOP thread that:
    • Tracks a buffer of past delays QQ and conservatively estimates the next dd (e.g., max of buffer).
    • Determines ss (actual execution horizon) as max(d,smin)\max(d, s_{\mathrm{min}})—the user specifies a minimum smins_{\mathrm{min}}.
    • Prepares the previous chunk’s remaining actions Aprev (after removing already executed ss actions) and calls GUIDEDINFERENCE to inpaint the next chunk with soft masking.
    • Swaps in the newly generated chunk Acur as soon as available and resets local indexing.
    • Records observed delay into QQ.
  3. A GUIDEDINFERENCE routine that:
    • Computes W\mathbf{W} via the soft-mask formula.
    • Right-pads Aprev to length HH to form Y\mathbf{Y}.
    • Initializes A0N(0,I)\mathbf{A}_{0} \sim \mathcal{N}(0, I) and iterates τ\tau from 0 to 1 with step 1/n1/n:
      • Defines the denoising estimate At1^=Aτ+(1τ)vπ(Aτ,o,τ)\widehat{\mathbf{A}_t^{1}} = \mathbf{A}^{\tau} + (1 - \tau)\mathbf{v}_{\pi}(\mathbf{A}^{\tau}, \mathbf{o}, \tau).

      • Computes the weighted error e=(AprevAt1^)diag(W)\mathbf{e} = (\mathbf{A}_{\text{prev}} - \widehat{\mathbf{A}_t^{1}})\mathrm{diag}(\mathbf{W}).

      • Forms the vector-Jacobian product via autodiff.

      • Updates Aτ\mathbf{A}^{\tau} using Equation (1) plus guidance with clipping β\beta.

        Visualization of RTC’s masking/regions is shown below (Figure 3 from the original paper). The left region is “frozen” (full guidance weight); the middle overlap gets exponentially decaying weights; the last ss steps are newly generated (weight 0). This respects that early overlap must match what will execute, while later overlap can incorporate new observations smoothly:

        该图像是示意图,展示了实时行动分块流策略的推理过程。在图中,横轴表示推理的开始与执行的时间段,包含多个动作 \(a_{-5}\) 到 \(a_{15}\)。纵轴展示了引导权重的变化,从1降到0,表示不同区域的动作执行状态:执行的动作、冻结的动作以及需要新生成的动作区域。图中还标注了执行地平线、推理延迟等重要参数,说明了实时分块算法的执行机制。 该图像是示意图,展示了实时行动分块流策略的推理过程。在图中,横轴表示推理的开始与执行的时间段,包含多个动作 a5a_{-5}a15a_{15}。纵轴展示了引导权重的变化,从1降到0,表示不同区域的动作执行状态:执行的动作、冻结的动作以及需要新生成的动作区域。图中还标注了执行地平线、推理延迟等重要参数,说明了实时分块算法的执行机制。

Additionally, the paper illustrates the effect of RTC on a precise, dynamic task (lighting a match), showing smoother joint trajectories and improved speed under delay; see below (Figure 1 from the original paper):

Figure 1: Top: Real-time chunking (RTC) enables the robot to perform highly dexterous and dynamic tasks, such as lighting a atch—even in the presenceof inference delays in excess of 300 milliseconds, correspondin to more than \(30 \\%\) of the model's prediction horizon. Bottom: RTC performs the same robot motion \(20 \\%\) faster than synchronous inference \[5, 30, 8, 24, 31, 59\], and smoother than al competing methods, including temporal ensembling \[68\]. The shown positions, velocites, and accelerations correspond to the shoulder joint of on arm, and are taken from the first 10 seconds of a real autonomous match-lighting rollout. 该图像是示意图,展示了实时分块(RTC)在点燃火柴任务中的应用。在右侧,机器人在超过300毫秒的推理延迟下成功点燃火柴。下方的图表显示了肩关节在此过程中的位置、速度和加速度变化情况。实时分块显著提升了任务的流畅性和速度。

4.7. Practical Considerations

  • Guidance weight clipping β\beta: Necessary for stability with small nn (e.g., 5 denoising steps common in control), especially to make the term finite at τ=0\tau=0. Empirical ablations suggest β=5\beta=5 works well; too high β\beta induces divergence and jerkiness (see Figure 7 in Appendix).
  • Diffusion policies: Can be converted to flows at inference time using known relationships; RTC is agnostic to whether the base is diffusion or flow, as long as an iterative denoiser and velocity/score can be guided.
  • Execution horizon bounds: Must satisfy dsHdd \leq s \leq H - d to ensure the new chunk arrives before non-overlap steps are needed, preserving real-time availability.

5. Experimental Setup

5.1. Datasets and Tasks

  • Simulation (Kinetix benchmark):
    • 12 highly dynamic environments (10 existing + 2 new) involving throwing, catching, balancing, etc.
    • Force-based control with added Gaussian action noise to require closed-loop corrections.
    • Expert data: Trained with Robust Policy Optimization (RPO) [50], 6 seeds per environment; 1M transition dataset combining experts per episode.
    • Policy training: Action chunking flow policies with H=8H=8, 4-layer MLP-Mixer architecture; trained for 32 epochs.
    • Evaluation: 2048 rollouts per data point; delays d{0,,4}d \in \{0,\dots,4\} (max supported with H=8H=8).
  • Real-world (bimanual manipulation):
    • Base policy: π0.5\pi_{0.5} VLA [24]; H=50H=50, Δt=20\Delta t = 20 ms (50 Hz), n=5n = 5 denoising steps.
    • Latency: Baseline model latency 76 ms (vanilla), RTC 97 ms (additional guidance cost); remote inference over LAN adds 10–20 ms, giving starting d6d \approx 6. Injected latencies: +100 ms (d11d \approx 11) and +200 ms (d16d \approx 16).
    • Hardware: Two 6-DoF arms with parallel-jaw grippers; position control; separate inference workstation with NVIDIA RTX 4090; robots over LAN WebSockets.
    • Tasks and scoring (each episode’s integer score = number of substeps completed):
      • Light candle (5 steps, 40s cutoff): pick match/matchbox, strike, light candle, drop in bowl.
      • Plug ethernet (6 steps, 120s cutoff): reorient and plug cable ends into server rack (both ends).
      • Make bed, mobile (3 steps, 200s cutoff): move blanket corner and two pillows to head of bed.
      • Shirt folding (1 step, 300s cutoff): fold a flattened shirt.
      • Batch folding (4 steps, 300s cutoff): take varied crumpled clothing from bin, flatten, fold, stack neatly.
      • Dishes in sink, mobile (8 steps, 300s cutoff): move 4 varied items from counter to sink.
    • Trials: 10 trials per task per method × delays, totaling 480 episodes and 28 hours of robot time. Post-hoc annotation of scores and timestamps per step.

5.2. Baselines

  • Simulation:
    • Naive async: Generate new chunk in parallel but ignore the previous chunk’s actions; switch as soon as the new chunk is ready (causes discontinuities).
    • Bidirectional Decoding (BID) [39]: Rejection sampling to ensure continuity; configured with batch size N=32N=32, mode size K=3K=3, weak policy trained for 8 epochs. Uses significantly more compute (samples 64 chunks per update).
    • Temporal ensembling (TE) [68]: Keep a buffer of predicted chunks and execute averaged actions per timestep (prone to invalid actions in multi-modal settings).
  • Real-world:
    • Synchronous: Execute s=25s=25 actions, pause for new chunk generation (default in prior works).
    • TE, sparse: Execute s=25s=25 while computing next chunk in parallel; apply TE only on overlapping executed steps (Hs2dH - s - 2d).
    • TE, dense: Run inference as often as possible, so s=ds=d each chunk; always at least 2 overlapping chunks to ensemble (often more).

5.3. Evaluation Metrics

  • Binary success rate (simulation):
    • Concept: Fraction of episodes that meet the task’s completion criterion (success/failure).
    • Mathematical formula: $ \text{SuccessRate} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}{\text{episode } i \text{ succeeds}}. $ Symbols: NN total episodes; 1{}\mathbb{1}\{\cdot\} indicator function.
    • Confidence intervals: 95% Wilson score intervals are plotted: For observed successes xx in NN trials and confidence level 1α=0.951-\alpha=0.95 (with z=1.96z = 1.96), $ \hat{p} = \frac{x}{N}, \quad \text{CI}_{\text{Wilson}} = \frac{\hat{p} + \frac{z^{2}}{2N} \pm z \sqrt{\frac{\hat{p}(1 - \hat{p})}{N} + \frac{z^{2}}{4N^{2}}}}{1 + \frac{z^{2}}{N}}. $ Symbols: p^\hat{p} empirical proportion; zz standard normal quantile.
  • Average throughput (real-world):
    • Concept: Balanced measure of speed and performance—proportion of the task completed per unit time (higher means more steps done faster).
    • Mathematical formula: $ \text{Throughput} = \frac{\text{StepsCompleted}}{\text{EpisodeDuration}}. $ Symbols: StepsCompleted = integer count of substeps achieved; EpisodeDuration = total elapsed time including pauses.
    • Error bars: Mean ± 1 standard error of the mean (SEM) across episodes: $ \mathrm{SEM} = \frac{\sigma}{\sqrt{N}}, $ Symbols: σ\sigma sample standard deviation; NN number of episodes.

5.4. Why These Datasets/Tasks?

  • Kinetix dynamic tasks with noise stress delay robustness and closed-loop correction benefits—suitable for testing continuity and reactivity.
  • Real-world tasks include both precise, dynamic manipulation (lighting a match) and longer-horizon mobile/bimanual tasks (bed-making, dishes), demonstrating RTC’s practical benefits and safety under latency.

6. Results & Analysis

6.1. Core Results Analysis

  • Simulation (Figure 5):
    • Delay robustness: RTC achieves the highest solve rates across delays; TE performs poorly even at d=0d=0 (averaging invalid actions in multi-modal benchmarks). RTC outperforms BID, with the gap widening as dd increases, despite BID’s higher compute.
    • Execution horizon sensitivity: RTC uniquely benefits from decreasing execution horizon (more frequent closed-loop updates) thanks to continuity across chunks; other methods do not consistently improve.
    • Soft vs. hard masking: Soft masking outperforms hard masking, particularly at lower delays and smaller ss, supporting the need for decaying weights across overlap.
  • Real-world (Figure 6):
    • Average throughput: RTC is best at all delays; maintains performance under +100ms and +200ms injected latency with statistically significant advantages, whereas TE variants cannot run (protective stop triggered due to oscillations), and synchronous degrades linearly.

    • Early progress and fewer retries: RTC completes more substeps earlier even after removing inference pauses, indicating fewer mistakes and smoother control; synchronous often reaches similar final scores but later, reflecting retries.

    • Precision tasks: In “light candle,” RTC has a large advantage in final score (higher success rate) despite no retrying, showcasing robustness in precise, latency-sensitive actions. In hard tasks like bed-making, RTC shows strong benefits (policy struggles with pillows; RTC mitigates).

      The following figure (Figure 5 from the original paper) summarizes simulated performance across tasks and delays, with RTC outperforming baselines:

      该图像是一个示意图,展示了在不同执行时间和推理延迟下的解决率表现。左上方为动态任务的环境示例,右侧展示了不同环境下的解决率曲线。结果显示RTC(红色线)在各种推理延迟下均优于其他比较方法。 该图像是一个示意图,展示了在不同执行时间和推理延迟下的解决率表现。左上方为动态任务的环境示例,右侧展示了不同环境下的解决率曲线。结果显示RTC(红色线)在各种推理延迟下均优于其他比较方法。

The following figure (Figure 6 from the original paper) summarizes real-world cumulative progress and throughput under varying delays; note TE variants failing under injected latency:

Figure 6: Top: Controller steps (equivalent to elapsed time with inference pauses removed multiplied by \(5 0 \\mathrm { H z }\) s. cumulative progress for each task, aggregated across all delays. Progress is measured in discrete steps corresponding to the subsections of each task. Left: Time (including inference pauses) vs. cumulative progres aggregated across all tasks. The \(\\mathbf { X }\) -axis is log scale to better show progress during both short and long-horizon tasks. Right: Inference delay vs. average throughput, defined as the proportion of task completed divided by duration of episode averaged over episodes. Error bars are \(\\pm 1\) SEM. Average throughput gives a balanced view of both speed and performance for each method. Neither TE variant can run at \(+ 1 0 0\) or \(+ 2 0 0 \\mathrm { m s }\) of injected latency, causing such high oscillations that the robot's protective stop is triggered. 该图像是一个图表,展示了不同方法在多个任务中的累积进展和平均吞吐量。上部显示了每个任务在控制步骤与时间的关系,下部则比较了在不同延迟情况下各方法的表现,包含 +0extms+0 ext{ms}+100extms+100 ext{ms}+200extms+200 ext{ms}

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Method Latency
RTC (ours) 97ms
BID with N = 16 (no forward model) 115ms
BID with N = 16 (shared backbone) 169ms
BID with N = 16 (full) 223ms
Vanilla π0.5 76ms

Interpretation:

  • RTC adds guidance overhead relative to vanilla (97ms vs. 76ms) but is significantly faster than BID configurations at similar batch sizes, especially full BID (223ms). This supports RTC’s superior real-time viability.

    The following are the results from Table 2 (breakdown of total inference latency by component) of the original paper:

    Component Time (mobile) Time (non-mobile)
    Model 96.89 ± 0.16ms 97.43 ± 0.28ms
    Network 21.20 ± 3.12ms 6.89 ± 2.39ms
    Image resize 11.22 ± 5.00ms 1.44 ± 0.27ms
    Other 9.67 ± 3.20ms 3.00 ± 0.68ms
    Total 138.98 ± 6.71ms 108.76 ± 2.34ms

Interpretation:

  • Network latency is higher in mobile manipulation (robot laptop vs. desktop), and local pre-processing (image resize) contributes non-trivially. Total end-to-end latency varies by deployment setting despite identical model time.

    The following are the results from Table 3 (model inference latency breakdown) of the original paper:

    Component Time (no RTC) Time (with RTC)
    Image encoders (SigLIP) 18ms 18ms
    LLM prefill (Gemma 2B) 44ms 44ms
    Denoising step (x5) 14ms 35ms
    Total 76ms 97ms

Interpretation:

  • RTC incurs a ~2.5× per-denoising-step latency increase due to guidance and autodiff, yielding 97ms total (vs. 76ms baseline). The added cost buys continuity and real-time robustness.

    The following are the results from Table 4 (RTC hyperparameters) of the original paper:

    Hyperparameter Description Simulation Real-world
    n Denoising steps 5 5
    H Prediction horizon 8 50
    Smin Minimum execution horizon 25
    β Guidance weight clipping 5 5
    b Delay buffer size 10

6.3. Ablation Studies / Parameter Analysis

  • Guidance weight clipping β\beta (Appendix Figure 7):
    • The top-left shows the unclipped guidance schedule 1ττrτ2\frac{1 - \tau}{\tau r_{\tau}^{2}} diverging at τ=0\tau=0; clipping stabilizes.
    • Ablation indicates no marginal benefit beyond β=5\beta=5; too-high β\beta causes divergence and increased jerkiness (max acceleration grows), especially with few denoising steps (n=5n=5). This justifies the conservative choice β=5\beta=5.
  • Soft mask schedules (Appendix Figure 8, left):
    • Exponential decay performs best overall; linear is close behind. These support the intuition that stronger early overlap enforcement is beneficial, tapering toward future steps.
  • Comparison to Diffuser inpainting (Appendix Figure 8, right):
    • Diffuser’s overwrite-based inpainting underperforms guidance-based RTC in this control setting, suggesting RTC’s gradient-guided approach better preserves action manifold and continuity.

      Here is Figure 7 from the original paper, illustrating the effect of β\beta on stability and jerkiness:

      该图像是图表,展示了与实时动作分块策略相关的多个关键性能指标。上方左图显示了值与参数 \(\\tau\) 的关系,表明其在一定范围内的变化。右侧图表描绘了解决率与参数 \(\\beta\) 的关系,并根据延迟的不同进行了分组。下方左图反映了步骤与辐角的变化,强调了先前分块与不同步数的比较。右侧则展示了最大加速度相对于 \(\\beta\) 的变化趋势,说明了加速度的增长。 该图像是图表,展示了与实时动作分块策略相关的多个关键性能指标。上方左图显示了值与参数 τ\tau 的关系,表明其在一定范围内的变化。右侧图表描绘了解决率与参数 β\beta 的关系,并根据延迟的不同进行了分组。下方左图反映了步骤与辐角的变化,强调了先前分块与不同步数的比较。右侧则展示了最大加速度相对于 β\beta 的变化趋势,说明了加速度的增长。

Here is Figure 8 from the original paper, showing soft-mask schedule ablations and comparison with Diffuser inpainting:

Figur 8: Let: Simulated ablation over different schedules for sot masking weights (Eq 5). Exponential decay performs the best overal although linear decay is very close behind. Right:Comparison with the inpainting algorithm from Diffuser \[26\], which overwrites a portion of the action chunk with the desired actions at each deWhihi nheapepeo l oi bee d by our guidance-based approach. 该图像是一个比较不同调度策略对解决率影响的图表,其中展示了在不同推理延迟下使用指数衰减、线性衰减和无衰减策略的解决率变化。图中显示了各个任务(如 car_launch 和 cartpole_thrust)的解决率,平均解决率位于底部,突显了不同策略在效率上的差异。

7. Conclusion & Reflections

7.1. Conclusion Summary

  • RTC is a practical, training-free inference-time algorithm enabling real-time, asynchronous execution of high-latency action chunking policies (flow/diffusion VLAs).
  • It frames the problem as inpainting: freeze guaranteed-to-execute actions and guide the denoiser to inpaint the rest, with soft masking over overlapping timesteps for continuity.
  • RTC is fast and robust to delay vs. alternatives; in simulation, it outperforms naive async, TE, and BID, with increasing margins under higher delays; in real robots, it maintains throughput under significant injected latency while TE variants fail and synchronous degrades.
  • It improves both speed and quality (earlier completion of substeps, fewer retries) and achieves high success on precise tasks (lighting a candle with a match).

7.2. Limitations & Future Work

  • Limitations:
    • Computational overhead: RTC adds guidance/autodiff cost per denoising step, increasing latency vs. vanilla sampling.
    • Applicability: Requires iterative denoising (diffusion/flow); not directly applicable to non-iterative policy classes.
    • Real-world scope: While manipulation tasks are diverse, other dynamic domains (e.g., legged locomotion) were not demonstrated in hardware.
  • Future directions:
    • Applying RTC to more dynamic platforms (e.g., legged robots).
    • Combining RTC with model acceleration (e.g., consistency policies) to reduce per-step cost while preserving continuity.
    • Exploring theoretical connections with cascade control/MPC under learned policies (outer loop learned chunking with inner hardware control).
    • Extending RTC-style guidance to hierarchical VLAs (System 2 planning + System 1 control) and evaluating trade-offs.

7.3. Personal Insights & Critique

  • Innovations:
    • The inpainting perspective for action chunks is elegant and practical; soft masking is a crucial addition that respects uncertainty across time and significantly improves continuity.
    • Guidance weight clipping shows careful attention to stability in low-step denoising regimes common in control.
  • Transferability:
    • RTC could be applied to any iterative generative controller (e.g., diffusion-based walking gaits) that needs real-time responsiveness under latency, including AR/VR interaction controllers or autonomous driving submodules that plan action sequences.
    • The masking-and-guidance framework may generalize to other sequential generation tasks (e.g., speech or motion synthesis) where continuity across overlapping windows matters.
  • Potential issues:
    • Compute overhead may still be prohibitive on edge devices without GPU acceleration; combining RTC with distillation/consistency policies could be vital.
    • Soft-mask schedule tuning could be task-dependent; while exponential decay works well broadly, adaptive schedules informed by uncertainty estimates might further improve performance.
    • The real-world evaluation uses position control; extensions to torque/force control (as in simulation) would strengthen claims in broader settings.
  • Overall assessment:
    • RTC addresses a real pain point in deploying large VLAs for embodied control: unavoidable latency. Its training-free, general applicability and robust empirical gains make it a valuable contribution likely to influence how practitioners run diffusion/flow policies in real time.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.