Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Published:03/31/2026

Analysis

~13 min read · 17,252 charsThis analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

1. Bibliographic Information

1.1. Title

The title of the paper is "Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis". This title indicates that the research focuses on a unified model architecture that acts as an "agent" to generate images grounded in real-world knowledge, rather than relying solely on internal training data.

1.2. Authors

The authors are Shawn Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yun Long Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, and Nanyun Peng.

Their affiliations are a mix of academia and industry:

  • University of California, Los Angeles (UCLA)
  • Tencent Hunyuan (a leading AI research lab in China)
  • The Chinese University of Hong Kong (CUHK)
  • The Hong Kong University of Science and Technology (HKUST)

1.3. Journal/Conference

The paper is currently a preprint hosted on arXiv (arXiv:2603.29620), published on March 31, 2026. As a preprint, it has not yet appeared in a specific peer-reviewed conference or journal proceedings, but it addresses cutting-edge topics in multimodal learning and generative AI, typical of top-tier venues like CVPR, ICCV, or NeurIPS.

1.4. Publication Year

1.5. Abstract

The paper proposes Unify-Agent, a unified multimodal agent designed to solve the problem of generating images for rare, long-tail, or knowledge-intensive concepts. Current unified multimodal models (UMMs) struggle with these tasks because they rely on "frozen parametric knowledge" (what they learned during training). Unify-Agent reframes image generation as an agentic pipeline consisting of four steps: THINK (prompt understanding), RESEARCH (multimodal evidence searching), RECAPTION (grounded recaptioning), and GENERATE (final synthesis). To train this model, the authors constructed a data pipeline curating 143K high-quality agent trajectories. They also introduced FactIP, a new benchmark for evaluating factual consistency in long-tail concepts. Experiments show Unify-Agent significantly outperforms base unified models and approaches the capabilities of strong closed-source models.

2. Executive Summary

2.1. Background & Motivation

Core Problem: The core problem is that current state-of-the-art Unified Multimodal Models (UMMs)—models that can both understand and generate images within a single architecture—fail to generate accurate images for "long-tail" or "knowledge-intensive" concepts. These are rare entities, specific cultural icons, or niche objects that do not appear frequently in the training data.

Why is this important? Standard text-to-image models act as "closed-book" systems. They rely entirely on their internal weights (parametric memory). If they haven't seen "Grigori Perelman" or a specific "DUDOO" art toy during training, they will hallucinate or produce generic outputs. This limits the reliability of AI in real-world applications where factual accuracy and identity preservation are crucial.

Innovative Idea: The authors propose moving from a "closed-book" approach to an "open-book" approach using agentic modeling. Instead of just mapping a prompt to an image, the model should act like a human agent: first understand the prompt, realize it doesn't know the subject, actively search the web for text and images (RESEARCH), synthesize this information into a detailed description (RECAPTION), and then generate the image.

2.2. Main Contributions / Findings

  1. Novel Architecture: Unify-Agent, a unified multimodal agent that integrates reasoning, searching, and generation into a single model with a specific pipeline (THINK, RESEARCH, RECAPTION, GENERATE).
  2. Data Pipeline: A tailored data construction process that curates 143K high-quality agent trajectories. This data teaches the model not just what to generate, but how to search for information and how to rewrite prompts based on evidence.
  3. FactIP Benchmark: A new benchmark specifically designed to evaluate long-tail, culturally significant, and factual concepts (e.g., celebrities, landmarks, art toys) across 12 categories.
  4. Superior Performance: Unify-Agent achieves state-of-the-art performance among open-source models on benchmarks like FactIP, WiSE, KiTTEN, and T2I-FactualBench. It proves that tightly coupling reasoning and generation improves factual fidelity.
  5. Architectural Insight: The paper finds that including generation capabilities (specifically VAE tokens) in the understanding/recaption phase improves the model's ability to reason about visual evidence.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, one must be familiar with several key concepts:

  • Unified Multimodal Models (UMMs): These are AI models (like Chameleon, Show-o, or the Bagel model used here) that use a single Transformer architecture to handle multiple modalities—text, images, and sometimes video—simultaneously. Unlike traditional systems that might use a separate language model for understanding and a separate diffusion model for generation, UMMs process everything in one shared space.
  • Parametric Knowledge vs. Non-Parametric Knowledge:
    • Parametric Knowledge: The information stored in the weights of a neural network during training. It is fast to access but static and hard to update.
    • Non-Parametric Knowledge: Information stored externally, such as in a database, a search engine, or a text document. This paper introduces non-parametric knowledge (via search) into the generation process.
  • Flow Matching: This is a modern technique for training generative models (like diffusion models). Instead of learning to denoise noise, the model learns to predict the "velocity field" (the direction and speed of change) that transforms a noise distribution into a data distribution over continuous time. The base model used in this paper, Bagel, uses this objective.
  • Vision Transformer (ViT): A neural network architecture that processes images by splitting them into patches (like a sequence of words) and applying the Transformer self-attention mechanism. It is used here to encode visual features for understanding.
  • Variational Autoencoder (VAE): A type of generative model that learns to compress data into a lower-dimensional latent space and then reconstruct it from that space. In image generation, VAEs are used to compress high-resolution images into smaller "latent" representations that are easier for the main model to process.

3.2. Previous Works

The paper categorizes related work into three areas:

  1. Unified Multimodal Models: The authors cite models like Chameleon, Show-o, and Janus. These models demonstrate that understanding and generation can be harmonized. However, the authors note they remain "closed-book" and struggle with long-tail concepts.
  2. Agentic Workflows: Previous work like Visual ChatGPT or HuggingGPT often uses a "decoupled" system where a language model plans and calls separate tools (like a Stable Diffusion model) to generate. The authors argue that decoupled systems lack the fine-grained visual understanding that a unified model provides.
  3. World Knowledge and Evaluation: Benchmarks like T2I-CompBench and GenAI Bench evaluate general quality. Newer benchmarks like KiTTEN and WiSE focus on specific knowledge domains (biology, physics, culture). The authors argue that existing evaluations lack a structural focus on "evidence-grounded" generation, which their new FactIP benchmark addresses.

3.3. Technological Evolution

The field has evolved from:

  1. GANs and Early Diffusion: Focusing on visual fidelity, often ignoring text alignment or factual accuracy.
  2. Text-to-Image Diffusion (e.g., Stable Diffusion): Improved text alignment but still relied on frozen internal knowledge.
  3. Unified Multimodal Models: Merged understanding and generation into one transformer, improving coherence but still lacking external knowledge access.
  4. Agentic Generation (Current State): This paper fits here, adding the capability to actively reason and search the web before generating, bridging the gap between static generation and dynamic, real-world knowledge.

3.4. Differentiation Analysis

The core difference between Unify-Agent and previous agentic systems (like Visual ChatGPT) is architectural integration.

  • Previous: A language model (LLM) acts as the controller, sending text prompts to a separate image generator. The LLM doesn't "see" the visual details the same way the generator does.
  • Unify-Agent: Uses a Unified Transformer (based on Bagel). The same model that performs the text search and reasoning also contains the visual generation weights. This allows the "RECAPTION" phase to leverage deep visual features (from the VAE and ViT) to write better prompts, a capability decoupled agents lack.

4. Methodology

4.1. Principles

The core principle of Unify-Agent is to formulate world-grounded image synthesis not as a simple mapping p(yx)p(y|x), but as a joint probabilistic process involving multiple cognitive steps. The model acts as an agent that detects its own ignorance (cognitive gap), searches for evidence (text and images), rewrites the prompt based on that evidence (recaption), and finally generates the image. This process is trained end-to-end within a unified architecture.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Base Model: Bagel

Unify-Agent is built upon Bagel, a unified multimodal model. Bagel uses a Mixture-of-Experts (MoE) architecture to disentangle understanding and generation.

  • Multimodal Understanding: This is treated as a next-token prediction task. Given a context CC (which can include text and images), the model minimizes the negative log-likelihood of the text tokens: Ltext=t=1Tlogpθ(xtx<t,C) \mathcal{L}_{\text{text}} = - \sum_{t=1}^T \log p_\theta(x_t \mid x_{<t}, C) Here, xtx_t is the target text token, and x<tx_{<t} is the history. This loss trains the model to understand and describe inputs.

  • Multimodal Generation: This is treated as a flow-matching task in the latent space of a VAE. The model learns a time-conditioned velocity field uθu_\theta: \mathcal{L}_{\text{image}} = \mathbb{E}_{t \sim \mathcal{U}(0,1), z_t} \| u_\boldsymbol{\theta}(z_t, t; C) - u^\star(z_t, t) \|_2^2 Here, tt is a continuous timestep, ztz_t is the latent state, and uu^\star is the target vector field. This loss trains the model to generate high-fidelity images.

4.2.2. Problem Formulation

The authors argue that simply conditioning generation on raw retrieved text and images (pθ(yx,Ktext,Kvis)p_\theta(y \mid x, K_{\text{text}}, K_{\text{vis}})) is suboptimal because raw evidence can be noisy or conflicting. Instead, they define the generation process as a factorization of four distinct phases:

pθ(y,c,τt,τv,gx)=pθ(gx)Gap Detectionpθ(τt,τvx,g)Evidence Acquisitionpθ(cx,g,τt,τv)Recaptioningpθ(yc,τv)Synthesis p_\theta(y, c, \tau_t, \tau_v, g \mid x) = \underbrace{p_\theta(g \mid x)}_{\text{Gap Detection}} \cdot \underbrace{p_\theta(\tau_t, \tau_v \mid x, g)}_{\text{Evidence Acquisition}} \cdot \underbrace{p_\theta(c \mid x, g, \tau_t, \tau_v)}_{\text{Recaptioning}} \cdot \underbrace{p_\theta(y \mid c, \tau_v)}_{\text{Synthesis}}

  • Variables:
    • xx: User prompt.

    • gg: Cognitive gap (internal reasoning trace).

    • τt\tau_t: Textual evidence trace (search results).

    • τv\tau_v: Visual evidence trace (retrieved images).

    • cc: Evidence-grounded recaption (the refined prompt).

    • yy: Final generated image.

      This formula dictates the exact training and inference pipeline.

4.2.3. Training Data Construction

To train the model to execute this pipeline, the authors constructed a dataset DSFT={(x,τt,τv,c)}\mathcal{D}_{\text{SFT}} = \{ (x, \tau_t, \tau_v, c) \} of 143K trajectories.

  1. Task Source & Prompt Collection: They gathered 456K examples from 12 domains (Celebrity, Animation, Game, Mythology, etc.) using BabelNet and Wikipedia.
  2. Multimodal Research Trace Construction: For each prompt, they simulated the agent's behavior to create ground truth traces.
    • Textual Trace: A teacher agent formulates a query qtq_t to search for text. It retrieves evidence EtE_t (e.g., identity disambiguation).
    • Visual Trace: Based on text evidence, the agent formulates a visual query qvq_v. It retrieves candidate images E~v\tilde{E}_v and filters them using a VLM (Vision Language Model) like Gemini Flash to select the top 2 high-quality images EvE_v.
  3. Evidence-Grounded Recaption Annotation: A teacher model generates a detailed recaption cc based on the prompt and evidence. Crucially, this recaption is verified by generating a test image and checking if it matches the ground truth using GPT-4o. If the generation fails, the recaption is rejected, ensuring high-quality supervision.

4.2.4. Inference-Time Pipeline

At inference, Unify-Agent follows the THINK, RESEARCH, RECAPTION, GENERATE pipeline.

The following figure (Figure 4 from the original paper) illustrates this agentic pipeline:

Figure4 Overviw o the agentic pipeline o ur method.Give n input prompt, urramework rst peor complementary multimodal evidence through textual evidence searching and visual evidence searching. Based n the t and overa moodThesegroundedconstraints are then integrated int a evidencegroundedrecaptioningmoule, which produces a detailed caption for the downstream image generator to synthesize the final image.

Step 1: THINK (Prompt Understanding & Gap Detection) The model interprets the prompt xx and determines if it lacks sufficient visual knowledge. gpθ(gx)g \sim p_\theta(g \mid x) If a gap is detected (e.g., "Who is Grigori Perelman?"), the model proceeds to research.

Step 2: RESEARCH (Sequential Evidence Acquisition) The model actively interacts with external tools (search engines). pθ(τt,τvx,g) p_\theta(\tau_t, \tau_v \mid x, g)

  • First, it performs Text Search to get τt\tau_t (textual evidence).
  • Then, using the text context, it performs Image Search to get τv\tau_v (visual evidence).

Step 3: RECAPTION (Grounded Recaptioning) This is a critical architectural innovation. The model consolidates the prompt and evidence into a refined description cc. cpθ(cx,g,τt,τv) c \sim p_\theta(c \mid x, g, \tau_t, \tau_v) Unlike standard text-only prompting, Unify-Agent's recaption module utilizes a dual VAE + ViT design. This means it ingests the retrieved reference images not just as abstract concepts, but with fine-grained visual tokens (VAE latents) and semantic tokens (ViT features). This allows it to write a recaption that is visually precise.

Step 4: GENERATE (Synthesis) Finally, the image is generated conditioned on the refined recaption and the visual anchors. ypθ(yc,τv) y \sim p_\theta(y \mid c, \tau_v) Note that the generation is conditionally independent of the noisy search history τt\tau_t, relying only on the processed recaption cc and clean visual anchors τv\tau_v.

4.2.5. Unified Fine-Tuning

The entire pipeline is trained using a unified loss function: LSFT=Ltext+Limage \mathcal{L}_{\text{SFT}} = \mathcal{L}_{\text{text}} + \mathcal{L}_{\text{image}} The model uses special tokens (like <think><think>, <recaption><recaption>) to delineate the different stages of the agent trajectory. A hybrid attention masking strategy is used (see Appendix C in the paper, illustrated in Figure 6 below) to allow the recaption generation to attend fully to the retrieved visual tokens, while maintaining causality for the text generation.

Figure 6 Hybridattention masking strategy oragenticmage generation. The atri shows causal attentinor uxtk. Noalye hask laten fow-matchin tokens tatten ny relevant ferenages andrecaptions or high-delyynthe. 该图像是示意图,描述了混合注意力掩码策略在智能图像生成中的应用。图中展示了文本搜索、图像搜索、VAE 令牌与 ViT 令牌的交互,反映了如何处理文本输入并最终生成图像。

5. Experimental Setup

5.1. Datasets

5.1.1. FactIP (Proposed Benchmark)

  • Source: Curated by the authors.

  • Scale: 2,500 samples.

  • Characteristics: Covers 12 categories of culturally significant and long-tail concepts (e.g., Mascots, Mythology, Art Toys). It requires models to generate images that are faithful to specific Intellectual Properties (IPs).

  • Purpose: To evaluate identity consistency and factual faithfulness where standard models usually fail.

    The following figure (Figure 7 from the original paper) shows the category breakdown and performance radar chart for FactIP:

    该图像是一个示意图,左侧展示了 FactIP 基准的不同类别,包括动画、游戏、节日等,右侧是不同模型在这些类别上的表现对比雷达图。该图表清晰呈现了 Unify-Agent 在多个类别中的优势。 该图像是一个示意图,左侧展示了 FactIP 基准的不同类别,包括动画、游戏、节日等,右侧是不同模型在这些类别上的表现对比雷达图。该图表清晰呈现了 Unify-Agent 在多个类别中的优势。

5.1.2. Existing Benchmarks

  • WiSE: Evaluates world knowledge across dimensions like Culture, Time, Space, Biology, etc.
  • KiTTEN: Evaluates fine-grained generation for entities like Aircraft, Cuisine, Flowers.
  • T2I-FactBench: Evaluates Single Knowledge Concept Memorization (SKCM), Instantiation (SKCI), and Multiple Concept Composition (MKCC).

5.2. Evaluation Metrics

5.2.1. FactIP Metrics

The paper uses four dimensions evaluated by an MLLM judge (Seed 2.0):

  1. Clarity: Image sharpness, absence of blur/noise.

  2. Content Quality: Faithfulness to the prompt and semantic coherence.

  3. Aesthetics: Visual appeal, lighting, composition.

  4. Relevance (Text Relevance IP): Crucial metric. Measures IP identity consistency—whether the generated image preserves the specific identity traits (face, costume, colors) of the subject.

    The Overall Score is a weighted sum: Overall Score=α1Clarity+α2Content+α3Aesthetics+α4Relevance \text{Overall Score} = \alpha_1 \cdot \text{Clarity} + \alpha_2 \cdot \text{Content} + \alpha_3 \cdot \text{Aesthetics} + \alpha_4 \cdot \text{Relevance} where α4=0.75\alpha_4 = 0.75, heavily prioritizing factual relevance.

5.2.2. WiSE Metrics (WiScore)

Evaluated by GPT-4o on three dimensions (Consistency, Realism, Aesthetics). WiScore=α1Consistency+α2Realism+α3Aesthetics \text{WiScore} = \alpha_1 \cdot \text{Consistency} + \alpha_2 \cdot \text{Realism} + \alpha_3 \cdot \text{Aesthetics} Weights are α1=0.7,α2=0.2,α3=0.1\alpha_1=0.7, \alpha_2=0.2, \alpha_3=0.1, prioritizing knowledge consistency.

5.2.3. KiTTEN Metrics

Evaluated by GPT-4o on two dimensions:

  1. Text Alignment: How well the image matches the text description.
  2. Entity Alignment: How well the specific entity details are preserved. Scores range from 1 to 5.

5.3. Baselines

The paper compares against three groups of models:

  1. Commercial Models: GPT-Image-1, Seedream-5, Nano Banana-2 (strong closed-source models).
  2. Generation-Only Models: FLUX.1-dev, SD-3.5-large, Qwen-Image (standard diffusion/transformer generators).
  3. Unified MLLMs: Janus-Pro, Emu3.5, Hunyuan-Image-3.0, Bagel (the base model for Unify-Agent).

6. Results & Analysis

6.1. Core Results Analysis

The results demonstrate that Unify-Agent significantly outperforms its base model (Bagel) and other open-source unified models, particularly in tasks requiring factual grounding.

  • FactIP: Unify-Agent achieves an Overall score of 73.2, compared to Bagel's 47.0. The most dramatic improvement is in the Relevance metric (jumping from ~34.7 to 72.4), proving the agent's ability to correctly identify and render rare IPs.
  • WiSE: Unify-Agent achieves a WiScore of 0.77, surpassing Bagel (0.52) and approaching commercial models like Nano Banana (0.89). It excels in "Cultural" and "Logic" dimensions.
  • KiTTEN: Unify-Agent scores 4.08 overall, beating the strong Imagen-3 baseline (3.50).
  • T2I-FactBench: Unify-Agent leads in SKCI (77.4) and MKCC (71.5), showing it can adapt concepts to new scenes and compose multiple rare concepts effectively.

6.2. Data Presentation (Tables)

6.2.1. FactIP Results

The following are the results from Table 1 of the original paper. Note the significant gap in "Relevance" scores between Unify-Agent and the baselines.

Model Clarity Content Aesthetics Relevance Overall
Character Object Scene Character Object Scene Character Object Scene Character Object Scene
Commercial Models
GPT-Image-1 94.694.196.2 83.571.178.4 89.288.194.1 61.960.276.4 69.3
Nano Banana-2 96.688.290.4 86.396.086.6 92.292.295.3 85.593.989.5 88.5
Unified MLLM
Bagel 91.491.890.6 68.559.165.0 81.683.387.2 39.944.034.7 50.9
Unify-Agent (Ours) 92.491.590.7 75.876.173.6 83.386.086.4 67.371.778.0 73.2

6.2.2. WiSE Results

The following are the results from Table 2 of the original paper:

Model Cultural Time Space Biology Physics Chemistry Overall
Commercial Models
Nano Banana 0.890.870.95 0.890.890.79 0.89
Unified MLLM
BAGEL 0.440.550.68 0.440.600.39 0.52
BAGEL+CoT 0.760.690.75 0.650.750.58 0.70
Unify-Agent (Ours) 0.820.750.74 0.720.730.70 0.77

6.2.3. Ablation Study

The following are the results from Table 5 of the original paper. This table analyzes the contribution of different components of the Unify-Agent pipeline.

Variant Clarity Content Aesthetics Relevance Overall
Baseline (Vanilla Bagel) 91.364.284.044.950.9
Pipeline Ablations
w/o Text-Search 90.770.984.364.665.4
w/o Image-Search 92.173.185.050.856.2
w/o Recaption 83.069.074.560.262.9
Constraint Ablations
Recaption w/o Identity-preserving 91.572.683.465.967.7
Recaption w/o Scene-compositional 90.770.880.768.668.2
Recaption Architecture Ablations
Recaption w/o VAE 90.974.384.570.871.2
Recaption w/o ViT 88.668.481.158.761.4
Unify-Agent (Full) 91.275.285.272.473.2

6.3. Ablation Studies / Parameter Analysis

6.3.1. Importance of Visual Evidence

The ablation study (Table 5) reveals that removing Image Search causes a significant drop in "Relevance" (from 72.4 to 50.8) compared to removing Text Search (drop to 64.6). This confirms the intuition from Figure 2 (in the paper) that visual anchors are more critical for identity fidelity than text alone. The model needs to "see" the rare entity to generate it correctly.

6.3.2. Importance of Recaptioning

Removing the Recaption step entirely drops performance significantly (Clarity 83.0, Relevance 60.2). This validates the paper's claim that raw evidence injection is suboptimal. The model needs to process and rewrite the evidence into a "generation-oriented specification" to bridge the gap between search results and the image generator.

6.3.3. Generation Helps Understanding

A fascinating finding is in the "Recaption Architecture Ablations". Removing the VAE (the visual generation encoder) from the recaption module drops performance (Relevance 70.8 vs 72.4). This suggests that having "generative visual features" (latent tokens) helps the model understand the retrieved reference images better than just semantic features (ViT) alone. This supports the "Unified" approach, where generation capabilities aid understanding.

6.3.4. Qualitative Analysis

The paper provides several qualitative examples (Figures 8-12) showing the model's ability to handle complex prompts.

  • Case 1 (Copper Combustion): The model correctly identifies that burning copper produces green flames, a specific chemical fact.

  • Case 2 (DUDOO Art Toy): It successfully retrieves and generates a specific niche art toy character.

  • Case 3 (William Butler Yeats): It generates a historically plausible scene for the poet, preserving his likeness.

    The following figure (Figure 12 from the original paper) shows a successful generation of Nobel laureate Bruce Beutler, demonstrating the model's ability to handle specific, less famous celebrities accurately.

    Figure 12 | Image generated for Bruce Beutler reviewing data on a monitor. 该图像是图示,左侧为真实图像(Ground Truth),右侧为生成图像(Generated Images)。图中展示了一名男性正在显微镜前进行数据分析,特点是高技术设备的使用,体现了生成模型的效果。

7. Conclusion & Reflections

7.1. Conclusion Summary

Unify-Agent successfully addresses the limitation of "frozen knowledge" in unified multimodal models by introducing an agentic pipeline. By integrating THINK, RESEARCH, RECAPTION, and GENERATE into a single unified model, it achieves state-of-the-art performance on long-tail and knowledge-intensive benchmarks (FactIP, WiSE, KiTTEN). The work demonstrates that tightly coupling reasoning, searching, and generation is essential for reliable open-world image synthesis.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

  1. Context Window: Current open-source unified models have limited context windows, restricting the number of images (e.g., reference images) that can be processed at once.

  2. Inference Cost: The agentic pipeline (searching + recaptioning) adds latency compared to one-step generation.

  3. Complexity: The model relies on external search tools; if the search fails or returns irrelevant data, generation quality may suffer.

    Future work involves extending the framework to support longer-horizon planning, repeated search, and adaptive reasoning in more complex settings.

7.3. Personal Insights & Critique

  • Validation of the Agentic Paradigm: This paper provides strong empirical evidence that the "agentic" approach is not just a buzzword but a necessary architectural shift for handling the long tail of the real world. The performance gap between Bagel and Unify-Agent on FactIP is substantial.
  • The "Recaption" Bottleneck: The paper highlights that raw retrieval isn't enough; the synthesis of that information into a prompt is key. This suggests future research could focus on making the "recaptioner" even more sophisticated, perhaps using Chain-of-Thought reasoning explicitly during the recaption phase.
  • Architectural Synergy: The finding that "Generation helps Understanding" (via VAE tokens) is a profound insight. It implies that training a model to generate pixels gives it a better internal representation of visual objects than training it to just classify or describe them. This argues against "decoupled" agent designs where the planner is a pure LLM.
  • Data Dependency: The model's success relies heavily on the 143K curated trajectories. While this is a contribution, it raises the question of how scalable this data curation process is. Can we automate the creation of such high-quality trajectories without expensive human-in-the-loop verification (using GPT-4o)?