Paper status: completed

Qwen3-Omni Technical Report

Published:09/22/2025

Multimodal Large Language Model (28)Qwen3-Omni Architecture (1)Audio Task Performance Optimization (1)General-Purpose Audio Captioning (1)Multilingual Speech Understanding and Generation (1)

Original Link PDF

Price: 0.100000

18 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Qwen3-Omni is a single multimodal model achieving state-of-the-art performance across text, image, audio, and video, particularly excelling in audio tasks. It uses a mixture-of-experts architecture, supports multilingual audio understanding and generation, and reduces latency wit

Abstract

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Mind Map

In-depth Reading

English Analysis~38 min read · 58,834 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Qwen3-Omni Technical Report," which introduces a novel single multimodal model designed to achieve state-of-the-art performance across various modalities without performance degradation relative to single-modal counterparts.

1.2. Authors

The paper lists a comprehensive team of authors from the Qwen Team. Core Contributors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin Contributors: An Yang, Anfeng Li, Bei Chen, Beichen Zhang, Bin Lin, Binyuan Hui, Bohan Wang, Buxiao Wu, Chenfei Wu, Cheng Chen, Chen Qiang, Chenhan Yuan, Chenxu Lv, Chujie Zheng, Daren Chen, Dayiheng Liu, Dake Guo, Fei Huang, Gezhengyang Zhu, Guangdong Zhou, Hang Zhang, Hongjian Tu, Humen Zhong, Jialong Zuo, Jianhong Tu, Jianwei Zhang, Jiayi Leng, Jing Zhou, Jingn Zhou, Kai Dang, Kexin Yang, Kun Yan, Laien Zheng, Lei Xie, Liango Deng, Ligen Meng, M Li Mi Hong Meg Xue, Ming i ine L P Zhang, n L, enei Wa Rn Yuan, Rui Hu, Ruiyang Xu, Qidog Huang, QinZhu, Que Shen, Shen Li, Shixuan Lu, Sibo Son, Zag SoChe H T Ta Wn Weo W Dg Wi Wg, Xi Deng, Xiaotong Chen, Xiao Li, Xian Yang, Xinyao Niu, Xudong Guo, Xin Le, Xuechun Wang, Xutong Jin, Xuancheng Ren, Yang Fan, Yang Liu, Yang Su, Yantao Liu, Yi Wu, Yichang Zhang, Yilei Chen, Yiming Dong, Yinger Zhang, Yizhong Cao, Yuchong Sun, Yuezhang Wang, Yuhao Wang, Yuqiong Liu, Yuanzhi Zhu, Yuxiang Chen, Yuxuan Cai, Yuxuan Liu, Zeyu Cui, Zheng Li, Zhenghao Xing, Zhenru Zhang, Zihan Qiu, ZiYue Jiang, Zhaohai Li, Zhi Li, Zhibo Yang, Zhihai Wang, Zhipeng Zhou

The authors are primarily affiliated with the Qwen Team, indicating an internal research and development effort from a major technology entity, likely Alibaba Group, given the QwenLM GitHub repository and ModelScope links. Their backgrounds appear to span large language models, multimodal AI, speech processing, and vision systems.

1.3. Journal/Conference

The paper is published as an arXiv preprint with the identifier $abs/2509.17765$ . This indicates it is a pre-publication version, not yet formally peer-reviewed or published in a journal or conference proceedings. arXiv is a reputable platform for sharing new research rapidly in fields like AI and machine learning.

1.4. Publication Year

The paper was published on 2025-09-22T13:26:24.000Z (UTC), indicating a future publication date relative to the current analysis time. This suggests the paper is either a forward-dated preprint or a hypothetical future publication.

1.5. Abstract

The paper introduces Qwen3-Omni, a single multimodal model that achieves state-of-the-art performance across text, image, audio, and video modalities without degradation compared to single-modal models of similar size. It matches Qwen series performance in text and vision, and significantly excels in audio tasks, achieving open-source SOTA on 32 out of 36 audio and audio-visual benchmarks, and overall SOTA on 22, surpassing strong closed-source models like Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe.

Qwen3-Omni utilizes a Thinker-Talker MoE (Mixture-of-Experts) architecture to unify perception and generation across modalities, producing fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19, and speech generation in 10. To reduce first-packet latency in streaming synthesis, the Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. This allows replacing computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling immediate streaming. The model boasts a theoretical end-to-end first-packet latency of 234 ms in cold-start settings.

For enhanced multimodal reasoning, a Thinking model is introduced to explicitly reason over inputs from any modality. Additionally, to address the lack of general-purpose audio captioning models, Qwen3-Omni-30B-A3B was fine-tuned to create Qwen3-Omni-30B-A3B-Captioner, which generates detailed, low-hallucination captions for arbitrary audio inputs. Several variants (Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner) are publicly released under the Apache 2.0 license.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2509.17765 PDF Link: https://arxiv.org/pdf/2509.17765v1.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the modality trade-off commonly observed in contemporary LLM-centric multimodal models. Prior research often shows that gains in one modality's performance (e.g., text understanding) come at the cost of degradation in others (e.g., image or audio understanding). This limitation prevents multimodal models from achieving parity across all modalities and fully leveraging cross-modal synergy.

The problem is important because human perception and intelligence fundamentally rely on the coordinated use of multiple modalities. Building AI systems that mirror this integrated multimodal learning is crucial for creating more robust, versatile, and human-like AI. Existing challenges include:

Performance Degradation: Multimodal models often fail to match the specialized performance of unimodal counterparts.
Latency in Real-time Interaction: Especially for speech generation, first-packet latency is a critical barrier for a smooth user experience.
Lack of Unified Architectures: Many multimodal systems are cascaded pipelines of separate unimodal components, leading to higher complexity, costs, and limited cross-modal reasoning.
Deficiency in Specific Multimodal Tasks: The research community lacks general-purpose models for tasks like audio captioning.

The paper's entry point or innovative idea is to explore integrated multimodal training within the LLM-based paradigm to demonstrate that joint multimodal training can achieve parity across all modalities without degradation, while simultaneously enhancing cross-modal capabilities. This is achieved through a Thinker-Talker Mixture-of-Experts (MoE) architecture and novel designs for efficient speech generation and multimodal reasoning.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Achieving Multimodal Performance Parity and Non-Degradation: Qwen3-Omni is presented as the first single multimodal model that maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to same-sized single-modal counterparts within the Qwen series. This resolves the common modality trade-off issue.
State-of-the-Art Audio Performance: The model particularly excels on audio tasks, achieving open-source SOTA on 32 out of 36 audio and audio-visual benchmarks and overall SOTA on 22, outperforming strong closed-source models.
Thinker-Talker MoE Architecture: Introduction of an upgraded Thinker-Talker MoE architecture that unifies perception and generation across modalities, enabling fluent text and natural real-time speech.
- Upgraded Thinker and Talker to MoE designs: Enhances concurrency and fast inference.
- Decoupled Thinker-Talker conditioning: Talker conditions only on audio/visual features and conversational history, allowing distinct system prompts for Thinker (response style) and Talker (audio style).
Novel Audio Encoder (AuT): Development of AuT (Audio Transformer), trained from scratch on 20 million hours of supervised audio, yielding stronger general-purpose audio representations and employing block-wise window attention for real-time prefill caching.
Advanced Speech Generation (Talker) with Ultra-Low Latency:
- Adoption of a multi-codebook representation for increased capacity and faithful modeling of diverse voices and acoustic cues.
- Shift from single-track to multi-track codec modeling, autoregressively predicting multiple codebook layers via MTP modules.
- Replacement of computationally intensive block-wise diffusion with a lightweight causal ConvNet (Code2Wav) for waveform synthesis.
- Reduced input/output audio code rates to 12.5 Hz, enabling single-frame, immediate speech synthesis.
- Achieves a theoretical end-to-end first-packet latency of 234 ms in cold-start settings, enabling low-latency speech interaction.
Enhanced Multimodal Reasoning with a Thinking Model: Introduction of a dedicated Thinking model that explicitly reasons over inputs from any modality, including audio-video and audio-only scenarios.
General-Purpose Audio Captioning Model: Fine-tuning Qwen3-Omni-30B-A3B to create Qwen3-Omni-30B-A3B-Captioner, addressing a gap in the research community by producing detailed, low-hallucination captions for arbitrary audio inputs.
Expanded Language Coverage: Supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. It can process audio recordings up to 40 minutes for ASR and spoken-language understanding.
Public Release: Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

These findings solve the problems of multimodal performance trade-offs, high latency in real-time speech generation, lack of unified multimodal reasoning, and gaps in specific multimodal tasks like audio captioning. They demonstrate that integrated multimodal training can lead to superior performance and efficiency compared to cascaded unimodal systems.

3.1. Foundational Concepts

To understand the Qwen3-Omni paper, a reader should be familiar with several foundational concepts in deep learning and multimodal AI:

Large Language Models (LLMs): LLMs (e.g., GPT-3, Qwen) are deep learning models trained on massive text datasets, capable of understanding, generating, and reasoning with human language. They typically employ the Transformer architecture and are known for their emergent abilities in various natural language processing (NLP) tasks.
Multimodal AI: This field deals with AI systems that can process and understand information from multiple input modalities (e.g., text, images, audio, video) and generate outputs in one or more of these modalities.
Transformer Architecture: The Transformer is a neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017). It relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence, allowing for parallel processing and capturing long-range dependencies. It consists of an encoder and a decoder stack.
- Self-Attention: A mechanism that allows the model to weigh the importance of different words in an input sequence when encoding a particular word. For a query $Q$ $Q$ , keys $K$ $K$ , and values $V$ $V$ , the attention output is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
  - $Q K^T$ calculates the similarity between queries and keys.
  - $d_k$ is the dimension of the key vectors, used for scaling to prevent vanishing gradients.
  - $\mathrm{softmax}$ normalizes the scores to produce attention weights.
  - The output is a weighted sum of the values.
Mixture-of-Experts (MoE) Architecture: MoE is a technique to improve model capacity and efficiency. Instead of using a single large model, it uses multiple "expert" sub-networks. A gating network learns to select or combine the outputs of these experts based on the input. This allows for conditional computation, where only a subset of the model's parameters is activated for each input, leading to faster inference for a given model size.
Audio Encoders: These models convert raw audio waveforms into meaningful numerical representations (embeddings or features) that can be processed by other neural networks. Examples include mel-spectrograms (a visual representation of audio frequency content over time) and learned representations like Whisper encoders or AuT in this paper.
Speech Codecs: Algorithms used to encode and decode audio signals efficiently. They compress audio data for storage or transmission and then reconstruct it. Discrete speech codecs transform continuous audio into discrete tokens, which can be predicted autoregressively by generative models.
- Multi-codebook representation: Instead of a single stream of discrete tokens, multiple "codebooks" are used, where each codebook captures different aspects or resolutions of the audio signal. This allows for a richer and more detailed representation of speech.
Autoregressive Models: These models predict future elements in a sequence based on previously generated elements. In speech generation, an autoregressive Talker would predict the next speech codec token based on the preceding ones, ensuring temporal coherence.
Diffusion Models: A class of generative models that learn to reverse a diffusion process, gradually transforming noise into data (e.g., images, audio). Block-wise diffusion would apply this process to blocks of data. They are known for high quality but can be computationally intensive and slow for real-time generation.
Convolutional Neural Networks (ConvNets): Neural networks that use convolutional layers to automatically and adaptively learn spatial hierarchies of features from input data. Causal ConvNets are designed so that the output at any time step only depends on past inputs, making them suitable for sequential data generation and real-time streaming.
Rotary Position Embedding (RoPE): A method for encoding positional information in Transformer models. Unlike traditional absolute or learned positional embeddings, RoPE integrates relative positional information directly into the self-attention mechanism by rotating query and key vectors.
- Multimodal Rotary Position Embedding (M-RoPE): An extension of RoPE to multimodal contexts, allowing the model to understand positions within different modalities and their relationships.
- Time-aligned Multimodal Rotary Position Embedding (TM-RoPE): Further extends M-RoPE by explicitly incorporating absolute temporal information, crucial for dynamically sampled video and audio streams.
Supervised Fine-Tuning (SFT): A common practice where a pre-trained LLM is further trained on a smaller, task-specific dataset with labeled examples to adapt it to a particular downstream task or instruction-following capabilities.
Direct Preference Optimization (DPO): A reinforcement learning (RL) technique for aligning LLMs with human preferences. Instead of explicit reward modeling, DPO directly optimizes the LLM policy using preference pairs (e.g., "response A is better than response B").
Generative Sequence Policy Optimization (GSPO): A reinforcement learning from human feedback (RLHF) algorithm mentioned for comprehensively enhancing model capabilities and stability.

3.2. Previous Works

The paper builds upon and references several key prior works, particularly within the Qwen series and broader multimodal AI:

Qwen Models (Yang et al., 2024; 2025a): Qwen3-Omni is part of the Qwen series, which are powerful LLMs. The LLM component of Qwen3-Omni is initialized with parameters from Qwen3. The Qwen series provides strong text-only capabilities that Qwen3-Omni aims to match or surpass in a multimodal setting.
Qwen-VL (Bai et al., 2023b): This was an early large vision-language model within the Qwen family, demonstrating versatile abilities in vision and language. Qwen3-Omni leverages the vision encoder from Qwen3-VL (a likely successor to Qwen-VL). The vision encoder from Qwen3-VL is initialized from SigLIP2-So400m (Tschannen et al., 2025).
Qwen-Audio (Chu et al., 2023; 2024): These works explore advancing universal audio understanding via unified large-scale audio-language models. Qwen3-Omni significantly builds upon Qwen-Audio's focus on audio perception and generation.
Qwen2.5-Omni (Xu et al., 2025): Qwen3-Omni explicitly builds on the Thinker-Talker architecture introduced in Qwen2.5-Omni and introduces five key upgrades. Qwen2.5-Omni segmented audiovisual representations into fixed 2-second chunks, a limitation addressed by Qwen3-Omni. Qwen2.5-Omni also used block-wise diffusion for waveform generation, which is replaced by a causal ConvNet in Qwen3-Omni. In Qwen2.5-Omni, the Talker consumed the Thinker's high-level text representations, which is decoupled in Qwen3-Omni.
GPT-4o (OpenAI, 2024), Gemini-2.5-Pro (Gemini Team, 2024), Seed-ASR, GPT-4o-Transcribe: These are strong closed-source multimodal or unimodal systems that Qwen3-Omni aims to outperform or match, especially on audio benchmarks. GPT-4o is known for its impressive multimodal capabilities, including voice interaction.
Whisper (Radford et al., 2022): A widely recognized ASR model. Qwen3-Omni replaces the Whisper audio encoder with its custom AuT encoder, indicating a focus on developing specialized, higher-performing audio components.
M-RoPE (Bai et al., 2023b): Qwen3-Omni extends M-RoPE to TM-RoPE, incorporating absolute temporal information for better handling of dynamic multimodal data.

Core Formula in Prior Work (Transformer's Self-Attention): As mentioned in the Foundational Concepts, the Transformer architecture, central to LLMs and MoE models, relies on the self-attention mechanism. Its core formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where $Q$ , $K$ , and $V$ are query, key, and value matrices, respectively. $d_k$ is the dimension of the key vectors. This formula allows the model to compute a weighted sum of value vectors, where the weights are determined by the compatibility of the query with the corresponding key. This mechanism is fundamental for processing sequences in Transformers.

3.3. Technological Evolution

The field of large language models and multimodal AI has evolved rapidly:

Unimodal LLMs: Started with text-only models (e.g., GPT-3, Qwen1) demonstrating powerful language understanding and generation (Brown et al., 2020; OpenAI, 2023).
Vision-Language Models (VLMs): Integration of visual modality, initially through separate vision encoders connected to LLMs (e.g., Flamingo, BLIP, Qwen-VL), allowing for image captioning, visual Q&A.
Audio-Language Models (ALMs): Similar integration for audio, with audio encoders feeding into LLMs (e.g., Whisper for ASR, Qwen-Audio for broader audio understanding).
Early Multimodal LLMs: Combining multiple modalities (text, vision, audio) but often with modality trade-offs or reliance on cascaded pipelines (e.g., GPT-4o, Gemini series, Qwen2.5-Omni). These models often used separate encoders for each modality and then projected their outputs into the LLM's embedding space.
Towards Fully Integrated Multimodal Models: The current frontier aims for true multimodal integration where different modalities are deeply intertwined, trained jointly, and mutually enhance each other without degradation. This often involves unified architectures and specialized multimodal positional embeddings.

Qwen3-Omni fits into this timeline as a significant step in the fifth stage. It aims to overcome the modality trade-offs seen in earlier multimodal LLMs by demonstrating that deep, integrated multimodal training can achieve performance parity with specialized unimodal models while also enabling novel cross-modal reasoning and real-time interaction capabilities.

3.4. Differentiation Analysis

Compared to previous work, Qwen3-Omni introduces several core innovations:

Non-Degradation in Multimodal Training: The most significant differentiation is the claim of achieving state-of-the-art performance across text, image, audio, and video without any degradation relative to same-sized single-modal counterparts. This contrasts with many existing multimodal LLMs that often show performance dips in certain modalities when trained jointly. The paper attributes this to mixing unimodal and cross-modal data from the early stage of text pretraining and careful architectural design.
Upgraded Thinker-Talker MoE Architecture:
- MoE for both Thinker and Talker: Enhances scalability, concurrency, and inference speed, which is a significant upgrade from Qwen2.5-Omni.
- Decoupled Thinker and Talker: Allows for independent control of text response style and audio style, and more flexible multimodal conditioning for speech generation, directly consuming multimodal features rather than just Thinker's text output. This is a departure from Qwen2.5-Omni.
Novel AuT Audio Encoder: Instead of relying on existing encoders like Whisper, Qwen3-Omni introduces a custom AuT encoder trained on a massive 20 million hours of supervised audio. This is designed for stronger, more general-purpose audio representations and incorporates block-wise window attention for real-time prefill caching.
Ultra-Low Latency Speech Generation:
- Multi-codebook representation and MTP module: Allows for richer speech modeling and autoregressive prediction of multiple codebook layers frame-by-frame.
- Causal ConvNet for Code2Wav: Replaces the computationally intensive block-wise diffusion from Qwen2.5-Omni with a lightweight causal ConvNet, drastically reducing inference latency and enabling immediate, single-frame speech synthesis. This is critical for achieving the stated 234 ms first-packet latency.
- Reduced code rates: 12.5 Hz token rate for efficient streaming.
Enhanced Positional Encoding (TM-RoPE): TM-RoPE explicitly incorporates absolute temporal information and dynamically aligns audiovisual representations based on temporal IDs, moving beyond the fixed 2-second chunking of Qwen2.5-Omni and supporting streaming inputs of arbitrary duration.
Dedicated Thinking Model and Audio Captioner: The introduction of a specific Thinking model for explicit reasoning across modalities and a fine-tuned Audio Captioner addresses functional gaps not commonly covered by general multimodal models.

In essence, Qwen3-Omni differentiates itself by pushing the boundaries of integrated multimodal training to achieve true performance parity, significant efficiency gains in real-time speech interaction, and enhanced reasoning capabilities through specialized architectural components and training strategies.

4. Methodology

4.1. Principles

The core idea behind Qwen3-Omni is to create a single, unified multimodal model that can process and generate information across text, image, audio, and video modalities without any performance degradation relative to specialized unimodal models. This is achieved by leveraging a Thinker-Talker Mixture-of-Experts (MoE) architecture and focusing on deep multimodal integration during training, from early pretraining stages. The theoretical basis is that human-like intelligence arises from the coordinated processing of multiple senses, and AI models should aim to mimic this cross-modal synergy. By training all modalities jointly and designing components specifically for efficiency and real-time interaction, the model can overcome the limitations of cascaded pipelines and modality-specific trade-offs seen in previous approaches. Key principles include:

Unified Perception and Generation: A single model handles both understanding (perception) and response generation across all supported modalities.
MoE for Scalability and Efficiency: Utilizing a Mixture-of-Experts architecture for both Thinker (text generation/reasoning) and Talker (speech generation) components to enable high concurrency and fast inference.
Modality Parity through Joint Training: Integrating unimodal and cross-modal data from the early stages of pretraining to ensure that all modalities mutually enhance each other without causing performance degradation in any single modality.
Low-Latency Real-time Interaction: Designing the speech generation pipeline (Talker) with specific optimizations, such as multi-codebook autoregressive prediction and a lightweight causal ConvNet for waveform synthesis, to achieve ultra-low first-packet latency.
Explicit Multimodal Reasoning: Introducing a dedicated Thinking model to perform explicit reasoning over complex multimodal inputs.
Advanced Positional Encoding: Using Time-aligned Multimodal Rotary Position Embedding (TM-RoPE) to effectively integrate temporal and spatial information across diverse modalities, including dynamically sampled video.

4.2. Core Methodology In-depth (Layer by Layer)

Qwen3-Omni employs a Thinker-Talker architecture, which builds upon previous versions like Qwen2.5-Omni but introduces significant upgrades. The overall system is depicted in Figure 2.

The following figure (Figure 2 from the original paper) illustrates the overall architecture of Qwen3-Omni:

Figure 2: The overview of Qwen3-Omni. Qwen3-Omni adopts the Thinker-Talker architecture. Thinker is tasked with text generation while Talker focuses on generating streaming speech tokens by receives…
该图像是Qwen3-Omni的示意图，展示了其Thinker-Talker架构。Thinker负责文本生成，而Talker生成流式语音令牌，通过直接接收Thinker的高层表示来实现超低延迟流式处理。每个解码步骤中，MTP模块输出当前帧的残差代码本，随后Code2Wav渲染器逐步合成相应波形，支持逐帧流式生成。

Figure 2: The overview of Qwen3-Omni. Qwen3-Omni adopts the Thinker-Talker architecture. Thinker is tasked with text generation while Talker focuses on generating streaming speech tokens by receives high-level representations directly from Thinker. To achieve ultralow-latency streaming, Talker autoregressively predicts a multi-codebook sequence. At each decoding step, an MTP module outputs the residual codebooks for the current frame, after which the Code2Wav renderer incrementally synthesizes the corresponding waveform, enabling frame-by-frame streaming generation.

4.2.1. Overall Architecture (`Thinker-Talker MoE`)

The Qwen3-Omni architecture is fundamentally composed of two main MoE components: the Thinker and the Talker.

Thinker: This component is primarily responsible for text generation and multimodal reasoning. It receives input from all modalities (text, audio, image, video) through various encoders and processes them to generate textual responses. The Thinker itself is an MoE Transformer.
Talker: This component focuses on streaming speech generation. Unlike Qwen2.5-Omni, the Talker in Qwen3-Omni does not solely consume the Thinker's high-level text representations. Instead, it conditions on audio and visual multimodal features directly from the Thinker and shares access to the full conversational history. This design allows for audiovisual-coordinated speech generation (e.g., preserving paralinguistic cues like emotion) and enables separate system prompts for the Thinker (response style) and Talker (audio style). The Talker is also an MoE Transformer.

The Thinker and Talker operate asynchronously. When the Thinker completes processing a chunk of input, its high-level representations are immediately used to prefill the Talker's current chunk, while the Thinker simultaneously processes its next chunk. This chunked prefilling mechanism is crucial for reducing Time-To-First-Token (TTFT) for both components.

4.2.2. Audio Transformer (AuT)

The AuT (Audio Transformer) encoder is a critical component for audio perception.

The following figure (Figure 3 from the original paper) provides an overview of the AuT architecture:

Figure 3: The overview of AuT. AuT is an attention-encoder-decoder based auto-regressive model, which is trained from scratch on 20 millon hours of supervised audio. Qwen3-Omni employs the AuT encode…
该图像是AuT模型的示意图，展示了其基于注意力的编码器-解码器结构。AuT解码器包含8个解码器交叉注意力层和自注意力层，而AuT编码器则包括32个自注意力层和3个降采样卷积层。此外，图中还展示了FBank特征及其10ms的帧移。

Architecture: AuT is an attention-encoder-decoder based auto-regressive model.
- The encoder consists of 32 self-attention layers and 3 downsampling convolutional layers.
- The decoder consists of 8 decoder cross-attention layers and self-attention layers.
Training: It is trained from scratch on 20 million hours of supervised audio data.
- During training, filter bank features (specifically, mel-spectrograms) of the audio are downsampled 8 times using Conv2D blocks before the attention layers. This reduces the token rate to 12.5 Hz.
- The training data mix includes 80% Chinese and English pseudo-labeled ASR data, 10% ASR data from other languages, and 10% audio understanding data to learn stronger, general-purpose audio representations.
Efficiency: AuT utilizes flash attention with dynamic attention window sizes (1 to 8 seconds) to balance real-time prefill caching efficiency with performance for offline audio tasks.
Role in Qwen3-Omni: AuT serves as the primary audio encoder, providing general-purpose audio representations at a token rate of 12.5 Hz. It has approximately 0.6 billion parameters.

4.2.3. Perception

The Thinker processes various inputs from different modalities:

Text Inputs: Uses Qwen's tokenizer (Yang et al., 2025a), which employs byte-level byte-pair encoding (BPE) with a vocabulary of 151,643 regular tokens. This converts raw text into discrete tokens for the Thinker.
Audio Inputs (and Audio extracted from Video):
1. Raw waveform is resampled to 16 kHz.
2. Converted into a 128-channel mel-spectrogram using a 25 ms window and 10 ms hop.
3. Processed by the AuT encoder to obtain audio representations. Each frame of this representation corresponds to approximately an 80 ms segment of the original audio signal.
Image and Video (without audio) Inputs:
1. Employs the vision encoder from Qwen3-VL, which is initialized from SigLIP2-So400m (Tschannen et al., 2025) and has approximately 543 million parameters.
2. This encoder is trained on a mixture of image and video data.
3. For video inputs, frames are sampled at a dynamic frame rate to preserve video information while aligning with the audio sampling rate.

4.2.4. Positional Embedding (`Time-aligned Multimodal Rotary Position Embedding - TM-RoPE`)

To integrate temporal and spatial information effectively across modalities, Qwen3-Omni uses TM-RoPE, an extension of M-RoPE (Bai et al., 2023b).

Factorization: TM-RoPE factorizes the conventional Rotary Position Embedding (RoPE) into three distinct dimensions: temporal, height, and width.
Angle Redistribution: To address the limitation of M-RoPE (where initial 16 rotary angles for temporal dependencies captured fine-grained local variations but impeded long-range extrapolation), TM-RoPE modifies the allocation:
- 24 rotary angles for the temporal dimension.
- 20 rotary angles for the height dimension.
- 20 rotary angles for the width dimension. This redistribution aims for a more balanced representation of both local semantics and long-range dependencies.
Modality-Specific Application:
- Text Inputs: All three components (temporal, height, width) share identical position identifiers, making TM-RoPE functionally equivalent to a one-dimensional RoPE (Su et al., 2024).
- Audio Inputs: Use shared position IDs but are augmented with absolute temporal encodings. Each temporal ID corresponds to a duration of 80 ms.
- Image Data: A constant temporal ID is assigned to all visual tokens. Height and width IDs are determined by their distinct row and column positions.
- Multimodal Audiovisual Streams:
  - Audio component: Encoded with a temporal ID for every 80 ms.
  - Video component: Treated as a sequence of frames with monotonically increasing temporal IDs that are dynamically adjusted based on actual timestamps to maintain a consistent temporal resolution of 80 ms per ID. Height and width IDs are assigned as for still images.
Contiguous Position Numbering: To prevent positional conflicts between modalities, position numbering is made contiguous, with each subsequent modality starting from one plus the maximum position ID of the preceding modality.
Arbitrary Duration Support: Unlike Qwen2.5-Omni which segmented audiovisual representations into fixed 2-second chunks, Qwen3-Omni directly aligns these representations using their temporal IDs (explicitly anchored to absolute time), allowing it to support streaming inputs of arbitrary duration.

4.2.5. Speech Generation

The Talker module is responsible for speech synthesis in multi-turn dialogues.

Context Conditioning: The Talker is conditioned on a rich context from the Thinker component, including historical textual tokens, multimodal representations, and the current turn's streamed text. This long-context information is vital for adapting acoustic attributes (prosody, loudness, emotion) to the ongoing discourse.
RVQ Token Operation: The Talker operates directly on RVQ tokens (Residual Vector Quantization tokens), which are discrete representations of audio.
Hierarchical Prediction Scheme:
1. Backbone: Ingests aggregated codebook features of the current frame.
2. Linear Head: Uses a linear head to predict the zeroth codebook.
3. Multi-Token Prediction (MTP) module: Generates all residual codebooks (the subsequent codebooks after the zeroth one). This strategy allows learning a complete representation of acoustic details.
Waveform Reconstruction (Code2Wav): The final stage of audio synthesis, Code2Wav, is simplified to a lightweight causal ConvNet. This replaces more complex DiT-based vocoders (Diffusion Transformers used in Qwen2.5-Omni), significantly reducing inference latency and computational cost (FLOPs) while maintaining high audio fidelity.
- The Code2Wav causal ConvNet allows for streaming from the first codec frame, enabling frame-by-frame streaming generation.

4.2.6. Designs for Streaming and Concurrency

Optimizing first-packet latency and concurrency is crucial for real-time interaction.

Chunked Prefilling and MoE Architecture:
- Chunked Prefilling: Similar to Qwen2.5-Omni, the audio and vision encoders output data in chunks along the temporal dimension. During real-time interaction, Thinker and Talker perform asynchronous prefilling. The Thinker's output representations from a completed chunk immediately prefill the Talker's current chunk, while the Thinker moves to its next chunk. This reduces Time-To-First-Token (TTFT).
- MoE Architecture: Both Thinker and Talker utilize MoE designs. MoE models activate only a subset of experts per input, which significantly decreases IO consumption from KV cache (Key-Value cache) during processing of long sequences, enhancing throughput (tokens per second).
Streaming Multi-Codebook Codec Generation:
- A left context only multi-codebook generation mechanism is employed.
- Once the Talker generates the first token, the MTP module predicts the remaining tokens for the current frame.
- These tokens are then decoded into a waveform by a streaming multi-codebook codec decoder that only attends to the left context.
- Unlike Qwen2.5-Omni that needed sufficient block-context before synthesis, Qwen3-Omni can output the waveform immediately after the Talker generates each token, significantly reducing first-packet latency.
Lightweight MTP module and ConvNet:
- MTP Module: An ultra-lightweight fixed-step autoregressive dense transformer. Its low memory bandwidth requirements and fixed-step autoregressive inference enable efficient batch processing and low latency in high-concurrency scenarios.
- Codec Decoder (Code2Wav ConvNet): Achieves high throughput with low latency due to its convolutional architecture which benefits from extensive hardware acceleration and efficient batched inference.
  
  The theoretical end-to-end first-packet latency is calculated as the sum of Thinker-Talker Tail Packet Preprocessing Latency, Thinker Time-to-First-Token (TTPT), Talker Time-to-First-Token (TTPT), MTP Module Time Cost Per Token, and Codec Decoder Time Cost Per Code. For Qwen3-Omni-30B-A3B at 1 concurrency, this is $72/160ms$ (preprocessing) + $88/160ms$ (Thinker TTPT) + $57/210ms$ (Talker TTPT) + 14ms (MTP) + 3ms (Codec) = 234ms (Audio) / 547ms (Video). The Generation Real Time Factor (RTF) remains below 1 across varying concurrency levels, ensuring continuous streaming.

4.2.7. Pretraining

The pretraining of Qwen3-Omni is structured into three distinct stages:

Encoder Alignment Stage (S1):
- Initialization: The LLM component is initialized with parameters from Qwen3 (Yang et al., 2025a). The vision encoder is adopted from Qwen3-VL, and the audio encoder is initialized with AuT.
- Training Strategy: The two encoders (vision and audio) are trained separately on the fixed LLM. Training initially focuses on their respective adapters before training the encoders themselves.
- Rationale: This approach avoids a common issue where joint training of encoders and adapters with a frozen LLM might cause encoders to compensate for LLM limitations, leading to degraded perception capabilities.
General Stage (S2):
- Unfreezing: All parameters (including the LLM and encoders) are unfrozen.
- Dataset: Utilizes a large-scale dataset of approximately 2 trillion tokens with a diverse distribution across modalities:
  - Text: 0.57 trillion tokens
  - Audio: 0.77 trillion tokens
  - Image: 0.82 trillion tokens
  - Video: 0.05 trillion tokens
  - Video-Audio: 0.05 trillion tokens
- Objective: To enhance the model's understanding and interaction capabilities across auditory, visual, textual, and audiovisual information through a wider range of multimodal data and tasks.
Long Context Stage (S3):
- Increased Sequence Length: The maximum token length is increased from 8,192 to 32,768.
- Data Proportion Adjustment: The proportion of long audio and long video in the training data is increased.
- Objective: To improve the model's ability to understand complex long-sequence data.
  
  During pretraining, a wider range of natural language prompts are used compared to Qwen2.5-Omni (which used a single prompt per task) to enhance generalization ability and instruction-following capabilities.

4.2.8. Post-training (Thinker)

The Thinker undergoes a three-stage post-training process to acquire instruction-following capabilities:

Supervised Fine-Tuning (SFT):
- Purpose: To bridge the gap between pretrained representations and downstream task requirements through targeted instruction optimization.
- Data: Designed in ChatML (OpenAI, 2022) format, including pure text-based dialogue, visual modality conversation, audio modality conversation, and mixed-modality conversation data.
- Strategy: Lightweight SFT that deliberately diverges from the pretraining data schema but maintains architectural consistency, enabling efficient knowledge transfer.
Strong-to-Weak Distillation:
- Pipeline: Adopts the pipeline described in Qwen3 (Yang et al., 2025a).
- Off-policy Distillation:
  - Process: Combines outputs generated by teacher models to provide response distillation.
  - Goal: Helps lightweight student models (like Qwen3-Omni) acquire fundamental reasoning abilities.
- On-policy Distillation:
  - Process: The student model generates responses based on sampled prompts. These on-policy sequences are then used for fine-tuning.
  - Objective: Align the student's predicted logits with those of a teacher model (e.g., Qwen3-32B or Qwen3-235B-A22B) by minimizing KL divergence. $ D_{KL}(P || Q) = \sum_{i} P(i) \log\left(\frac{P(i)}{Q(i)}\right) $ Where $P$ is the probability distribution of the teacher model and $Q$ is the probability distribution of the student model. The KL divergence measures how one probability distribution diverges from a second, expected probability distribution.
Generative Sequence Policy Optimization (GSPO):
- Purpose: To comprehensively enhance the model's capabilities and stability across various modalities (text, image, video, audio).
- Feedback: Uses two types of rewards:
  - Rule-based Reward: For verifiable multimodal tasks (e.g., mathematics, coding, instruction following). Reward signal derived from predefined rules, assessing output correctness and preventing reward hacking.
  - Model-based Reward: For multimodal tasks lacking objective, predefined evaluation metrics. Uses an LLM-as-a-judge protocol.
    - Evaluator: Qwen3 for general tasks, Qwen2.5-VL for visually-grounded tasks.
    - Robustness: The LLM evaluator is furnished with ground-truth or reference answers where applicable.

4.2.9. Post-training (Talker)

The Talker undergoes a four-stage training process for speech generation:

Initial Training:
- Data: Leverages hundreds of millions of speech data with multimodal context.
- Objective: Establish a monotonic mapping from multimodal representation to speech.
Continual Pretraining (CPT) and Long-context Training:
- CPT: Performed with high-quality data to alleviate hallucinations caused by noisy data from the first stage and significantly improve generated speech quality.
- Long-context Training: Concurrently conducted to enhance the Talker's ability to process extended and complex inputs and generate contextually appropriate speech responses.
Direct Preference Optimization (DPO):
- Purpose: Improve generalization of multilingual speech generation and system stability.
- Process: Constructs preference pairs from diverse multilingual speech samples and optimizes the model using DPO (Rafailov et al., 2023).
  - DPO directly optimizes the policy $\pi_\theta$ $π_{θ}$ to maximize the likelihood of preferred responses over dispreferred ones, given a reference policy $\pi_r$ $π_{r}$ : $ \mathcal{L}{DPO}(\pi\theta; (\mathbf{x}, \mathbf{y}_w, \mathbf{y}l)) = -\log \sigma\left(\beta \left[ \log \frac{\pi\theta(\mathbf{y}_w|\mathbf{x})}{\pi_r(\mathbf{y}w|\mathbf{x})} - \log \frac{\pi\theta(\mathbf{y}_l|\mathbf{x})}{\pi_r(\mathbf{y}_l|\mathbf{x})} \right]\right) $ Where:
    - $\mathbf{x}$ is the prompt.
    - $\mathbf{y}_w$ is the preferred (winning) response.
    - $\mathbf{y}_l$ is the dispreferred (losing) response.
    - $\pi_\theta$ is the current policy (model).
    - $\pi_r$ is the reference policy (usually the SFT model).
    - $\beta$ is a temperature hyperparameter.
    - $\sigma$ is the sigmoid function.
    - The loss function minimizes the probability of choosing the dispreferred response over the preferred one.
Speaker Fine-Tuning:
- Purpose: Enables the Talker to adopt specific voices while refining the naturalness, expressiveness, and controllability of its speech response.
- Process: Applied to the aforementioned base model.

4.2.10. Captioner

The Qwen3-Omni-30B-A3B-Captioner is developed to address the lack of general-purpose audio captioning models.

Development: Created by fine-tuning the base Qwen3-Omni-30B-A3B model.
Data: Fine-tuned on a large-scale dataset of detailed audio descriptions.
Output: Produces detailed, low-hallucination captions for arbitrary audio inputs.

5. Experimental Setup

The experiments evaluate Qwen3-Omni's ability to comprehend various multimodal inputs and generate textual or speech responses. The evaluation includes Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and two Flash variants (Qwen3-Omni-Flash-Instruct, Qwen3-Omni-Flash-Thinking) designed for improved computational efficiency and performance, notably supporting various dialects.

5.1. Datasets

The evaluation covers a wide range of benchmarks across different modalities:

Text-Text ( $Chi -> Text$ ):
- General Tasks: MMLURedux (Gema et al., 2024), GPQA (Rein et al., 2023).
- Reasoning: AIME25 (AIME, 2025), ZebraLogic (Lin et al., 2025), LiveBench 20241125.
- Coding: MultiPL-E (Cassano et al., 2023).
- Alignment: IFEval (Zhou et al., 2023), Creative Writing V3 (Paech, 2024), WritingBench (Wu et al., 2025b), Arena-Hard v2.
- Agent: BFCL-v3 (Yan et al., 2024).
- Multilingual: MultiF (He et al., 2024), PolyMath (Wang et al., 2025c), MGSM, INCLUDE.
Audio-Text ( $Chi -> Text$ ):
- Basic Audio Tasks: Automatic Speech Recognition (ASR), Speech-to-Text (S2TT).
  - English & Chinese ASR: Wenetspeech net I meeting, $Librispeech clean | other$ , CV15-en, CV15-zh, Fleurs-en, Fleurs-zh.
  - Multilingual ASR: Fleurs-avg (19 lang) (Arabic, German, English, Spanish, French, Indonesian, Italian, Japanese, Korean, Malay, Dutch, Portuguese, Russian, Spanish, Thai, Turkish, Urdu, Vietnamese, Cantonese, Mandarin).
  - Lyric ASR: MIR-1K (vocal-only), Opencpop-test.
  - S2TT: Fleurs-en2xx, Fleurs-xx2en, Fleurs-zh2xx, Fleurs-xx2zh (where xx denotes other languages).
- Advanced Audio Tasks:
  - Voice Chatting: VoiceBench (Chen et al., 2024b) which includes sub-benchmarks like AlpacaEval, CommonEval, WildVoice, SDD-QA, MMSU, OpenBookQA, BH, IFEval, AdvBench.
  - Audio Reasoning: MMAU (Sakshi et al., 2024), MMSU (Wang et al., 2025a).
  - Music Understanding: RUL-MuchoMusic (Zang et al., 2025), GTZAN (Tzanetakis & Cook, 2002), four subsets of MTG-Jamendo (Bogdanov et al., 2019), MagnaTagATune (Law et al., 2009). The evaluation set composition for GTZAN, MTG-Jamendo and MagnaTagATune follows MARBLE (Yuan et al., 2023).
Vision-Text ( $Chi -> Text$ ):
- General Visual Question Answering: MMStar (Chen et al., 2024a), HallusionBench (Guan et al., 2024), MM-MT-Bench (Agrawal et al., 2024), RealWorldQAavg.
- Mathematical & STEM Reasoning: MathVista (Lu et al., 2024), MathVision (Wang et al., 2024a), MMMU (Yue et al., 2023), MMMU-Pro (Yue et al., 2024).
- Document Understanding: AI2D (Kembhavi et al., 2016), ChartQA (Masry et al., 2022), TextVQAval, DocVQAtest, InfoVQAtest.
- Counting: CountBench (Paiss et al., 2023).
- Video Understanding: Video-MME (Fu et al., 2024), LVBench (Wang et al., 2024b), MLVU (Zhou et al., 2025a), MVBench.
- OCR-related Tasks: AI2D, TextVQAval, DocVQAtest, InfoVQAtest, ChartQAtest Avg, OCRBench.
AudioVisual Video Text ( $Chi -> Text$ ):
- General Understanding: WorldSense (Hong et al., 2025).
- Audiovisual Reasoning: DailyOmni (Zhou et al., 2025b), VideoHolmes (Cheng et al., 2025).
Speech Generation ( $Chi -> Speech$ ):
- Zero-Shot Speech Generation: SEED (Anastassiou et al., 2024) (test-zh, test-en).
- Multilingual Speech Generation: MiniMax multilingual test set (Zhang et al., 2025).
- Cross-Lingual Speech Generation: CV3-Eval (Du et al., 2025).

Example of Data Sample (from qualitative results section):

The paper provides an example of expressive speech for Qwen3-Omni-30B-A3B-Captioner: "The audio clip opens in a studio setting, marked by a faint, persistent electronic hiss and a subtle low-frequency hum... The male speaker, whose voice is delivered in a clear, energetic, and highly theatrical manner, begins with an assertive 'Right!', delivered with a sharp, rising intonation..." This illustrates how the model processes complex audio scenes and generates detailed descriptions.

These datasets are chosen to validate the model's performance across a broad spectrum of tasks, from basic perception (ASR, image recognition) to complex reasoning (multimodal Q&A, audio reasoning) and generation (text, speech). They represent a comprehensive suite of benchmarks in the multimodal AI field.

5.2. Evaluation Metrics

The paper uses various evaluation metrics tailored to each task:

Word Error Rate (WER): Primarily used for Automatic Speech Recognition (ASR) tasks. It measures the number of errors (substitutions, insertions, deletions) in a transcribed sequence compared to a reference sequence. A lower WER indicates better performance.
- Conceptual Definition: WER quantifies the accuracy of a speech recognition system by comparing the machine-generated transcription to a human-created reference transcription. It is a direct measure of how many words were incorrectly identified, added, or missed.
- Mathematical Formula: $ WER = \frac{S + D + I}{N} $ Where:
  - $S$ is the number of substitutions (incorrectly recognized words).
  - $D$ is the number of deletions (words in the reference that were missed by the system).
  - $I$ is the number of insertions (words added by the system that were not in the reference).
  - $N$ is the total number of words in the reference transcription.
Bilingual Evaluation Understudy (BLEU): Used for Speech-to-Text Translation (S2TT) tasks, specifically for evaluating the quality of machine-translated text against human-translated references. A higher BLEU score indicates better translation quality.
- Conceptual Definition: BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. It measures the correspondence between a machine's output and a human's output (reference translations) by counting the number of matching n-grams.
- Mathematical Formula: $ BLEU = BP \cdot \exp \left( \sum_{n=1}^{N} w_n \log p_n \right) $ Where:
  - BP is the brevity penalty, which penalizes short machine translations.
  - $N$ is the maximum n-gram length (typically 4).
  - $w_n$ is the weight for each n-gram (typically $1/N$ ).
  - $p_n$ is the n-gram precision, calculated as: $ p_n = \frac{\sum_{\text{sentence} \in \text{candidate}} \sum_{n\text{-gram} \in \text{sentence}} \text{Count}{\text{clip}}(n\text{-gram})}{\sum{\text{sentence} \in \text{candidate}} \sum_{n\text{-gram} \in \text{sentence}} \text{Count}(n\text{-gram})} $ Where $\text{Count}_{\text{clip}}$ ensures that an n-gram is counted no more times than it appears in any single reference translation.
Micro F1 Score (Micro F1): Used for multi-label classification tasks like music understanding (e.g., genre, mood, instrument recognition). It's a measure of a test's accuracy, considering both precision and recall. A higher Micro F1 indicates better performance.
- Conceptual Definition: Micro F1 is a global version of the F1 score that aggregates the contributions of all classes to compute the average metric. It's particularly useful in multi-label classification where the sum of true positives, false positives, and false negatives are calculated across all labels.
- Mathematical Formula: $ MicroF1 = \frac{2 \times \text{MicroPrecision} \times \text{MicroRecall}}{\text{MicroPrecision} + \text{MicroRecall}} $ Where:
  - $\text{MicroPrecision} = \frac{\sum_{c=1}^{C} TP_c}{\sum_{c=1}^{C} TP_c + \sum_{c=1}^{C} FP_c}$
  - $\text{MicroRecall} = \frac{\sum_{c=1}^{C} TP_c}{\sum_{c=1}^{C} TP_c + \sum_{c=1}^{C} FN_c}$
  - $TP_c$ (True Positives), $FP_c$ (False Positives), $FN_c$ (False Negatives) for class $c$ .
  - $C$ is the total number of classes.
Accuracy (Acc.): Used for classification tasks (e.g., GTZAN music genre identification) and general task evaluation (e.g., MMLU, GPQA). It measures the proportion of correctly predicted instances among the total instances. A higher Accuracy indicates better performance.
- Conceptual Definition: Accuracy is the ratio of correctly predicted observations to the total observations. It's a straightforward measure of overall correctness in a classification problem.
- Mathematical Formula: $ Accuracy = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
Speaker Similarity (SIM): Used for speech generation tasks, particularly in zero-shot, multilingual, and cross-lingual voice cloning. This metric quantifies how well the generated speech's voice characteristics (e.g., timbre, pitch, intonation) match those of a reference speaker. The paper does not provide an explicit formula but implies it's a quantitative measure where higher values indicate better similarity. Typically, SIM is measured using cosine similarity between speaker embeddings extracted from a pre-trained speaker verification model.
- Conceptual Definition: Speaker Similarity assesses the degree to which a synthesized voice retains the unique identity and characteristics of a target speaker from a given reference audio. This is crucial for applications like voice cloning.
Generation Real Time Factor (RTF): Used to evaluate the efficiency of speech generation systems in streaming scenarios. RTF is the ratio of the time taken to generate an audio segment to the actual duration of that audio segment. An RTF less than 1 means the system can generate audio faster than real-time.
- Conceptual Definition: RTF indicates how fast a speech synthesis system can generate audio relative to the length of the audio it produces. An RTF of 0.5 means the system takes half the duration of the audio to generate it, implying it can stream speech in real-time.
- Mathematical Formula (as implied by the paper): $ RTF = \frac{\text{Time taken to generate 80ms audio}}{\text{80ms}} $ Where the "Time taken to generate 80ms audio" includes the combined time for Thinker and Talker to generate one token, plus the processing time for the MTP Module and Codec Decoder per token.

5.3. Baselines

Qwen3-Omni is compared against a wide array of strong baselines, including both open-source and closed-source models, as well as specialist and generalist models:

Text-Text:
- GPT-4o-0327 (OpenAI, 2024): A strong closed-source multimodal LLM.
- Gemini-2.5-Flash Thinking (Gemini Team, 2024): A closed-source multimodal LLM with reasoning capabilities.
- Qwen3-235B-A22B Non-Thinking & Thinking (Yang et al., 2025a): Larger, text-only Qwen3 variants.
- Qwen3-30B-A3B-Instruct-2507 & Thinking-2507: Same-sized text-only Qwen3 variants, serving as direct unimodal counterparts.
- InternVL-3.5-241B-A28B: A large vision-language reasoning model.
Audio-Text (ASR & S2TT):
- Seed-ASR: A specialist ASR model.
- Voxtral-Mini, Voxtral-Small: Specialist ASR models.
- GPT-4o-Transcribe (OpenAI, 2024): ASR component of GPT-4o.
- Gemini-2.5-Pro (Gemini Team, 2024): A strong closed-source multimodal model.
- Qwen2.5-Omni (Xu et al., 2025): The predecessor multimodal model from the same team.
Audio-Text (Voice Interaction & Audio Reasoning):
- GPT-4o-Audio (OpenAI, 2024): Audio capabilities of GPT-4o.
- Gemini-2.5-Flash, Gemini-2.5-Pro (Gemini Team, 2024): Closed-source multimodal models.
- Qwen2.5-Omni (Xu et al., 2025): Predecessor model.
Audio-Text (Music Understanding):
- Best Specialist Models: Includes Audio Flamingo 3 (Goel et al., 2025), CLaMP 3 (Wu et al., 2025a), MuQ-MuLan (Zhu et al., 2025), MuQ (Zhu et al., 2025). These are highly specialized models for music understanding tasks.
- GPT-4o-Audio (OpenAI, 2024), Gemini-2.5-Pro (Gemini Team, 2024), Qwen2.5-Omni (Xu et al., 2025): Generalist multimodal models.
Vision-Text:
- GPT-4o (OpenAI, 2024), Gemini-2.0-Flash (Gemini Team, 2024): Closed-source vision-language models.
- Qwen2.5-VL-72B (Bai et al., 2025): A larger vision-language model from the Qwen series.
- Gemini-2.5-Flash -Thinking, InternVL-3.5-241B-A28B: Strong reasoning-focused vision-language models.
AudioVisual Video Text:
- Previous Open-source SoTA: Specific reference to (Yang et al., 2025b) and (Tang et al., 2025) for WorldSense, DailyOmni, and VideoHolmes.
- Gemini-2.5-Flash (Gemini Team, 2024): Closed-source multimodal model.
- Qwen2.5-Omni (Xu et al., 2025): Predecessor model.
Speech Generation:
- Seed-TTSIcL, Seed-TTSRL (Anastassiou et al., 2024): State-of-the-art zero-shot TTS systems.
- MaskGCT (Wang et al., 2024c), E2 TTS (Eskimez et al., 2024), F5-TTS (Chen et al., 2024c), Spark TTS (Wang et al., 2025b), CosyVoice 2 (Du et al., 2024), CosyVoice 3 (Du et al., 2025), Qwen2.5-Omni-7B (Xu et al., 2025): Various TTS models.
- MiniMax-Speech (Zhang et al., 2025), ElevenLabs Multilingual v2: Multilingual speech generation models.
- CosyVoice2, CosyVoice3: Cross-lingual speech generation models.
  
  These baselines are chosen to provide a comprehensive comparison against both general-purpose powerful LLMs and specialized models across different modalities, ensuring a robust evaluation of Qwen3-Omni's state-of-the-art claims and its non-degradation property.

6. Results & Analysis

The experimental evaluation comprehensively assesses Qwen3-Omni's performance across various multimodal tasks, including Text-Text, Audio-Text, Vision-Text, AudioVisual Video Text, and Speech Generation. The results consistently demonstrate its strong capabilities and validate the core claims of non-degradation and state-of-the-art performance, particularly in audio.

6.1. Core Results Analysis

6.1.1. Performance of Text-Text

The following are the results from Table 4 of the original paper:

		GPT-40-0327	Qwen3-235B-A22B Non Thinking	Qwen3-30B-A3B -Instruct-2507	Qwen3-Omni-30B-A3B -Instruct	Qwen3-Omni-Flash -Instruct
GeneralTasks	MMLU-Redux	91.3	89.2	89.3	86.6	86.8
	GPQA	66.9	62.9	70.4	69.6	69.7
	AIME25	26.7	24.7	61.3	65.0	65.9
Reasoning Code	ZebraLogic	52.6	37.7	90.0	76.0	76.1
Reasoning Code	MultiPL-E	82.7	79.3	83.8	81.4	81.5
Alignment	IFEval	83.9	83.2	84.7	81.0	81.7
Alignment	Creative Writing v3	84.9	80.4	86.0	80.6	81.8
Tasks Agent	WritingBench	75.5	77.0	85.5	82.6	83.0
Tasks Agent	BFCL-v3	66.5	68.0	65.1	64.4	65.0
Multilingual Tasks	MultiIF	70.4	70.2	67.9
Multilingual Tasks	PolyMATH	25.5	27.0	43.1	64.0 37.9	64.7 39.3

The following are the results from Table 5 of the original paper:

		Gemini-2.5-Flash Thinking	Qwen3-235B-A22B Thinking	Qwen3-30B-A3B -Thinking-2507	Qwen3-Omni-30B-A3B Thinking	Qwen3-Omni-Flash Thinking
General Tasks	MMLU-Redux	92.1	92.7	91.4	88.8	89.7
General Tasks	GPQA	82.8	71.1	73.4	73.1	73.1
Reasoning	AIME25	72.0	81.5	85.0	73.7	74.0
Reasoning	LiveBench 20241125	74.3	77.1	76.8	71.8	70.3
Code	MultiPL-E	84.5	79.9	81.3	80.6	81.0
Alignment Tasks	IFEval	89.8	83.4	88.9	85.1	85.2
	Arena-Hard v2	56.7	61.5	56.0	55.1	57.8
	Creative Writing v3	85.0	84.6	84.4	82.5	83.6
Agent	WritingBench	83.9	80.3	85.0	85.5	85.9
Agent	BFCL-v3	68.6	70.8	72.4	63.2	64.5
Multilingual Tsks	MultiIF PolyMATH	74.4 49.8	71.9 54.7	76.4 52.6	72.9 47.1	73.2 48.7

Qwen3-Omni-30B-A3B-Instruct demonstrates strong performance, surpassing the larger open-source Qwen3-235B-A22B Non-Thinking and even the closed-source GPT-4o-0327 on several benchmarks like GPQA, AIME25, ZebraLogic, WritingBench, and PolyMath. This is a significant finding, indicating that the multimodal nature does not degrade text performance and can even enhance it.
The Instruct variant also performs comparably to its text-only counterpart, Qwen3-30B-A3B-Instruct-2507, reinforcing the non-degradation claim for text capabilities.
Qwen3-Omni-30B-A3B-Thinking exhibits performance comparable to Gemini-2.5-Flash-Thinking and Qwen3-235B-A22B Thinking, especially on reasoning tasks like AIME25.
The Flash variants (Qwen3-Omni-Flash-Instruct and Qwen3-Omni-Flash-Thinking) generally maintain performance close to their non-Flash counterparts, suggesting improved efficiency without significant accuracy loss.

6.1.2. Performance of Audio-Text

The following are the results from Table 6 of the original paper:

	Seed -ASR	Voxtral -Mini	Voxtral -Small	GPT-40 -Transcribe	Gemini-2.5 -Pro	Qwen2.5 -Omni	Qwen3-Omni -30B-A3B-Instruct	Qwen3-Omni -Flash-Instruct
EN & ZH ASR (wer)
Wenetspeech net I meeting	4.66 \| 5.69	24.30 \| 31.53	20.33 \| 26.08	15.30 \| 32.27	14.43 \| 13.47	5.9117.65	4.69 \| 5.89	4.62 \| 5.75
Librispeech clean \| other	1.58 \|2.84	1.88 \| 4.12	1.56 \|3.30	1.39 \|3.75	2.89 \| 3.56	1.74\|3.45	1.22 \| 2.48	1.27 \| 2.44
CV15-en	-	9.47	7.79	10.01	9.89	7.61	6.05	5.94
CV15-zh	-	24.67	19.30	9.84	8.00	5.13	4.31	4.28
Fleurs-en	3.40	3.96	3.77	3.32	2.94	3.77	2.72	2.74
Fleurs-zh	2.69	12.22	7.98	2.44	2.71	2.54	2.20	2.19
Multilingual ASR (wer)
Fleurs-avg (19 lang)a		15.67	8.09	4.48	5.55	14.04	5.33	5.31
Lyric ASR (wer)
MIR-1K (vocal-only)b	6.45	23.33	18.73	11.87	9.85	8.15	5.90	5.85
Opencpop-test	2.98	31.01	16.06	7.93	6.49	2.84	1.54	2.02
S2TT (BLEU)
Fleurs-en2xxc		30.35	37.85		39.25	29.22	37.50	36.22
Fleurs-xx2en		27.54	32.81		35.41	28.61	31.08	30.71
Fleurs-zh2xx		17.03	22.05		26.63	17.97	25.17	25.10
Fleurs-xx2zh		28.75	34.82		37.50	27.68	33.13	31.19

ASR (Word Error Rate - WER): Qwen3-Omni-Instruct achieves state-of-the-art performance across both English and Chinese ASR benchmarks (Librispeech, Wenetspeech, Fleurs, CommonVoice). For instance, on Librispeech clean, it scores 1.22, outperforming Seed-ASR (1.58), GPT-4o-Transcribe (1.39), and Gemini-2.5-Pro (2.89).
Multilingual ASR: It shows competitive performance on Fleurs-avg (19 lang) with 5.33 WER, outperforming Gemini-2.5-Pro (5.55) and significantly Qwen2.5-Omni (14.04).
Lyric ASR: Achieves SOTA on MIR-1K (vocal-only) (5.90 WER) and Opencpop-test (1.54 WER), again outperforming strong baselines.

S2TT (BLEU): Qwen3-Omni-Instruct delivers better or comparable BLEU scores for Speech-to-Text Translation, such as Fleurs-en2xx (37.50) where it is comparable to Gemini-2.5-Pro (39.25) and outperforms Voxtral-Small (37.85).

The following are the results from Table 7 of the original paper:

	GPT-40 -Audio	Gemini-2.5 -Flash	Gemini-2.5 -Pro	Qwen2.5 -Omni	Qwen3-Omni -30B-A3B-Instruct	Qwen3-Omni -30B-A3B-Thinking	Qwen3-Omni -Flash-Instruct	Qwen3-Omni FlashThinking
VoiceBench
AlpacaEval	95.6	96.1	94.3	89.9	94.8	96.4	95.4	96.8
CommonEval	89.8	88.3	88.4	76.7	90.8	90.5	91.0	90.9
WildVoice	91.6	92.1	93.4	77.7	91.6	90.5	92.3	90.9
SDD-QA	75.5	84.5	90.1	56.4	76.9	78.1	76.8	78.5
MMSU	80.3	66.1	71.1	61.7	68.1	83.0	68.4	84.3
OpenBookQA	89.2	56.9	92.3	80.9	89.7	94.3 88.9	91.4 80.6	95.0
BH	84.1	83.9 83.8	92.6 85.7	66.7 53.5	80.4 77.8	80.6	75.2	89.6 80.8
IFEval AdvBench	76.0 98.7	98.9	98.1	99.2	99.3	97.2	99.4	98.9
Overall	86.8	83.4	89.6	73.6	85.5	88.8	85.6	89.5
					Audio Reasoning

MMAU-v05.15.25 MMSU	62.5 56.4	71.8 70.2	77.4 77.7	65.5 62.6	77.5 69.0	75.4 70.2	77.6 69.1	76.5 71.3

VoiceBench: Qwen3-Omni-Thinking achieves an impressive average score of 89.5, almost matching Gemini-2.5-Pro (89.6) and surpassing all other audio language models. This highlights its strong capabilities in speech interaction.

Audio Reasoning: Qwen3-Omni (both Instruct and Thinking variants) demonstrates impressive performance, outperforming Gemini-2.5-Pro and Gemini-2.5-Flash on MMAU and Gemini-2.5-Flash and GPT-4o-Audio on MMSU. For instance, Qwen3-Omni-Flash-Thinking scores 76.5 on MMAU-v05.15.25 and 71.3 on MMSU, outperforming most baselines. This confirms its powerful general audio understanding and reasoning abilities.

The following are the results from Table 8 of the original paper:

	Best Specialist Modes	GPT-40 -Audio	Gemini-2.5 -Pro	Qwen2.5 -Omni	Qwen3-Omni -30B-A3B-Instruct	Qwen3-Omni -Flash-Instruct
RUL-MuchoMusic	47.6 (Audio Flamingo 3) (Goel et al., 2025)	36.1	49.4	47.3	52.0	52.1
GTZAN Acc.	87.9 (CLaMP 3) (Wu et al., 2025a)	76.5	81.0	81.7	93.0	93.1
MTG Genre Micro F1	35.8 (MuQ-MuLan) (Zhu et al., 2025)	25.3	32.6	32.5	39.0	39.5
MTG Mood/Theme Micro F1	10.9 (MuQ-MuLan) (Zhu et al., 2025)	11.3	14.1	8.9	21.0	21.7
MTG Instrument Micro F1	39.8 (MuQ-MuLan) (Zhu et al., 2025)	34.2	33.0	22.6	40.5	40.7
MTG Top50 Micro F1	33.2 (MuQ-MuLan) (Zhu et al., 2025)	25.0	26.1	21.6	36.7	36.9
MagnaTagATune Micro F1	41.6 (MuQ) (Zhu et al., 2025)	29.2	28.1	30.1	44.3	46.8

Music Understanding: Qwen3-Omni-Instruct achieves SOTA on RUL-MuchoMusic (52.0), surpassing specialist models like Audio Flamingo 3 (47.6) and strong generalists like Gemini-2.5-Pro (49.4).
It also significantly outperforms other audio language models and even self-supervised music specialist models on GTZAN (93.0 Acc.), MTG-Jamendo (various Micro F1 scores), and MagnaTagATune (44.3 Micro F1). This demonstrates superior capabilities across diverse music understanding tasks.

6.1.3. Performance of Vision-Text

The following are the results from Table 9 of the original paper:

Datasets	GPT4-0	Gemini-2.0-Flash	Qwen2.5-VL 72B	Qwen3-Omni-30B-A3B -Instruct	Qwen3-Omni-Flash -Instruct
General Visual Question Ansrering
MMStar	64.7	71.4	70.8	68.5	69.3
HallusionBench	55.0	56.3	55.2	59.7	60.4
MM-MT-Bench	7.7	6.7	7.6	7.4	7.6
Math & STEM
MMMUval	69.1	71.3	70.2	69.1	69.8
MMMU-Prooverall	51.9	56.1	51.1	57.0	58.2
MathVistamini MATH-Visionfull	63.8	71.4	74.8	75.9	77.4
	30.4	48.6	38.1	56.3	57.3
Documentation Understanding
AI2Dw.M.	84.6	86.7	88.7	85.2	86.4
ChartQAtest Avg.	86.7	64.6	89.5	86.8	87.1
Counting
CountBench	87.9	91.2	93.6	90.0	90.0
Video Understanding
Video-MMEw/o sub	71.9	72.4	73.3	70.5	71.4
LVBench	30.8	57.9	47.3	50.2	51.1
MLVU	64.6	71.0	74.6	75.2	75.7

The following are the results from Table 10 of the original paper:

Datasets	Gemini-2.5-Flash -Thinking	InternVL-3.5-241B-A28B	Qwen3-Omni-30B-A3B Thinking	Qwen3-Omni-Flash -Thinking
General Visual Question Answering
MMStar	75.5	77.9	74.9	75.5
HallusionBench	61.1	57.3	62.8	63.4
MM-MT-Bench	7.8	−	8.0	8.0
Math & STEM
MMMUval	76.9	77.7	75.6	75.0
MMMU-PrOoverall	65.8		60.5	60.8
MathVistamini	77.6	82.7	80.0	81.2
MATH-Visionfull	62.3	63.9	62.9	63.8
Documentation Understanding
AI2Dw.M.	88.6	87.3	86.1	86.8
ChartQAtest Avg.		88.0	89.5	89.3
Counting
CountBench	88.6		88.6	92.5
Video Understanding
Video-MMEw/o sub	79.6	72.9	69.7	69.8
LVBench	64.5		49.0	49.5
MLVU	82.1	78.2	72.9	73.9

Qwen3-Omni-Instruct demonstrates performance comparable to Qwen2.5-VL-72B, and on certain benchmarks like HallusionBench (59.7), MMMU-Pro overall (57.0), MathVistamini (75.9), and MATH-Visionfull (56.3), it outperforms GPT-4o and Gemini-2.0-Flash. This indicates excellent image understanding and reasoning.
Qwen3-Omni-Thinking shows significant advancements, outperforming the Instruct baseline by 4.4 points on Math and STEM benchmarks. It achieves performance levels on par with substantially larger baselines, showcasing its effectiveness and computational efficiency. For instance, Qwen3-Omni-Flash-Thinking scores 63.4 on HallusionBench and 81.2 on MathVistamini.
Limitation: The current model shows suboptimal performance on long video benchmarks (Video-MME, LVBench, MLVU), attributed to limited positional extrapolation capacity and restricted context length.

6.1.4. Performance of AudioVisual Video Text

The following are the results from Table 11 of the original paper:

Datasets	Previous Open-source SoTA	Gemini-2.5-Flash	Qwen2.5-Omni	Qwen3-Omni-30B-A3B -Instruct	Qwen3-Omni-Flash -Instruct
WorldSense	47.1(Yang et al., 2025b)	50.9	45.4	54.0	54.1

The following are the results from Table 12 of the original paper:

Datasets	Previous Open-source SoTA	-Thinking	Gemini-2.5-Flash Qwen3-Omni-30B-A3B -Thinking	Qwen3-Omni-Flash -Thinking
DailyOmni	69.8(Tang et al., 2025)	72.7	75.8	76.2
VideoHolmes	55.6(Tang et al., 2025)	49.5	57.3	57.3

General Understanding: Qwen3-Omni-Instruct achieves state-of-the-art performance on the WorldSense benchmark (54.0), significantly surpassing other Omni models (e.g., Qwen2.5-Omni at 45.4). This demonstrates its efficacy in foundational multimodal integration.
Complex Reasoning: The Thinking variants exhibit enhanced performance on audiovisual reasoning tasks. Qwen3-Omni-Flash-Thinking scores 76.2 on DailyOmni and 57.3 on VideoHolmes, outperforming previous open-source SOTA and Gemini-2.5-Flash. These results highlight Qwen3-Omni's potential for advanced perception and reasoning in real-world contexts.

6.1.5. Performance of Speech Generation

The following are the results from Table 13 of the original paper:

Datasets	Model	Performance
	Content Consistency
SEED test-zh \| test-en	Seed-TTSIcL (Anastassiou et al., 2024)	1.11 \| 2.24
	Seed-TTSRL (Anastassiou et al., 2024)	1.00 \| 1.94
	MaskGCT (Wang et al., 2024c)	2.27 2.62
	E2 TTS (Eskimez et al., 2024)	1.97 2.19
	F5-TTS (Chen et al., 2024c)	1.56 1.83
	Spark TTS (Wang et al., 2025b)	1.20 1.98
	CosyVoice 2 (Du et al., 2024)	1.45 2.57
	CosyVoice 3 (Du et al., 2025)	0.71 \| 1.45
	Qwen2.5-Omni-7B (Xu et al., 2025)	1.42 2.33
	Qwen3-Omni-30B-A3B	1.07 \| 1.39

Zero-Shot Speech Generation (Content Consistency - WER): Qwen3-Omni-30B-A3B demonstrates highly competitive performance on SEED test-zh (1.07) and test-en (1.39), achieving the best performance on test-en. This indicates robust speech understanding and generation. The RL optimization yields significant improvements in generation stability.

The following are the results from Table 14 of the original paper:

Language	Content Consistency			Speaker Similarity
Language	Qwen3-Omni -30B-A3B	MiniMax	ElevenLabs	Qwen3-Omni -30B-A3B	MiniMax	ElevenLabs
Chinese	0.716	2.252	16.026	0.772	0.780	0.677
English	1.069	2.164	2.339	0.773	0.756	0.613
German	0.777	1.906	0.572	0.738	0.733	0.614
Italian	1.067	1.543	1.743	0.742	0.699	0.579
Portuguese	1.872	1.877	1.331	0.770	0.805	0.711
Spanish	1.765	1.029	1.084	0.744	0.762	0.615
Japanese	3.631	3.519	10.646	0.763	0.776	0.738
Korean	1.670	1.747	1.865	0.778	0.776	0.700
French	2.505	4.099	5.216	0.689	0.628	0.535
Russian	3.986	4.281	3.878	0.759	0.761	0.676

Multilingual Speech Generation: Qwen3-Omni surpasses MiniMax-Speech and ElevenLabs Multilingual v2 in Content Consistency for languages like Chinese (0.716), English (1.069), and French (2.505). It also delivers competitive results in other languages, demonstrating stable and human-like voice generation across 10 supported languages, with high Speaker Similarity.

The following are the results from Table 15 of the original paper:

Language	Qwen3-Omni-30B-A3B	CosyVoice3	CosyVoice2
en-to-zh	5.37	5.09	13.5
ja-to-zh	3.32	3.05	48.1
ko-to-zh	0.99	1.06	7.70
zh-to-en	2.76	2.98	6.47
ja-to-en	3.31	4.20	17.1
ko-to-en	3.34	4.19	11.2
zh-to-ja	8.29	7.08	13.1
en-to-ja	7.53	6.80	14.9
ko-to-ja	4.24	3.93	5.86
zh-to-ko	5.13	14.4	24.8
en-to-ko	4.96	5.87	21.9
ja-to-ko	6.23	7.92	21.5

Cross-Lingual Speech Generation (Content Consistency - WER): Qwen3-Omni generally outperforms CosyVoice3 in any-to-en (e.g., ja-to-en 3.31 vs 4.20) and any-to-ko voice cloning (e.g., zh-to-ko 5.13 vs 14.4). It also achieves comparable performance to CosyVoice3 in any-to-ja tasks, even without text normalization, highlighting its adaptability across diverse linguistic contexts.

6.1.6. Non-Degradation Across Modalities

The following are the results from Table 16 of the original paper:

	Datasets	Qwen3-30B-A3B -Base-202507	Qwen3-VL-30B-A3B -Base-202507	Qwen3-Omni-30B-A3B
	Datasets	Qwen3-30B-A3B -Base-202507	Qwen3-VL-30B-A3B -Base-202507	Qwen3-Omni-30B-A3B	-Base-202507
General Tasks	MMLU	81.24		81.69
	MMLU-Redux	80.17		80.60
	MMLU-Pro	61.81		61.57
	SuperGPQA	38.24		40.14
	BBH	83.79		83.53
Math & STEAM Tasks	GSM8K MATH	90.83		91.36
Math & STEAM Tasks		60.84		60.42
Coding Tasks	EvalPlus	69.70		73.96
	MultiPL-E	65.75		64.79
	MBPP	72.60		72.60
	CRUX-O	66.94		69.06
Multilingual Tasks	MGSM INCLUDE	78.75 65.17	-	79.93 64.73
College-level Problems	MMMUval		57.22	59.33
General Visual Question Answering	MMStar		67.2	69.6
General Visual Question Answering	RealWorldQAavg		73.98	71.89
OCR-related Tasks	AI2D		85.88	86.62
	TextVQAval		81.67	81.65
	DocVQAtest		95.19	95.27
	InfoVQAtest		81.17	83.31
	ChartQAtest Avg		87.12	87.52
	OCRBench		85.8	86.0
Video Understanding Tasks	Video-MMEw/o sub		69.22	69.25
	MVBench		71.87	69.50
	LVBench		48.61	51.07

A controlled comparative study was conducted with three models of identical parameter counts (30B-A3B) and matched training compute (FLOPs): a text-only baseline (Qwen3-30B-A3B-Base), a vision-only baseline (Qwen3-VL-30B-A3B-Base), and the multimodal Qwen3-Omni-30B-A3B-Base. The Omni model's sole differentiating factor was the inclusion of supplementary audio and audio-visual data.

Text Modality: Qwen3-Omni-30B-A3B-Base shows comparable or slightly better performance on many text benchmarks (e.g., MMLU, SuperGPQA, GSM8K, EvalPlus, MGSM) compared to the Qwen3-30B-A3B-Base (text-only). This supports the claim that early multimodal integration does not degrade language capability.
Vision Modality: Qwen3-Omni-30B-A3B-Base consistently outperforms the Qwen3-VL-30B-A3B-Base (vision-only) on various vision benchmarks, including MMMUval (59.33 vs 57.22), MMStar (69.6 vs 67.2), and most OCR-related tasks (AI2D, InfoVQAtest, ChartQAtest Avg). This indicates that joint multimodal training leads to mutual enhancement, improving performance even in single modalities.
Observations from Authors:
1. Early multimodal integration during pretraining allows language models to be co-trained with vision or audio without any degradation in language capability.
2. The inclusion of the text modality substantially improves performance in the vision and audio modalities.
3. No measurable gains in language ability are observed from adding visual or audio signals.
4. Empirically, adding audio data consistently improves vision performance on the MMMU benchmark and OCR-related tasks.

6.2. Ablation Studies / Parameter Analysis

While the paper doesn't present explicit ablation studies in a dedicated section with detailed tables, the comparison between Instruct and Thinking variants, and the Flash variants, implicitly serves as a form of parameter/variant analysis.

Instruct vs. Thinking Models: The Thinking models generally show enhanced reasoning capabilities, especially on complex tasks (e.g., Math & STEM in Vision-Text, Audiovisual Reasoning). However, for purely perception-based tasks like $ASR/S2TT$ and Music Understanding, the Thinking model is sometimes outperformed by its Instruct counterpart (as shown in Appendix Table 17 and 18). This suggests that complex reasoning processes might not always yield gains for straightforward perceptual tasks and could even introduce hallucinations.
Flash Models: The Flash models are designed for computational efficiency while maintaining high performance. Results show they generally achieve comparable performance to their non-Flash counterparts (e.g., Qwen3-Omni-Flash-Instruct vs Qwen3-Omni-30B-A3B-Instruct), indicating a good trade-off between speed and accuracy.
MoE Architecture: The MoE design for both Thinker and Talker is highlighted as crucial for high concurrency and fast inference, particularly in maintaining low prefill latency and TTFT under varying load (Table 2).
Multi-codebook & ConvNet in Talker: The shift to multi-codebook representation and causal ConvNet for Code2Wav is directly tied to achieving ultra-low first-packet latency (234 ms). This architectural choice dramatically reduces computational overhead compared to block-wise diffusion.

These analyses demonstrate the effectiveness of specific architectural choices and training methodologies in achieving Qwen3-Omni's stated goals.

6.3. Latency and Concurrency

The following are the results from Table 1 of the original paper:

Module	Architecture	Params	Streaming
Audio Encoder	AuT	650M	✓
Vision Encoder	SigLIP2-S0400M	540M
Thinker	MoE Transformer	30B-A3B	✓
Talker	MoE Transformer	3B-A0.3B	V
MTP	Dense Transformer	80M	✓
Code2wav	ConvNet	200M
End-to-End First-Packet Latency: 234/547ms

The following are the results from Table 2 of the original paper:

	Qwen3-Omni-30B-A3B
	1 Concurrency	4 Concurrency 6 Concurrency
Thinker-Talker Tail Packet Preprocessing Latency	72/160ms	94/180ms	100/200ms
Thinker Time-to-First-Token (TTPT)	88/160ms	468/866ms	673/1330ms
Talker Time-to-First-Token (TTPT)	57/210ms	145/450ms	376/734ms
MTP Module Time Cost Per Token	14ms	16ms	18ms
Codec Decoder Time Cost Per Code	3ms	5ms	5ms
Overral Latency (Audio/Video)	234/547ms	728/1517ms	1172/2284ms
Thinker Token Generation Rate (TPS)	75 tokens/s	63 tokens/s	53 tokens/s
Talker Token Generation Rate (TPS)	140 tokens/s	125 tokens/s	110 tokens/s
Generation RTF(Real Time Factor)	0.47	0.56	0.66

First-Packet Latency: Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms for audio and 547 ms for video in cold-start settings with 1 concurrency. This is a critical achievement for real-time interactive applications.
Concurrency: The MoE architecture ensures that prefill latency and TTPT for Thinker and Talker remain largely unaffected under high concurrency (e.g., 6 concurrency). The lightweight MTP Module and Codec Decoder also minimize overhead, showing only marginal increases in time cost per token/code under higher loads.
Real Time Factor (RTF): The RTF consistently remains below 1 across varying concurrency levels (0.47 at 1 concurrency, 0.66 at 6 concurrency). This guarantees that users receive continuously streaming audio responses faster than real-time, which is essential for smooth conversational experiences.

6.4. Qualitative Results from `Qwen3-Omni-30B-A3B-Captioner`

The paper includes detailed qualitative results for the fine-tuned Audio Captioner, demonstrating its ability to produce rich, low-hallucination descriptions for diverse audio inputs.

Analysis of Expressive Speech: The model accurately identifies a "studio setting," "faint, persistent electronic hiss," "male speaker," "clear, energetic, highly theatrical manner," specific Chinese phrases, "exaggerated emphasis," "comedic contrast," "central positioning in stereo field," and "digital reverb for dramatic effect." It even deduces the "comedic, over-the-top manner" and "theatrical nature," showcasing strong semantic understanding and paralinguistic analysis.
Analysis of Complex Scene Sound Effect: The captioner describes a "highly produced, cinematic soundscape," noting "deep, resonant musical drone," "metallic clank," "slow, rhythmic, ominous beat," "swelling orchestral strings," "thunderous, mechanical roar of a massive engine," "high-pitched, metallic screech," "colossal, explosive impact," "shattering and debris," and "heavy, strained breathing" post-impact. It infers context like "imminent danger," "immense machinery," "vast, hard-walled environment," and "catastrophic event," demonstrating excellent audio event detection and scene understanding.
Analysis of Mixed Speech, Audio, and Music: For a composite audio, it details "deep, resonant metallic clang," "low-frequency rumble," "mechanical whirring," "electrical arcs or energy discharges," "distant and high-pitched female voice" asking "Are we there yet?", "deeper, gravelly male voice" responding "We get there when we get there," and "synthesized musical sting." It correctly interprets "familial banter," "playful annoyance," and "science fiction or fantasy context," showcasing complex audio scene analysis and multimodal reasoning to piece together narrative elements.

These qualitative results underscore the Captioner's ability to provide granular, contextually rich, and narrative-driven descriptions of audio, which is a significant advancement in the field.

7. Conclusion & Reflections

7.1. Conclusion Summary

The Qwen3-Omni technical report introduces a groundbreaking family of multimodal models (Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, Qwen3-Omni-Flash-Instruct, and Qwen3-Omni-Flash-Thinking) that achieve a significant milestone in multimodal AI. For the first time, these models demonstrate that fully integrated, end-to-end multimodal training can be achieved without degrading core capabilities in any single modality.

Key findings include:

Performance Parity and Beyond: Qwen3-Omni-30B-A3B matches or surpasses same-sized unimodal Qwen models on text and vision benchmarks, directly refuting the common modality trade-off.
SOTA Audio Performance: It sets new state-of-the-art records on audio processing and dialogue benchmarks, achieving open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming even strong proprietary systems like Gemini-2.5-Pro.
Enhanced Reasoning: The Thinking variants (Qwen3-Omni-30B-A3B-Thinking) further boost performance on complex multimodal reasoning tasks involving text, vision, and audio-visual inputs.
Ultra-Low Latency Speech Interaction: Through innovations like the Thinker-Talker MoE architecture, multi-codebook autoregressive prediction, and a lightweight causal ConvNet for waveform synthesis, Qwen3-Omni achieves an impressive end-to-end first-packet latency of 234 ms, enabling fluent and natural real-time speech generation across 10 languages.
Broad Language and Modality Coverage: The model supports 119 text languages, 19 languages for speech understanding, and 10 for speech synthesis, and can process long audio inputs (up to 40 minutes).
Novel Audio Captioning: The introduction of Qwen3-Omni-30B-A3B-Captioner addresses a gap in the community by providing a model capable of generating detailed, low-hallucination captions for arbitrary audio inputs.
Public Availability: The release of key models under the Apache 2.0 license promotes further research and application.

The paper concludes that Qwen3-Omni represents a critical step towards truly integrated multimodal AI, offering advantages over cascaded pipelines in cross-modal reasoning, lower end-to-end latency, and reduced system complexity and cost.

7.2. Limitations & Future Work

The authors acknowledge a key limitation:

Suboptimal Long Video Performance: The current model exhibits suboptimal performance on long video benchmarks. This is attributed to two architectural constraints:
1. Limited capacity for positional extrapolation.
2. Restricted context length.
  
  Future research directions suggested by the authors include:
Multi-speaker ASR: Improving Automatic Speech Recognition for scenarios with multiple speakers.
Video OCR: Enhancing Optical Character Recognition capabilities within video.
Audiovisual Proactive Learning: Developing models that can proactively learn from audiovisual cues.
Enhanced Agent-based Workflows and Function Calling: Further integrating the model with agentic capabilities and tool-use.

7.3. Personal Insights & Critique

The Qwen3-Omni paper presents a compelling case for the feasibility and benefits of deeply integrated multimodal learning. The central claim of non-degradation across modalities while achieving SOTA performance is highly significant, as it addresses a long-standing challenge in multimodal AI. This suggests that with careful architectural design (e.g., MoE, TM-RoPE) and comprehensive training strategies (early multimodal integration, diverse data), the "cost" of multimodality can be effectively mitigated or even turned into a benefit.

The achievement of ultra-low first-packet latency for speech generation is a crucial practical innovation. For real-time human-AI interaction, latency is paramount, and demonstrating streaming from the first codec frame with a causal ConvNet is a clever engineering solution that significantly improves user experience compared to previous block-wise diffusion methods. This architectural shift from a heavy generative model (diffusion) to a lightweight causal network for the final waveform synthesis is particularly insightful, showing how to optimize different stages of a complex generation pipeline.

The distinction between Instruct and Thinking models, and the observation that Thinking models can sometimes perform worse on purely perceptual tasks like $ASR/S2TT$ , offers valuable insight. It suggests that complex reasoning pathways, while beneficial for higher-level cognitive tasks, might introduce unnecessary overhead or hallucinations when the task is primarily about accurate perception and straightforward mapping. This implies that specialized model variants or adaptive routing within an MoE could be crucial for optimal performance across the full spectrum of tasks.

The explicit creation and release of an Audio Captioner is a commendable contribution to the research community, filling a recognized gap. The qualitative results provided are genuinely impressive, showcasing a rich understanding of sound events, context, and even paralinguistic cues, which goes beyond simple sound classification.

Potential Issues or Unverified Assumptions:

Complexity of Training: While the paper outlines the training stages, the sheer scale of data (2 trillion tokens, 20 million hours of audio) and the intricate multi-stage pretraining and post-training processes imply immense computational resources. Replicability for smaller research groups might be challenging.
Flash Variants vs. Full Models: While Flash models maintain good performance, the exact trade-offs in specific nuanced tasks (e.g., very long-context understanding or subtle emotional generation) might warrant further detailed investigation.
Generalization to Novel Modalities/Tasks: While Qwen3-Omni is impressive, it's unclear how easily it can incorporate entirely new modalities or adapt to tasks significantly different from those it was trained on, beyond the current perception-generation-reasoning paradigm.
"Overall SOTA" Definition: While the paper states "overall SOTA on 22" benchmarks, the exact criteria for this "overall SOTA" across a diverse set of benchmarks could be more explicitly defined for rigorous comparison.

Transferability and Application: The methods and conclusions of Qwen3-Omni are highly transferable. The Thinker-Talker MoE architecture and the optimized streaming speech generation pipeline could be applied to other multimodal LLMs to improve real-time interaction. The TM-RoPE for flexible multimodal positional encoding is a valuable contribution for any model dealing with dynamic, time-aligned multimodal streams. The success of early multimodal integration suggests a powerful recipe for future foundation models aiming for broad applicability without sacrificing specialized performance. The Captioner itself has direct applications in accessibility, content indexing, and even creating synthetic datasets for further audio research. This work inspires confidence that truly general-purpose multimodal AI is within reach.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.