Qwen3-Omni Technical Report
TL;DR Summary
Qwen3-Omni is a single multimodal model achieving state-of-the-art performance across text, image, audio, and video, particularly excelling in audio tasks. It uses a mixture-of-experts architecture, supports multilingual audio understanding and generation, and reduces latency wit
Abstract
We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Qwen3-Omni Technical Report," which introduces a novel single multimodal model designed to achieve state-of-the-art performance across various modalities without performance degradation relative to single-modal counterparts.
1.2. Authors
The paper lists a comprehensive team of authors from the Qwen Team. Core Contributors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin Contributors: An Yang, Anfeng Li, Bei Chen, Beichen Zhang, Bin Lin, Binyuan Hui, Bohan Wang, Buxiao Wu, Chenfei Wu, Cheng Chen, Chen Qiang, Chenhan Yuan, Chenxu Lv, Chujie Zheng, Daren Chen, Dayiheng Liu, Dake Guo, Fei Huang, Gezhengyang Zhu, Guangdong Zhou, Hang Zhang, Hongjian Tu, Humen Zhong, Jialong Zuo, Jianhong Tu, Jianwei Zhang, Jiayi Leng, Jing Zhou, Jingn Zhou, Kai Dang, Kexin Yang, Kun Yan, Laien Zheng, Lei Xie, Liango Deng, Ligen Meng, M Li Mi Hong Meg Xue, Ming i ine L P Zhang, n L, enei Wa Rn Yuan, Rui Hu, Ruiyang Xu, Qidog Huang, QinZhu, Que Shen, Shen Li, Shixuan Lu, Sibo Son, Zag SoChe H T Ta Wn Weo W Dg Wi Wg, Xi Deng, Xiaotong Chen, Xiao Li, Xian Yang, Xinyao Niu, Xudong Guo, Xin Le, Xuechun Wang, Xutong Jin, Xuancheng Ren, Yang Fan, Yang Liu, Yang Su, Yantao Liu, Yi Wu, Yichang Zhang, Yilei Chen, Yiming Dong, Yinger Zhang, Yizhong Cao, Yuchong Sun, Yuezhang Wang, Yuhao Wang, Yuqiong Liu, Yuanzhi Zhu, Yuxiang Chen, Yuxuan Cai, Yuxuan Liu, Zeyu Cui, Zheng Li, Zhenghao Xing, Zhenru Zhang, Zihan Qiu, ZiYue Jiang, Zhaohai Li, Zhi Li, Zhibo Yang, Zhihai Wang, Zhipeng Zhou
The authors are primarily affiliated with the Qwen Team, indicating an internal research and development effort from a major technology entity, likely Alibaba Group, given the QwenLM GitHub repository and ModelScope links. Their backgrounds appear to span large language models, multimodal AI, speech processing, and vision systems.
1.3. Journal/Conference
The paper is published as an arXiv preprint with the identifier . This indicates it is a pre-publication version, not yet formally peer-reviewed or published in a journal or conference proceedings. arXiv is a reputable platform for sharing new research rapidly in fields like AI and machine learning.
1.4. Publication Year
The paper was published on 2025-09-22T13:26:24.000Z (UTC), indicating a future publication date relative to the current analysis time. This suggests the paper is either a forward-dated preprint or a hypothetical future publication.
1.5. Abstract
The paper introduces Qwen3-Omni, a single multimodal model that achieves state-of-the-art performance across text, image, audio, and video modalities without degradation compared to single-modal models of similar size. It matches Qwen series performance in text and vision, and significantly excels in audio tasks, achieving open-source SOTA on 32 out of 36 audio and audio-visual benchmarks, and overall SOTA on 22, surpassing strong closed-source models like Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe.
Qwen3-Omni utilizes a Thinker-Talker MoE (Mixture-of-Experts) architecture to unify perception and generation across modalities, producing fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19, and speech generation in 10. To reduce first-packet latency in streaming synthesis, the Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. This allows replacing computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling immediate streaming. The model boasts a theoretical end-to-end first-packet latency of 234 ms in cold-start settings.
For enhanced multimodal reasoning, a Thinking model is introduced to explicitly reason over inputs from any modality. Additionally, to address the lack of general-purpose audio captioning models, Qwen3-Omni-30B-A3B was fine-tuned to create Qwen3-Omni-30B-A3B-Captioner, which generates detailed, low-hallucination captions for arbitrary audio inputs. Several variants (Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner) are publicly released under the Apache 2.0 license.
1.6. Original Source Link
Official Source Link: https://arxiv.org/abs/2509.17765
PDF Link: https://arxiv.org/pdf/2509.17765v1.pdf
Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the modality trade-off commonly observed in contemporary LLM-centric multimodal models. Prior research often shows that gains in one modality's performance (e.g., text understanding) come at the cost of degradation in others (e.g., image or audio understanding). This limitation prevents multimodal models from achieving parity across all modalities and fully leveraging cross-modal synergy.
The problem is important because human perception and intelligence fundamentally rely on the coordinated use of multiple modalities. Building AI systems that mirror this integrated multimodal learning is crucial for creating more robust, versatile, and human-like AI. Existing challenges include:
-
Performance Degradation: Multimodal models often fail to match the specialized performance of unimodal counterparts.
-
Latency in Real-time Interaction: Especially for speech generation,
first-packet latencyis a critical barrier for a smooth user experience. -
Lack of Unified Architectures: Many multimodal systems are
cascaded pipelinesof separate unimodal components, leading to higher complexity, costs, and limitedcross-modal reasoning. -
Deficiency in Specific Multimodal Tasks: The research community lacks general-purpose models for tasks like
audio captioning.The paper's entry point or innovative idea is to explore
integrated multimodal trainingwithin theLLM-based paradigmto demonstrate thatjoint multimodal trainingcan achieveparityacross all modalities without degradation, while simultaneously enhancingcross-modal capabilities. This is achieved through aThinker-Talker Mixture-of-Experts (MoE) architectureand novel designs for efficient speech generation and multimodal reasoning.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Achieving Multimodal Performance Parity and Non-Degradation:
Qwen3-Omniis presented as the first single multimodal model that maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to same-sized single-modal counterparts within theQwenseries. This resolves the commonmodality trade-offissue. -
State-of-the-Art Audio Performance: The model particularly excels on audio tasks, achieving open-source SOTA on 32 out of 36 audio and audio-visual benchmarks and overall SOTA on 22, outperforming strong closed-source models.
-
Thinker-Talker MoE Architecture: Introduction of an upgraded
Thinker-Talker MoEarchitecture that unifies perception and generation across modalities, enabling fluent text and natural real-time speech.- Upgraded Thinker and Talker to MoE designs: Enhances
concurrencyandfast inference. - Decoupled Thinker-Talker conditioning:
Talkerconditions only on audio/visual features and conversational history, allowing distinct system prompts forThinker(response style) andTalker(audio style).
- Upgraded Thinker and Talker to MoE designs: Enhances
-
Novel Audio Encoder (AuT): Development of
AuT (Audio Transformer), trained from scratch on 20 million hours of supervised audio, yielding stronger general-purpose audio representations and employingblock-wise window attentionforreal-time prefill caching. -
Advanced Speech Generation (Talker) with Ultra-Low Latency:
- Adoption of a
multi-codebook representationfor increased capacity and faithful modeling of diverse voices and acoustic cues. - Shift from single-track to multi-track codec modeling, autoregressively predicting multiple
codebook layersviaMTP modules. - Replacement of computationally intensive
block-wise diffusionwith a lightweightcausal ConvNet(Code2Wav) for waveform synthesis. - Reduced input/output audio code rates to
12.5 Hz, enabling single-frame, immediate speech synthesis. - Achieves a theoretical end-to-end
first-packet latencyof 234 ms in cold-start settings, enabling low-latency speech interaction.
- Adoption of a
-
Enhanced Multimodal Reasoning with a Thinking Model: Introduction of a dedicated
Thinking modelthat explicitly reasons over inputs from any modality, including audio-video and audio-only scenarios. -
General-Purpose Audio Captioning Model: Fine-tuning
Qwen3-Omni-30B-A3Bto createQwen3-Omni-30B-A3B-Captioner, addressing a gap in the research community by producing detailed, low-hallucination captions for arbitrary audio inputs. -
Expanded Language Coverage: Supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. It can process audio recordings up to 40 minutes for
ASRand spoken-language understanding. -
Public Release:
Qwen3-Omni-30B-A3B,Qwen3-Omni-30B-A3B-Thinking, andQwen3-Omni-30B-A3B-Captionerare publicly released under the Apache 2.0 license.These findings solve the problems of multimodal performance trade-offs, high latency in real-time speech generation, lack of unified multimodal reasoning, and gaps in specific multimodal tasks like audio captioning. They demonstrate that
integrated multimodal trainingcan lead to superior performance and efficiency compared tocascaded unimodal systems.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the Qwen3-Omni paper, a reader should be familiar with several foundational concepts in deep learning and multimodal AI:
- Large Language Models (LLMs):
LLMs(e.g.,GPT-3,Qwen) are deep learning models trained on massive text datasets, capable of understanding, generating, and reasoning with human language. They typically employ theTransformer architectureand are known for their emergent abilities in variousnatural language processing (NLP)tasks. - Multimodal AI: This field deals with AI systems that can process and understand information from multiple input modalities (e.g., text, images, audio, video) and generate outputs in one or more of these modalities.
- Transformer Architecture: The
Transformeris a neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017). It relies heavily onself-attention mechanismsto weigh the importance of different parts of the input sequence, allowing for parallel processing and capturing long-range dependencies. It consists of anencoderand adecoderstack.- Self-Attention: A mechanism that allows the model to weigh the importance of different words in an input sequence when encoding a particular word. For a query , keys , and values , the attention output is calculated as:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
- (Query), (Key), (Value) are matrices derived from the input embeddings.
- calculates the similarity between queries and keys.
- is the dimension of the key vectors, used for scaling to prevent vanishing gradients.
- normalizes the scores to produce attention weights.
- The output is a weighted sum of the values.
- Self-Attention: A mechanism that allows the model to weigh the importance of different words in an input sequence when encoding a particular word. For a query , keys , and values , the attention output is calculated as:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
- Mixture-of-Experts (MoE) Architecture:
MoEis a technique to improve model capacity and efficiency. Instead of using a single large model, it uses multiple "expert" sub-networks. Agating networklearns to select or combine the outputs of these experts based on the input. This allows for conditional computation, where only a subset of the model's parameters is activated for each input, leading to faster inference for a given model size. - Audio Encoders: These models convert raw audio waveforms into meaningful numerical representations (
embeddingsorfeatures) that can be processed by other neural networks. Examples includemel-spectrograms(a visual representation of audio frequency content over time) and learned representations likeWhisperencoders orAuTin this paper. - Speech Codecs: Algorithms used to encode and decode audio signals efficiently. They compress audio data for storage or transmission and then reconstruct it.
Discrete speech codecstransform continuous audio into discrete tokens, which can be predicted autoregressively by generative models.- Multi-codebook representation: Instead of a single stream of discrete tokens, multiple "codebooks" are used, where each codebook captures different aspects or resolutions of the audio signal. This allows for a richer and more detailed representation of speech.
- Autoregressive Models: These models predict future elements in a sequence based on previously generated elements. In speech generation, an
autoregressive Talkerwould predict the next speech codec token based on the preceding ones, ensuring temporal coherence. - Diffusion Models: A class of generative models that learn to reverse a diffusion process, gradually transforming noise into data (e.g., images, audio).
Block-wise diffusionwould apply this process to blocks of data. They are known for high quality but can be computationally intensive and slow for real-time generation. - Convolutional Neural Networks (ConvNets): Neural networks that use
convolutional layersto automatically and adaptively learn spatial hierarchies of features from input data.Causal ConvNetsare designed so that the output at any time step only depends on past inputs, making them suitable for sequential data generation and real-time streaming. - Rotary Position Embedding (RoPE): A method for encoding positional information in
Transformer models. Unlike traditional absolute or learned positional embeddings,RoPEintegrates relative positional information directly into theself-attentionmechanism by rotating query and key vectors.- Multimodal Rotary Position Embedding (M-RoPE): An extension of
RoPEto multimodal contexts, allowing the model to understand positions within different modalities and their relationships. - Time-aligned Multimodal Rotary Position Embedding (TM-RoPE): Further extends
M-RoPEby explicitly incorporating absolute temporal information, crucial for dynamically sampled video and audio streams.
- Multimodal Rotary Position Embedding (M-RoPE): An extension of
- Supervised Fine-Tuning (SFT): A common practice where a pre-trained
LLMis further trained on a smaller, task-specific dataset with labeled examples to adapt it to a particular downstream task or instruction-following capabilities. - Direct Preference Optimization (DPO): A reinforcement learning (RL) technique for aligning
LLMswith human preferences. Instead of explicit reward modeling,DPOdirectly optimizes theLLMpolicy using preference pairs (e.g., "response A is better than response B"). - Generative Sequence Policy Optimization (GSPO): A
reinforcement learning from human feedback (RLHF)algorithm mentioned for comprehensively enhancing model capabilities and stability.
3.2. Previous Works
The paper builds upon and references several key prior works, particularly within the Qwen series and broader multimodal AI:
- Qwen Models (Yang et al., 2024; 2025a):
Qwen3-Omniis part of theQwenseries, which are powerfulLLMs. TheLLMcomponent ofQwen3-Omniis initialized with parameters fromQwen3. TheQwenseries provides strong text-only capabilities thatQwen3-Omniaims to match or surpass in a multimodal setting. - Qwen-VL (Bai et al., 2023b): This was an early large
vision-language modelwithin theQwenfamily, demonstrating versatile abilities in vision and language.Qwen3-Omnileverages the vision encoder fromQwen3-VL(a likely successor toQwen-VL). The vision encoder fromQwen3-VLis initialized fromSigLIP2-So400m(Tschannen et al., 2025). - Qwen-Audio (Chu et al., 2023; 2024): These works explore advancing universal audio understanding via unified large-scale audio-language models.
Qwen3-Omnisignificantly builds uponQwen-Audio's focus on audio perception and generation. - Qwen2.5-Omni (Xu et al., 2025):
Qwen3-Omniexplicitly builds on theThinker-Talker architectureintroduced inQwen2.5-Omniand introduces five key upgrades.Qwen2.5-Omnisegmented audiovisual representations into fixed 2-second chunks, a limitation addressed byQwen3-Omni.Qwen2.5-Omnialso usedblock-wise diffusionfor waveform generation, which is replaced by acausal ConvNetinQwen3-Omni. InQwen2.5-Omni, theTalkerconsumed theThinker's high-level text representations, which is decoupled inQwen3-Omni. - GPT-4o (OpenAI, 2024), Gemini-2.5-Pro (Gemini Team, 2024), Seed-ASR, GPT-4o-Transcribe: These are strong closed-source multimodal or unimodal systems that
Qwen3-Omniaims to outperform or match, especially on audio benchmarks.GPT-4ois known for its impressive multimodal capabilities, including voice interaction. - Whisper (Radford et al., 2022): A widely recognized
ASRmodel.Qwen3-Omnireplaces theWhisper audio encoderwith its customAuT encoder, indicating a focus on developing specialized, higher-performing audio components. - M-RoPE (Bai et al., 2023b):
Qwen3-OmniextendsM-RoPEtoTM-RoPE, incorporating absolute temporal information for better handling of dynamic multimodal data.
Core Formula in Prior Work (Transformer's Self-Attention):
As mentioned in the Foundational Concepts, the Transformer architecture, central to LLMs and MoE models, relies on the self-attention mechanism. Its core formula is:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where , , and are query, key, and value matrices, respectively. is the dimension of the key vectors. This formula allows the model to compute a weighted sum of value vectors, where the weights are determined by the compatibility of the query with the corresponding key. This mechanism is fundamental for processing sequences in Transformers.
3.3. Technological Evolution
The field of large language models and multimodal AI has evolved rapidly:
-
Unimodal LLMs: Started with text-only models (e.g.,
GPT-3,Qwen1) demonstrating powerful language understanding and generation (Brown et al., 2020; OpenAI, 2023). -
Vision-Language Models (VLMs): Integration of visual modality, initially through separate vision encoders connected to
LLMs(e.g.,Flamingo,BLIP,Qwen-VL), allowing for image captioning, visual Q&A. -
Audio-Language Models (ALMs): Similar integration for audio, with audio encoders feeding into
LLMs(e.g.,WhisperforASR,Qwen-Audiofor broader audio understanding). -
Early Multimodal LLMs: Combining multiple modalities (text, vision, audio) but often with
modality trade-offsor reliance oncascaded pipelines(e.g.,GPT-4o,Geminiseries,Qwen2.5-Omni). These models often used separate encoders for each modality and then projected their outputs into theLLM's embedding space. -
Towards Fully Integrated Multimodal Models: The current frontier aims for true
multimodal integrationwhere different modalities are deeply intertwined, trained jointly, and mutually enhance each other without degradation. This often involves unified architectures and specialized multimodal positional embeddings.Qwen3-Omnifits into this timeline as a significant step in the fifth stage. It aims to overcome themodality trade-offsseen in earlier multimodalLLMsby demonstrating that deep,integrated multimodal trainingcan achieveperformance paritywith specialized unimodal models while also enabling novelcross-modal reasoningand real-time interaction capabilities.
3.4. Differentiation Analysis
Compared to previous work, Qwen3-Omni introduces several core innovations:
-
Non-Degradation in Multimodal Training: The most significant differentiation is the claim of achieving
state-of-the-art performanceacross text, image, audio, and video without any degradation relative to same-sized single-modal counterparts. This contrasts with many existing multimodalLLMsthat often show performance dips in certain modalities when trained jointly. The paper attributes this to mixingunimodalandcross-modal datafrom the early stage of text pretraining and careful architectural design. -
Upgraded Thinker-Talker MoE Architecture:
MoEfor bothThinkerandTalker: Enhances scalability, concurrency, and inference speed, which is a significant upgrade fromQwen2.5-Omni.- Decoupled
ThinkerandTalker: Allows for independent control of text response style and audio style, and more flexiblemultimodal conditioningfor speech generation, directly consuming multimodal features rather than justThinker's text output. This is a departure fromQwen2.5-Omni.
-
Novel AuT Audio Encoder: Instead of relying on existing encoders like
Whisper,Qwen3-Omniintroduces a customAuT encodertrained on a massive 20 million hours of supervised audio. This is designed for stronger, more general-purpose audio representations and incorporatesblock-wise window attentionforreal-time prefill caching. -
Ultra-Low Latency Speech Generation:
Multi-codebook representationandMTP module: Allows for richer speech modeling and autoregressive prediction of multiplecodebook layersframe-by-frame.Causal ConvNetforCode2Wav: Replaces the computationally intensiveblock-wise diffusionfromQwen2.5-Omniwith a lightweightcausal ConvNet, drastically reducinginference latencyand enabling immediate,single-frame speech synthesis. This is critical for achieving the stated 234 msfirst-packet latency.- Reduced code rates:
12.5 Hztoken rate for efficient streaming.
-
Enhanced Positional Encoding (TM-RoPE):
TM-RoPEexplicitly incorporates absolute temporal information and dynamically alignsaudiovisual representationsbased ontemporal IDs, moving beyond the fixed 2-second chunking ofQwen2.5-Omniand supportingstreaming inputs of arbitrary duration. -
Dedicated Thinking Model and Audio Captioner: The introduction of a specific
Thinking modelfor explicit reasoning across modalities and a fine-tunedAudio Captioneraddresses functional gaps not commonly covered by general multimodal models.In essence,
Qwen3-Omnidifferentiates itself by pushing the boundaries of integrated multimodal training to achieve true performance parity, significant efficiency gains in real-time speech interaction, and enhanced reasoning capabilities through specialized architectural components and training strategies.
4. Methodology
4.1. Principles
The core idea behind Qwen3-Omni is to create a single, unified multimodal model that can process and generate information across text, image, audio, and video modalities without any performance degradation relative to specialized unimodal models. This is achieved by leveraging a Thinker-Talker Mixture-of-Experts (MoE) architecture and focusing on deep multimodal integration during training, from early pretraining stages. The theoretical basis is that human-like intelligence arises from the coordinated processing of multiple senses, and AI models should aim to mimic this cross-modal synergy. By training all modalities jointly and designing components specifically for efficiency and real-time interaction, the model can overcome the limitations of cascaded pipelines and modality-specific trade-offs seen in previous approaches. Key principles include:
- Unified Perception and Generation: A single model handles both understanding (perception) and response generation across all supported modalities.
- MoE for Scalability and Efficiency: Utilizing a
Mixture-of-Expertsarchitecture for bothThinker(text generation/reasoning) andTalker(speech generation) components to enable high concurrency and fast inference. - Modality Parity through Joint Training: Integrating unimodal and cross-modal data from the early stages of pretraining to ensure that all modalities mutually enhance each other without causing performance degradation in any single modality.
- Low-Latency Real-time Interaction: Designing the speech generation pipeline (Talker) with specific optimizations, such as
multi-codebook autoregressive predictionand a lightweightcausal ConvNetfor waveform synthesis, to achieve ultra-lowfirst-packet latency. - Explicit Multimodal Reasoning: Introducing a dedicated
Thinking modelto perform explicit reasoning over complex multimodal inputs. - Advanced Positional Encoding: Using
Time-aligned Multimodal Rotary Position Embedding (TM-RoPE)to effectively integrate temporal and spatial information across diverse modalities, including dynamically sampled video.
4.2. Core Methodology In-depth (Layer by Layer)
Qwen3-Omni employs a Thinker-Talker architecture, which builds upon previous versions like Qwen2.5-Omni but introduces significant upgrades. The overall system is depicted in Figure 2.
The following figure (Figure 2 from the original paper) illustrates the overall architecture of Qwen3-Omni:

该图像是Qwen3-Omni的示意图,展示了其Thinker-Talker架构。Thinker负责文本生成,而Talker生成流式语音令牌,通过直接接收Thinker的高层表示来实现超低延迟流式处理。每个解码步骤中,MTP模块输出当前帧的残差代码本,随后Code2Wav渲染器逐步合成相应波形,支持逐帧流式生成。
Figure 2: The overview of Qwen3-Omni. Qwen3-Omni adopts the Thinker-Talker architecture. Thinker is tasked with text generation while Talker focuses on generating streaming speech tokens by receives high-level representations directly from Thinker. To achieve ultralow-latency streaming, Talker autoregressively predicts a multi-codebook sequence. At each decoding step, an MTP module outputs the residual codebooks for the current frame, after which the Code2Wav renderer incrementally synthesizes the corresponding waveform, enabling frame-by-frame streaming generation.
4.2.1. Overall Architecture (Thinker-Talker MoE)
The Qwen3-Omni architecture is fundamentally composed of two main MoE components: the Thinker and the Talker.
-
Thinker: This component is primarily responsible for
text generationandmultimodal reasoning. It receives input from all modalities (text, audio, image, video) through various encoders and processes them to generate textual responses. TheThinkeritself is anMoE Transformer. -
Talker: This component focuses on
streaming speech generation. UnlikeQwen2.5-Omni, theTalkerinQwen3-Omnidoes not solely consume theThinker's high-level text representations. Instead, it conditions onaudioandvisual multimodal featuresdirectly from theThinkerand shares access to the fullconversational history. This design allows foraudiovisual-coordinated speech generation(e.g., preservingparalinguistic cueslike emotion) and enables separate system prompts for theThinker(response style) andTalker(audio style). TheTalkeris also anMoE Transformer.The
ThinkerandTalkeroperate asynchronously. When theThinkercompletes processing a chunk of input, its high-level representations are immediately used toprefilltheTalker's current chunk, while theThinkersimultaneously processes its next chunk. Thischunked prefillingmechanism is crucial for reducingTime-To-First-Token (TTFT)for both components.
4.2.2. Audio Transformer (AuT)
The AuT (Audio Transformer) encoder is a critical component for audio perception.
The following figure (Figure 3 from the original paper) provides an overview of the AuT architecture:

该图像是AuT模型的示意图,展示了其基于注意力的编码器-解码器结构。AuT解码器包含8个解码器交叉注意力层和自注意力层,而AuT编码器则包括32个自注意力层和3个降采样卷积层。此外,图中还展示了FBank特征及其10ms的帧移。
Figure 3: The overview of AuT. AuT is an attention-encoder-decoder based auto-regressive model, which is trained from scratch on 20 millon hours of supervised audio. Qwen3-Omni employs the AuT encoder as the audio encoder to obtain general purpose audio representations at a token rate of .
- Architecture:
AuTis anattention-encoder-decoder based auto-regressive model.- The
encoderconsists of 32self-attention layersand 3downsampling convolutional layers. - The
decoderconsists of 8decoder cross-attention layersandself-attention layers.
- The
- Training: It is trained from scratch on 20 million hours of supervised audio data.
- During training,
filter bank features(specifically,mel-spectrograms) of the audio are downsampled 8 times usingConv2D blocksbefore the attention layers. This reduces thetoken rateto12.5 Hz. - The training data mix includes
80% Chinese and English pseudo-labeled ASR data,10% ASR data from other languages, and10% audio understanding datato learn stronger, general-purpose audio representations.
- During training,
- Efficiency:
AuTutilizesflash attentionwith dynamic attention window sizes (1 to 8 seconds) to balancereal-time prefill cachingefficiency with performance foroffline audio tasks. - Role in Qwen3-Omni:
AuTserves as the primary audio encoder, providing general-purpose audio representations at a token rate of12.5 Hz. It has approximately 0.6 billion parameters.
4.2.3. Perception
The Thinker processes various inputs from different modalities:
- Text Inputs: Uses
Qwen's tokenizer(Yang et al., 2025a), which employsbyte-level byte-pair encoding (BPE)with a vocabulary of 151,643 regular tokens. This converts raw text into discrete tokens for theThinker. - Audio Inputs (and Audio extracted from Video):
- Raw waveform is resampled to
16 kHz. - Converted into a
128-channel mel-spectrogramusing a25 ms windowand10 ms hop. - Processed by the
AuT encoderto obtain audio representations. Each frame of this representation corresponds to approximately an80 ms segmentof the original audio signal.
- Raw waveform is resampled to
- Image and Video (without audio) Inputs:
- Employs the
vision encoderfromQwen3-VL, which is initialized fromSigLIP2-So400m(Tschannen et al., 2025) and has approximately 543 million parameters. - This encoder is trained on a mixture of image and video data.
- For video inputs, frames are sampled at a
dynamic frame rateto preserve video information while aligning with the audio sampling rate.
- Employs the
4.2.4. Positional Embedding (Time-aligned Multimodal Rotary Position Embedding - TM-RoPE)
To integrate temporal and spatial information effectively across modalities, Qwen3-Omni uses TM-RoPE, an extension of M-RoPE (Bai et al., 2023b).
- Factorization:
TM-RoPEfactorizes the conventionalRotary Position Embedding (RoPE)into three distinct dimensions:temporal,height, andwidth. - Angle Redistribution: To address the limitation of
M-RoPE(where initial16 rotary anglesfor temporal dependencies captured fine-grained local variations but impeded long-range extrapolation),TM-RoPEmodifies the allocation:24 rotary anglesfor thetemporaldimension.20 rotary anglesfor theheightdimension.20 rotary anglesfor thewidthdimension. This redistribution aims for a more balanced representation of both local semantics and long-range dependencies.
- Modality-Specific Application:
- Text Inputs: All three components (
temporal,height,width) share identicalposition identifiers, makingTM-RoPEfunctionally equivalent to a one-dimensionalRoPE(Su et al., 2024). - Audio Inputs: Use shared
position IDsbut are augmented withabsolute temporal encodings. Eachtemporal IDcorresponds to a duration of80 ms. - Image Data: A
constant temporal IDis assigned to all visual tokens.Heightandwidth IDsare determined by their distinct row and column positions. - Multimodal Audiovisual Streams:
- Audio component: Encoded with a
temporal IDfor every80 ms. - Video component: Treated as a sequence of frames with
monotonically increasing temporal IDsthat are dynamically adjusted based on actual timestamps to maintain a consistenttemporal resolutionof80 ms per ID.Heightandwidth IDsare assigned as for still images.
- Audio component: Encoded with a
- Text Inputs: All three components (
- Contiguous Position Numbering: To prevent positional conflicts between modalities,
position numberingis madecontiguous, with each subsequent modality starting from one plus the maximum position ID of the preceding modality. - Arbitrary Duration Support: Unlike
Qwen2.5-Omniwhich segmentedaudiovisual representationsinto fixed 2-second chunks,Qwen3-Omnidirectly aligns these representations using theirtemporal IDs(explicitly anchored to absolute time), allowing it to supportstreaming inputs of arbitrary duration.
4.2.5. Speech Generation
The Talker module is responsible for speech synthesis in multi-turn dialogues.
- Context Conditioning: The
Talkeris conditioned on a rich context from theThinkercomponent, includinghistorical textual tokens,multimodal representations, and thecurrent turn's streamed text. Thislong-context informationis vital for adaptingacoustic attributes(prosody, loudness, emotion) to the ongoing discourse. - RVQ Token Operation: The
Talkeroperates directly onRVQ tokens(Residual Vector Quantization tokens), which are discrete representations of audio. - Hierarchical Prediction Scheme:
- Backbone: Ingests
aggregated codebook featuresof the current frame. - Linear Head: Uses a linear head to predict the
zeroth codebook. Multi-Token Prediction (MTP) module: Generates allresidual codebooks(the subsequent codebooks after the zeroth one). This strategy allows learning a complete representation of acoustic details.
- Backbone: Ingests
- Waveform Reconstruction (
Code2Wav): The final stage of audio synthesis,Code2Wav, is simplified to alightweight causal ConvNet. This replaces more complexDiT-based vocoders(Diffusion Transformers used inQwen2.5-Omni), significantly reducinginference latencyandcomputational cost (FLOPs)while maintaining high audio fidelity.- The
Code2Wavcausal ConvNetallows forstreaming from the first codec frame, enablingframe-by-frame streaming generation.
- The
4.2.6. Designs for Streaming and Concurrency
Optimizing first-packet latency and concurrency is crucial for real-time interaction.
Chunked PrefillingandMoE Architecture:Chunked Prefilling: Similar toQwen2.5-Omni, the audio and vision encoders output data inchunksalong the temporal dimension. During real-time interaction,ThinkerandTalkerperformasynchronous prefilling. TheThinker's output representations from a completed chunk immediatelyprefilltheTalker's current chunk, while theThinkermoves to its next chunk. This reducesTime-To-First-Token (TTFT).MoE Architecture: BothThinkerandTalkerutilizeMoEdesigns.MoEmodels activate only a subset of experts per input, which significantly decreasesIO consumptionfromKV cache(Key-Value cache) during processing of long sequences, enhancingthroughput(tokens per second).
Streaming Multi-Codebook Codec Generation:- A
left context only multi-codebook generation mechanismis employed. - Once the
Talkergenerates the first token, theMTP modulepredicts the remaining tokens for the current frame. - These tokens are then decoded into a waveform by a
streaming multi-codebook codec decoderthat only attends to theleft context. - Unlike
Qwen2.5-Omnithat needed sufficientblock-contextbefore synthesis,Qwen3-Omnican output the waveform immediately after theTalkergenerates each token, significantly reducingfirst-packet latency.
- A
- Lightweight
MTP moduleandConvNet:-
MTP Module: Anultra-lightweight fixed-step autoregressive dense transformer. Its low memory bandwidth requirements and fixed-step autoregressive inference enable efficient batch processing and low latency in high-concurrency scenarios. -
Codec Decoder(Code2WavConvNet): Achieves highthroughputwith low latency due to itsconvolutional architecturewhich benefits from extensive hardware acceleration and efficient batched inference.The theoretical end-to-end
first-packet latencyis calculated as the sum ofThinker-Talker Tail Packet Preprocessing Latency,Thinker Time-to-First-Token (TTPT),Talker Time-to-First-Token (TTPT),MTP Module Time Cost Per Token, andCodec Decoder Time Cost Per Code. ForQwen3-Omni-30B-A3Bat 1 concurrency, this is (preprocessing) + (Thinker TTPT) + (Talker TTPT) +14ms(MTP) +3ms(Codec) =234ms(Audio) /547ms(Video). TheGeneration Real Time Factor (RTF)remains below 1 across varying concurrency levels, ensuring continuous streaming.
-
4.2.7. Pretraining
The pretraining of Qwen3-Omni is structured into three distinct stages:
-
Encoder Alignment Stage (S1):
- Initialization: The
LLMcomponent is initialized with parameters fromQwen3(Yang et al., 2025a). Thevision encoderis adopted fromQwen3-VL, and theaudio encoderis initialized withAuT. - Training Strategy: The two encoders (
visionandaudio) are trained separately on the fixedLLM. Training initially focuses on their respectiveadaptersbefore training the encoders themselves. - Rationale: This approach avoids a common issue where joint training of encoders and adapters with a frozen
LLMmight cause encoders to compensate forLLMlimitations, leading to degraded perception capabilities.
- Initialization: The
-
General Stage (S2):
- Unfreezing: All parameters (including the
LLMand encoders) are unfrozen. - Dataset: Utilizes a large-scale dataset of approximately 2 trillion tokens with a diverse distribution across modalities:
- Text: 0.57 trillion tokens
- Audio: 0.77 trillion tokens
- Image: 0.82 trillion tokens
- Video: 0.05 trillion tokens
- Video-Audio: 0.05 trillion tokens
- Objective: To enhance the model's understanding and interaction capabilities across auditory, visual, textual, and
audiovisual informationthrough a wider range ofmultimodal dataand tasks.
- Unfreezing: All parameters (including the
-
Long Context Stage (S3):
-
Increased Sequence Length: The maximum
token lengthis increased from 8,192 to 32,768. -
Data Proportion Adjustment: The proportion of
long audioandlong videoin the training data is increased. -
Objective: To improve the model's ability to understand
complex long-sequence data.During pretraining, a wider range of
natural language promptsare used compared toQwen2.5-Omni(which used a single prompt per task) to enhancegeneralization abilityandinstruction-following capabilities.
-
4.2.8. Post-training (Thinker)
The Thinker undergoes a three-stage post-training process to acquire instruction-following capabilities:
-
Supervised Fine-Tuning (SFT):
- Purpose: To bridge the gap between
pretrained representationsanddownstream task requirementsthroughtargeted instruction optimization. - Data: Designed in
ChatML(OpenAI, 2022) format, including pure text-based dialogue, visual modality conversation, audio modality conversation, and mixed-modality conversation data. - Strategy: Lightweight
SFTthat deliberately diverges from the pretraining data schema but maintains architectural consistency, enabling efficient knowledge transfer.
- Purpose: To bridge the gap between
-
Strong-to-Weak Distillation:
- Pipeline: Adopts the pipeline described in
Qwen3(Yang et al., 2025a). - Off-policy Distillation:
- Process: Combines outputs generated by
teacher modelsto provideresponse distillation. - Goal: Helps
lightweight student models(likeQwen3-Omni) acquire fundamental reasoning abilities.
- Process: Combines outputs generated by
- On-policy Distillation:
- Process: The
student modelgenerates responses based on sampled prompts. Theseon-policy sequencesare then used for fine-tuning. - Objective: Align the
student's predicted logitswith those of ateacher model(e.g.,Qwen3-32BorQwen3-235B-A22B) by minimizingKL divergence. $ D_{KL}(P || Q) = \sum_{i} P(i) \log\left(\frac{P(i)}{Q(i)}\right) $ Where is the probability distribution of theteacher modeland is the probability distribution of thestudent model. TheKL divergencemeasures how one probability distribution diverges from a second, expected probability distribution.
- Process: The
- Pipeline: Adopts the pipeline described in
-
Generative Sequence Policy Optimization (GSPO):
- Purpose: To comprehensively enhance the model's capabilities and stability across various modalities (text, image, video, audio).
- Feedback: Uses two types of rewards:
- Rule-based Reward: For
verifiable multimodal tasks(e.g., mathematics, coding, instruction following). Reward signal derived from predefined rules, assessing output correctness and preventingreward hacking. - Model-based Reward: For
multimodal taskslacking objective, predefined evaluation metrics. Uses anLLM-as-a-judge protocol.- Evaluator:
Qwen3for general tasks,Qwen2.5-VLforvisually-grounded tasks. - Robustness: The
LLM evaluatoris furnished withground-truthorreference answerswhere applicable.
- Evaluator:
- Rule-based Reward: For
4.2.9. Post-training (Talker)
The Talker undergoes a four-stage training process for speech generation:
-
Initial Training:
- Data: Leverages hundreds of millions of speech data with
multimodal context. - Objective: Establish a
monotonic mappingfrommultimodal representationto speech.
- Data: Leverages hundreds of millions of speech data with
-
Continual Pretraining (CPT) and Long-context Training:
- CPT: Performed with high-quality data to alleviate
hallucinationscaused by noisy data from the first stage and significantly improve generated speech quality. - Long-context Training: Concurrently conducted to enhance the
Talker's ability to processextended and complex inputsand generatecontextually appropriate speech responses.
- CPT: Performed with high-quality data to alleviate
-
Direct Preference Optimization (DPO):
- Purpose: Improve
generalizationofmultilingual speech generationandsystem stability. - Process: Constructs
preference pairsfrom diversemultilingual speech samplesand optimizes the model usingDPO(Rafailov et al., 2023).DPOdirectly optimizes the policy to maximize the likelihood of preferred responses over dispreferred ones, given a reference policy : $ \mathcal{L}{DPO}(\pi\theta; (\mathbf{x}, \mathbf{y}_w, \mathbf{y}l)) = -\log \sigma\left(\beta \left[ \log \frac{\pi\theta(\mathbf{y}_w|\mathbf{x})}{\pi_r(\mathbf{y}w|\mathbf{x})} - \log \frac{\pi\theta(\mathbf{y}_l|\mathbf{x})}{\pi_r(\mathbf{y}_l|\mathbf{x})} \right]\right) $ Where:- is the prompt.
- is the preferred (winning) response.
- is the dispreferred (losing) response.
- is the current policy (model).
- is the reference policy (usually the SFT model).
- is a temperature hyperparameter.
- is the sigmoid function.
- The loss function minimizes the probability of choosing the dispreferred response over the preferred one.
- Purpose: Improve
-
Speaker Fine-Tuning:
- Purpose: Enables the
Talkerto adoptspecific voiceswhile refining the naturalness, expressiveness, and controllability of its speech response. - Process: Applied to the aforementioned base model.
- Purpose: Enables the
4.2.10. Captioner
The Qwen3-Omni-30B-A3B-Captioner is developed to address the lack of general-purpose audio captioning models.
- Development: Created by fine-tuning the base
Qwen3-Omni-30B-A3Bmodel. - Data: Fine-tuned on a large-scale dataset of
detailed audio descriptions. - Output: Produces detailed,
low-hallucination captionsfor arbitrary audio inputs.
5. Experimental Setup
The experiments evaluate Qwen3-Omni's ability to comprehend various multimodal inputs and generate textual or speech responses. The evaluation includes Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and two Flash variants (Qwen3-Omni-Flash-Instruct, Qwen3-Omni-Flash-Thinking) designed for improved computational efficiency and performance, notably supporting various dialects.
5.1. Datasets
The evaluation covers a wide range of benchmarks across different modalities:
-
Text-Text ():
- General Tasks:
MMLURedux(Gema et al., 2024),GPQA(Rein et al., 2023). - Reasoning:
AIME25(AIME, 2025),ZebraLogic(Lin et al., 2025),LiveBench 20241125. - Coding:
MultiPL-E(Cassano et al., 2023). - Alignment:
IFEval(Zhou et al., 2023),Creative Writing V3(Paech, 2024),WritingBench(Wu et al., 2025b),Arena-Hard v2. - Agent:
BFCL-v3(Yan et al., 2024). - Multilingual:
MultiF(He et al., 2024),PolyMath(Wang et al., 2025c),MGSM,INCLUDE.
- General Tasks:
-
Audio-Text ():
- Basic Audio Tasks:
Automatic Speech Recognition (ASR),Speech-to-Text (S2TT).- English & Chinese ASR:
Wenetspeech net I meeting, ,CV15-en,CV15-zh,Fleurs-en,Fleurs-zh. - Multilingual ASR:
Fleurs-avg (19 lang)(Arabic, German, English, Spanish, French, Indonesian, Italian, Japanese, Korean, Malay, Dutch, Portuguese, Russian, Spanish, Thai, Turkish, Urdu, Vietnamese, Cantonese, Mandarin). - Lyric ASR:
MIR-1K (vocal-only),Opencpop-test. - S2TT:
Fleurs-en2xx,Fleurs-xx2en,Fleurs-zh2xx,Fleurs-xx2zh(wherexxdenotes other languages).
- English & Chinese ASR:
- Advanced Audio Tasks:
- Voice Chatting:
VoiceBench(Chen et al., 2024b) which includes sub-benchmarks likeAlpacaEval,CommonEval,WildVoice,SDD-QA,MMSU,OpenBookQA,BH,IFEval,AdvBench. - Audio Reasoning:
MMAU(Sakshi et al., 2024),MMSU(Wang et al., 2025a). - Music Understanding:
RUL-MuchoMusic(Zang et al., 2025),GTZAN(Tzanetakis & Cook, 2002), four subsets ofMTG-Jamendo(Bogdanov et al., 2019),MagnaTagATune(Law et al., 2009). The evaluation set composition forGTZAN,MTG-JamendoandMagnaTagATunefollowsMARBLE(Yuan et al., 2023).
- Voice Chatting:
- Basic Audio Tasks:
-
Vision-Text ():
- General Visual Question Answering:
MMStar(Chen et al., 2024a),HallusionBench(Guan et al., 2024),MM-MT-Bench(Agrawal et al., 2024),RealWorldQAavg. - Mathematical & STEM Reasoning:
MathVista(Lu et al., 2024),MathVision(Wang et al., 2024a),MMMU(Yue et al., 2023),MMMU-Pro(Yue et al., 2024). - Document Understanding:
AI2D(Kembhavi et al., 2016),ChartQA(Masry et al., 2022),TextVQAval,DocVQAtest,InfoVQAtest. - Counting:
CountBench(Paiss et al., 2023). - Video Understanding:
Video-MME(Fu et al., 2024),LVBench(Wang et al., 2024b),MLVU(Zhou et al., 2025a),MVBench. - OCR-related Tasks:
AI2D,TextVQAval,DocVQAtest,InfoVQAtest,ChartQAtest Avg,OCRBench.
- General Visual Question Answering:
-
AudioVisual Video Text ():
- General Understanding:
WorldSense(Hong et al., 2025). - Audiovisual Reasoning:
DailyOmni(Zhou et al., 2025b),VideoHolmes(Cheng et al., 2025).
- General Understanding:
-
Speech Generation ():
- Zero-Shot Speech Generation:
SEED(Anastassiou et al., 2024) (test-zh,test-en). - Multilingual Speech Generation:
MiniMax multilingual test set(Zhang et al., 2025). - Cross-Lingual Speech Generation:
CV3-Eval(Du et al., 2025).
- Zero-Shot Speech Generation:
Example of Data Sample (from qualitative results section):
The paper provides an example of expressive speech for Qwen3-Omni-30B-A3B-Captioner:
"The audio clip opens in a studio setting, marked by a faint, persistent electronic hiss and a subtle low-frequency hum... The male speaker, whose voice is delivered in a clear, energetic, and highly theatrical manner, begins with an assertive 'Right!', delivered with a sharp, rising intonation..." This illustrates how the model processes complex audio scenes and generates detailed descriptions.
These datasets are chosen to validate the model's performance across a broad spectrum of tasks, from basic perception (ASR, image recognition) to complex reasoning (multimodal Q&A, audio reasoning) and generation (text, speech). They represent a comprehensive suite of benchmarks in the multimodal AI field.
5.2. Evaluation Metrics
The paper uses various evaluation metrics tailored to each task:
-
Word Error Rate (WER): Primarily used for
Automatic Speech Recognition (ASR)tasks. It measures the number of errors (substitutions, insertions, deletions) in a transcribed sequence compared to a reference sequence. A lowerWERindicates better performance.- Conceptual Definition:
WERquantifies the accuracy of a speech recognition system by comparing the machine-generated transcription to a human-created reference transcription. It is a direct measure of how many words were incorrectly identified, added, or missed. - Mathematical Formula:
$
WER = \frac{S + D + I}{N}
$
Where:
- is the number of substitutions (incorrectly recognized words).
- is the number of deletions (words in the reference that were missed by the system).
- is the number of insertions (words added by the system that were not in the reference).
- is the total number of words in the reference transcription.
- Conceptual Definition:
-
Bilingual Evaluation Understudy (BLEU): Used for
Speech-to-Text Translation (S2TT)tasks, specifically for evaluating the quality of machine-translated text against human-translated references. A higherBLEUscore indicates better translation quality.- Conceptual Definition:
BLEUis an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. It measures the correspondence between a machine's output and a human's output (reference translations) by counting the number of matching n-grams. - Mathematical Formula:
$
BLEU = BP \cdot \exp \left( \sum_{n=1}^{N} w_n \log p_n \right)
$
Where:
BPis the brevity penalty, which penalizes short machine translations.- is the maximum
n-gramlength (typically 4). - is the weight for each
n-gram(typically ). - is the
n-gramprecision, calculated as: $ p_n = \frac{\sum_{\text{sentence} \in \text{candidate}} \sum_{n\text{-gram} \in \text{sentence}} \text{Count}{\text{clip}}(n\text{-gram})}{\sum{\text{sentence} \in \text{candidate}} \sum_{n\text{-gram} \in \text{sentence}} \text{Count}(n\text{-gram})} $ Where ensures that ann-gramis counted no more times than it appears in any single reference translation.
- Conceptual Definition:
-
Micro F1 Score (Micro F1): Used for
multi-label classificationtasks likemusic understanding(e.g., genre, mood, instrument recognition). It's a measure of a test's accuracy, considering both precision and recall. A higherMicro F1indicates better performance.- Conceptual Definition:
Micro F1is a global version of theF1 scorethat aggregates the contributions of all classes to compute the average metric. It's particularly useful inmulti-label classificationwhere the sum of true positives, false positives, and false negatives are calculated across all labels. - Mathematical Formula:
$
MicroF1 = \frac{2 \times \text{MicroPrecision} \times \text{MicroRecall}}{\text{MicroPrecision} + \text{MicroRecall}}
$
Where:
- (True Positives), (False Positives), (False Negatives) for class .
- is the total number of classes.
- Conceptual Definition:
-
Accuracy (Acc.): Used for
classification tasks(e.g.,GTZANmusic genre identification) and general task evaluation (e.g.,MMLU,GPQA). It measures the proportion of correctly predicted instances among the total instances. A higherAccuracyindicates better performance.- Conceptual Definition:
Accuracyis the ratio of correctly predicted observations to the total observations. It's a straightforward measure of overall correctness in a classification problem. - Mathematical Formula: $ Accuracy = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Conceptual Definition:
-
Speaker Similarity (SIM): Used for
speech generationtasks, particularly inzero-shot,multilingual, andcross-lingual voice cloning. This metric quantifies how well the generated speech's voice characteristics (e.g., timbre, pitch, intonation) match those of a reference speaker. The paper does not provide an explicit formula but implies it's a quantitative measure where higher values indicate better similarity. Typically,SIMis measured using cosine similarity between speaker embeddings extracted from a pre-trained speaker verification model.- Conceptual Definition:
Speaker Similarityassesses the degree to which a synthesized voice retains the unique identity and characteristics of a target speaker from a given reference audio. This is crucial for applications likevoice cloning.
- Conceptual Definition:
-
Generation Real Time Factor (RTF): Used to evaluate the efficiency of
speech generationsystems in streaming scenarios.RTFis the ratio of the time taken to generate an audio segment to the actual duration of that audio segment. AnRTFless than 1 means the system can generate audio faster than real-time.- Conceptual Definition:
RTFindicates how fast a speech synthesis system can generate audio relative to the length of the audio it produces. AnRTFof 0.5 means the system takes half the duration of the audio to generate it, implying it can stream speech in real-time. - Mathematical Formula (as implied by the paper):
$
RTF = \frac{\text{Time taken to generate 80ms audio}}{\text{80ms}}
$
Where the "Time taken to generate 80ms audio" includes the combined time for
ThinkerandTalkerto generate one token, plus the processing time for theMTP ModuleandCodec Decoderper token.
- Conceptual Definition:
5.3. Baselines
Qwen3-Omni is compared against a wide array of strong baselines, including both open-source and closed-source models, as well as specialist and generalist models:
-
Text-Text:
GPT-4o-0327(OpenAI, 2024): A strong closed-source multimodalLLM.Gemini-2.5-Flash Thinking(Gemini Team, 2024): A closed-source multimodalLLMwith reasoning capabilities.Qwen3-235B-A22B Non-Thinking&Thinking(Yang et al., 2025a): Larger, text-onlyQwen3variants.Qwen3-30B-A3B-Instruct-2507&Thinking-2507: Same-sized text-onlyQwen3variants, serving as direct unimodal counterparts.InternVL-3.5-241B-A28B: A large vision-language reasoning model.
-
Audio-Text (ASR & S2TT):
Seed-ASR: A specialistASRmodel.Voxtral-Mini,Voxtral-Small: SpecialistASRmodels.GPT-4o-Transcribe(OpenAI, 2024):ASRcomponent ofGPT-4o.Gemini-2.5-Pro(Gemini Team, 2024): A strong closed-source multimodal model.Qwen2.5-Omni(Xu et al., 2025): The predecessor multimodal model from the same team.
-
Audio-Text (Voice Interaction & Audio Reasoning):
GPT-4o-Audio(OpenAI, 2024): Audio capabilities ofGPT-4o.Gemini-2.5-Flash,Gemini-2.5-Pro(Gemini Team, 2024): Closed-source multimodal models.Qwen2.5-Omni(Xu et al., 2025): Predecessor model.
-
Audio-Text (Music Understanding):
Best Specialist Models: IncludesAudio Flamingo 3(Goel et al., 2025),CLaMP 3(Wu et al., 2025a),MuQ-MuLan(Zhu et al., 2025),MuQ(Zhu et al., 2025). These are highly specialized models for music understanding tasks.GPT-4o-Audio(OpenAI, 2024),Gemini-2.5-Pro(Gemini Team, 2024),Qwen2.5-Omni(Xu et al., 2025): Generalist multimodal models.
-
Vision-Text:
GPT-4o(OpenAI, 2024),Gemini-2.0-Flash(Gemini Team, 2024): Closed-source vision-language models.Qwen2.5-VL-72B(Bai et al., 2025): A largervision-language modelfrom theQwenseries.Gemini-2.5-Flash -Thinking,InternVL-3.5-241B-A28B: Strong reasoning-focused vision-language models.
-
AudioVisual Video Text:
Previous Open-source SoTA: Specific reference to (Yang et al., 2025b) and (Tang et al., 2025) forWorldSense,DailyOmni, andVideoHolmes.Gemini-2.5-Flash(Gemini Team, 2024): Closed-source multimodal model.Qwen2.5-Omni(Xu et al., 2025): Predecessor model.
-
Speech Generation:
-
Seed-TTSIcL,Seed-TTSRL(Anastassiou et al., 2024): State-of-the-art zero-shotTTSsystems. -
MaskGCT(Wang et al., 2024c),E2 TTS(Eskimez et al., 2024),F5-TTS(Chen et al., 2024c),Spark TTS(Wang et al., 2025b),CosyVoice 2(Du et al., 2024),CosyVoice 3(Du et al., 2025),Qwen2.5-Omni-7B(Xu et al., 2025): VariousTTSmodels. -
MiniMax-Speech(Zhang et al., 2025),ElevenLabs Multilingual v2: Multilingual speech generation models. -
CosyVoice2,CosyVoice3: Cross-lingual speech generation models.These baselines are chosen to provide a comprehensive comparison against both general-purpose powerful
LLMsand specialized models across different modalities, ensuring a robust evaluation ofQwen3-Omni'sstate-of-the-artclaims and itsnon-degradationproperty.
-
6. Results & Analysis
The experimental evaluation comprehensively assesses Qwen3-Omni's performance across various multimodal tasks, including Text-Text, Audio-Text, Vision-Text, AudioVisual Video Text, and Speech Generation. The results consistently demonstrate its strong capabilities and validate the core claims of non-degradation and state-of-the-art performance, particularly in audio.
6.1. Core Results Analysis
6.1.1. Performance of Text-Text
The following are the results from Table 4 of the original paper:
| GPT-40-0327 | Qwen3-235B-A22B Non Thinking | Qwen3-30B-A3B -Instruct-2507 | Qwen3-Omni-30B-A3B -Instruct | Qwen3-Omni-Flash -Instruct | ||
| GeneralTasks | MMLU-Redux | 91.3 | 89.2 | 89.3 | 86.6 | 86.8 |
| GPQA | 66.9 | 62.9 | 70.4 | 69.6 | 69.7 | |
| AIME25 | 26.7 | 24.7 | 61.3 | 65.0 | 65.9 | |
| Reasoning Code | ZebraLogic | 52.6 | 37.7 | 90.0 | 76.0 | 76.1 |
| MultiPL-E | 82.7 | 79.3 | 83.8 | 81.4 | 81.5 | |
| Alignment | IFEval | 83.9 | 83.2 | 84.7 | 81.0 | 81.7 |
| Creative Writing v3 | 84.9 | 80.4 | 86.0 | 80.6 | 81.8 | |
| Tasks Agent | WritingBench | 75.5 | 77.0 | 85.5 | 82.6 | 83.0 |
| BFCL-v3 | 66.5 | 68.0 | 65.1 | 64.4 | 65.0 | |
| Multilingual Tasks | MultiIF | 70.4 | 70.2 | 67.9 | ||
| PolyMATH | 25.5 | 27.0 | 43.1 | 64.0 37.9 | 64.7 39.3 | |
The following are the results from Table 5 of the original paper:
| Gemini-2.5-Flash Thinking | Qwen3-235B-A22B Thinking | Qwen3-30B-A3B -Thinking-2507 | Qwen3-Omni-30B-A3B Thinking | Qwen3-Omni-Flash Thinking | ||
| General Tasks | MMLU-Redux | 92.1 | 92.7 | 91.4 | 88.8 | 89.7 |
| GPQA | 82.8 | 71.1 | 73.4 | 73.1 | 73.1 | |
| Reasoning | AIME25 | 72.0 | 81.5 | 85.0 | 73.7 | 74.0 |
| LiveBench 20241125 | 74.3 | 77.1 | 76.8 | 71.8 | 70.3 | |
| Code | MultiPL-E | 84.5 | 79.9 | 81.3 | 80.6 | 81.0 |
| Alignment Tasks | IFEval | 89.8 | 83.4 | 88.9 | 85.1 | 85.2 |
| Arena-Hard v2 | 56.7 | 61.5 | 56.0 | 55.1 | 57.8 | |
| Creative Writing v3 | 85.0 | 84.6 | 84.4 | 82.5 | 83.6 | |
| Agent | WritingBench | 83.9 | 80.3 | 85.0 | 85.5 | 85.9 |
| BFCL-v3 | 68.6 | 70.8 | 72.4 | 63.2 | 64.5 | |
| Multilingual Tsks | MultiIF PolyMATH | 74.4 49.8 | 71.9 54.7 | 76.4 52.6 | 72.9 47.1 | 73.2 48.7 |
Qwen3-Omni-30B-A3B-Instructdemonstrates strong performance, surpassing the larger open-sourceQwen3-235B-A22B Non-Thinkingand even the closed-sourceGPT-4o-0327on several benchmarks likeGPQA,AIME25,ZebraLogic,WritingBench, andPolyMath. This is a significant finding, indicating that the multimodal nature does not degrade text performance and can even enhance it.- The
Instructvariant also performs comparably to its text-only counterpart,Qwen3-30B-A3B-Instruct-2507, reinforcing thenon-degradationclaim for text capabilities. Qwen3-Omni-30B-A3B-Thinkingexhibits performance comparable toGemini-2.5-Flash-ThinkingandQwen3-235B-A22B Thinking, especially on reasoning tasks likeAIME25.- The
Flashvariants (Qwen3-Omni-Flash-InstructandQwen3-Omni-Flash-Thinking) generally maintain performance close to their non-Flash counterparts, suggesting improved efficiency without significant accuracy loss.
6.1.2. Performance of Audio-Text
The following are the results from Table 6 of the original paper:
| Seed -ASR | Voxtral -Mini | Voxtral -Small | GPT-40 -Transcribe | Gemini-2.5 -Pro | Qwen2.5 -Omni | Qwen3-Omni -30B-A3B-Instruct | Qwen3-Omni -Flash-Instruct | |
| EN & ZH ASR (wer) | ||||||||
| Wenetspeech net I meeting | 4.66 | 5.69 | 24.30 | 31.53 | 20.33 | 26.08 | 15.30 | 32.27 | 14.43 | 13.47 | 5.9117.65 | 4.69 | 5.89 | 4.62 | 5.75 |
| Librispeech clean | other | 1.58 |2.84 | 1.88 | 4.12 | 1.56 |3.30 | 1.39 |3.75 | 2.89 | 3.56 | 1.74|3.45 | 1.22 | 2.48 | 1.27 | 2.44 |
| CV15-en | - | 9.47 | 7.79 | 10.01 | 9.89 | 7.61 | 6.05 | 5.94 |
| CV15-zh | - | 24.67 | 19.30 | 9.84 | 8.00 | 5.13 | 4.31 | 4.28 |
| Fleurs-en | 3.40 | 3.96 | 3.77 | 3.32 | 2.94 | 3.77 | 2.72 | 2.74 |
| Fleurs-zh | 2.69 | 12.22 | 7.98 | 2.44 | 2.71 | 2.54 | 2.20 | 2.19 |
| Multilingual ASR (wer) | ||||||||
| Fleurs-avg (19 lang)a | 15.67 | 8.09 | 4.48 | 5.55 | 14.04 | 5.33 | 5.31 | |
| Lyric ASR (wer) | ||||||||
| MIR-1K (vocal-only)b | 6.45 | 23.33 | 18.73 | 11.87 | 9.85 | 8.15 | 5.90 | 5.85 |
| Opencpop-test | 2.98 | 31.01 | 16.06 | 7.93 | 6.49 | 2.84 | 1.54 | 2.02 |
| S2TT (BLEU) | ||||||||
| Fleurs-en2xxc | 30.35 | 37.85 | 39.25 | 29.22 | 37.50 | 36.22 | ||
| Fleurs-xx2en | 27.54 | 32.81 | 35.41 | 28.61 | 31.08 | 30.71 | ||
| Fleurs-zh2xx | 17.03 | 22.05 | 26.63 | 17.97 | 25.17 | 25.10 | ||
| Fleurs-xx2zh | 28.75 | 34.82 | 37.50 | 27.68 | 33.13 | 31.19 | ||
-
ASR (Word Error Rate - WER):
Qwen3-Omni-Instructachievesstate-of-the-artperformance across both English and Chinese ASR benchmarks (Librispeech,Wenetspeech,Fleurs,CommonVoice). For instance, onLibrispeech clean, it scores1.22, outperformingSeed-ASR(1.58),GPT-4o-Transcribe(1.39), andGemini-2.5-Pro(2.89). -
Multilingual ASR: It shows competitive performance on
Fleurs-avg (19 lang)with5.33 WER, outperformingGemini-2.5-Pro(5.55) and significantlyQwen2.5-Omni(14.04). -
Lyric ASR: Achieves SOTA on
MIR-1K (vocal-only)(5.90 WER) andOpencpop-test(1.54 WER), again outperforming strong baselines. -
S2TT (BLEU):
Qwen3-Omni-Instructdelivers better or comparableBLEUscores forSpeech-to-Text Translation, such asFleurs-en2xx(37.50) where it is comparable toGemini-2.5-Pro(39.25) and outperformsVoxtral-Small(37.85).The following are the results from Table 7 of the original paper:
GPT-40 -Audio Gemini-2.5 -Flash Gemini-2.5 -Pro Qwen2.5 -Omni Qwen3-Omni -30B-A3B-Instruct Qwen3-Omni -30B-A3B-Thinking Qwen3-Omni -Flash-Instruct Qwen3-Omni FlashThinking VoiceBench AlpacaEval 95.6 96.1 94.3 89.9 94.8 96.4 95.4 96.8 CommonEval 89.8 88.3 88.4 76.7 90.8 90.5 91.0 90.9 WildVoice 91.6 92.1 93.4 77.7 91.6 90.5 92.3 90.9 SDD-QA 75.5 84.5 90.1 56.4 76.9 78.1 76.8 78.5 MMSU 80.3 66.1 71.1 61.7 68.1 83.0 68.4 84.3 OpenBookQA 89.2 56.9 92.3 80.9 89.7 94.3 88.9 91.4 80.6 95.0 BH 84.1 83.9 83.8 92.6 85.7 66.7 53.5 80.4 77.8 80.6 75.2 89.6 80.8 IFEval AdvBench 76.0 98.7 98.9 98.1 99.2 99.3 97.2 99.4 98.9 Overall 86.8 83.4 89.6 73.6 85.5 88.8 85.6 89.5 Audio Reasoning MMAU-v05.15.25 MMSU 62.5 56.4 71.8 70.2 77.4 77.7 65.5 62.6 77.5 69.0 75.4 70.2 77.6 69.1 76.5 71.3 -
VoiceBench:
Qwen3-Omni-Thinkingachieves an impressive average score of89.5, almost matchingGemini-2.5-Pro(89.6) and surpassing all other audio language models. This highlights its strong capabilities in speech interaction. -
Audio Reasoning:
Qwen3-Omni(both Instruct and Thinking variants) demonstrates impressive performance, outperformingGemini-2.5-ProandGemini-2.5-FlashonMMAUandGemini-2.5-FlashandGPT-4o-AudioonMMSU. For instance,Qwen3-Omni-Flash-Thinkingscores76.5onMMAU-v05.15.25and71.3onMMSU, outperforming most baselines. This confirms its powerful general audio understanding and reasoning abilities.The following are the results from Table 8 of the original paper:
Best Specialist Modes GPT-40 -Audio Gemini-2.5 -Pro Qwen2.5 -Omni Qwen3-Omni -30B-A3B-Instruct Qwen3-Omni -Flash-Instruct RUL-MuchoMusic 47.6 (Audio Flamingo 3) (Goel et al., 2025) 36.1 49.4 47.3 52.0 52.1 GTZAN Acc. 87.9 (CLaMP 3) (Wu et al., 2025a) 76.5 81.0 81.7 93.0 93.1 MTG Genre Micro F1 35.8 (MuQ-MuLan) (Zhu et al., 2025) 25.3 32.6 32.5 39.0 39.5 MTG Mood/Theme Micro F1 10.9 (MuQ-MuLan) (Zhu et al., 2025) 11.3 14.1 8.9 21.0 21.7 MTG Instrument Micro F1 39.8 (MuQ-MuLan) (Zhu et al., 2025) 34.2 33.0 22.6 40.5 40.7 MTG Top50 Micro F1 33.2 (MuQ-MuLan) (Zhu et al., 2025) 25.0 26.1 21.6 36.7 36.9 MagnaTagATune Micro F1 41.6 (MuQ) (Zhu et al., 2025) 29.2 28.1 30.1 44.3 46.8 -
Music Understanding:
Qwen3-Omni-Instructachieves SOTA onRUL-MuchoMusic(52.0), surpassing specialist models likeAudio Flamingo 3(47.6) and strong generalists likeGemini-2.5-Pro(49.4). -
It also significantly outperforms other audio language models and even self-supervised music specialist models on
GTZAN(93.0 Acc.),MTG-Jamendo(variousMicro F1scores), andMagnaTagATune(44.3 Micro F1). This demonstrates superior capabilities across diverse music understanding tasks.
6.1.3. Performance of Vision-Text
The following are the results from Table 9 of the original paper:
| Datasets | GPT4-0 | Gemini-2.0-Flash | Qwen2.5-VL 72B | Qwen3-Omni-30B-A3B -Instruct | Qwen3-Omni-Flash -Instruct |
| General Visual Question Ansrering | |||||
| MMStar | 64.7 | 71.4 | 70.8 | 68.5 | 69.3 |
| HallusionBench | 55.0 | 56.3 | 55.2 | 59.7 | 60.4 |
| MM-MT-Bench | 7.7 | 6.7 | 7.6 | 7.4 | 7.6 |
| Math & STEM | |||||
| MMMUval | 69.1 | 71.3 | 70.2 | 69.1 | 69.8 |
| MMMU-Prooverall | 51.9 | 56.1 | 51.1 | 57.0 | 58.2 |
| MathVistamini MATH-Visionfull | 63.8 | 71.4 | 74.8 | 75.9 | 77.4 |
| 30.4 | 48.6 | 38.1 | 56.3 | 57.3 | |
| Documentation Understanding | |||||
| AI2Dw.M. | 84.6 | 86.7 | 88.7 | 85.2 | 86.4 |
| ChartQAtest Avg. | 86.7 | 64.6 | 89.5 | 86.8 | 87.1 |
| Counting | |||||
| CountBench | 87.9 | 91.2 | 93.6 | 90.0 | 90.0 |
| Video Understanding | |||||
| Video-MMEw/o sub | 71.9 | 72.4 | 73.3 | 70.5 | 71.4 |
| LVBench | 30.8 | 57.9 | 47.3 | 50.2 | 51.1 |
| MLVU | 64.6 | 71.0 | 74.6 | 75.2 | 75.7 |
The following are the results from Table 10 of the original paper:
| Datasets | Gemini-2.5-Flash -Thinking | InternVL-3.5-241B-A28B | Qwen3-Omni-30B-A3B Thinking | Qwen3-Omni-Flash -Thinking |
| General Visual Question Answering | ||||
| MMStar | 75.5 | 77.9 | 74.9 | 75.5 |
| HallusionBench | 61.1 | 57.3 | 62.8 | 63.4 |
| MM-MT-Bench | 7.8 | − | 8.0 | 8.0 |
| Math & STEM | ||||
| MMMUval | 76.9 | 77.7 | 75.6 | 75.0 |
| MMMU-PrOoverall | 65.8 | 60.5 | 60.8 | |
| MathVistamini | 77.6 | 82.7 | 80.0 | 81.2 |
| MATH-Visionfull | 62.3 | 63.9 | 62.9 | 63.8 |
| Documentation Understanding | ||||
| AI2Dw.M. | 88.6 | 87.3 | 86.1 | 86.8 |
| ChartQAtest Avg. | 88.0 | 89.5 | 89.3 | |
| Counting | ||||
| CountBench | 88.6 | 88.6 | 92.5 | |
| Video Understanding | ||||
| Video-MMEw/o sub | 79.6 | 72.9 | 69.7 | 69.8 |
| LVBench | 64.5 | 49.0 | 49.5 | |
| MLVU | 82.1 | 78.2 | 72.9 | 73.9 |
Qwen3-Omni-Instructdemonstrates performance comparable toQwen2.5-VL-72B, and on certain benchmarks likeHallusionBench(59.7),MMMU-Pro overall(57.0),MathVistamini(75.9), andMATH-Visionfull(56.3), it outperformsGPT-4oandGemini-2.0-Flash. This indicates excellentimage understandingandreasoning.Qwen3-Omni-Thinkingshows significant advancements, outperforming theInstructbaseline by4.4 pointson Math and STEM benchmarks. It achieves performance levels on par with substantially larger baselines, showcasing itseffectivenessandcomputational efficiency. For instance,Qwen3-Omni-Flash-Thinkingscores63.4onHallusionBenchand81.2onMathVistamini.- Limitation: The current model shows suboptimal performance on
long video benchmarks(Video-MME,LVBench,MLVU), attributed to limitedpositional extrapolation capacityand restrictedcontext length.
6.1.4. Performance of AudioVisual Video Text
The following are the results from Table 11 of the original paper:
| Datasets | Previous Open-source SoTA | Gemini-2.5-Flash | Qwen2.5-Omni | Qwen3-Omni-30B-A3B -Instruct | Qwen3-Omni-Flash -Instruct |
| WorldSense | 47.1(Yang et al., 2025b) | 50.9 | 45.4 | 54.0 | 54.1 |
The following are the results from Table 12 of the original paper:
| Datasets | Previous Open-source SoTA | -Thinking | Gemini-2.5-Flash Qwen3-Omni-30B-A3B -Thinking | Qwen3-Omni-Flash -Thinking |
| DailyOmni | 69.8(Tang et al., 2025) | 72.7 | 75.8 | 76.2 |
| VideoHolmes | 55.6(Tang et al., 2025) | 49.5 | 57.3 | 57.3 |
- General Understanding:
Qwen3-Omni-Instructachievesstate-of-the-artperformance on theWorldSensebenchmark (54.0), significantly surpassing otherOmnimodels (e.g.,Qwen2.5-Omniat45.4). This demonstrates its efficacy infoundational multimodal integration. - Complex Reasoning: The
Thinkingvariants exhibit enhanced performance onaudiovisual reasoning tasks.Qwen3-Omni-Flash-Thinkingscores76.2onDailyOmniand57.3onVideoHolmes, outperforming previous open-source SOTA andGemini-2.5-Flash. These results highlightQwen3-Omni's potential for advanced perception and reasoning in real-world contexts.
6.1.5. Performance of Speech Generation
The following are the results from Table 13 of the original paper:
| Datasets | Model | Performance |
| Content Consistency | ||
| SEED test-zh | test-en | Seed-TTSIcL (Anastassiou et al., 2024) | 1.11 | 2.24 |
| Seed-TTSRL (Anastassiou et al., 2024) | 1.00 | 1.94 | |
| MaskGCT (Wang et al., 2024c) | 2.27 2.62 | |
| E2 TTS (Eskimez et al., 2024) | 1.97 2.19 | |
| F5-TTS (Chen et al., 2024c) | 1.56 1.83 | |
| Spark TTS (Wang et al., 2025b) | 1.20 1.98 | |
| CosyVoice 2 (Du et al., 2024) | 1.45 2.57 | |
| CosyVoice 3 (Du et al., 2025) | 0.71 | 1.45 | |
| Qwen2.5-Omni-7B (Xu et al., 2025) | 1.42 2.33 | |
| Qwen3-Omni-30B-A3B | 1.07 | 1.39 |
-
Zero-Shot Speech Generation (
Content Consistency- WER):Qwen3-Omni-30B-A3Bdemonstrates highly competitive performance onSEED test-zh(1.07) andtest-en(1.39), achieving the best performance ontest-en. This indicates robust speech understanding and generation. TheRL optimizationyields significant improvements in generation stability.The following are the results from Table 14 of the original paper:
Language Content Consistency Speaker Similarity Qwen3-Omni -30B-A3B MiniMax ElevenLabs Qwen3-Omni -30B-A3B MiniMax ElevenLabs Chinese 0.716 2.252 16.026 0.772 0.780 0.677 English 1.069 2.164 2.339 0.773 0.756 0.613 German 0.777 1.906 0.572 0.738 0.733 0.614 Italian 1.067 1.543 1.743 0.742 0.699 0.579 Portuguese 1.872 1.877 1.331 0.770 0.805 0.711 Spanish 1.765 1.029 1.084 0.744 0.762 0.615 Japanese 3.631 3.519 10.646 0.763 0.776 0.738 Korean 1.670 1.747 1.865 0.778 0.776 0.700 French 2.505 4.099 5.216 0.689 0.628 0.535 Russian 3.986 4.281 3.878 0.759 0.761 0.676 -
Multilingual Speech Generation:
Qwen3-OmnisurpassesMiniMax-SpeechandElevenLabs Multilingual v2inContent Consistencyfor languages like Chinese (0.716), English (1.069), and French (2.505). It also delivers competitive results in other languages, demonstrating stable and human-like voice generation across 10 supported languages, with highSpeaker Similarity.The following are the results from Table 15 of the original paper:
Language Qwen3-Omni-30B-A3B CosyVoice3 CosyVoice2 en-to-zh 5.37 5.09 13.5 ja-to-zh 3.32 3.05 48.1 ko-to-zh 0.99 1.06 7.70 zh-to-en 2.76 2.98 6.47 ja-to-en 3.31 4.20 17.1 ko-to-en 3.34 4.19 11.2 zh-to-ja 8.29 7.08 13.1 en-to-ja 7.53 6.80 14.9 ko-to-ja 4.24 3.93 5.86 zh-to-ko 5.13 14.4 24.8 en-to-ko 4.96 5.87 21.9 ja-to-ko 6.23 7.92 21.5 -
Cross-Lingual Speech Generation (
Content Consistency- WER):Qwen3-Omnigenerally outperformsCosyVoice3in any-to-en (e.g.,ja-to-en3.31vs4.20) and any-to-ko voice cloning (e.g.,zh-to-ko5.13vs14.4). It also achieves comparable performance toCosyVoice3inany-to-jatasks, even without text normalization, highlighting its adaptability across diverse linguistic contexts.
6.1.6. Non-Degradation Across Modalities
The following are the results from Table 16 of the original paper:
| Datasets | Qwen3-30B-A3B -Base-202507 | Qwen3-VL-30B-A3B -Base-202507 | Qwen3-Omni-30B-A3B | |
| -Base-202507 | ||||
| General Tasks | MMLU | 81.24 | 81.69 | |
| MMLU-Redux | 80.17 | 80.60 | ||
| MMLU-Pro | 61.81 | 61.57 | ||
| SuperGPQA | 38.24 | 40.14 | ||
| BBH | 83.79 | 83.53 | ||
| Math & STEAM Tasks | GSM8K MATH | 90.83 | 91.36 | |
| 60.84 | 60.42 | |||
| Coding Tasks | EvalPlus | 69.70 | 73.96 | |
| MultiPL-E | 65.75 | 64.79 | ||
| MBPP | 72.60 | 72.60 | ||
| CRUX-O | 66.94 | 69.06 | ||
| Multilingual Tasks | MGSM INCLUDE | 78.75 65.17 | - | 79.93 64.73 |
| College-level Problems | MMMUval | 57.22 | 59.33 | |
| General Visual Question Answering | MMStar | 67.2 | 69.6 | |
| RealWorldQAavg | 73.98 | 71.89 | ||
| OCR-related Tasks | AI2D | 85.88 | 86.62 | |
| TextVQAval | 81.67 | 81.65 | ||
| DocVQAtest | 95.19 | 95.27 | ||
| InfoVQAtest | 81.17 | 83.31 | ||
| ChartQAtest Avg | 87.12 | 87.52 | ||
| OCRBench | 85.8 | 86.0 | ||
| Video Understanding Tasks | Video-MMEw/o sub | 69.22 | 69.25 | |
| MVBench | 71.87 | 69.50 | ||
| LVBench | 48.61 | 51.07 | ||
A controlled comparative study was conducted with three models of identical parameter counts (30B-A3B) and matched training compute (FLOPs): a text-only baseline (Qwen3-30B-A3B-Base), a vision-only baseline (Qwen3-VL-30B-A3B-Base), and the multimodal Qwen3-Omni-30B-A3B-Base. The Omni model's sole differentiating factor was the inclusion of supplementary audio and audio-visual data.
- Text Modality:
Qwen3-Omni-30B-A3B-Baseshows comparable or slightly better performance on many text benchmarks (e.g.,MMLU,SuperGPQA,GSM8K,EvalPlus,MGSM) compared to theQwen3-30B-A3B-Base(text-only). This supports the claim thatearly multimodal integrationdoes not degrade language capability. - Vision Modality:
Qwen3-Omni-30B-A3B-Baseconsistently outperforms theQwen3-VL-30B-A3B-Base(vision-only) on various vision benchmarks, includingMMMUval(59.33vs57.22),MMStar(69.6vs67.2), and mostOCR-related tasks(AI2D,InfoVQAtest,ChartQAtest Avg). This indicates thatjoint multimodal trainingleads tomutual enhancement, improving performance even in single modalities. - Observations from Authors:
Early multimodal integrationduring pretraining allowslanguage modelsto beco-trainedwith vision or audio without any degradation in language capability.- The inclusion of the text modality substantially improves performance in the vision and audio modalities.
- No measurable gains in language ability are observed from adding visual or audio signals.
- Empirically, adding audio data consistently improves vision performance on the
MMMUbenchmark andOCR-related tasks.
6.2. Ablation Studies / Parameter Analysis
While the paper doesn't present explicit ablation studies in a dedicated section with detailed tables, the comparison between Instruct and Thinking variants, and the Flash variants, implicitly serves as a form of parameter/variant analysis.
-
Instructvs.ThinkingModels: TheThinkingmodels generally show enhanced reasoning capabilities, especially on complex tasks (e.g., Math & STEM in Vision-Text, Audiovisual Reasoning). However, for purely perception-based tasks like andMusic Understanding, theThinkingmodel is sometimes outperformed by itsInstructcounterpart (as shown in Appendix Table 17 and 18). This suggests that complex reasoning processes might not always yield gains for straightforward perceptual tasks and could even introducehallucinations. -
FlashModels: TheFlashmodels are designed for computational efficiency while maintaining high performance. Results show they generally achieve comparable performance to their non-Flash counterparts (e.g.,Qwen3-Omni-Flash-InstructvsQwen3-Omni-30B-A3B-Instruct), indicating a good trade-off between speed and accuracy. -
MoEArchitecture: TheMoEdesign for bothThinkerandTalkeris highlighted as crucial forhigh concurrencyandfast inference, particularly in maintaining lowprefill latencyandTTFTunder varying load (Table 2). -
Multi-codebook&ConvNetin Talker: The shift tomulti-codebook representationandcausal ConvNetforCode2Wavis directly tied to achieving ultra-lowfirst-packet latency(234 ms). This architectural choice dramatically reduces computational overhead compared toblock-wise diffusion.These analyses demonstrate the effectiveness of specific architectural choices and training methodologies in achieving
Qwen3-Omni's stated goals.
6.3. Latency and Concurrency
The following are the results from Table 1 of the original paper:
| Module | Architecture | Params | Streaming |
| Audio Encoder | AuT | 650M | ✓ |
| Vision Encoder | SigLIP2-S0400M | 540M | |
| Thinker | MoE Transformer | 30B-A3B | ✓ |
| Talker | MoE Transformer | 3B-A0.3B | V |
| MTP | Dense Transformer | 80M | ✓ |
| Code2wav | ConvNet | 200M | |
| End-to-End First-Packet Latency: 234/547ms | |||
The following are the results from Table 2 of the original paper:
| Qwen3-Omni-30B-A3B | |||
| 1 Concurrency | 4 Concurrency 6 Concurrency | ||
| Thinker-Talker Tail Packet Preprocessing Latency | 72/160ms | 94/180ms | 100/200ms |
| Thinker Time-to-First-Token (TTPT) | 88/160ms | 468/866ms | 673/1330ms |
| Talker Time-to-First-Token (TTPT) | 57/210ms | 145/450ms | 376/734ms |
| MTP Module Time Cost Per Token | 14ms | 16ms | 18ms |
| Codec Decoder Time Cost Per Code | 3ms | 5ms | 5ms |
| Overral Latency (Audio/Video) | 234/547ms | 728/1517ms | 1172/2284ms |
| Thinker Token Generation Rate (TPS) | 75 tokens/s | 63 tokens/s | 53 tokens/s |
| Talker Token Generation Rate (TPS) | 140 tokens/s | 125 tokens/s | 110 tokens/s |
| Generation RTF(Real Time Factor) | 0.47 | 0.56 | 0.66 |
- First-Packet Latency:
Qwen3-Omniachieves a theoretical end-to-endfirst-packet latencyof234 msfor audio and547 msfor video in cold-start settings with 1 concurrency. This is a critical achievement for real-time interactive applications. - Concurrency: The
MoE architectureensures thatprefill latencyandTTPTforThinkerandTalkerremain largely unaffected under high concurrency (e.g., 6 concurrency). The lightweightMTP ModuleandCodec Decoderalso minimize overhead, showing only marginal increases in time cost per token/code under higher loads. - Real Time Factor (RTF): The
RTFconsistently remains below 1 across varying concurrency levels (0.47at 1 concurrency,0.66at 6 concurrency). This guarantees that users receive continuously streaming audio responses faster than real-time, which is essential for smooth conversational experiences.
6.4. Qualitative Results from Qwen3-Omni-30B-A3B-Captioner
The paper includes detailed qualitative results for the fine-tuned Audio Captioner, demonstrating its ability to produce rich, low-hallucination descriptions for diverse audio inputs.
-
Analysis of Expressive Speech: The model accurately identifies a "studio setting," "faint, persistent electronic hiss," "male speaker," "clear, energetic, highly theatrical manner," specific Chinese phrases, "exaggerated emphasis," "comedic contrast," "central positioning in stereo field," and "digital reverb for dramatic effect." It even deduces the "comedic, over-the-top manner" and "theatrical nature," showcasing strong
semantic understandingandparalinguistic analysis. -
Analysis of Complex Scene Sound Effect: The captioner describes a "highly produced, cinematic soundscape," noting "deep, resonant musical drone," "metallic clank," "slow, rhythmic, ominous beat," "swelling orchestral strings," "thunderous, mechanical roar of a massive engine," "high-pitched, metallic screech," "colossal, explosive impact," "shattering and debris," and "heavy, strained breathing" post-impact. It infers context like "imminent danger," "immense machinery," "vast, hard-walled environment," and "catastrophic event," demonstrating excellent
audio event detectionandscene understanding. -
Analysis of Mixed Speech, Audio, and Music: For a composite audio, it details "deep, resonant metallic clang," "low-frequency rumble," "mechanical whirring," "electrical arcs or energy discharges," "distant and high-pitched female voice" asking "Are we there yet?", "deeper, gravelly male voice" responding "We get there when we get there," and "synthesized musical sting." It correctly interprets "familial banter," "playful annoyance," and "science fiction or fantasy context," showcasing
complex audio scene analysisandmultimodal reasoningto piece together narrative elements.These qualitative results underscore the
Captioner's ability to provide granular, contextually rich, and narrative-driven descriptions of audio, which is a significant advancement in the field.
7. Conclusion & Reflections
7.1. Conclusion Summary
The Qwen3-Omni technical report introduces a groundbreaking family of multimodal models (Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, Qwen3-Omni-Flash-Instruct, and Qwen3-Omni-Flash-Thinking) that achieve a significant milestone in multimodal AI. For the first time, these models demonstrate that fully integrated, end-to-end multimodal training can be achieved without degrading core capabilities in any single modality.
Key findings include:
-
Performance Parity and Beyond:
Qwen3-Omni-30B-A3Bmatches or surpasses same-sized unimodalQwenmodels on text and vision benchmarks, directly refuting the commonmodality trade-off. -
SOTA Audio Performance: It sets new
state-of-the-artrecords on audio processing and dialogue benchmarks, achieving open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming even strong proprietary systems likeGemini-2.5-Pro. -
Enhanced Reasoning: The
Thinkingvariants (Qwen3-Omni-30B-A3B-Thinking) further boost performance on complexmultimodal reasoningtasks involving text, vision, and audio-visual inputs. -
Ultra-Low Latency Speech Interaction: Through innovations like the
Thinker-Talker MoEarchitecture,multi-codebook autoregressive prediction, and a lightweightcausal ConvNetfor waveform synthesis,Qwen3-Omniachieves an impressive end-to-endfirst-packet latencyof234 ms, enabling fluent and natural real-time speech generation across 10 languages. -
Broad Language and Modality Coverage: The model supports 119 text languages, 19 languages for speech understanding, and 10 for speech synthesis, and can process long audio inputs (up to 40 minutes).
-
Novel Audio Captioning: The introduction of
Qwen3-Omni-30B-A3B-Captioneraddresses a gap in the community by providing a model capable of generating detailed,low-hallucination captionsfor arbitrary audio inputs. -
Public Availability: The release of key models under the Apache 2.0 license promotes further research and application.
The paper concludes that
Qwen3-Omnirepresents a critical step towards truly integratedmultimodal AI, offering advantages overcascaded pipelinesincross-modal reasoning, lower end-to-end latency, and reduced system complexity and cost.
7.2. Limitations & Future Work
The authors acknowledge a key limitation:
- Suboptimal Long Video Performance: The current model exhibits suboptimal performance on
long video benchmarks. This is attributed to two architectural constraints:-
Limited capacity for
positional extrapolation. -
Restricted
context length.Future research directions suggested by the authors include:
-
- Multi-speaker ASR: Improving
Automatic Speech Recognitionfor scenarios with multiple speakers. - Video OCR: Enhancing
Optical Character Recognitioncapabilities within video. - Audiovisual Proactive Learning: Developing models that can proactively learn from
audiovisual cues. - Enhanced Agent-based Workflows and Function Calling: Further integrating the model with
agentic capabilitiesandtool-use.
7.3. Personal Insights & Critique
The Qwen3-Omni paper presents a compelling case for the feasibility and benefits of deeply integrated multimodal learning. The central claim of non-degradation across modalities while achieving SOTA performance is highly significant, as it addresses a long-standing challenge in multimodal AI. This suggests that with careful architectural design (e.g., MoE, TM-RoPE) and comprehensive training strategies (early multimodal integration, diverse data), the "cost" of multimodality can be effectively mitigated or even turned into a benefit.
The achievement of ultra-low first-packet latency for speech generation is a crucial practical innovation. For real-time human-AI interaction, latency is paramount, and demonstrating streaming from the first codec frame with a causal ConvNet is a clever engineering solution that significantly improves user experience compared to previous block-wise diffusion methods. This architectural shift from a heavy generative model (diffusion) to a lightweight causal network for the final waveform synthesis is particularly insightful, showing how to optimize different stages of a complex generation pipeline.
The distinction between Instruct and Thinking models, and the observation that Thinking models can sometimes perform worse on purely perceptual tasks like , offers valuable insight. It suggests that complex reasoning pathways, while beneficial for higher-level cognitive tasks, might introduce unnecessary overhead or hallucinations when the task is primarily about accurate perception and straightforward mapping. This implies that specialized model variants or adaptive routing within an MoE could be crucial for optimal performance across the full spectrum of tasks.
The explicit creation and release of an Audio Captioner is a commendable contribution to the research community, filling a recognized gap. The qualitative results provided are genuinely impressive, showcasing a rich understanding of sound events, context, and even paralinguistic cues, which goes beyond simple sound classification.
Potential Issues or Unverified Assumptions:
- Complexity of Training: While the paper outlines the training stages, the sheer scale of data (2 trillion tokens, 20 million hours of audio) and the intricate multi-stage pretraining and post-training processes imply immense computational resources. Replicability for smaller research groups might be challenging.
FlashVariants vs. Full Models: WhileFlashmodels maintain good performance, the exact trade-offs in specific nuanced tasks (e.g., very long-context understanding or subtle emotional generation) might warrant further detailed investigation.- Generalization to Novel Modalities/Tasks: While
Qwen3-Omniis impressive, it's unclear how easily it can incorporate entirely new modalities or adapt to tasks significantly different from those it was trained on, beyond the currentperception-generation-reasoningparadigm. - "Overall SOTA" Definition: While the paper states "overall SOTA on 22" benchmarks, the exact criteria for this "overall SOTA" across a diverse set of benchmarks could be more explicitly defined for rigorous comparison.
Transferability and Application:
The methods and conclusions of Qwen3-Omni are highly transferable. The Thinker-Talker MoE architecture and the optimized streaming speech generation pipeline could be applied to other multimodal LLMs to improve real-time interaction. The TM-RoPE for flexible multimodal positional encoding is a valuable contribution for any model dealing with dynamic, time-aligned multimodal streams. The success of early multimodal integration suggests a powerful recipe for future foundation models aiming for broad applicability without sacrificing specialized performance. The Captioner itself has direct applications in accessibility, content indexing, and even creating synthetic datasets for further audio research. This work inspires confidence that truly general-purpose multimodal AI is within reach.
Similar papers
Recommended via semantic vector search.