Alibaba Cloud Unveils Qwen3.5-Omni: A Leap in Omni-Modal AI with SOTA Performance and Emergent Audio-Visual Coding

Qwen3.5-Omni, the latest iteration in the Qwen-Omni model family, has been introduced, representing a significant advancement over its predecessors. This new model scales to hundreds of billions of parameters and supports an extensive 256k context length.

Leveraging a vast dataset of heterogeneous text-vision pairs and over 100 million hours of audio-visual content, Qwen3.5-Omni demonstrates robust omni-modality capabilities. Specifically, Qwen3.5-Omni-plus achieves State-of-the-Art (SOTA) results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks. It notably surpasses Gemini-3.1 Pro in key audio tasks and matches its performance in comprehensive audio-visual understanding.

Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both its "Thinker" and "Talker" components, which enables efficient long-sequence inference. The model is engineered to facilitate sophisticated interaction, supporting over 10 hours of audio understanding and processing 400 seconds of 720P video at 1 FPS.

To address the inherent instability and unnaturalness often observed in streaming speech synthesis, which can stem from encoding efficiency discrepancies between text and speech tokenizers, Qwen3.5-Omni introduces ARIA. This innovation dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact.

Furthermore, Qwen3.5-Omni expands linguistic capabilities, offering multilingual understanding and speech generation across 10 languages with human-like emotional nuance. The model also exhibits superior audio-visual grounding, capable of generating script-level structured captions with precise temporal synchronization and automated scene segmentation.

Remarkably, the development of Qwen3.5-Omni has revealed an emergent capability in omnimodal models: the ability to directly perform coding based on audio-visual instructions. This groundbreaking feature has been termed "Audio-Visual Vibe Coding," hinting at new paradigms for human-AI interaction and development.