Kuaishou's self-developed multimodal large language model, Keye-VL-2.0-30B-A3B, has been released, marking a significant step towards deep reasoning in multimodal understanding. Conventional visual large models often produce superficial summaries based on subtitles and simple frame tags when processing long, dynamic videos, such as a 9-minute Iceland travel vlog with drastic scene changes. Keye-VL-2.0-30B-A3B, however, demonstrates a distinctly different level of understanding, not only seeing the visuals but also grasping the underlying causality. For instance, when analyzing an Iceland travel vlog, the model can infer "cold hands" details and suggest preparing warm gloves; upon hearing complaints about exotic food, it offers an emotionally intelligent recommendation to "experience local culture"; and it acutely identifies a "snow accident" scene to advise "guided tours over self-driving" for safety. This capability transcends simple tag recognition, demonstrating the ability to untangle causal chains within continuous temporal flows and perform deep planning based on human logic.
In the evolution of multimodal large models from "basic perception" to "deep reasoning," the Kuaishou team focused on two key challenges: overcoming the computational bottleneck of ultra-long visual contexts in video understanding tasks and transforming the model from a mere "observer" into an "actor" capable of solving complex real-world problems. Keye-VL-2.0-30B-A3B, as the latest 30B-class flagship foundation model of the Keye family, is the first to integrate the DSA (DeepSeek Sparse Attention) mechanism into multimodal understanding scenarios. This integration successfully unlocks deep perception for 256K ultra-long contexts, achieving near-lossless inference capabilities for long video temporal perception. Furthermore, this release marks the first time the Keye series has enabled an Agent collaboration mechanism, showcasing robust system-level collaboration and execution potential in complex scenarios involving Code, Tool, and Search.
Five Technical Engines Reshaping the Multimodal Foundation
DSA's debut in multimodal contexts addresses the bottleneck of long video understanding. A primary challenge in video understanding lies in the exponential computational overhead and dilution of core information caused by ultra-long visual contexts. Keye-VL-2.0-30B-A3B achieves a critical architectural leap by successfully implementing DSA (DeepSeek Sparse Attention) within its multimodal understanding framework. By combining sparse attention with highly targeted feature aggregation, the model can effectively purify information from noisy environments and precisely capture keyframes while discerning dynamic patterns, even when processing video sequences hours in length.
This architectural advantage has been rigorously validated in the latest benchmark for fine-grained video temporal understanding, TimeLens. The Kuaishou team conducted internal tests against Gemini 3 Flash and Gemini 2.5 Pro, adhering strictly to the same evaluation methodology for a robust comparison:
- Everyday Action Temporal Parsing (Charades-TimeLens): The model achieved an mIoU of 58.4, closely matching the strong performance of the measured closed-source benchmark, Gemini 3 Flash (61.2).
- Video Action Localization (ActivityNet-TimeLens): With an mIoU of 58.5, the model surpassed both the official Gemini-2.5-Pro data (58.1) and the measured Gemini 3 Flash (57.0).
- Highlight Moment Extraction (QVHighlights-TimeLens): The model's mIoU reached 70.1, rivaling top-tier closed-source models on the official leaderboard and significantly outperforming the measured Gemini 3 Flash (49.5).
Consider a video detailing the intricate process of making a ceramic cup. Keye-VL-2.0-30B-A3B exhibited "surgical-knife" precision in frame-level judgment, outputting a complete breakdown of the craftsmanship with accurate timestamps:
- Calcite Raw Material Processing: Crushing raw stones into small pieces with a hammer; repeatedly rinsing in a bamboo sieve in a stream to remove impurities.
- Calcite Calcination and Pulping: Calcining in an earth kiln with charcoal at high temperatures (approx. 950℃); extracting white powder from the kiln; grinding with water to form a fine slurry (elutriation process).
- Clay Collection and Treatment: Excavating reddish-brown clay from mountainous terrain; pouring into a缸 and stirring with water to remove impurities.
- Teacup Body Forming and Decoration: Hand-throwing on a potter's wheel for shaping; fine-tuning thickness and form; attaching a square seal mark to the bottom and trimming.
- Glaze Preparation and Application: Weighing quartz, feldspar, and other raw materials in proportion and stirring with water to create a slurry; repeatedly dipping the body into the glaze slurry and natural air drying.
- Firing and Finished Product Display: Stacking in the kiln; firing with wood to 1200℃; removing from kiln, cleaning, and immersing in aged tea water for oxidation to adjust glaze color; finally presenting glaze features like crackles and iron feet.
The model accurately identified every professional manual step throughout the process, achieving millisecond-level synchronization with the video timeline.
In analyzing a high-octane "Honor of Kings" gameplay video, Keye-VL-2.0-30B-A3B moved beyond the mechanical logic of traditional AI, which typically extracts only kill notifications or scenes with drastic visual changes. Instead, it delivered precise highlight judgments based on visual intensity, audio-visual synergy, and a deep understanding of esports narratives:
- Dual Burst of Visuals and Rhythm: The model keenly identified the most intense team fight scenes, not only recognizing "golden and purple light effects intertwined" but also accurately reading specific dynamic damage values like "276" and "132." It used the density of these visual elements as direct evidence of intense battle rhythm, demonstrating strong dynamic visual parsing capabilities.
- Dramatic Tension Built Through Audio-Visual Synergy: The model was not limited to the game screen itself; it cross-modally captured the English lyric subtitles at the bottom. It successfully understood the connection between the high-energy lyrics and the fierce match, pointing out how this "audio-visual synergy" elevated the video's dramatic tension.
- Emotional Resonance of a "Comeback from the Brink": This aspect best reflects the model's depth. By reading "27 vs 35" on the screen, it inferred the disadvantageous team background, and combined with the close-quarter team fight, accurately extracted the core esports narrative point of a "comeback from the brink." This proved its ability to not only understand visuals but also grasp the emotional impact and entertainment value behind game videos.
- Exclusive Logic from a Global Perspective: The model's analysis was not isolated; it demonstrated a macroscopic, global view. It proactively compared the highlight segment with previous combat and chase segments (00:00-00:16 / 00:17-00:58), rigorously arguing the irreplaceable nature of the chosen time frame from three dimensions: effect intensity, rhythmic tension, and narrative significance, forming a highly convincing logical loop.
This analysis showcases Keye-VL-2.0-30B-A3B's depth in video understanding and complex reasoning capabilities.