Traditional multimodal large models often generate fragmented summaries based on subtitles and frame tags when handling long video understanding. Understanding long-form video requires not just identifying individual frames, but capturing causal chains within a continuous temporal flow. Kuaishou's self-developed multimodal large language model, Keye-VL-2.0-30B-A3B, demonstrates a fundamentally different level of deep comprehension, transitioning AI from basic perception to deep reasoning.
Kuaishou has officially released Keye-VL-2.0-30B-A3B. As the latest 30B-class foundational model of the Keye family, it is the first to integrate the DSA (DeepSeek Sparse Attention) mechanism into multimodal scenarios, unlocking a 256K ultra-long context window and enabling near-lossless temporal perception for long videos. Crucially, this release marks the first time the Keye series integrates Agent collaboration mechanisms, showcasing robust capabilities in Code, Tool, and Search environments.
A major bottleneck in video understanding has been the exponential computational overhead and dilution of core information caused by ultra-long visual contexts. Keye-VL-2.0-30B-A3B addresses this at the architectural level. By combining sparse attention with targeted feature aggregation, the model filters noise, captures keyframes, and deciphers dynamic patterns across hours of video sequences. This advantage has been validated on the fine-grained temporal video understanding benchmark TimeLens, where it went head-to-head with Gemini models:
- Action Temporal Parsing (Charades-TimeLens): Achieved a mIoU of 58.4, highly competitive with the closed-source benchmark Gemini 3 Flash (61.2).
- Video Action Localization (ActivityNet-TimeLens): Secured a 58.5 mIoU, outperforming both official Gemini-2.5-Pro data (58.1) and internal evaluations of Gemini 3 Flash (57.0).
- Highlight Extraction (QVHighlights-TimeLens): Reached a mIoU of 70.1, matching top closed-source models and significantly surpassing Gemini 3 Flash (49.5).
In real-world tests, the model demonstrates millisecond-level precision. When analyzing a complex, multi-step ceramic cup crafting video, Keye-VL-2.0-30B-A3B not only identified expert tasks—such as raw calcite calcination (at 950°C), clay purification, wheel throwing, glazing (at 1200°C), and tea water oxidation—but also mapped each step precisely to the video timeline.
For esports gameplay analysis (e.g., Honor of Kings), Keye-VL-2.0-30B-A3B bypassed simplistic tag-matching. Instead, it analyzed visual tension, audiovisual alignment, and narrative arcs. It successfully read dynamic damage numbers, matched localized subtitles with background music shifts, and identified emotional climax moments like "clutch victories" based on live team scores, displaying a comprehensive global contrast logic.
Beyond visual perception, Keye-VL-2.0-30B-A3B introduces agentic capabilities. When processing a 9-minute Iceland travel Vlog, the model did not merely summarize events; it actively noted details like "freezing hands" to recommend winter gear and analyzed snowy road hazards to suggest structured safety plans (e.g., group tours over self-driving). This "slow-thinking" ability marks a paradigm shift toward active AI Agents.
[AgentUpdate Depth Analysis] By integrating DeepSeek Sparse Attention (DSA) into a multimodal framework, Kuaishou's Keye-VL 2.0 addresses a fundamental bottleneck in the AI Agent ecosystem: processing long-term visual dependencies efficiently. Standard attention mechanisms suffer from quadratic computational costs that restrict Agent context windows, but DSA’s sparse routing enables Keye-VL 2.0 to navigate a 256K context without losing granular sequence details. This shift enables AI Agents to transcend simple reactive queries, evolving into proactive planners capable of long-horizon causal reasoning over hours of visual data. Furthermore, by pairing this robust perception engine with Agentic capabilities (Code, Tool, Search), Keye-VL 2.0 paves the way for advanced multimodal workflows. The future of AI Agents will rely on these highly efficient temporal perception backbones to execute complex tasks in robotics, autonomous driving, and automated video workflows, shifting the paradigm from static understanding to active, long-term environmental interaction.