MiniMax M3: The Sparse Attention Breakthrough Driving 1M-Token Context

On June 1, 2026, Shanghai-based AI lab MiniMax released M3 — the first open-weight model to deliver three frontier capabilities simultaneously: 59.0% on SWE-Bench Pro (edging GPT-5.5's 58.6%), a 1M-token context window, and native multimodal understanding of text, images, and video. The core enabler of this impressive trifecta is MiniMax Sparse Attention (MSA), a novel architecture that makes 1M-token inference computationally practical on standard modern hardware.

Standard softmax attention scales quadratically (O(n²)) with context length — doubling the context quadruples the compute. At a 1M context, a single forward pass becomes virtually impossible on today's accelerators. While the industry has explored KV-cache compression and linear attention variants, they often introduce major performance trade-offs. MSA's approach is elegantly practical: instead of attending to all tokens, it identifies the few that actually matter for each query and computes attention over those alone, representing a major breakthrough as other models like Switzerland's Apertus 70B face identical scaling laws.

MSA operates in a two-stage block selection. First, an Index Branch divides the KV cache into 128-token blocks and selects the top 16 most relevant blocks per GQA (Grouped-Query Attention) group. This group-specific sparsity differentiates MSA from uniform approaches. Then, the Main Branch runs exact attention over only those ~2,048 KV tokens — a fixed budget regardless of the context length, resulting in sub-quadratic scaling and reducing compute by 28.4x at 1M context.

To translate algorithmic sparsity into raw GPU speedups, #MiniMax co-designed a custom hardware kernel featuring exp-free top-k selection, KV-outer sparse attention (batching queries requiring the exact same block), and contiguous memory access. This represents a genuine architectural fork from DeepSeek's MLA (Multi-head Latent Attention). While DeepSeek compresses KV data into a low-rank latent space, MSA physically filters out non-essential tokens.

Priced aggressively at $0.30 per million input tokens (promotional rate), M3 costs only 5-10% of rivals like Claude 4.8 Opus or GPT-5.5 ($5.00/M), making it a highly cost-effective coding model. However, users should note that the benchmarks are self-reported, licensing places restrictions on commercial self-hosting, and abstract reasoning remains its primary weakness.

[AgentUpdate Depth Analysis] The introduction of MiniMax M3 and its MSA architecture marks a significant milestone for the AI Agent ecosystem. For complex, multi-modal Agent workflows—which often require absorbing massive code repositories or lengthy video feeds—context window size and inference costs are the ultimate bottlenecks. By reducing compute requirements by over 28x and pricing input tokens at a fraction of competitors' rates, MSA enables continuous, low-cost "long-thought" Agent iterations that were previously economically unfeasible. Unlike compression-heavy approaches like DeepSeek's MLA, MSA's physical filtering preserves semantic precision over long horizons, ensuring high fidelity in memory retrieval. This powerful combination of extremely low costs, 1M context, and native multimodality will directly accelerate the deployment of production-grade software engineering agents, autonomous analytical systems, and multimodal reasoners, shifting the AI Agent bottleneck from computational limits to workflow design.

MiniMax M3: The Sparse Attention Breakthrough Driving 1M-Token Context

Next Stories to Read

AWS Empowers XPeng and Kimi to Drive AI Agent Production Readiness

GLM-5.2 Released: The Ultimate Game-Changer for Open AI Agents

OpenAI Launches 'Patch the Planet' to Tackle Open-Source Bugs

Related Tools & Resources

Skill Marketplaces

Awesome OpenClaw Skills