⚡ News

Cohere Releases Command A+: A 218B Sparse MoE for Agentic Workflows

Cohere Releases Command A+: A 218B Sparse MoE for Agentic Workflows

Cohere has officially released Command A+, a new open-source model designed specifically for enterprise agentic workflows. Available under the Apache 2.0 license, Command A+ is a Mixture-of-Experts (MoE) model built to deliver high-performance agentic capabilities with minimal compute overhead. The model is deeply optimized for reasoning, agentic workflows, Retrieval-Augmented Generation (RAG), multilingual capabilities, and multimodal document processing. It seamlessly unifies the capabilities of four previous models: Command A, Command A Reasoning, Command A Vision, and Command A Translate.

Architecturally, Command A+ is a decoder-only Sparse MoE Transformer featuring 218 billion total parameters and 25 billion active parameters. It consists of 128 experts, with 8 active per token, alongside a single shared expert applied to all tokens. This architecture ensures that active compute remains at a 25B-parameter scale during inference. Its attention mechanism interleaves sliding-window attention layers utilizing Rotational Positional Embeddings (RoPE) with global attention layers without positional embeddings in a 3:1 ratio. The sparse MoE layer is trained in a fully dropless manner and utilizes a token-choice router with a normalized sigmoid over top-k expert logits. The model processes text, image, and tool use as inputs, and outputs text, reasoning, and tool use, supporting a 128K input context length and a 64K maximum generation length.

To lower deployment barriers, Cohere provides three quantization options. The BF16 (16-bit) configuration requires 4x B200 or 8x H100 GPUs; FP8 (8-bit) requires 2x B200 or 4x H100 GPUs; while the highly optimized W4A4 (4-bit) runs on a single B200 or just 2x H100 GPUs. Benchmarks show negligible quality differences across these versions, leading Cohere to recommend W4A4 for most enterprise deployments.

For its W4A4 methodology, Cohere implements NVFP4 W4A4 quantization—applying 4-bit weights and activations with two-level scaling strictly to the MoE experts. Crucially, the attention path (including Q/K/V/O projections, KV cache, and attention compute) is kept at full precision. To close the remaining quality gap, Cohere utilizes Quantization-Aware Distillation (QAD) during post-training, where the quantized student model is trained to match the full-precision teacher’s output distribution using fake quantization operators in the forward pass and straight-through estimators on the backward pass.

Performance benchmarks showcase massive gains over predecessor models. On the τ²-Bench Telecom agent benchmark, scores rose from 37% (Command A Reasoning) to 85%, while Terminal-Bench Hard agentic coding performance jumped from 3% to 25%. On internal North platform evaluations (scored via LLM-as-a-judge), Agentic Question Answering—which measures the model's ability to answer enterprise queries using MCP-connected cloud file systems—improved by 20% over Command A Reasoning.

[AgentUpdate Depth Analysis] The release of Command A+ represents a major milestone in enterprise AI Agent infrastructure. By combining a massive 218B sparse MoE architecture with advanced W4A4 quantization, Cohere has solved a critical bottleneck: running high-capacity, agent-optimized models with minimal hardware overhead (just two H100 GPUs). Horizontal comparisons with heavyweights like LLaMA-3 405B or closed-source APIs show that Cohere's focus on targeted architectural hybridizing (like MCP support and MoE-only quantization) offers an incredibly cost-effective alternative. This architectural paradigm—distributing computation efficiently through a dropless MoE while keeping the attention block in high precision—sets a new benchmark for on-premise and private cloud agent deployments. It will likely accelerate the transition of AI agents from experimental pilots to low-latency, production-ready enterprise systems.

↗ Read original source