⚡ News

Cohere Debuts Command A+: A 218B Sparse MoE Model Optimized for Enterprise Agentic Workflows

Cohere Debuts Command A+: A 218B Sparse MoE Model Optimized for Enterprise Agentic Workflows

Cohere has officially released Command A+, an open-source model specifically targeting enterprise agentic workflows. Available under the Apache 2.0 license, Command A+ is a Mixture-of-Experts (MoE) model engineered for high-performance agentic tasks with minimal compute overhead. The model is optimized for reasoning, agentic workflows, RAG, multilingual processing, and multimodal document analysis. It unifies the specialized capabilities of four previous models—Command A, Command A Reasoning, Command A Vision, and Command A Translate—into a single, scalable architecture.

Architecturally, Command A+ is a decoder-only Sparse MoE Transformer featuring 218B total parameters, with only 25B active parameters during inference. It integrates 128 experts, routing 8 active experts per token alongside a single shared expert applied to all tokens. This MoE design ensures that while the model benefits from a vast parameter pool, active compute remains at a 25B-parameter scale. The attention mechanism interleaves sliding-window layers (using Rotational Positional Embeddings) with global attention layers in a 3:1 ratio. The sparse MoE layer is trained in a fully dropless manner and utilizes a token-choice router with a normalized sigmoid over top-k expert logits.

The model supports text, image, and tool-use inputs, delivering text, reasoning, and tool-use outputs. It features a robust 128K input context length and a 64K maximum generation length. To facilitate deployment, Cohere offers three quantization variants: BF16 (requires 8x H100s), FP8 (requires 4x H100s), and W4A4 (runs on just 2x H100s). Cohere recommends W4A4 for most enterprise deployments, as benchmark quality remains consistent across all quantization levels.

For the W4A4 variant, Cohere employs NVFP4 quantization—applying 4-bit weights and activations with two-level scaling—specifically to the MoE experts. The attention path, including Q/K/V/O projections and the KV cache, is maintained at full precision. To close the precision gap, Cohere utilized Quantization-Aware Distillation (QAD) during post-training, where the quantized student model is trained to mimic the full-precision teacher's output distribution using fake quantization operators and straight-through estimators.

In terms of performance, Command A+ shows dramatic improvements over its predecessors. On τ²-Bench Telecom, scores rose from 37% to 85% compared to Command A Reasoning. Agentic coding performance on Terminal-Bench Hard jumped from 3% to 25%. Internal evaluations on the North platform, utilizing LLM-as-a-judge techniques, showed a 20% improvement in Agentic Question Answering, which specifically measures the model's ability to navigate enterprise questions using MCP-connected cloud file systems.

↗ Read original source