News

Meet OpenMythos: Open-Source PyTorch Reconstruction of Claude Mythos Proposes Recurrent-Depth Transformer Achieving 1.3B Performance with 770M Parameters

Meet OpenMythos: Open-Source PyTorch Reconstruction of Claude Mythos Proposes Recurrent-Depth Transformer Achieving 1.3B Performance with 770M Parameters

Anthropic has never published a technical paper on Claude Mythos, yet this hasn't deterred the research community from theorizing its architecture. A new open-source project named OpenMythos, released on GitHub by Kye Gomez, undertakes an ambitious task: a first-principles theoretical reconstruction of what the Claude Mythos architecture might be, built entirely in PyTorch and grounded in peer-reviewed research.

This project is neither a leaked model, a fine-tune, nor a distillation. It represents a hypothesis rendered in code—a hypothesis specific enough to be falsifiable, which makes it particularly compelling.

The Main Claim: Claude Mythos Is a Recurrent-Depth Transformer

OpenMythos's central claim posits that Claude Mythos belongs to a class of architectures known as Recurrent-Depth Transformers (RDTs), also referred to in literature as Looped Transformers. This concept significantly differs from standard transformer stacks. In a conventional transformer (e.g., GPT, LLaMA, Mistral), input passes through a sequence of unique layers, each with independent weights; greater capability typically implies more layers and parameters. Conversely, in an RDT, a fixed set of weights is iteratively applied across T loop steps within a single forward pass. The same weights execute multiple times, meaning reasoning depth is a function of inference-time iterations rather than stored parameters.

One can conceptualize this not as reading a book, but as refining a draft: the model repeatedly engages with the same computational block, improving its internal representation with each pass.

How the Architecture is Structured

OpenMythos instantiates this architecture as a three-part structure: Prelude → Recurrent Block → Coda. The Prelude and Coda are standard transformer layers that execute once. The Recurrent Block forms the computational core, looped up to T=16 times.

At each loop step t, the hidden state is updated according to the rule:

ht+1 = A·ht + B·e + Transformer(ht, e)

Here, ht denotes the hidden state after loop iteration t, and e is the encoded input from the Prelude, re-injected at every step. This re-injection is critical: without it, the hidden state would drift from the original input signal over deep loops. The learned matrices A and B govern the proportion of the previous hidden state and encoded input carried forward at each step.

The FFN within the Recurrent Block is not a standard feedforward layer. OpenMythos replaces it with a Mixture-of-Experts (MoE) layer, following the design introduced in DeepSeekMoE. This involves a large pool of fine-grained routed experts, where only a sparse top-K subset is activated per token, alongside a small set of always-active shared experts that absorb common cross-domain patterns. Crucially, the router selects distinct expert subsets at each loop depth, implying that each iteration is computationally unique despite sharing the same base weights.

↗ Read original source