The open model landscape has become highly competitive, a stark contrast to previous years. When Llama 3 was released, researchers were primarily working with Llama 2, and an update was met with widespread enthusiasm. Similarly, Qwen 3 emerged shortly after the Llama 4 issues, catalyzing a research community focused on RL with Qwen 2.5, making an upgrade a clear choice. Today, however, any new open model release must compete with a multitude of established players such as Qwen 3.5, Kimi K2.5, GLM 5, MiniMax M2.5, GPT-OSS, Arcee Large, Nemotron 3, and Olmo 3.
Despite the crowded market, the potential of open models remains immense, akin to "dark matter" – a vast, recognized potential that currently lacks clear methodologies or examples for effective unlocking. Innovations in Agentic AI and OpenClaw are expected to spur widespread experimentation with open models, serving to complement, rather than replace, offerings like Claude and Codex.
For open models, initial benchmark results at release present an incomplete picture. While this can be exciting, given their higher variance and capacity for surprise, it also highlights structural challenges in building businesses and robust AI experiences around them compared to closed alternatives. When new closed models like Claude Opus or GPT are released, a few hours of testing within agentic workflows can effectively gauge their capabilities. Applying this same "vibe test" to open models, however, can be a category error.
Another significant aspect of open models in the era of agents is their ability to shift focus from integration complexities, harnesses, and tools, allowing a clearer assessment of the model's inherent capabilities. While certain functionalities like search naturally require tools, the ability to precisely measure the standalone progress of a model offers a welcome simplification in an often opaque AI space.
When evaluating a new open-weight model for investment, the following factors are crucial:
- Model performance (and size): How the model performs on relevant benchmarks and its comparison to other models of similar scale.
- Country of origin: The provenance of a model, such as whether it was developed in China, can be a critical consideration for some businesses.
- Model license: Models requiring complex legal approvals for use will experience slower adoption rates within mid-sized and large enterprises.
- Tooling at release: Many models, due to pushing architectural or tool boundaries, are released with implementations in popular software like vLLM, Transformers, or SGLANG that are either incomplete or substantially slower.
- Model fine-tunability: The ease or difficulty of adapting the model to specific use-cases during practical application.
The primary challenge is that while some of these factors, such as general performance, licensing, and origin, are immediately apparent at release, others, like tooling maturity and fine-tunability, often require days or weeks to fully assess.