G

GLM-OCR

by zai-org
🔓 Open Source Python 🌍 Global

About

GLM-OCR is a multimodal OCR model built on the GLM-V encoder-decoder architecture, specifically designed for complex document understanding. It integrates an advanced CogViT visual encoder and a GLM-0.5B language decoder, enhanced by Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to significantly improve training efficiency, recognition accuracy, and generalization. Achieving state-of-the-art performance with a 94.62 score on OmniDocBench V1.5, GLM-OCR excels in formula, table recognition, and information extraction, optimized for challenging real-world layouts. With only 0.9B parameters, it supports efficient deployment via vLLM, SGLang, and Ollama, offering low inference latency and reduced compute costs, making it ideal for high-concurrency services and edge deployments. Fully open-sourced with a comprehensive SDK, GLM-OCR is easy to integrate, providing robust and high-quality OCR for diverse practical business scenarios.