Zhipu AI Unveils GLM-4.6V Multimodal Models: Native Tool Calling Powers Next-Gen AI Agents

Zhipu AI has introduced the latest additions to its GLM Vision family: the GLM-4.6V and GLM-4.6V-Flash multimodal language models. These models aim to provide robust multimodal capabilities for AI agent workflows. Despite being smaller in size compared to competitors like Qwen3-VL and Step3, the GLM-4.6V series delivers superior performance.

Key features of the GLM-4.6V series include:

Model Sizes and Deployment: The series comprises two models: GLM-4.6V (106B), the foundational model designed for cloud GPU clusters, offering the best response quality; and GLM-4.6V-Flash (9B), a lightweight version suitable for local deployment.
Extended Context Length: Both models boast a 128K context length, enabling them to process lengthy documents and engage in extended, coherent conversations with users.
Native Multimodal Tool Use: A significant advancement, GLM-4.6V models natively support multimodal tool calling. This eliminates the need to chain separate LLMs with other vision models, allowing a single model to handle text output, process image/video/document inputs, and call tools as necessary. The models can also comprehend outputs from tools, such as rendered webpages, search results, and statistical charts. This capability streamlines the entire workflow from perception to reasoning and execution, significantly enhancing their potential for AI agent applications.
Rich Text Understanding: The GLM-4.6V models can accept complex inputs like research papers, reports, and slide decks to generate structured output, demonstrating their advanced text comprehension capabilities.

Developers can leverage the Hugging Face Transformers library to perform inference with GLM-4.6V models. The models also facilitate the creation of Gradio applications for use cases such as OCR (Optical Character Recognition) and image-to-HTML conversion.

The introduction of GLM-4.6V marks a substantial step forward in multimodal understanding and execution for AI agents, laying a strong foundation for building more intelligent and autonomous local AI agent workflows.