GLM-4.6V series are Z.ai’s iterations in a multimodal large language model. GLM-4.6V scales its context window to 128k tokens in training, and achieves SoTA performance in visual understanding among models of similar parameter scales. Crucially, GLM-4.6V integrate native Function Calling capabilities for the first time. This effectively bridges the gap between “visual perception” and “executable action,” providing a unified technical foundation for multimodal agents in real-world business scenarios.
Traditional tool usage in LLMs often relies on pure text, requiring multiple intermediate conversions when dealing with images, videos, or complex documents—a process that introduces information loss and engineering complexity.
GLM-4.6V is equipped with native multimodal tool calling capability:
This native support allows GLM-4.6V to close the loop from perception to understanding to execution, enabling complex tasks like mixed text-image content creation and visual web searching.
We evaluated GLM-4.6V on over 20 mainstream multimodal benchmarks, including MMBench, MathVista, and OCRBench. The model achieves SOTA performance among open-source models of comparable scale in key capabilities such as multimodal interaction, logical reasoning, and long-context understanding.
Supports native multimodality, enabling direct processing of documents containing visual elements (e.g., images, tables, curves, etc.). This eliminates the need for cumbersome and error-prone preprocessing steps such as OCR and document parsing.
In addition to text output, the model is capable of independent decision-making to locate the pages and regions where relevant content resides. It can also invoke tools via MCP for screenshot capture and embedding, generating well-illustrated reports.
On the basis of in-depth paper reading and information analysis & consolidation, the model is endowed with reasoning capabilities, allowing it to express its own “insights” on specific topics.
curl -X POST \
https://api.z.ai/api/paas/v4/chat/completions \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.6v",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://cloudcovert-1305175928.cos.ap-guangzhou.myqcloud.com/%E5%9B%BE%E7%89%87grounding.PNG"
}
},
{
"type": "text",
"text": "Where is the second bottle of beer from the right on the table? Provide coordinates in [[xmin,ymin,xmax,ymax]] format"
}
]
}
],
"thinking": {
"type":"enabled"
}
}'
GLM-4.6V is a state-of-the-art multimodal large language model from Z.ai. It features a 128k context window, native function calling capabilities, and superior visual understanding. It comes in two versions: GLM-4.6V (106B) for cloud/cluster scenarios and GLM-4.6V-Flash (9B) for local/low-latency deployment.
You can access GLM-4.6V through the Z.ai Open Platform API, the online demo at chat.z.ai, or by downloading the model from Hugging Face for local deployment using SGLang or vLLM.
While GLM-4.6V achieves SOTA performance, it still has some limitations: pure text QA capabilities have room for improvement, the model may occasionally overthink or repeat itself, and there are certain perception limitations in counting accuracy and specific individual identification.