Vision Language Models
Copy page
GLM-4.6V Series

GLM-4.6V

GLM-4.6V series are Z.ai’s iterations in a multimodal large language model. GLM-4.6V scales its context window to 128k tokens in training, and achieves SoTA performance in visual understanding among models of similar parameter scales. Crucially, GLM-4.6V integrate native Function Calling capabilities for the first time. This effectively bridges the gap between “visual perception” and “executable action,” providing a unified technical foundation for multimodal agents in real-world business scenarios.

Hugging Face
GLM-4.6V Visualization
Positioning
For cloud and high-performance cluster scenarios
Input Modality
Video / Image / Text / File
Output Modality
Text
Context Length
128K

Resources

API Documentation: Learn how to call the API.

Introducing GLM-4.6V

1

Native Multimodal Tool Use

Traditional tool usage in LLMs often relies on pure text, requiring multiple intermediate conversions when dealing with images, videos, or complex documents—a process that introduces information loss and engineering complexity.

GLM-4.6V is equipped with native multimodal tool calling capability:

  • Multimodal Input: Images, screenshots, and document pages can be passed directly as tool parameters without being converted to text descriptions first, minimizing signal loss.
  • Multimodal Output: The model can visually comprehend results returned by tools—such as searching results, statistical charts, rendered web screenshots, or retrieved product images—and incorporate them into subsequent reasoning chains.

This native support allows GLM-4.6V to close the loop from perception to understanding to execution, enabling complex tasks like mixed text-image content creation and visual web searching.

2

Capabilities & Scenarios

  • Intelligent Image-Text Content Creation & Layout GLM-4.6V can accept multimodal inputs—mixed text/image papers, reports, or slides—and automatically generate high-quality, structured image-text interleaved content.
  • Complex Document Understanding Accurately understands structured information from documents containing text, charts, figures, and formulas.
  • Visual Tool Retrieval During generation, the model can autonomously call search tools to find candidate images or crop key visuals from the source multimodal context.
  • Visual Audit & Layout The model performs a “visual audit” on candidate images to assess relevance and quality, filtering out noise to produce structured articles ready for social media or knowledge bases.
3

Overall Performance

We evaluated GLM-4.6V on over 20 mainstream multimodal benchmarks, including MMBench, MathVista, and OCRBench. The model achieves SOTA performance among open-source models of comparable scale in key capabilities such as multimodal interaction, logical reasoning, and long-context understanding.

Examples

Description

Supports native multimodality, enabling direct processing of documents containing visual elements (e.g., images, tables, curves, etc.). This eliminates the need for cumbersome and error-prone preprocessing steps such as OCR and document parsing.

In addition to text output, the model is capable of independent decision-making to locate the pages and regions where relevant content resides. It can also invoke tools via MCP for screenshot capture and embedding, generating well-illustrated reports.

On the basis of in-depth paper reading and information analysis & consolidation, the model is endowed with reasoning capabilities, allowing it to express its own “insights” on specific topics.

Prompt

Based on the key visualizations from the two papers, this research report delivers a comprehensive, illustration-integrated analysis of the bench described in the literature.

Display

[Interactive Example Display Area]

Quick Start

cURL
Python
Java
curl -X POST \
    https://api.z.ai/api/paas/v4/chat/completions \
    -H "Authorization: Bearer your-api-key" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "glm-4.6v",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://cloudcovert-1305175928.cos.ap-guangzhou.myqcloud.com/%E5%9B%BE%E7%89%87grounding.PNG"
                        }
                    },
                    {
                        "type": "text",
                        "text": "Where is the second bottle of beer from the right on the table?  Provide coordinates in [[xmin,ymin,xmax,ymax]] format"
                    }
                ]
            }
        ],
        "thinking": {
            "type":"enabled"
        }
    }'

FAQ

What is GLM-4.6V?

GLM-4.6V is a state-of-the-art multimodal large language model from Z.ai. It features a 128k context window, native function calling capabilities, and superior visual understanding. It comes in two versions: GLM-4.6V (106B) for cloud/cluster scenarios and GLM-4.6V-Flash (9B) for local/low-latency deployment.

What are the key features of GLM-4.6V?

  • Native Multimodal Function Calling: Directly process images/screenshots as tool inputs and interpret visual outputs.
  • Interleaved Image-Text Content Generation: Create high-quality mixed media content from complex multimodal inputs.
  • Multimodal Document Understanding: Process up to 128K tokens of multi-document input, understanding text, layout, charts, and tables without OCR.
  • Frontend Replication & Visual Editing: Reconstruct and edit HTML/CSS from UI screenshots via natural language.

How can I use GLM-4.6V?

You can access GLM-4.6V through the Z.ai Open Platform API, the online demo at chat.z.ai, or by downloading the model from Hugging Face for local deployment using SGLang or vLLM.

What are the known limitations?

While GLM-4.6V achieves SOTA performance, it still has some limitations: pure text QA capabilities have room for improvement, the model may occasionally overthink or repeat itself, and there are certain perception limitations in counting accuracy and specific individual identification.