QwenLM / qwen3-vl

Qwen3-VL: The most powerful vision-language model in the Qwen family, offering advanced multimodal capabilities including visual agent functionality, superior text performance, and long context understanding.

Qwen3-VL Architecture

Models

Model	Size	Context Length	Input Modalities
qwen3-vl-235b	235B	256K	Text
qwen3-vl-235b-instruct	235B	256K	Text

Qwen3-VL: Next-Generation Vision-Language Models

Qwen3-VL represents the most powerful vision-language model in the Qwen model family to date. This generation introduces significant improvements across multiple dimensions: text understanding and generation, visual content perception and reasoning, spatial relationship understanding, dynamic video analysis, and AI agent interactions.

Key Features

Visual Agent Capabilities: Qwen3-VL can operate computer and mobile interfaces, recognize GUI elements, understand button functions, call tools, and complete complex tasks. It achieves top global performance on benchmarks like OS World, with tool usage significantly enhancing its performance on fine-grained perception tasks.
Superior Text-Centric Performance: Through early-stage joint pretraining of text and visual modalities, Qwen3-VL continuously strengthens its language capabilities. Its performance on text-based tasks matches that of Qwen3-235B-A22B-2507, making it a truly "text-grounded, multimodal powerhouse."
Greatly Improved Visual Coding: The model can generate code from images or videos, turning design mockups into Draw.io, HTML, CSS, or JavaScript code - making "what you see is what you get" visual programming a reality.
Much Better Spatial Understanding: Qwen3-VL supports 2D grounding from absolute to relative coordinates, can judge object positions, viewpoint changes, and occlusion relationships. It also supports 3D grounding, enabling complex spatial reasoning and embodied AI applications.
Long Context & Long Video Understanding: All models natively support 256K tokens of context, expandable up to 1 million tokens. This enables processing hundreds of pages of technical documents, entire textbooks, or even two-hour videos while maintaining accurate detail retrieval down to the exact second.
Stronger Multimodal Reasoning: The Thinking model is specially optimized for STEM and math reasoning, capable of noticing fine details, breaking down problems step-by-step, analyzing cause and effect, and providing logical, evidence-based answers.
Upgraded Visual Perception & Recognition: Improved pre-training data quality and diversity enables recognition of a wide range of objects including celebrities, anime characters, products, landmarks, animals, and plants - covering both everyday life and professional "recognize anything" needs.
Better OCR Across More Languages & Complex Scenes: OCR now supports 32 languages (up from 10), with improved performance under challenging conditions like poor lighting, blur, or tilted text. Enhanced recognition accuracy for rare characters, ancient scripts, and technical terms.