QwenLM / qwen3-vl

Qwen3-VL: The most powerful vision-language model in the Qwen family, offering advanced multimodal capabilities including visual agent functionality, superior text performance, and long context understanding.

Models

ModelSizeContext LengthInput Modalities
qwen3-vl-235b235B256KText
qwen3-vl-235b-instruct235B256KText

Qwen3-VL: Next-Generation Vision-Language Models

Qwen3-VL represents the most powerful vision-language model in the Qwen model family to date. This generation introduces significant improvements across multiple dimensions: text understanding and generation, visual content perception and reasoning, spatial relationship understanding, dynamic video analysis, and AI agent interactions.

Key Features

  • Visual Agent Capabilities: Qwen3-VL can operate computer and mobile interfaces, recognize GUI elements, understand button functions, call tools, and complete complex tasks. It achieves top global performance on benchmarks like OS World, with tool usage significantly enhancing its performance on fine-grained perception tasks.
  • Superior Text-Centric Performance: Through early-stage joint pretraining of text and visual modalities, Qwen3-VL continuously strengthens its language capabilities. Its performance on text-based tasks matches that of Qwen3-235B-A22B-2507, making it a truly "text-grounded, multimodal powerhouse."
  • Greatly Improved Visual Coding: The model can generate code from images or videos, turning design mockups into Draw.io, HTML, CSS, or JavaScript code - making "what you see is what you get" visual programming a reality.
  • Much Better Spatial Understanding: Qwen3-VL supports 2D grounding from absolute to relative coordinates, can judge object positions, viewpoint changes, and occlusion relationships. It also supports 3D grounding, enabling complex spatial reasoning and embodied AI applications.
  • Long Context & Long Video Understanding: All models natively support 256K tokens of context, expandable up to 1 million tokens. This enables processing hundreds of pages of technical documents, entire textbooks, or even two-hour videos while maintaining accurate detail retrieval down to the exact second.
  • Stronger Multimodal Reasoning: The Thinking model is specially optimized for STEM and math reasoning, capable of noticing fine details, breaking down problems step-by-step, analyzing cause and effect, and providing logical, evidence-based answers.
  • Upgraded Visual Perception & Recognition: Improved pre-training data quality and diversity enables recognition of a wide range of objects including celebrities, anime characters, products, landmarks, animals, and plants - covering both everyday life and professional "recognize anything" needs.
  • Better OCR Across More Languages & Complex Scenes: OCR now supports 32 languages (up from 10), with improved performance under challenging conditions like poor lighting, blur, or tilted text. Enhanced recognition accuracy for rare characters, ancient scripts, and technical terms.

Model Variants

NameSizeContextInput ModalitiesDescription
qwen3-vl-235b235B256KText235B parameter model
qwen3-vl-235b-instruct235B256KTextInstruction-tuned

Technical Capabilities

Multimodal Understanding

Qwen3-VL models excel at processing and reasoning about various visual inputs including:

  • Images: High-resolution photographs, diagrams, screenshots
  • Documents: Technical manuals, research papers, forms
  • Interfaces: GUI screenshots, mobile app interfaces
  • Videos: Dynamic scenes, temporal understanding

Advanced OCR

The models feature state-of-the-art optical character recognition with:

  • Support for 32 languages
  • Robust performance in challenging conditions (low light, blur, tilted text)
  • Enhanced recognition of rare characters and technical terminology
  • Improved document structure understanding

Spatial and Temporal Reasoning

Qwen3-VL demonstrates advanced capabilities in:

  • 2D Grounding: Understanding absolute and relative coordinates
  • 3D Understanding: Analyzing spatial relationships and occlusion
  • Video Analysis: Temporal understanding of dynamic scenes
  • Object Relationships: Judging positions, viewpoints, and interactions

Use Cases

Visual Programming

Transform design mockups into functional code:

  • Convert UI screenshots to HTML/CSS/JavaScript
  • Generate Draw.io diagrams from sketches
  • Create technical documentation from visual assets

Enterprise Applications

  • Document Processing: Extract and analyze information from complex documents
  • Customer Support: Visual question answering for product support
  • Data Analysis: Interpret charts, graphs, and visual data representations

Research and Education

  • STEM Problem Solving: Visual math and science problem solving
  • Multilingual Content: OCR and translation across 32 languages
  • Accessibility: Visual assistance for accessibility applications

Getting Started

Qwen3-VL models are available through various API providers. For more information: