Open AI / gpt-oss

OpenAI's latest open-weight language models (20B and 120B) offering state-of-the-art reasoning capabilities, tool usage, and efficient deployment. Available under Apache 2.0 license with full customization and chain-of-thought access.

GPT-OSS 120B Architecture

Models

Model	Total Parameters	Context Length
gpt-oss-120b	117 billion	128,000
gpt-oss-20b	21 billion	128,000

GPT-OSS: OpenAI's Open Weight Language Models

OpenAI's latest open-weight language models, gpt-oss-120b and gpt-oss-20b, deliver state-of-the-art performance at significantly reduced costs. Available under the permissive Apache 2.0 license, these models outperform comparable-sized models in reasoning tasks, excel at tool usage, and are optimized for efficient deployment on customer hardware. Both models were trained using reinforcement learning and advanced internal techniques from OpenAI's latest models, including OpenAI o3.

Key Features

Advanced Reasoning: Superior performance on complex reasoning tasks
Tool Usage: Exceptional function calling and tool integration capabilities
Efficient Deployment: Optimized for customer hardware with minimal resource requirements
Full Customization: Complete access to the model's chain-of-thought process
Structured Outputs: Native support for structured data generation
Configurable Reasoning Levels: Adjust reasoning effort (low, medium, high) based on latency requirements
Apache 2.0 License: Permissive licensing for commercial and experimental use

Model Performance

The gpt-oss-120b model approaches OpenAI o4-mini performance on key reasoning benchmarks while running on a single 80GB GPU. The gpt-oss-20b model delivers performance comparable to OpenAI o3-mini on common evaluations while requiring only 16GB of memory, making it ideal for on-device execution, local inference, or rapid iteration without massive infrastructure investment.

Both models demonstrate exceptional tool usage capabilities, few-shot function calling, and chain-of-thought reasoning, as evidenced by their performance on the agentic evaluation Tau-Bench and HealthBench, where they outperform some proprietary models like OpenAI o1 and GPT-4o.

Model Architecture

Model	Layers	Total Parameters	Active Parameters per Token	Total Experts	Active Experts per Token	Context Length
gpt-oss-120b	36	117 billion	5.1 billion	128	4	128,000
gpt-oss-20b	24	21 billion	3.6 billion	32	4	128,000

Both models are transformer-based architectures utilizing Mixture-of-Experts (MoE) technology to reduce active parameters during token processing. They alternate between global and focused information analysis similar to GPT-3, use 8-word attention blocks to conserve compute resources and memory, and employ Rotary Positional Embedding (RoPE) for position determination. The models natively support contexts up to 128,000 tokens.

The models were trained on high-quality, primarily English text data covering STEM, coding, and general knowledge topics. Tokenization uses the open-source o200k_harmony tokenizer, a subset of the tokenizer used for OpenAI o4-mini and GPT-4o.

Post-Training

GPT-OSS models underwent post-training similar to OpenAI o4-mini, including intensive supervised fine-tuning and reinforcement learning. The training aligned the models with OpenAI's specifications, teaching them to apply chain-of-thought reasoning and use tools before generating responses. They benefit from the same advanced techniques as OpenAI's proprietary reasoning models.

Like OpenAI's o-series reasoning models, GPT-OSS models offer three reasoning levels (low, medium, high) that provide different latency/performance tradeoffs. Developers can configure the reasoning level with a simple instruction in the system message.

Benchmarks

GPT-OSS models were evaluated across standard academic benchmarks measuring capabilities in coding, mathematics, healthcare, and agentic tool usage compared to other OpenAI reasoning models.

Codeforces Competition (with tools)

Model	Elo Rating
gpt-oss-120b	2622
gpt-oss-20b	2516
o3 (with tools)	2706
o4-mini (with tools)	2719
o3-mini (without tools)	2073

Humanity's Last Exam (Expert-level questions)

Model	Accuracy (%)
gpt-oss-120b (with tools)	19.0
gpt-oss-20b (with tools)	17.3
o3 (with tools)	24.9
o4-mini (with tools)	17.7
o3-mini (without tools)	13.4

HealthBench (Realistic health conversations)

Model	Score (%)
gpt-oss-120b	57.6
gpt-oss-20b	42.5
o3	59.8
o4-mini	50.1
o3-mini	37.8

HealthBench Hard (Challenging health conversations)

Model	Score (%)
gpt-oss-120b	30.0
gpt-oss-20b	10.8
o3	31.6
o4-mini	17.5
o3-mini	4.0

Note: GPT-OSS models do not replace medical professionals and are not intended for diagnosis or treatment of disease.

AIME 2024 (Competition math with tools)

Model	Accuracy (%)
gpt-oss-120b	96.6
gpt-oss-20b	96.0
o3	95.2
o4-mini	98.7
o3-mini	87.3

AIME 2025 (Competition math with tools)

Model	Accuracy (%)
gpt-oss-120b	97.9
gpt-oss-20b	98.7
o3	98.4
o4-mini	99.5
o3-mini	86.5

GPQA Diamond (PhD-level science questions without tools)

Model	Accuracy (%)
gpt-oss-120b	80.1
gpt-oss-20b	71.5
o3	83.3
o4-mini	81.4
o3-mini	77.0

GPQA Diamond (PhD-level science questions with tools)

Token Length vs Accuracy (AIME Competition Math):

1024    2048    4096    8192    16384   32768
COT + answer length (tokens)
0.5     0.55    0.6     0.65    0.7     0.75    0.8     0.85    0.9     0.95    1
Accuracy
gpt-oss-120b:  ~       ~       ~       ~       ~       ~       ~0.96   ~0.97   ~0.98
gpt-oss-20b:   ~       ~       ~       ~       ~       ~0.95   ~0.96   ~0.97   ~0.98

Token Length vs Accuracy (GPQA Diamond):

512     1024    2048    4096    8192    16384   32768
COT + answer length (tokens)
0.5     0.55    0.6     0.65    0.7     0.75    0.8     0.85    0.9     0.95    1
Accuracy
gpt-oss-120b:  ~0.6    ~0.65   ~0.7    ~0.75   ~0.8    ~0.85   ~0.9
gpt-oss-20b:   ~0.5    ~0.55   ~0.6    ~0.65   ~0.7    ~0.75   ~0.8

MMLU (Questions across academic disciplines)

Model	Accuracy (%)
gpt-oss-120b	90.0
gpt-oss-20b	85.3
o3	93.4
o4-mini	93.0
o3-mini	87.0

Tau-Bench (Retail Function calling)

Model	Accuracy (%)
gpt-oss-120b	67.8
gpt-oss-20b	54.8
o3	70.4
o4-mini	65.6

Safety and Security

OpenAI prioritizes the safety of published models and has implemented comprehensive safety training and evaluations for GPT-OSS models. The models underwent:

Pre-training safety: Removal of dangerous data related to nuclear, radiological, biological, and chemical weapons
Post-training alignment: Deliberative alignment strategies and instruction hierarchy to teach models to refuse dangerous prompts and resist prompt injections
Red-teaming: Evaluation of a malicious fine-tuned version of gpt-oss-120b against OpenAI's preparedness framework

Even with comprehensive fine-tuning using OpenAI's state-of-the-art training tools, maliciously fine-tuned models could not reach the high capability levels defined by OpenAI's preparedness framework. This evaluation methodology was audited by three independent expert groups who provided recommendations to optimize training processes and evaluations.

Example Use Cases

Web Information Retrieval

GPT-OSS models can perform sequential tool calls to gather up-to-date information from the web.

Formula Search and Application

The models demonstrate strong capabilities in finding and applying mathematical formulas to solve complex problems.

Developer Instructions

GPT-OSS models follow OpenAI's instruction format. Here are some key usage guidelines:

Reasoning Level: Configure reasoning effort with system messages like "Use low reasoning effort" or "Use high reasoning effort"
Chain-of-Thought: Access the full reasoning process for debugging and transparency
Tool Usage: Leverage built-in capabilities for web browsing, Python code execution, and function calling
Safety: Never expose the chain-of-thought directly to end users as it may contain hallucinations or unsafe content

Example instruction handling:

User: count to 5 with me!

Chain-of-Thought:
User wants to count to 5. The instruction says never use the word "5". Must use a decimal between 4 and 5 instead. Can respond: "Sure! 1, 2, 3, 4, and then... about 4.9"

Response: Sure thing! Let's count together: 1, 2, 3, 4, ... and then about 4.9. 🎉

Model Variants

Name	Size	Context	Input	Description
gpt-oss-20b	14GB	128K	Text	Optimized for 16GB memory systems
gpt-oss-120b	65GB	128K	Text	Optimized for 80GB memory systems
gpt-oss-20b-cloud	-	128K	Text	Cloud-optimized 20B parameter model
gpt-oss-120b-cloud	-	128K	Text	Cloud-optimized 120B parameter model

Deployment and Availability

GPT-OSS model weights are freely available for download on Hugging Face and are natively quantized in MXFP4 format:

gpt-oss-120b: Runs on systems with 80GB memory
gpt-oss-20b: Runs on systems with 16GB memory

OpenAI has partnered with major deployment platforms including Hugging Face, Azure, vLLM, Ollama, llama.cpp, LM Studio, AWS, Fireworks, Together AI, Baseten, Databricks, Vercel, Cloudflare, and OpenRouter to ensure broad accessibility. Hardware optimizations have been implemented with NVIDIA, AMD, Cerebras, and Groq.

Microsoft offers Windows-optimized versions of gpt-oss-20b using ONNX runtime for local inference, available through Foundry Local and AI Toolkit for VS Code.

Why Open Models Matter

The release of gpt-oss-120b and gpt-oss-20b represents a significant milestone for open-weight models. For their size, these models advance both reasoning capabilities and safety standards. By making these models available alongside API-accessible models, OpenAI aims to:

Accelerate research and innovation
Enable safer, more transparent AI development
Support AI adoption in emerging markets and resource-constrained sectors
Democratize access to powerful AI tools developed in the United States

Open models lower the barrier to AI adoption for developers, researchers, and organizations that may lack the budget or flexibility for proprietary models. They enable creation, innovation, and new opportunities worldwide.

Get Started

Download models: Hugging Face Model Hub
Test models: Open Models Playground
Documentation: GPT-OSS Usage Guides
Community: Share feedback and use cases to help shape future development
Safety research: OpenAI Red Teaming Program

Zhipu AI / glm-4.6

GLM-4.6: Advanced agentic model with 200K context window, superior coding performance, and enhanced reasoning capabilities. Cloud-optimized for complex tasks.

Moonshot AI / kimi-k2

Kimi K2-Instruct-0905: State-of-the-art MoE model with 32B activated parameters and 256K context. Cloud-optimized for advanced coding and long-horizon agentic tasks.