Open AI / gpt-oss
Models
| Model | Total Parameters | Context Length |
|---|---|---|
| gpt-oss-120b | 117 billion | 128,000 |
| gpt-oss-20b | 21 billion | 128,000 |
GPT-OSS: OpenAI's Open Weight Language Models
OpenAI's latest open-weight language models, gpt-oss-120b and gpt-oss-20b, deliver state-of-the-art performance at significantly reduced costs. Available under the permissive Apache 2.0 license, these models outperform comparable-sized models in reasoning tasks, excel at tool usage, and are optimized for efficient deployment on customer hardware. Both models were trained using reinforcement learning and advanced internal techniques from OpenAI's latest models, including OpenAI o3.
Key Features
- Advanced Reasoning: Superior performance on complex reasoning tasks
- Tool Usage: Exceptional function calling and tool integration capabilities
- Efficient Deployment: Optimized for customer hardware with minimal resource requirements
- Full Customization: Complete access to the model's chain-of-thought process
- Structured Outputs: Native support for structured data generation
- Configurable Reasoning Levels: Adjust reasoning effort (low, medium, high) based on latency requirements
- Apache 2.0 License: Permissive licensing for commercial and experimental use
Model Performance
The gpt-oss-120b model approaches OpenAI o4-mini performance on key reasoning benchmarks while running on a single 80GB GPU. The gpt-oss-20b model delivers performance comparable to OpenAI o3-mini on common evaluations while requiring only 16GB of memory, making it ideal for on-device execution, local inference, or rapid iteration without massive infrastructure investment.
Both models demonstrate exceptional tool usage capabilities, few-shot function calling, and chain-of-thought reasoning, as evidenced by their performance on the agentic evaluation Tau-Bench and HealthBench, where they outperform some proprietary models like OpenAI o1 and GPT-4o.
Model Architecture
| Model | Layers | Total Parameters | Active Parameters per Token | Total Experts | Active Experts per Token | Context Length |
|---|---|---|---|---|---|---|
| gpt-oss-120b | 36 | 117 billion | 5.1 billion | 128 | 4 | 128,000 |
| gpt-oss-20b | 24 | 21 billion | 3.6 billion | 32 | 4 | 128,000 |
Both models are transformer-based architectures utilizing Mixture-of-Experts (MoE) technology to reduce active parameters during token processing. They alternate between global and focused information analysis similar to GPT-3, use 8-word attention blocks to conserve compute resources and memory, and employ Rotary Positional Embedding (RoPE) for position determination. The models natively support contexts up to 128,000 tokens.
The models were trained on high-quality, primarily English text data covering STEM, coding, and general knowledge topics. Tokenization uses the open-source o200k_harmony tokenizer, a subset of the tokenizer used for OpenAI o4-mini and GPT-4o.
Post-Training
GPT-OSS models underwent post-training similar to OpenAI o4-mini, including intensive supervised fine-tuning and reinforcement learning. The training aligned the models with OpenAI's specifications, teaching them to apply chain-of-thought reasoning and use tools before generating responses. They benefit from the same advanced techniques as OpenAI's proprietary reasoning models.
Like OpenAI's o-series reasoning models, GPT-OSS models offer three reasoning levels (low, medium, high) that provide different latency/performance tradeoffs. Developers can configure the reasoning level with a simple instruction in the system message.
Benchmarks
GPT-OSS models were evaluated across standard academic benchmarks measuring capabilities in coding, mathematics, healthcare, and agentic tool usage compared to other OpenAI reasoning models.
Codeforces Competition (with tools)
| Model | Elo Rating |
|---|---|
| gpt-oss-120b | 2622 |
| gpt-oss-20b | 2516 |
| o3 (with tools) | 2706 |
| o4-mini (with tools) | 2719 |
| o3-mini (without tools) | 2073 |
Humanity's Last Exam (Expert-level questions)
| Model | Accuracy (%) |
|---|---|
| gpt-oss-120b (with tools) | 19.0 |
| gpt-oss-20b (with tools) | 17.3 |
| o3 (with tools) | 24.9 |
| o4-mini (with tools) | 17.7 |
| o3-mini (without tools) | 13.4 |
HealthBench (Realistic health conversations)
| Model | Score (%) |
|---|---|
| gpt-oss-120b | 57.6 |
| gpt-oss-20b | 42.5 |
| o3 | 59.8 |
| o4-mini | 50.1 |
| o3-mini | 37.8 |
HealthBench Hard (Challenging health conversations)
| Model | Score (%) |
|---|---|
| gpt-oss-120b | 30.0 |
| gpt-oss-20b | 10.8 |
| o3 | 31.6 |
| o4-mini | 17.5 |
| o3-mini | 4.0 |
Note: GPT-OSS models do not replace medical professionals and are not intended for diagnosis or treatment of disease.
AIME 2024 (Competition math with tools)
| Model | Accuracy (%) |
|---|---|
| gpt-oss-120b | 96.6 |
| gpt-oss-20b | 96.0 |
| o3 | 95.2 |
| o4-mini | 98.7 |
| o3-mini | 87.3 |
AIME 2025 (Competition math with tools)
| Model | Accuracy (%) |
|---|---|
| gpt-oss-120b | 97.9 |
| gpt-oss-20b | 98.7 |
| o3 | 98.4 |
| o4-mini | 99.5 |
| o3-mini | 86.5 |
GPQA Diamond (PhD-level science questions without tools)
| Model | Accuracy (%) |
|---|---|
| gpt-oss-120b | 80.1 |
| gpt-oss-20b | 71.5 |
| o3 | 83.3 |
| o4-mini | 81.4 |
| o3-mini | 77.0 |
GPQA Diamond (PhD-level science questions with tools)
Token Length vs Accuracy (AIME Competition Math):
1024 2048 4096 8192 16384 32768
COT + answer length (tokens)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Accuracy
gpt-oss-120b: ~ ~ ~ ~ ~ ~ ~0.96 ~0.97 ~0.98
gpt-oss-20b: ~ ~ ~ ~ ~ ~0.95 ~0.96 ~0.97 ~0.98
Token Length vs Accuracy (GPQA Diamond):
512 1024 2048 4096 8192 16384 32768
COT + answer length (tokens)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Accuracy
gpt-oss-120b: ~0.6 ~0.65 ~0.7 ~0.75 ~0.8 ~0.85 ~0.9
gpt-oss-20b: ~0.5 ~0.55 ~0.6 ~0.65 ~0.7 ~0.75 ~0.8
MMLU (Questions across academic disciplines)
| Model | Accuracy (%) |
|---|---|
| gpt-oss-120b | 90.0 |
| gpt-oss-20b | 85.3 |
| o3 | 93.4 |
| o4-mini | 93.0 |
| o3-mini | 87.0 |
Tau-Bench (Retail Function calling)
| Model | Accuracy (%) |
|---|---|
| gpt-oss-120b | 67.8 |
| gpt-oss-20b | 54.8 |
| o3 | 70.4 |
| o4-mini | 65.6 |
Safety and Security
OpenAI prioritizes the safety of published models and has implemented comprehensive safety training and evaluations for GPT-OSS models. The models underwent:
- Pre-training safety: Removal of dangerous data related to nuclear, radiological, biological, and chemical weapons
- Post-training alignment: Deliberative alignment strategies and instruction hierarchy to teach models to refuse dangerous prompts and resist prompt injections
- Red-teaming: Evaluation of a malicious fine-tuned version of gpt-oss-120b against OpenAI's preparedness framework
Even with comprehensive fine-tuning using OpenAI's state-of-the-art training tools, maliciously fine-tuned models could not reach the high capability levels defined by OpenAI's preparedness framework. This evaluation methodology was audited by three independent expert groups who provided recommendations to optimize training processes and evaluations.
Example Use Cases
Web Information Retrieval
GPT-OSS models can perform sequential tool calls to gather up-to-date information from the web.
Formula Search and Application
The models demonstrate strong capabilities in finding and applying mathematical formulas to solve complex problems.
Developer Instructions
GPT-OSS models follow OpenAI's instruction format. Here are some key usage guidelines:
- Reasoning Level: Configure reasoning effort with system messages like "Use low reasoning effort" or "Use high reasoning effort"
- Chain-of-Thought: Access the full reasoning process for debugging and transparency
- Tool Usage: Leverage built-in capabilities for web browsing, Python code execution, and function calling
- Safety: Never expose the chain-of-thought directly to end users as it may contain hallucinations or unsafe content
Example instruction handling:
User: count to 5 with me!
Chain-of-Thought:
User wants to count to 5. The instruction says never use the word "5". Must use a decimal between 4 and 5 instead. Can respond: "Sure! 1, 2, 3, 4, and then... about 4.9"
Response: Sure thing! Let's count together: 1, 2, 3, 4, ... and then about 4.9. 🎉
Model Variants
| Name | Size | Context | Input | Description |
|---|---|---|---|---|
| gpt-oss-20b | 14GB | 128K | Text | Optimized for 16GB memory systems |
| gpt-oss-120b | 65GB | 128K | Text | Optimized for 80GB memory systems |
| gpt-oss-20b-cloud | - | 128K | Text | Cloud-optimized 20B parameter model |
| gpt-oss-120b-cloud | - | 128K | Text | Cloud-optimized 120B parameter model |
Deployment and Availability
GPT-OSS model weights are freely available for download on Hugging Face and are natively quantized in MXFP4 format:
- gpt-oss-120b: Runs on systems with 80GB memory
- gpt-oss-20b: Runs on systems with 16GB memory
OpenAI has partnered with major deployment platforms including Hugging Face, Azure, vLLM, Ollama, llama.cpp, LM Studio, AWS, Fireworks, Together AI, Baseten, Databricks, Vercel, Cloudflare, and OpenRouter to ensure broad accessibility. Hardware optimizations have been implemented with NVIDIA, AMD, Cerebras, and Groq.
Microsoft offers Windows-optimized versions of gpt-oss-20b using ONNX runtime for local inference, available through Foundry Local and AI Toolkit for VS Code.
Why Open Models Matter
The release of gpt-oss-120b and gpt-oss-20b represents a significant milestone for open-weight models. For their size, these models advance both reasoning capabilities and safety standards. By making these models available alongside API-accessible models, OpenAI aims to:
- Accelerate research and innovation
- Enable safer, more transparent AI development
- Support AI adoption in emerging markets and resource-constrained sectors
- Democratize access to powerful AI tools developed in the United States
Open models lower the barrier to AI adoption for developers, researchers, and organizations that may lack the budget or flexibility for proprietary models. They enable creation, innovation, and new opportunities worldwide.
Get Started
- Download models: Hugging Face Model Hub
- Test models: Open Models Playground
- Documentation: GPT-OSS Usage Guides
- Community: Share feedback and use cases to help shape future development
- Safety research: OpenAI Red Teaming Program
Zhipu AI / glm-4.6
GLM-4.6: Advanced agentic model with 200K context window, superior coding performance, and enhanced reasoning capabilities. Cloud-optimized for complex tasks.
Moonshot AI / kimi-k2
Kimi K2-Instruct-0905: State-of-the-art MoE model with 32B activated parameters and 256K context. Cloud-optimized for advanced coding and long-horizon agentic tasks.