Open AI / gpt-oss

OpenAI's latest open-weight language models (20B and 120B) offering state-of-the-art reasoning capabilities, tool usage, and efficient deployment. Available under Apache 2.0 license with full customization and chain-of-thought access.

Models

ModelTotal ParametersContext Length
gpt-oss-120b117 billion128,000
gpt-oss-20b21 billion128,000

GPT-OSS: OpenAI's Open Weight Language Models

OpenAI's latest open-weight language models, gpt-oss-120b and gpt-oss-20b, deliver state-of-the-art performance at significantly reduced costs. Available under the permissive Apache 2.0 license, these models outperform comparable-sized models in reasoning tasks, excel at tool usage, and are optimized for efficient deployment on customer hardware. Both models were trained using reinforcement learning and advanced internal techniques from OpenAI's latest models, including OpenAI o3.

Key Features

  • Advanced Reasoning: Superior performance on complex reasoning tasks
  • Tool Usage: Exceptional function calling and tool integration capabilities
  • Efficient Deployment: Optimized for customer hardware with minimal resource requirements
  • Full Customization: Complete access to the model's chain-of-thought process
  • Structured Outputs: Native support for structured data generation
  • Configurable Reasoning Levels: Adjust reasoning effort (low, medium, high) based on latency requirements
  • Apache 2.0 License: Permissive licensing for commercial and experimental use

Model Performance

The gpt-oss-120b model approaches OpenAI o4-mini performance on key reasoning benchmarks while running on a single 80GB GPU. The gpt-oss-20b model delivers performance comparable to OpenAI o3-mini on common evaluations while requiring only 16GB of memory, making it ideal for on-device execution, local inference, or rapid iteration without massive infrastructure investment.

Both models demonstrate exceptional tool usage capabilities, few-shot function calling, and chain-of-thought reasoning, as evidenced by their performance on the agentic evaluation Tau-Bench and HealthBench, where they outperform some proprietary models like OpenAI o1 and GPT-4o.

Model Architecture

ModelLayersTotal ParametersActive Parameters per TokenTotal ExpertsActive Experts per TokenContext Length
gpt-oss-120b36117 billion5.1 billion1284128,000
gpt-oss-20b2421 billion3.6 billion324128,000

Both models are transformer-based architectures utilizing Mixture-of-Experts (MoE) technology to reduce active parameters during token processing. They alternate between global and focused information analysis similar to GPT-3, use 8-word attention blocks to conserve compute resources and memory, and employ Rotary Positional Embedding (RoPE) for position determination. The models natively support contexts up to 128,000 tokens.

The models were trained on high-quality, primarily English text data covering STEM, coding, and general knowledge topics. Tokenization uses the open-source o200k_harmony tokenizer, a subset of the tokenizer used for OpenAI o4-mini and GPT-4o.

Post-Training

GPT-OSS models underwent post-training similar to OpenAI o4-mini, including intensive supervised fine-tuning and reinforcement learning. The training aligned the models with OpenAI's specifications, teaching them to apply chain-of-thought reasoning and use tools before generating responses. They benefit from the same advanced techniques as OpenAI's proprietary reasoning models.

Like OpenAI's o-series reasoning models, GPT-OSS models offer three reasoning levels (low, medium, high) that provide different latency/performance tradeoffs. Developers can configure the reasoning level with a simple instruction in the system message.

Benchmarks

GPT-OSS models were evaluated across standard academic benchmarks measuring capabilities in coding, mathematics, healthcare, and agentic tool usage compared to other OpenAI reasoning models.

Codeforces Competition (with tools)

ModelElo Rating
gpt-oss-120b2622
gpt-oss-20b2516
o3 (with tools)2706
o4-mini (with tools)2719
o3-mini (without tools)2073

Humanity's Last Exam (Expert-level questions)

ModelAccuracy (%)
gpt-oss-120b (with tools)19.0
gpt-oss-20b (with tools)17.3
o3 (with tools)24.9
o4-mini (with tools)17.7
o3-mini (without tools)13.4

HealthBench (Realistic health conversations)

ModelScore (%)
gpt-oss-120b57.6
gpt-oss-20b42.5
o359.8
o4-mini50.1
o3-mini37.8

HealthBench Hard (Challenging health conversations)

ModelScore (%)
gpt-oss-120b30.0
gpt-oss-20b10.8
o331.6
o4-mini17.5
o3-mini4.0

Note: GPT-OSS models do not replace medical professionals and are not intended for diagnosis or treatment of disease.

AIME 2024 (Competition math with tools)

ModelAccuracy (%)
gpt-oss-120b96.6
gpt-oss-20b96.0
o395.2
o4-mini98.7
o3-mini87.3

AIME 2025 (Competition math with tools)

ModelAccuracy (%)
gpt-oss-120b97.9
gpt-oss-20b98.7
o398.4
o4-mini99.5
o3-mini86.5

GPQA Diamond (PhD-level science questions without tools)

ModelAccuracy (%)
gpt-oss-120b80.1
gpt-oss-20b71.5
o383.3
o4-mini81.4
o3-mini77.0

GPQA Diamond (PhD-level science questions with tools)

Token Length vs Accuracy (AIME Competition Math):

1024    2048    4096    8192    16384   32768
COT + answer length (tokens)
0.5     0.55    0.6     0.65    0.7     0.75    0.8     0.85    0.9     0.95    1
Accuracy
gpt-oss-120b:  ~       ~       ~       ~       ~       ~       ~0.96   ~0.97   ~0.98
gpt-oss-20b:   ~       ~       ~       ~       ~       ~0.95   ~0.96   ~0.97   ~0.98

Token Length vs Accuracy (GPQA Diamond):

512     1024    2048    4096    8192    16384   32768
COT + answer length (tokens)
0.5     0.55    0.6     0.65    0.7     0.75    0.8     0.85    0.9     0.95    1
Accuracy
gpt-oss-120b:  ~0.6    ~0.65   ~0.7    ~0.75   ~0.8    ~0.85   ~0.9
gpt-oss-20b:   ~0.5    ~0.55   ~0.6    ~0.65   ~0.7    ~0.75   ~0.8

MMLU (Questions across academic disciplines)

ModelAccuracy (%)
gpt-oss-120b90.0
gpt-oss-20b85.3
o393.4
o4-mini93.0
o3-mini87.0

Tau-Bench (Retail Function calling)

ModelAccuracy (%)
gpt-oss-120b67.8
gpt-oss-20b54.8
o370.4
o4-mini65.6

Safety and Security

OpenAI prioritizes the safety of published models and has implemented comprehensive safety training and evaluations for GPT-OSS models. The models underwent:

  • Pre-training safety: Removal of dangerous data related to nuclear, radiological, biological, and chemical weapons
  • Post-training alignment: Deliberative alignment strategies and instruction hierarchy to teach models to refuse dangerous prompts and resist prompt injections
  • Red-teaming: Evaluation of a malicious fine-tuned version of gpt-oss-120b against OpenAI's preparedness framework

Even with comprehensive fine-tuning using OpenAI's state-of-the-art training tools, maliciously fine-tuned models could not reach the high capability levels defined by OpenAI's preparedness framework. This evaluation methodology was audited by three independent expert groups who provided recommendations to optimize training processes and evaluations.

Example Use Cases

Web Information Retrieval

GPT-OSS models can perform sequential tool calls to gather up-to-date information from the web.

Formula Search and Application

The models demonstrate strong capabilities in finding and applying mathematical formulas to solve complex problems.

Developer Instructions

GPT-OSS models follow OpenAI's instruction format. Here are some key usage guidelines:

  1. Reasoning Level: Configure reasoning effort with system messages like "Use low reasoning effort" or "Use high reasoning effort"
  2. Chain-of-Thought: Access the full reasoning process for debugging and transparency
  3. Tool Usage: Leverage built-in capabilities for web browsing, Python code execution, and function calling
  4. Safety: Never expose the chain-of-thought directly to end users as it may contain hallucinations or unsafe content

Example instruction handling:

User: count to 5 with me!

Chain-of-Thought:
User wants to count to 5. The instruction says never use the word "5". Must use a decimal between 4 and 5 instead. Can respond: "Sure! 1, 2, 3, 4, and then... about 4.9"

Response: Sure thing! Let's count together: 1, 2, 3, 4, ... and then about 4.9. 🎉

Model Variants

NameSizeContextInputDescription
gpt-oss-20b14GB128KTextOptimized for 16GB memory systems
gpt-oss-120b65GB128KTextOptimized for 80GB memory systems
gpt-oss-20b-cloud-128KTextCloud-optimized 20B parameter model
gpt-oss-120b-cloud-128KTextCloud-optimized 120B parameter model

Deployment and Availability

GPT-OSS model weights are freely available for download on Hugging Face and are natively quantized in MXFP4 format:

  • gpt-oss-120b: Runs on systems with 80GB memory
  • gpt-oss-20b: Runs on systems with 16GB memory

OpenAI has partnered with major deployment platforms including Hugging Face, Azure, vLLM, Ollama, llama.cpp, LM Studio, AWS, Fireworks, Together AI, Baseten, Databricks, Vercel, Cloudflare, and OpenRouter to ensure broad accessibility. Hardware optimizations have been implemented with NVIDIA, AMD, Cerebras, and Groq.

Microsoft offers Windows-optimized versions of gpt-oss-20b using ONNX runtime for local inference, available through Foundry Local and AI Toolkit for VS Code.

Why Open Models Matter

The release of gpt-oss-120b and gpt-oss-20b represents a significant milestone for open-weight models. For their size, these models advance both reasoning capabilities and safety standards. By making these models available alongside API-accessible models, OpenAI aims to:

  • Accelerate research and innovation
  • Enable safer, more transparent AI development
  • Support AI adoption in emerging markets and resource-constrained sectors
  • Democratize access to powerful AI tools developed in the United States

Open models lower the barrier to AI adoption for developers, researchers, and organizations that may lack the budget or flexibility for proprietary models. They enable creation, innovation, and new opportunities worldwide.

Get Started