What Are Reasoning Models?
OpenAI's reasoning models (o1, o3, o4) represent a new paradigm in AI where the model explicitly "thinks" before answering. Unlike standard models that generate responses token-by-token, reasoning models use an internal chain-of-thought process that consumes additional tokens called thinking tokens.
Two Types of Tokens in Reasoning Models
- Input Tokens — The tokens from your prompt, system message, and conversation history. These are the same as in standard models.
- Thinking Tokens — Internal reasoning tokens generated by the model during its chain-of-thought process. These are not visible in the final output but are billed separately.
Understanding Token Types
Input Tokens
Input tokens work identically to standard models:
- Your prompt text is tokenized using the o200k_base tokenizer
- System messages, conversation history, and function definitions all count
- Billed at the model's input token rate
- You have full control over input token count by adjusting your prompt
Thinking Tokens
Thinking tokens are unique to reasoning models:
- Generated internally during the model's chain-of-thought reasoning
- Not visible in the API response (hidden from the output)
- Count varies based on problem complexity (can range from hundreds to tens of thousands)
- Billed at a separate rate, typically higher than input tokens
- Cannot be directly controlled, but problem framing affects thinking length
Important: Thinking tokens can significantly increase costs. A simple question might use 500 thinking tokens, while a complex math or coding problem could use 10,000-50,000+ thinking tokens. Always monitor thinking token usage in production.
Reasoning Model Comparison
Each reasoning model offers different trade-offs between capability, cost, and speed.
| Model | Context Window | Input Cost | Thinking Cost | Best For |
|---|---|---|---|---|
| o1 | 200,000 | $15/1M tokens | $60/1M tokens | Complex reasoning, research |
| o1-mini | 128,000 | $3/1M tokens | $12/1M tokens | STEM tasks, coding |
| o3 | 200,000 | $10/1M tokens | $40/1M tokens | Advanced reasoning, math |
| o3-mini | 200,000 | $1.10/1M tokens | $4.40/1M tokens | Efficient reasoning tasks |
| o4-mini | 200,000 | $1.10/1M tokens | $4.40/1M tokens | Cost-effective reasoning |
| GPT-5 (reasoning) | 256,000 | $2/1M tokens | $8/1M tokens | General + reasoning hybrid |
| GPT-4o (standard) | 128,000 | $2.50/1M tokens | N/A | Standard tasks (no thinking) |
How to Count Tokens for Reasoning Models
Token counting for reasoning models requires accounting for both input and thinking tokens. Here is a step-by-step approach:
- Count your input tokens using the o200k_base tokenizer (same as GPT-4.1/GPT-5)
- Estimate thinking tokens based on task complexity (see guidelines below)
- Add output tokens for the visible response the model generates
- Calculate total cost using the per-token rates for each category
Use our token counter tool to get accurate input token counts using the o200k_base tokenizer.
Example: Cost Calculation for o3-mini
Prompt: "Solve this calculus problem step by step: Find the integral of x^2 * sin(x) dx"
- Input tokens: ~18 tokens = $0.0000198
- Thinking tokens: ~3,000 tokens (estimated) = $0.0132
- Output tokens: ~500 tokens = $0.0022
- Total cost: ~$0.0154 per request
Pricing and Cost Estimation
Cost Structure
Reasoning model costs are split into three categories, each billed at different rates:
- Input tokens: Lowest cost per token. You control the count.
- Thinking tokens: Highest cost per token. Model-determined, varies by complexity.
- Output tokens: Mid-range cost. The visible response tokens.
Cost Comparison by Scenario
| Scenario | Input Tokens | Est. Thinking Tokens | o3-mini Cost | o1 Cost |
|---|---|---|---|---|
| Simple Q&A | 100 | 500 | $0.003 | $0.032 |
| Code review | 2,000 | 5,000 | $0.024 | $0.330 |
| Math problem | 500 | 10,000 | $0.045 | $0.608 |
| Research analysis | 10,000 | 30,000 | $0.143 | $1.950 |
Tips for Cost Management
- Use mini models first: o3-mini and o4-mini are 5-10x cheaper than full models
- Be specific in prompts: Clearer prompts reduce unnecessary thinking
- Set max token limits: Use the
max_completion_tokensparameter to cap spending - Monitor thinking tokens: Check API response metadata for actual thinking token usage
- Route by complexity: Use standard models for simple tasks, reasoning models only when needed
Try Reasoning Model Token Counter
Count your input tokens accurately with the o200k_base tokenizer, then estimate total costs including thinking tokens for any reasoning model.
When to Use Reasoning Models
Best use cases:
- Complex mathematical proofs and calculations
- Multi-step logical reasoning and analysis
- Code debugging and architecture design
- Scientific research and hypothesis evaluation
- Legal document analysis requiring careful interpretation
- Strategic planning with multiple trade-offs
When NOT to use reasoning models:
- Simple text generation or summarization (use GPT-4o or GPT-4.1)
- Translation tasks (standard models are equally effective)
- High-throughput applications where latency matters (thinking adds delay)
- Classification or extraction tasks (no reasoning needed)
- Cost-sensitive applications with simple queries (thinking tokens add unnecessary cost)
Common Questions
Can I see the thinking tokens in the response?
By default, thinking tokens are hidden from the API response. However, OpenAI provides a reasoning_content field in the response that shows a summary of the reasoning process. The full chain-of-thought is not exposed, but you can see how many thinking tokens were used in the response metadata.
How do I estimate thinking token usage?
Thinking token usage varies widely by task complexity. As a rough guide: simple questions use 200-1,000 thinking tokens, moderate problems use 1,000-10,000, and complex multi-step reasoning can use 10,000-50,000+ thinking tokens. The only way to get exact counts is to run the request and check the response metadata.
Are thinking tokens counted against the context window?
Yes. Thinking tokens consume context window space alongside input and output tokens. For a model with a 200K context window, the total of input + thinking + output tokens cannot exceed 200,000. This is an important consideration for prompts with large inputs.
Can I control the amount of thinking?
You cannot directly control thinking token usage, but you can influence it. Providing clearer, more structured prompts with explicit constraints tends to reduce thinking. You can also use the max_completion_tokens parameter to set an upper bound on total output (thinking + visible) tokens.
Should I use o3 or o4-mini?
Choose o3 for tasks requiring the highest reasoning capability, such as complex math, advanced coding, and research analysis. Use o4-mini for everyday reasoning tasks where cost efficiency matters more than peak performance. The o4-mini model provides excellent reasoning at a fraction of the cost of o3.
Model Selection Guide
| Task Type | Recommended Model | Why |
|---|---|---|
| Complex math/science | o3 | Highest reasoning capability for STEM |
| Code generation/review | o3-mini or o4-mini | Strong coding reasoning at lower cost |
| General reasoning | o4-mini | Best cost-to-reasoning ratio |
| Simple Q&A | GPT-4o or GPT-4.1 | No thinking tokens needed, lower cost |
| Research analysis | o1 or o3 | Deep reasoning for nuanced analysis |