Reasoning Model Tokens Explained o1/o3/o4

Understanding input tokens and thinking tokens in OpenAI's reasoning models

What Are Reasoning Models?

OpenAI's reasoning models (o1, o3, o4) represent a new paradigm in AI where the model explicitly "thinks" before answering. Unlike standard models that generate responses token-by-token, reasoning models use an internal chain-of-thought process that consumes additional tokens called thinking tokens.

Two Types of Tokens in Reasoning Models

  • Input Tokens — The tokens from your prompt, system message, and conversation history. These are the same as in standard models.
  • Thinking Tokens — Internal reasoning tokens generated by the model during its chain-of-thought process. These are not visible in the final output but are billed separately.

Understanding Token Types

Input Tokens

Input tokens work identically to standard models:

  • Your prompt text is tokenized using the o200k_base tokenizer
  • System messages, conversation history, and function definitions all count
  • Billed at the model's input token rate
  • You have full control over input token count by adjusting your prompt

Thinking Tokens

Thinking tokens are unique to reasoning models:

  • Generated internally during the model's chain-of-thought reasoning
  • Not visible in the API response (hidden from the output)
  • Count varies based on problem complexity (can range from hundreds to tens of thousands)
  • Billed at a separate rate, typically higher than input tokens
  • Cannot be directly controlled, but problem framing affects thinking length

Important: Thinking tokens can significantly increase costs. A simple question might use 500 thinking tokens, while a complex math or coding problem could use 10,000-50,000+ thinking tokens. Always monitor thinking token usage in production.

Reasoning Model Comparison

Each reasoning model offers different trade-offs between capability, cost, and speed.

ModelContext WindowInput CostThinking CostBest For
o1200,000$15/1M tokens$60/1M tokensComplex reasoning, research
o1-mini128,000$3/1M tokens$12/1M tokensSTEM tasks, coding
o3200,000$10/1M tokens$40/1M tokensAdvanced reasoning, math
o3-mini200,000$1.10/1M tokens$4.40/1M tokensEfficient reasoning tasks
o4-mini200,000$1.10/1M tokens$4.40/1M tokensCost-effective reasoning
GPT-5 (reasoning)256,000$2/1M tokens$8/1M tokensGeneral + reasoning hybrid
GPT-4o (standard)128,000$2.50/1M tokensN/AStandard tasks (no thinking)

How to Count Tokens for Reasoning Models

Token counting for reasoning models requires accounting for both input and thinking tokens. Here is a step-by-step approach:

  1. Count your input tokens using the o200k_base tokenizer (same as GPT-4.1/GPT-5)
  2. Estimate thinking tokens based on task complexity (see guidelines below)
  3. Add output tokens for the visible response the model generates
  4. Calculate total cost using the per-token rates for each category

Use our token counter tool to get accurate input token counts using the o200k_base tokenizer.

Example: Cost Calculation for o3-mini

Prompt: "Solve this calculus problem step by step: Find the integral of x^2 * sin(x) dx"

  • Input tokens: ~18 tokens = $0.0000198
  • Thinking tokens: ~3,000 tokens (estimated) = $0.0132
  • Output tokens: ~500 tokens = $0.0022
  • Total cost: ~$0.0154 per request

Pricing and Cost Estimation

Cost Structure

Reasoning model costs are split into three categories, each billed at different rates:

  • Input tokens: Lowest cost per token. You control the count.
  • Thinking tokens: Highest cost per token. Model-determined, varies by complexity.
  • Output tokens: Mid-range cost. The visible response tokens.

Cost Comparison by Scenario

ScenarioInput TokensEst. Thinking Tokenso3-mini Costo1 Cost
Simple Q&A100500$0.003$0.032
Code review2,0005,000$0.024$0.330
Math problem50010,000$0.045$0.608
Research analysis10,00030,000$0.143$1.950

Tips for Cost Management

  • Use mini models first: o3-mini and o4-mini are 5-10x cheaper than full models
  • Be specific in prompts: Clearer prompts reduce unnecessary thinking
  • Set max token limits: Use the max_completion_tokens parameter to cap spending
  • Monitor thinking tokens: Check API response metadata for actual thinking token usage
  • Route by complexity: Use standard models for simple tasks, reasoning models only when needed

Try Reasoning Model Token Counter

Count your input tokens accurately with the o200k_base tokenizer, then estimate total costs including thinking tokens for any reasoning model.

When to Use Reasoning Models

Best use cases:

  • Complex mathematical proofs and calculations
  • Multi-step logical reasoning and analysis
  • Code debugging and architecture design
  • Scientific research and hypothesis evaluation
  • Legal document analysis requiring careful interpretation
  • Strategic planning with multiple trade-offs

When NOT to use reasoning models:

  • Simple text generation or summarization (use GPT-4o or GPT-4.1)
  • Translation tasks (standard models are equally effective)
  • High-throughput applications where latency matters (thinking adds delay)
  • Classification or extraction tasks (no reasoning needed)
  • Cost-sensitive applications with simple queries (thinking tokens add unnecessary cost)

Common Questions

Can I see the thinking tokens in the response?

By default, thinking tokens are hidden from the API response. However, OpenAI provides a reasoning_content field in the response that shows a summary of the reasoning process. The full chain-of-thought is not exposed, but you can see how many thinking tokens were used in the response metadata.

How do I estimate thinking token usage?

Thinking token usage varies widely by task complexity. As a rough guide: simple questions use 200-1,000 thinking tokens, moderate problems use 1,000-10,000, and complex multi-step reasoning can use 10,000-50,000+ thinking tokens. The only way to get exact counts is to run the request and check the response metadata.

Are thinking tokens counted against the context window?

Yes. Thinking tokens consume context window space alongside input and output tokens. For a model with a 200K context window, the total of input + thinking + output tokens cannot exceed 200,000. This is an important consideration for prompts with large inputs.

Can I control the amount of thinking?

You cannot directly control thinking token usage, but you can influence it. Providing clearer, more structured prompts with explicit constraints tends to reduce thinking. You can also use the max_completion_tokens parameter to set an upper bound on total output (thinking + visible) tokens.

Should I use o3 or o4-mini?

Choose o3 for tasks requiring the highest reasoning capability, such as complex math, advanced coding, and research analysis. Use o4-mini for everyday reasoning tasks where cost efficiency matters more than peak performance. The o4-mini model provides excellent reasoning at a fraction of the cost of o3.

Model Selection Guide

Task TypeRecommended ModelWhy
Complex math/scienceo3Highest reasoning capability for STEM
Code generation/reviewo3-mini or o4-miniStrong coding reasoning at lower cost
General reasoningo4-miniBest cost-to-reasoning ratio
Simple Q&AGPT-4o or GPT-4.1No thinking tokens needed, lower cost
Research analysiso1 or o3Deep reasoning for nuanced analysis