What is cl100k_base?
The cl100k_base tokenizer is OpenAI's encoding scheme used by the GPT-4 and GPT-3.5 family of models. It converts text into numerical tokens that the model can process, and it forms the foundation of how these models understand and generate language.
cl100k_base at a Glance
- Vocabulary size: ~100,256 tokens
- Encoding method: Byte Pair Encoding (BPE)
- Unicode support: Full UTF-8 coverage
- Average efficiency: ~4 characters per token (English)
Models Using cl100k_base
The cl100k_base tokenizer is shared across the entire GPT-3.5 and GPT-4 model family. Understanding which models use this tokenizer is essential for accurate token counting and cost estimation.
| Model | Context Window | Release |
|---|---|---|
| GPT-3.5-Turbo | 16,385 tokens | 2023 |
| GPT-4 | 8,192 tokens | 2023 |
| GPT-4-32k | 32,768 tokens | 2023 |
| GPT-4 Turbo | 128,000 tokens | 2024 |
| GPT-4o | 128,000 tokens | 2024 |
How cl100k_base Token Counting Works
The cl100k_base tokenizer uses Byte Pair Encoding (BPE), an algorithm that iteratively merges the most frequent pairs of bytes or characters in a corpus to build a vocabulary of subword units. This allows the tokenizer to handle any text, including rare words and multilingual content.
BPE Encoding Process
- Text is first converted to UTF-8 bytes
- Common byte pairs are merged iteratively based on frequency
- The process repeats until the target vocabulary size (~100K) is reached
- Each resulting subword unit becomes a token in the vocabulary
Token Count Examples
Here are some examples of how cl100k_base tokenizes common text:
Hello, world!→ 4 tokensThe quick brown fox→ 4 tokensArtificial intelligence→ 2 tokenstokenization→ 2 tokenssupercalifragilisticexpialidocious→ 7 tokens
Efficiency Characteristics
For English text, cl100k_base averages approximately 4 characters per token. This ratio varies by language and content type:
- English prose: ~4 characters/token
- Source code: ~3 characters/token (more whitespace and symbols)
- Chinese/Japanese/Korean: ~1.5-2 characters/token
- Structured data (JSON/XML): ~3.5 characters/token
Comparison with Other Tokenizers
Understanding how cl100k_base compares to other tokenizers helps you choose the right model and estimate costs accurately.
| Tokenizer | Vocabulary | Chars/Token | Models |
|---|---|---|---|
| cl100k_base | ~100K | ~4 | GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o |
| o200k_base | ~200K | ~5 | GPT-4.1, GPT-5, o1, o3, o4 |
| Claude | ~100K | ~4-5 | Claude 3, Claude 4, Claude 4.5 |
| Gemini | ~256K | ~4 | Gemini 1.5, Gemini 2.0, Gemini 2.5 |
| Llama (tiktoken) | ~128K | ~4 | Llama 3, Llama 4 |
Try the cl100k_base Token Counter
Count tokens in real time using the cl100k_base tokenizer. Paste your text and see exactly how GPT-4 and GPT-3.5 will tokenize it.
When to Use cl100k_base vs o200k_base
Use cl100k_base when:
- Working with GPT-3.5-Turbo, GPT-4, GPT-4 Turbo, or GPT-4o
- You need exact token counts for these specific models
- Estimating costs for existing GPT-4-based applications
- Maintaining backward compatibility with deployed systems
- Building applications that target the GPT-4 API
Use o200k_base when:
- Working with GPT-4.1, GPT-5, or reasoning models (o1, o3, o4)
- You want better token efficiency (fewer tokens per text)
- Building new applications targeting the latest models
- Working extensively with multilingual content
- Optimizing for cost with next-generation models