cl100k_base Tokenizer Explained

The tokenizer used by GPT-4, GPT-4 Turbo, and GPT-3.5-Turbo

What is cl100k_base?

The cl100k_base tokenizer is OpenAI's encoding scheme used by the GPT-4 and GPT-3.5 family of models. It converts text into numerical tokens that the model can process, and it forms the foundation of how these models understand and generate language.

cl100k_base at a Glance

  • Vocabulary size: ~100,256 tokens
  • Encoding method: Byte Pair Encoding (BPE)
  • Unicode support: Full UTF-8 coverage
  • Average efficiency: ~4 characters per token (English)

Models Using cl100k_base

The cl100k_base tokenizer is shared across the entire GPT-3.5 and GPT-4 model family. Understanding which models use this tokenizer is essential for accurate token counting and cost estimation.

ModelContext WindowRelease
GPT-3.5-Turbo16,385 tokens2023
GPT-48,192 tokens2023
GPT-4-32k32,768 tokens2023
GPT-4 Turbo128,000 tokens2024
GPT-4o128,000 tokens2024

How cl100k_base Token Counting Works

The cl100k_base tokenizer uses Byte Pair Encoding (BPE), an algorithm that iteratively merges the most frequent pairs of bytes or characters in a corpus to build a vocabulary of subword units. This allows the tokenizer to handle any text, including rare words and multilingual content.

BPE Encoding Process

  1. Text is first converted to UTF-8 bytes
  2. Common byte pairs are merged iteratively based on frequency
  3. The process repeats until the target vocabulary size (~100K) is reached
  4. Each resulting subword unit becomes a token in the vocabulary

Token Count Examples

Here are some examples of how cl100k_base tokenizes common text:

  • Hello, world! → 4 tokens
  • The quick brown fox → 4 tokens
  • Artificial intelligence → 2 tokens
  • tokenization → 2 tokens
  • supercalifragilisticexpialidocious → 7 tokens

Efficiency Characteristics

For English text, cl100k_base averages approximately 4 characters per token. This ratio varies by language and content type:

  • English prose: ~4 characters/token
  • Source code: ~3 characters/token (more whitespace and symbols)
  • Chinese/Japanese/Korean: ~1.5-2 characters/token
  • Structured data (JSON/XML): ~3.5 characters/token

Comparison with Other Tokenizers

Understanding how cl100k_base compares to other tokenizers helps you choose the right model and estimate costs accurately.

TokenizerVocabularyChars/TokenModels
cl100k_base~100K~4GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o
o200k_base~200K~5GPT-4.1, GPT-5, o1, o3, o4
Claude~100K~4-5Claude 3, Claude 4, Claude 4.5
Gemini~256K~4Gemini 1.5, Gemini 2.0, Gemini 2.5
Llama (tiktoken)~128K~4Llama 3, Llama 4

Try the cl100k_base Token Counter

Count tokens in real time using the cl100k_base tokenizer. Paste your text and see exactly how GPT-4 and GPT-3.5 will tokenize it.

When to Use cl100k_base vs o200k_base

Use cl100k_base when:

  • Working with GPT-3.5-Turbo, GPT-4, GPT-4 Turbo, or GPT-4o
  • You need exact token counts for these specific models
  • Estimating costs for existing GPT-4-based applications
  • Maintaining backward compatibility with deployed systems
  • Building applications that target the GPT-4 API

Use o200k_base when:

  • Working with GPT-4.1, GPT-5, or reasoning models (o1, o3, o4)
  • You want better token efficiency (fewer tokens per text)
  • Building new applications targeting the latest models
  • Working extensively with multilingual content
  • Optimizing for cost with next-generation models