Tokenization — Definition, Examples & Latest News | gentic.news

Tokenization is the process of breaking raw text into smaller, discrete units called tokens, which are then mapped to integer IDs that a language model can process. It is the first step in virtually every NLP pipeline and has a profound impact on model behavior, training efficiency, and downstream performance.

How it works. A tokenizer first learns a vocabulary from a training corpus. Common subword tokenization algorithms include Byte-Pair Encoding (BPE), WordPiece, and Unigram. BPE (used by GPT models) iteratively merges the most frequent character pairs until a desired vocabulary size is reached. WordPiece (used by BERT) merges pairs that maximize the likelihood of the training data. Unigram (used by T5 and XLNet) starts with a large vocabulary and prunes tokens that least reduce likelihood. SentencePiece (used by Llama, T5) treats the input as a raw byte stream, making it language-agnostic. After learning, the tokenizer splits input text into subwords: e.g., "unhappiness" might become ["un", "happiness"] or ["un", "happi", "ness"] depending on the vocabulary. Each token is then assigned a unique integer ID from a lookup table.

Why it matters. Tokenization directly controls the model's effective context window — a model with a 128K token limit can process more *semantic* content if its tokenizer is efficient. For example, Llama 3 uses a 128K-token vocabulary and achieves roughly 1.5× better token efficiency than GPT-2's 50K vocabulary, meaning fewer tokens per word. This reduces memory use and latency. Tokenization also determines how the model handles rare words, spelling, and multilingual text. Poor tokenization can inflate sequence lengths, waste compute, and degrade performance on low-resource languages.

When it's used vs. alternatives. Tokenization is used at both training and inference time. Alternatives include character-level models (e.g., ByT5, Canine) which avoid tokenization entirely but require much longer sequences, and byte-level models (e.g., MegaByte) which group bytes into patches. As of 2026, subword tokenization remains dominant due to its balance of vocabulary size and sequence length, though byte-level methods are gaining traction for multilingual and noisy text.

Common pitfalls. A frequent mistake is using a tokenizer that mismatches the model's pretrained vocabulary (e.g., using GPT-2's tokenizer for a BERT model). Another is ignoring tokenization fairness: English text typically tokenizes into fewer tokens per word than, say, Swahili or Thai, leading to systematic bias in compute allocation and performance. Overly aggressive merging can also cause tokenizer-induced hallucinations, where rare subwords are incorrectly merged into plausible but wrong tokens.

Current state of the art (2026). State-of-the-art tokenizers are trained on massive, multilingual corpora with vocabularies of 128K–256K tokens. Llama 3.1 and GPT-4o use SentencePiece with BPE and a byte-level fallback. Research focuses on dynamic tokenization (e.g., adaptive tokenizers that adjust per domain) and tokenizer-free architectures like MegaByte, which process raw bytes in patches. OpenAI's tiktoken library has become the de facto standard for fast tokenization in production, supporting both BPE and WordPiece. The trend is toward larger vocabularies and byte-level tokenization to reduce sequence length and improve cross-lingual parity.

Examples

Llama 3.1 405B uses a 128K-token SentencePiece BPE tokenizer, giving ~1.5× better token efficiency than GPT-2's 50K vocabulary.

BERT-base uses a 30K-token WordPiece tokenizer; 'unhappiness' is tokenized as ['un', '##happiness'].

GPT-4o uses a 100K-token BPE tokenizer via tiktoken; 'hello world' becomes [15339, 1917].

ByT5 (2021) operates on UTF-8 bytes directly, avoiding tokenization, but requires sequences 4–8× longer than subword models.

MegaByte (2023) processes bytes in 256-byte patches, achieving competitive perplexity on C4 with 4× faster decoding than subword baselines.

FAQ

What is Tokenization?

Tokenization converts raw text into discrete units (tokens) — words, subwords, or characters — that a model can process. It determines vocabulary size, sequence length, and how out-of-vocabulary words are handled, directly impacting training efficiency and model quality.

How does Tokenization work?

Where is Tokenization used in 2026?

Llama 3.1 405B uses a 128K-token SentencePiece BPE tokenizer, giving ~1.5× better token efficiency than GPT-2's 50K vocabulary. BERT-base uses a 30K-token WordPiece tokenizer; 'unhappiness' is tokenized as ['un', '##happiness']. GPT-4o uses a 100K-token BPE tokenizer via tiktoken; 'hello world' becomes [15339, 1917].

Tokenization: definition + examples

Examples

Related terms

Latest news mentioning Tokenization

FAQ