Byte Pair Encoding (BPE) is a data compression–inspired algorithm adapted for tokenization in natural language processing and large language models. Originally introduced by Philip Gage in 1994 for byte-level compression, it was repurposed for NLP by Rico Sennrich et al. in their 2016 paper "Neural Machine Translation of Rare Words with Subword Units." BPE works by starting with a vocabulary of all individual characters (or bytes) in the training corpus, then iteratively counting all adjacent token pairs, merging the most frequent pair into a new token, and adding it to the vocabulary. This process repeats until a predefined vocabulary size (e.g., 32,000 or 50,000 tokens) is reached.
Technically, the algorithm operates in two phases: learning and encoding. In the learning phase, the corpus is scanned to build frequency counts of all consecutive token pairs. The most frequent pair is merged, and the process repeats. The final result is a set of merge rules (a lookup table). During encoding, new text is tokenized by greedily applying the learned merge rules in the same order they were learned, starting from the character/byte level. For example, given the words "low" and "lower", BPE might first merge "lo" and "w" if they appear frequently together, then merge "low" and "er" to create a token for "lower".
Why it matters: BPE provides a fixed-size vocabulary that can represent any input string without out-of-vocabulary tokens, a critical improvement over word-level tokenizers. It balances between character-level (too long sequences) and word-level (too many rare words). BPE is used in many influential models: GPT-2, GPT-3, GPT-4, RoBERTa, BART, Llama 2, Llama 3, and Mistral all use BPE variants. For example, GPT-2 uses a BPE tokenizer with a vocabulary of 50,257 tokens, while Llama 3 uses a BPE tokenizer with 128,000 tokens.
When used vs alternatives: BPE is the dominant tokenizer for English and other languages with alphabetic scripts. Alternatives include WordPiece (used by BERT), Unigram (used by T5 and XLNet), and SentencePiece (a framework that can implement BPE or Unigram). BPE tends to produce more compact tokenizations for common words but can be less consistent for rare words compared to Unigram. For multilingual models, SentencePiece with Unigram is often preferred because it can handle arbitrary languages without pre-tokenization (e.g., XLM-R uses SentencePiece Unigram with a 250k vocabulary).
Common pitfalls: (1) BPE is sensitive to the order of merges; greedy merging can lead to suboptimal tokenizations for very rare sequences. (2) It requires a pre-tokenization step (e.g., splitting on whitespace), which can be problematic for languages without clear word boundaries (e.g., Chinese, Japanese). (3) BPE can produce tokens that are not linguistically meaningful, such as splitting words at non-morphemic boundaries. (4) The tokenizer must be trained on a representative corpus; otherwise, it will perform poorly on domain-specific text (e.g., code vs. medical text).
Current state of the art (2026): BPE remains widely used, but improvements include byte-level BPE (used by GPT-4 and Llama 3) that operates directly on UTF-8 bytes, eliminating the need for a pre-tokenizer and making the tokenizer truly language-agnostic. Another trend is vocabulary expansion for specific domains: for instance, Code Llama uses a BPE tokenizer with additional tokens for code (e.g., whitespace, indentation). Research continues on adaptive tokenization and dynamic vocabulary adjustment during training, though BPE's simplicity and efficiency keep it dominant in production systems. Hugging Face's Tokenizers library provides highly optimized BPE implementations that can process millions of tokens per second on a single CPU.