What’s Byte Pair Encoding (BPE), and where is it used?

  • The BPE tokenizer is a subword tokenizer, which means it can split words into smaller parts.
  • The BPE tokenizer was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.
    • Has a total vocabulary size of 50,257 tokens, with <|<endoftex|> being assigned the largest token ID.
  • The BPE tokenizer can handle any unknown words.

How does BPE work and handle unknown words?

  • If the tokenizer encounters an unknown word, it can represent it as a sequence of subword tokens or characters.

Extract from Wikipedia

Suppose we are encoding the previous example of “aaabdaaabac”, with a specified vocabulary size of 6, then we would first be encoded as “0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 3” with a vocabulary of “a=0, b=1, d=2, c=3”. Then it would proceed as before, and obtain “4, 5, 2, 4, 5, 0, 3” with a vocabulary of “a=0, b=1, d=2, c=3, aa=4, ab=5”.

So far this is essentially the same as before. However, if we only had specified a vocabulary size of 5, then the process would stop at vocabulary “a=0, b=1, d=2, c=3, aa=4”, so that the example would be encoded as “4, 0, 1, 2, 4, 0, 1, 0, 3”.Conversely, if we had specified a vocabulary size of 8, then it would be encoded as “7, 6, 0, 3”, with a vocabulary of “a=0, b=1, d=2, c=3, aa=4, ab=5, aaab=6, aaabd=7”. This is not maximally efficient, but the modified BPE does not aim to maximally compress a dataset, but aim to encode it efficiently for language model training.

BPE example: Simplified

Additional resources