What’s Byte Pair Encoding (BPE), and where is it used?
- The BPE tokenizer is a subword tokenizer, which means it can split words into smaller parts.
- The BPE tokenizer was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.
- Has a total vocabulary size of 50,257 tokens, with
<|<endoftex|>
being assigned the largest token ID.
- Has a total vocabulary size of 50,257 tokens, with
- The BPE tokenizer can handle any unknown words.
How does BPE work and handle unknown words?
- If the tokenizer encounters an unknown word, it can represent it as a sequence of subword tokens or characters.
Extract from Wikipedia
Suppose we are encoding the previous example of “aaabdaaabac”, with a specified vocabulary size of 6, then we would first be encoded as “0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 3” with a vocabulary of “a=0, b=1, d=2, c=3”. Then it would proceed as before, and obtain “4, 5, 2, 4, 5, 0, 3” with a vocabulary of “a=0, b=1, d=2, c=3, aa=4, ab=5”.
So far this is essentially the same as before. However, if we only had specified a vocabulary size of 5, then the process would stop at vocabulary “a=0, b=1, d=2, c=3, aa=4”, so that the example would be encoded as “4, 0, 1, 2, 4, 0, 1, 0, 3”.Conversely, if we had specified a vocabulary size of 8, then it would be encoded as “7, 6, 0, 3”, with a vocabulary of “a=0, b=1, d=2, c=3, aa=4, ab=5, aaab=6, aaabd=7”. This is not maximally efficient, but the modified BPE does not aim to maximally compress a dataset, but aim to encode it efficiently for language model training.