Compute time, Parameter count, Embedding Size, Context Window
Given, NVIDIA A100 has a throughput of for FP16 precision,
Year | Model | Est. Parameter Count | Embedding Size | Context Window | Vocabulary Size | Est. Compute† (in FLOPs & TFLOPs) | Est. training duration with GPUs | Est transformer layers | Est. Compute Growth | Est. Parameter Growth |
---|---|---|---|---|---|---|---|---|---|---|
2018 | GPT-1 | 117M | 768 | ~512 tokens | ||||||
GPT-2 Small | 117M | 768 | 1024 tokens | 50257† | ||||||
GPT-2 Medium | 345M | 1024 | 1024 tokens | 50257† | ||||||
GPT-2 Large | 762M | 1280 | 1024 tokens | 50257† | ||||||
2019 | GPT-2 XL | 1.5B | 1600 | 1024 tokens | 50257† | | ||||
GPT-3 (Ada) | ~350M | 768 | 2048 tokens | 50257† | ||||||
GPT-3 (Babbage) | ~1.3B | 1024 | 2048 tokens | 50257† | ||||||
GPT-3 (Curie) | ~6.7B | 1600 | 2048 tokens | 50257† | ||||||
2020 | GPT-3 (Davinci) | 175B | 12288 | 2048 tokens | 100256† | | 96 | + ~2 OOMs | + ~2 OOMs | |
2023 | GPT-4 (Varies) | ~1-2 trillion params | Upto 12288 | 8192 - 32768 tokens | 100256† | | + ~1.5-2 OOMs | + ~1 OOM | ||
GPT-4o-Mini | 200019† |
- *without factoring in memory bandwidth, latency, power efficiency, etc.
- †Sources:
Visual comparison
graph LR A[GPT Models] --> GPT1[GPT Original] GPT1 --> G1D[768 dimensions, 512 tokens] A --> GPT2[GPT-2] GPT2 --> G2S[Small: 768 dimensions, 1024 tokens] GPT2 --> G2M[Medium: 1024 dimensions, 1024 tokens] GPT2 --> G2L[Large: 1280 dimensions, 1024 tokens] GPT2 --> G2XL[XL: 1600 dimensions, 1024 tokens] A --> GPT3[GPT-3] GPT3 --> G3A[Ada: 768 dimensions, 2048 tokens] GPT3 --> G3B[Babbage: 1024 dimensions, 2048 tokens] GPT3 --> G3C[Curie: 1600 dimensions, 2048 tokens] GPT3 --> G3D[Davinci: 12288 dimensions, 2048 tokens] A --> GPT4[GPT-4] GPT4 --> G4V[Up to 12288 dimensions, 8192-32768 tokens]