Compute time, Parameter count, Embedding Size, Context Window

Given, NVIDIA A100 has a throughput of for FP16 precision,

YearModelEst. Parameter CountEmbedding SizeContext WindowVocabulary SizeEst. Compute†
(in FLOPs & TFLOPs)
Est. training duration with GPUsEst transformer layersEst. Compute GrowthEst. Parameter Growth
2018GPT-1117M768~512 tokens
GPT-2 Small117M7681024 tokens50257†
GPT-2 Medium345M10241024 tokens50257†
GPT-2 Large762M12801024 tokens50257†
2019GPT-2 XL1.5B16001024 tokens50257†
GPT-3 (Ada)~350M7682048 tokens50257†
GPT-3 (Babbage)~1.3B10242048 tokens50257†
GPT-3 (Curie)~6.7B16002048 tokens50257†
2020GPT-3 (Davinci)175B122882048 tokens100256†
96+ ~2 OOMs+ ~2 OOMs
2023GPT-4 (Varies)~1-2 trillion paramsUpto 122888192 - 32768 tokens100256†
+ ~1.5-2 OOMs+ ~1 OOM
GPT-4o-Mini200019†

Visual comparison

graph LR
    A[GPT Models] --> GPT1[GPT Original]
    GPT1 --> G1D[768 dimensions, 512 tokens]

    A --> GPT2[GPT-2]
    GPT2 --> G2S[Small: 768 dimensions, 1024 tokens]
    GPT2 --> G2M[Medium: 1024 dimensions, 1024 tokens]
    GPT2 --> G2L[Large: 1280 dimensions, 1024 tokens]
    GPT2 --> G2XL[XL: 1600 dimensions, 1024 tokens]

    A --> GPT3[GPT-3]
    GPT3 --> G3A[Ada: 768 dimensions, 2048 tokens]
    GPT3 --> G3B[Babbage: 1024 dimensions, 2048 tokens]
    GPT3 --> G3C[Curie: 1600 dimensions, 2048 tokens]
    GPT3 --> G3D[Davinci: 12288 dimensions, 2048 tokens]

    A --> GPT4[GPT-4]
    GPT4 --> G4V[Up to 12288 dimensions, 8192-32768 tokens]

Pre-training dataset and cost

YearModelDataset (size)Cost (in terms of cloud compute)
2018GPT-1
2019GPT-2
2020GPT-3300 billion tokens†
(60% of original data of 499 billion tokens)
~$4.6 million†
2023GPT-4
*Sources: