Read the book here: situational-awareness.ai or find the full series as a 165-page PDF here.

I have summarised & highlighted the important concepts covered in the book for speed reading.

Chapter 1: From GPT-4 to AGI: Counting the OOMs

TL;DR
The author outright claims that “achieving AGI by 2027 is plausible”, which took me aback. However, the data points presented in this chapter convinced me of his argument.

GPT-2 to GPT-4 took us from ~preschooler to ~smart high-schooler abilities in 4 years. So, we should expect preschooler-to-high-schooler-sized-qualitative jump by 2027 by tracing the trend lines in three categories of scale ups:

Compute (~0.5 OOMs/year)
Algorithmic efficiencies (~0.5 OOMs/year)
Unhobbling (from chatbots to agents)

Here are some of the figures used by the author to support his claim.

Figure: Rough estimates of past and future scaleup of effective compute (both physical compute and algorithmic efficiencies).

Figure: Progress over just 4 years.

Figure: GPT-4 scores on standardized tests.

Figure: Gray: Professional forecasts, made in August 2021, for June 2022 performance on the MATH benchmark (difficult mathematics problems from high-school math competitions). Red star: actual state-of-the-art performance by June 2022, far exceeding even the upper range forecasters gave.

Compute

GPT-2 to GPT-4 roughly has training compute growth of ~4 OOMs (10,000x) in less than 4 years. Table: Estimates of compute in FLOP for GPT-2 to GPT-4 by Epoch AI.

GPT comparison
Compute time, Parameter count, Embedding Size, Context Window

Given, NVIDIA A100 has a throughput of $312 TFLOPs$ for FP16 precision,

Year Model Est. Parameter Count Embedding Size Context Window Vocabulary Size Est. Compute†
(in FLOPs & TFLOPs) Est. training duration with $10000 \times A100$ GPUs Est transformer layers Est. Compute Growth Est. Parameter Growth
2018 GPT-1 117M 768 ~512 tokens
GPT-2 Small 117M 768 1024 tokens 50257†
GPT-2 Medium 345M 1024 1024 tokens 50257†
GPT-2 Large 762M 1280 1024 tokens 50257†
2019 GPT-2 XL 1.5B 1600 1024 tokens 50257† $4 e 21 FLOP \to 4 \times 1 0^{9} TFLOPs$ $\frac{4 \times 1 0 ^{9} TFLOP}{10000 \times 312 TFLOPS}$
$= 12.8 \times 1 0^{2} secs$ $= 0.0147 days$
GPT-3 (Ada) ~350M 768 2048 tokens 50257†
GPT-3 (Babbage) ~1.3B 1024 2048 tokens 50257†
GPT-3 (Curie) ~6.7B 1600 2048 tokens 50257†
2020 GPT-3 (Davinci) 175B 12288 2048 tokens 100256† $3 e 23 FLOP \to 3 \times 1 0^{11} TFLOPs$ $\frac{3 \times 1 0 ^{11} TFLOP}{10000 \times 312 TFLOPS}$
$= 9.61 \times 1 0^{4} secs$ $= 1.11 days$ 96 + ~2 OOMs + ~2 OOMs
2023 GPT-4 (Varies) ~1-2 trillion params Upto 12288 8192 - 32768 tokens 100256† $4 e 25 FLOP \to 4 \times 1 0^{13} TFLOPs$ $\frac{4 \times 1 0 ^{13} TFLOP}{10000 \times 312 TFLOPS}$
$= 12.8 \times 1 0^{6} secs$ $= 148 days$ $(5 months)$ + ~1.5-2 OOMs + ~1 OOM
GPT-4o-Mini 200019†

*without factoring in memory bandwidth, latency, power efficiency, etc.

†Sources:

Epoch AI

https://en.wikipedia.org/wiki/Byte_pair_encoding

https://www.datacamp.com/tutorial/tiktoken-library-python

Visual comparison
graph LR
    A[GPT Models] --> GPT1[GPT Original]
    GPT1 --> G1D[768 dimensions, 512 tokens]

    A --> GPT2[GPT-2]
    GPT2 --> G2S[Small: 768 dimensions, 1024 tokens]
    GPT2 --> G2M[Medium: 1024 dimensions, 1024 tokens]
    GPT2 --> G2L[Large: 1280 dimensions, 1024 tokens]
    GPT2 --> G2XL[XL: 1600 dimensions, 1024 tokens]

    A --> GPT3[GPT-3]
    GPT3 --> G3A[Ada: 768 dimensions, 2048 tokens]
    GPT3 --> G3B[Babbage: 1024 dimensions, 2048 tokens]
    GPT3 --> G3C[Curie: 1600 dimensions, 2048 tokens]
    GPT3 --> G3D[Davinci: 12288 dimensions, 2048 tokens]

    A --> GPT4[GPT-4]
    GPT4 --> G4V[Up to 12288 dimensions, 8192-32768 tokens]
Pre-training dataset and cost

Year Model Dataset (size) Cost (in terms of cloud compute)
2018 GPT-1
2019 GPT-2
2020 GPT-3 300 billion tokens†
(60% of original data of 499 billion tokens) ~$4.6 million†
2023 GPT-4
*Sources:

†Build a Large Language Model from Scratch*

Link to original

Year	Model	Est. Parameter Count	Embedding Size	Context Window	Vocabulary Size	Est. Compute† (in FLOPs & TFLOPs)	Est. training duration with $10000 \times A100$ GPUs	Est transformer layers	Est. Compute Growth	Est. Parameter Growth
2018	GPT-1	117M	768	~512 tokens
	GPT-2 Small	117M	768	1024 tokens	50257†
	GPT-2 Medium	345M	1024	1024 tokens	50257†
	GPT-2 Large	762M	1280	1024 tokens	50257†
2019	GPT-2 XL	1.5B	1600	1024 tokens	50257†	$4 e 21 FLOP \to 4 \times 1 0^{9} TFLOPs$	$\frac{4 \times 1 0 ^{9} TFLOP}{10000 \times 312 TFLOPS}$ $= 12.8 \times 1 0^{2} secs$ $= 0.0147 days$
	GPT-3 (Ada)	~350M	768	2048 tokens	50257†
	GPT-3 (Babbage)	~1.3B	1024	2048 tokens	50257†
	GPT-3 (Curie)	~6.7B	1600	2048 tokens	50257†
2020	GPT-3 (Davinci)	175B	12288	2048 tokens	100256†	$3 e 23 FLOP \to 3 \times 1 0^{11} TFLOPs$	$\frac{3 \times 1 0 ^{11} TFLOP}{10000 \times 312 TFLOPS}$ $= 9.61 \times 1 0^{4} secs$ $= 1.11 days$	96	+ ~2 OOMs	+ ~2 OOMs
2023	GPT-4 (Varies)	~1-2 trillion params	Upto 12288	8192 - 32768 tokens	100256†	$4 e 25 FLOP \to 4 \times 1 0^{13} TFLOPs$	$\frac{4 \times 1 0 ^{13} TFLOP}{10000 \times 312 TFLOPS}$ $= 12.8 \times 1 0^{6} secs$ $= 148 days$ $(5 months)$		+ ~1.5-2 OOMs	+ ~1 OOM
	GPT-4o-Mini				200019†

Year	Model	Dataset (size)	Cost (in terms of cloud compute)
2018	GPT-1
2019	GPT-2
2020	GPT-3	300 billion tokens† (60% of original data of 499 billion tokens)	~$4.6 million†
2023	GPT-4
*Sources:

Source: Epoch AI

In a nut shell, GPT-2 to GPT-4 jump included 3.5-4 OOMs of compute gains over 4 years period (i.e., ~1 OOMs/year of compute efficiency).

Algorithmic efficiencies

There have been many tweaks and gains in architecture, data, training stack, etc, collectively called as algorithmic progress, which is probably a similarly important driver of progress along with compute. However, unlike compute, algorithmic progress do not get all the attention and are dramatically underrated.

Inference efficiency improved by nearly 3 OOMs (1000x) in less than 2 years. Figure: Rough estimate on relative inference cost of attaining ~50% MATH performance.

Date	Model	Cost per 1M Input Tokens	Cost per 1M Output Tokens
N.A. Source	GPT-3	$30.0	$60.0
Dec-2024 (Source)	GPT-4o*	$2.5 (12x reduction)	$10.0 (6x reduction)
*Cost further drops by half for Batch API.

Figure: Decomposing progress: compute and algorithmic efficiencies. (Rough illustration)

In a nut shell, GPT-2 to GPT-4 jump included 1-2 OOMs of algorithmic efficiency gains over 4 years period (i.e., ~0.5 OOMs/year of algorithmic efficiency).

Data wall

We’re running out of interent data.

Frontier models are already trained on much of the internet data. For e.g., Llama 3, was trained over 15T tokens. Common Crawl, dump of much of internet, used for LLM training, is >100T tokens raw, and a relatively simple deduplication leads to 30T tokens. For more specific domains like code, there are fewer tokens. For e.g., public github repos are estimated to be in low trillions of tokens.

After 16 epochs, repeating the data for pre-training returns extremely diminishing results.

That being said, all of the labs are rumoured to make massive bets on new algorithmic improvements/approaches to get around this.

What a modern LLM does during training is, essentially, very very quickly skim the textbook, the words just flying by, not spending much brain power on it. Just reading the same textbook over and over again might result in memorization, and not understanding. This is in contrast to how read a (say, math) textbook, where we read a couple of pages slowly; then digest it, ponder over it; discuss with our friends; try few practice problems; fail, try again in a different way, get some feedback, try again until we get it right; and so on, until, it “clicks”. So, there’s a “missing middle” between “pre-training” and “in-context learning”. When a human learns from a textbook, they’re able to distill their short-term memory/learnings into long-term memory/long-term skills with practice; however, we don’t have an equivalent way to distill in-context learning “back to the weights.” Synthetic data/self-play/RL/etc. are trying to fix that: let the model learn by itself, then think about it and practice what it learned, distilling that learning back into the weights.. One such example would be AlphaGo, which was initially trained to imitate learning on expert human Go games, and then by playing millions of games against itself.

In a nut shell, though we are hitting the data well, we will have unconver enough (near and long term) tricks to continue with our ~X OOMs/year progress in this space.

Unhobbling

Despite excellent raw capabilities, LLMs are much worse than they could be because they are hobbled, and it takes some algorithmic tweak (e.g., Chain-of-thought) to unlock much greater capabilities.

Key terminologies

Term	Explanation
OOM	Order of Magnitude.	3x is 0.5 OOM 10x is 1 OOM (i.e., 1 order of magnitude).
FLOP	Floating Point Operation
FLOPS	Floating Point Operation per Second

Appendix

Situational-Awareness.pdf

Thangavel PrasanthTP

Explorer

Situational Awareness

Chapter 1: From GPT-4 to AGI: Counting the OOMs

Compute

GPT comparison

Compute time, Parameter count, Embedding Size, Context Window

Visual comparison

Pre-training dataset and cost

Algorithmic efficiencies

Data wall

Unhobbling

Key terminologies

Appendix

Graph View

Table of Contents

Backlinks