Presentation

Llama 1 Models A very useful insight form original Llama 1 paper1, and is trained with 1.4T tokens (which is lower as 2025 standards)

llama-model-paper-loss-going-down-with-more-training.png

Figure: Llama pre-training data - Training loss going down with more training data and bigger models.

In the figure below, we can see that the slope of open-source model is more dramatic than close-source model. In terms of MMLU (5-shot), we can see that open-source models have reached close to closed source (e.g., Llama 3.1 405B vs. GPT-4). Tweet posted on 2025-Jul in Twitter by Maxime.

Additionally, we can find model performance comparison over time in other platforms as well:

Around Sep 2024, open-source and closed-source models converged in terms of MMLU accuracy, and suddenly, with launch of OpenAI’s o1-preview, the model performance and capability (reasoning, long reasoning traces) shot up. For 4 months, open-source community stalled. Then, suddenly, in Jan 2025 DeeSeek R1 came, and changed the entire world’s perspective that open-source models can indeed perform as good as close-source ones.

Even before ChatGPT launch in Dec 2022, LLMs existed, but the pre-trained base models were terrible in performance. However, ChatGPT showed that we can make the LLMs very useful with

  • good data
  • good instructions
  • good answers
  • good SFT
  • good RL

Open-source always try to catch up to close-source models.

Two huge jumps of open-source models

  1. SFT, RLHF jump
  2. RL jump

What's the next jump?

We don’t know yet!

Yann LeCun's Analogy

  • Pre-training is the “cake”,
  • SFT is the “icing”,
  • Reinforcement Learning is the “cherry on top”.

Training Phases

In the past

  • Pretraining
  • SFT (e.g., Instruction fine-tuning)
  • Post Training

Recently,

  • Pretraining - Predict the next word
  • “Mid” Training - High quality data, long context extension, etc.
  • SFT
  • Preference Finetuning (DPO, RLHF)
  • Reinforcement Finetuning (RLVR)

How do we get to our final model? 0. Random init

  1. Pretraining
  2. SFT (IFT) - Not much data for SFT
  3. PrefFT
  4. RLVR - e.g., gpt o3, o1 models

New Paradigm: Bypass SFT an PrefFT

Deepseek R1 has demonstrated this.

Hands-on

Appendix

Footnotes

  1. https://arxiv.org/pdf/2302.13971