Source: YouTube, [LinkedIn](https://www.linkedin.com/posts/danielhanchen_full-workshop-reinforcement-learning-kernels-activity-7353056071583748096-Srg_?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAcOLFMBUn1o8NEoEvAqJrA0ZzVgH3csPQ0, Google Slides, Unsloth, Google Slides
Speaker: Daniel Han, Unsloth (LinkedIn)
Additional Resources: Unsloth - Reinforcement Learning (RL) Guide

Presentation

Llama 1 Models A very useful insight form original Llama 1 paper¹, and is trained with 1.4T tokens (which is lower as 2025 standards)

llama-model-paper-loss-going-down-with-more-training.png

Figure: Llama pre-training data - Training loss going down with more training data and bigger models.

In the figure below, we can see that the slope of open-source model is more dramatic than close-source model. In terms of MMLU (5-shot), we can see that open-source models have reached close to closed source (e.g., Llama 3.1 405B vs. GPT-4). Tweet posted on 2025-Jul in Twitter by Maxime.

Additionally, we can find model performance comparison over time in other platforms as well:

Around Sep 2024, open-source and closed-source models converged in terms of MMLU accuracy, and suddenly, with launch of OpenAI’s o1-preview, the model performance and capability (reasoning, long reasoning traces) shot up. For 4 months, open-source community stalled. Then, suddenly, in Jan 2025 DeeSeek R1 came, and changed the entire world’s perspective that open-source models can indeed perform as good as close-source ones.

Even before ChatGPT launch in Dec 2022, LLMs existed, but the pre-trained base models were terrible in performance. However, ChatGPT showed that we can make the LLMs very useful with

good data
good instructions
good answers
good SFT
good RL

Open-source always try to catch up to close-source models.

Two huge jumps of open-source models

SFT, RLHF jump
RL jump

What's the next jump?

We don’t know yet!

Yann LeCun's Analogy

Pre-training is the “cake”,

SFT is the “icing”,

Reinforcement Learning is the “cherry on top”.

Training Phases

In the past

Pretraining
SFT (e.g., Instruction fine-tuning)
Post Training

Recently,

Pretraining - Predict the next word
“Mid” Training - High quality data, long context extension, etc.
SFT
Preference Finetuning (DPO, RLHF)
Reinforcement Finetuning (RLVR)

How do we get to our final model? 0. Random init

Pretraining
SFT (IFT) - Not much data for SFT
PrefFT
RLVR - e.g., gpt o3, o1 models

New Paradigm: Bypass SFT an PrefFT

Deepseek R1 has demonstrated this.

Hands-on

Github code: prasanth-ntu/qwen3_-4b-grpo.ipynb
- Note: Open in Colab for free GPU access

Appendix

https://arxiv.org/pdf/2302.13971 ↩

Thangavel PrasanthTP

Explorer

AI Engineers 2025 - Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han from Unsloth

Presentation

Training Phases

Hands-on

Appendix

Graph View

Table of Contents

Thangavel PrasanthTP

Explorer

AI Engineers 2025 - Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han from Unsloth

Presentation

Training Phases

Hands-on

Appendix

Footnotes

Graph View

Table of Contents