Presentation

Introduction

Why fine-tune LLMs?

  • Add new knowledge
    • medical/ legal/ industrial/ company secrets
    • domain specific knowledge
  • Improve performance
    • story telling
    • document classification
    • information extraction
    • summarisation
  • Give AI assistants a personality
  • Improve the usability of local models
  • Overcome guardrails
    • political bias
    • potentially dangerous research

Why not just prompt?

  • Sometimes hard/impossible to write an instruction/prompt
  • LLMs have limited context size & performance drops when context gets larger
  • Prompted behaviour might not meet performance requirements

How about RAG?

  • RAG is easier If we want to answer questions about knowledge base that changes frequently
  • Quality of RAG is very dependent on retrieval process

Most interesting combination: RAG + Finetuning

LLM Training

Pre-training & Post-training

How are LLM trained?

It all starts with Transformers

Transformer, full architecture.png
By dvgodoy - https://github.com/dvgodoy/dl-visuals/?tab=readme-ov-file, CC BY 4.0, Link

1) Pre-training

Phase

  • Base model
  • Variant: Completions

Dataset

  • Standard data (Common Crawl, GitHub, Wikipedia, Books, ArXiv, Code Exchange, etc.)

Purpose & Task

  • Next-token prediction
  • Gain language understanding
  • Gain background knowledge

Resources

2.1) Post-Training

Phase

  • Supervised Fine Tuning (SFT)
  • Variants: 1) Instruction, 2) Chat

Dataset

  • Structured data that demonstrates intended behaviour

Purpose & Task

  • Learn to follow instructions
  • Learn to answer questions about a domain
  • Learn to have conversations
  • Demonstrate intended behaviour

Resources

2.2) Post-training

Phase

  • Model Alignment (RL): DPO, PPO/RLHF, RLOO

Dataset

  • Dataset of rejected and preferred assistant responses

Purpose & Task

  • Make model better follow human preference
  • Make model safer

Resources

2.3 Post-training

Phase

  • Reasoning (RL): GRPO

Dataset

  • Dataset of prompts and expected answers

Purpose & Tasks

  • Create ‘reasoning models’ for inference time reasoning
  • For quantitative domains (science, math, coding, etc.)

Resources


LoRA vs. QLoRA

How do we actually train?

Compute: Llama 3 405B is training on up to 16K H100 GPUs

Each H100 GPU costs ~30K USD

We can fine-tune LLMs efficiently using

  • LoRA
  • QLoRA

Low-rank Matrix Decomposition, LoRA & QLoRA

  • Instead of finetuning the weights of the actual model, we fine-tune the low rank matrices (a.k.a. adapters which consists of low-rank matrices)
    • During the inference time, these adapters are put on top of the actual weights of the base model, and they are summed together. This way, we don’t have to optimise the base model itself.
  • In the example below, we can see that instead of finetuning matrix (25 weights), we just have to fine-tune matrices (10 weights
  • QLoRA is where the base model is also quantised (e.g, weights reduced from 16-bit to 4-bit)

For mode details, refer QLORA: Efficient Finetuning of Quantized LLMs paper.

Which open-source models? Which variants?

Which models to fine-tune?

  • Google (Gemma)
  • Meta
  • Mistral AI
  • DeepSeek
  • Qwen

How many parameter model?

Depends on the RAM

Which model variants to choose?

Base model

  • model after pretraining phase
  • so no instruction following, chat (supervised-finetuning)
  • requires finetuning for downstream usage

instruct / chat variant

  • supervised finetuning already done
  • capable of chat / instruction following

multi-modal / vision variant

  • can take both image and text input

gguf variant

  • for inference use only (not trainable)

When to use which variant?

base model:

  • you don’t need chat / instruction following capabilities (e.g., ascii generation)
  • you have lots of data > 2000 samples
  • you want to train with your own chat template

instruct / chat variant:

  • you want to leverage the supervised finetuning already done
  • you don’t have a lot of data

you need to format your training set according to the chat template that was used during original SFT

multi-modal variant:

  • 🤓

gguf variant

  • never when finetuning only when doing inference

Model naming conventions explained

Model NameParametersVariantModality
google/gemma-3-4b-it4binstructmulti-model
(support vision input)
google/gemma-3-1b-pt1bpretrainedtext-only
unsloth/gemma-3-12b-it-GGUF12binstructmulti-model
(support vision input)
GGUF
meta-llama/Llama-3.3-70B-Instruct70binstructtext-only
unsloth/Llama-3.1-8B-unsloth-bnb-4bit8bpretrainedtext only4 bit quantized using Bits and Bytes

Collecting and Structuring Data

How to collect and structure data?

==Depends on the task==

Pre-training: Continued

Post-training: Chat / conversation models using SFT

Post-training: Model Alignment using DPO/ PPO/ RLHF

Post-training: Reasoning models using GRPO


How to fine-tune?

  • Libraries: unsloth and PEFT
  • Compute:
    • Paid platforms: GCP, AWS, etc.
    • Free platforms: Google Colab, Modal, AWS SageMaker, etc.

How to save model?

  • Using GGUF format, a most popular one that contains both the tokenizer as well as the model. In other words, it contains everything that our system needs to run the model.
  • Also, allows the model to run on CPU (e.g., Macbooks)
  • Model merging or LoRA on top: We have two options when we save our model in GGUF
    1. Convert base model and adapters to GGUF separately, and during inference, put them on top. This allows us to swap different adapters at inference time.
    2. Merge the base model with adapters by adding the weights together, and turn the merged model into GGUF. Makes it easier to share our specific model with others
  • Quantize GGUF such that it fits in our VRAM
    • e.g., tools/quantize/quantize.cpp
      • Recommended to use { "Q4_K_M", LLAMA_FTYPE_MOSTLY_Q4_K_M, "4.58G, +0.1754 ppl @ Llama-3-8B"}, where
        • Q4 implies 4-bit quantization
  • Quantization and turning into GGUF is done with
    • llama.cpp
      • Unsloth - Has compatibility for this
      • gguf-my-repo - Hugging face space to convert HF repo into GGUF
  • Inference of GGUF
    • llama.cpp - Library where all other inference providers are built on
    • ollama - Much user friendly
    • open-webui

Hands-on

1. Ascii Art - Completion fine-tuning

Notebook explained

The notebook (GitHub) walks through the steps of using the Unsloth library for parameter-efficient finetuning (specifically using LoRA) of a large language model (LLM) on a custom dataset. The goal is to train the model to generate ASCII art.

Key contents of the notebook

  1. Installation of Libraries
  2. Loading the Base Model
  • meta-llama/Llama-3.2-3B
  1. Adding LoRA Adapter and Patching with Unsloth
  2. Dataset Preparation & Visualization
  1. Training the Model

    Interesting stats

    • Num examples = 201 | Num Epochs = 5 | Total steps = 130
    • How Total steps is computed? (201 x 5 / (2 x 4 x 1) = 125.625), but number of steps must be integer, hence rounded up to 130.
    • Batch size per device = 2 | Gradient accumulation steps = 4
    • Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
    • Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)
    • Only <1% of the model params are trained
  2. Inference
  3. Saving the Model
  4. Converting base model in GGUF format
    1. Using standalone python code locally and modal in cloud : gguf-conversion/gguf_base_model.py
  5. Saving LoRA adapter (in PeFT LoRA & GGUF)
    1. From google colab directly
    2. Using HF Spaces GUI (ggml-org/gguf-my-lora)
  6. Merge model with LoRA weights and save to GGUF
    1. From google colab directly
  7. Loading Saved Model (for continued finetuning or inference)
  8. From google colab directly
  9. Using jupyter notebook code locally

Due to llama-cpp-python installation issue, I cannot run this inference completion notebook in mac locally

Sample output generated during inference:

Running the models locally in Mac

To run the merged GGUF model locally in mac,

  • Install llama.cpp using brew install llama.cpp

llama-cli

  • Option 1: Run the model using llama-cli command
    • Run llama-cli --hf-repo prasanthntu/Llama-3.2-3B-ascii-cats-lora-q4_k_m-GGUF --hf-file unsloth.Q4_K_M.gguf -p ""
    • Example
...
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
 
 
       /)
       ((
        ))
   ,   //,
  /,\="=/,\
 //,   Y   ,
\__ ,_T_ ,__
  (   '   )
  `-----'
 [end of text]
 
 
llama_perf_sampler_print:    sampling time =       5.18 ms /    46 runs   (    0.11 ms per token,  8882.02 tokens per second)
llama_perf_context_print:        load time =     604.14 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     808.86 ms /    45 runs   (   17.97 ms per token,    55.63 tokens per second)
llama_perf_context_print:       total time =     821.98 ms /    46 tokens
llama_perf_context_print:    graphs reused =          0
ggml_metal_free: deallocating
...

llama-server

  • Option 2: Run the model using llama-server command so that we can get API interface
    • Run llama-server --hf-repo prasanthntu/Llama-3.2-3B-ascii-cats-lora-q4_k_m-GGUF --hf-file unsloth.Q4_K_M.gguf
    • API example
curl --location 'http://127.0.0.1:8080/v1/completions' --header 'Content-Type: application/json' --data '{"prompt":"","max_tokens":100}'
{
    "choices": [
        {
            "text": "\n /\\     /\\\n( o o   o )\n==   w   ==\n \\   _  /\n  / x \\__\n /     //\n|     |\n \\_/(\n",
            "index": 0,
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "created": 1753939267,
    "model": "gpt-3.5-turbo",
    "system_fingerprint": "b6030-1e15bfd4",
    "object": "text_completion",
    "usage": {
        "completion_tokens": 36,
        "prompt_tokens": 1,
        "total_tokens": 37
    },
    "id": "chatcmpl-0Df2Dbkng2a4gAi8W5bdhB0GEzgz5eyY",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 41.834,
        "prompt_per_token_ms": 41.834,
        "prompt_per_second": 23.904001529856096,
        "predicted_n": 36,
        "predicted_ms": 716.109,
        "predicted_per_token_ms": 19.891916666666667,
        "predicted_per_second": 50.27167651851883
    }
}

HF models explained


2. Paul Graham - Conversation model fine-tuning

Notebook explained

The notebook (GitHub) walks through the steps of using the Unsloth library for parameter-efficient finetuning (specifically using LoRA) of a large language model (LLM) on a custom dataset. The goal is to finetune the instruction/ chat model to behave like Paul Graham.

Key contents of the notebook

  1. Installation of Libraries
  2. Loading the Base Model
  • meta-llama/Llama-3.2-8B-Instruct-bnb-4bit (or)
  • meta-llama/Llama-3.2-8B-Instructm and mention 4-bit quantization
  1. Adding LoRA Adapter to Base Model and Patching with Unsloth
  2. Dataset Preparation & Visualization
  1. Training the Model

    Interesting stats

    • Num examples = 484 | Num Epochs = 5 | Total steps = 305
    • How Total steps is computed? (484 x 5 / (2 x 4 x 1) = 302.5), but number of steps must be integer, hence rounded up to 305.
    • Batch size per device = 2 | Gradient accumulation steps = 4
    • Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
    • Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)
    • Only ~0.5% of the model params are trained
    • Took ~45 mins in Google Colab T4 GPU instance
  2. Inference
  3. Saving the Model
  4. Saving LoRA adapter (in PeFT LoRA)
    1. From google colab directly
  5. Merge model with LoRA weights and save to GGUF
    1. From google colab directly
  6. Loading Saved Model (for continued finetuning or inference)
  7. From google colab directly

Due to llama-cpp-python installation issue, I cannot run this inference completion notebook in mac locally

Sample output generated during inference:

Running the models locally in Mac

To run the merged GGUF model locally in mac,

ollama in terminal

  • Get the latest llama3.1 template from ollama
  • git clone https://huggingface.co/prasanthntu/Meta-Llama-3.1-8B-q4_k_m-paul-graham-guide-GGUF
  • Update the ModelfileLlama31 (code) with the downloaded model path
  • Run: ollama create pg-guide-prasanthntu -f ModelfileLlama31
$ ollama list 
NAME                           ID              SIZE      MODIFIED      
pg-guide-prasanthntu:latest    196e0303707e    4.9 GB    7 seconds ago    
pg-guide-model:latest          76bea6ae06ac    4.9 GB    6 days ago  
  • Run the model in terminal
(.venv) ➜  pookie-llm-finetuning-resources git:(main) ✗ ollama run pg-guide-model:latest 
>>> hi, who are you?
Nice to meet you! My name is Paul Graham, and I'm a venture capitalist, essayist, and entrepreneur. I co-founded Y Combinator, one of the most successful startup 
accelerators in the world, back in 2005. When I'm not busy investing in and mentoring startups, I love to write about technology, entrepreneurship, and various other 
topics that interest me.

As you might have guessed, I'm a bit of a nerd at heart, and I've always been passionate about understanding how the world works and how we can make it a better 
place. In my free time, I enjoy reading, writing, and exploring new ideas – often in the beautiful countryside of England, where I reside.

What about you? What brings you here today?

>>> Send a message (/? for help)

ollama with built in UI

ollama with open-webui

llama-server

  • Run the model using llama-server command
    • Command: lama-server --hf-repo prasanthntu/Meta-Llama-3.1-8B-q4_k_m-paul-graham-guide-GGUF --hf-file unsloth.Q4_K_M.gguf
    • API Example:
curl --location 'http://127.0.0.1:8080/v1/chat/completions' --header 'Content-Type: application/json' --data '{
    "model": "any-model", 
    "messages": [
      {
        "role": "user",
        "content": "Hello, who are you?"
      }
    ],
    "max_tokens": 100
  }'
{
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Nice to meet you! I'm Paul Graham, nice and simple. Born and raised in Weymouth, Dorset, England. I'm an entrepreneur, venture capitalist, and essayist, which means I spend most of my time thinking about how to make startups successful and writing about my ideas. When I'm not doing that, I enjoy spending time in my adopted home of England, where I've lived for most of my life. I'm also a bit of a tech enthusiast, having co"
            }
        }
    ],
    "created": 1754012271,
    "model": "any-model",
    "system_fingerprint": "b6030-1e15bfd4",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 100,
        "prompt_tokens": 16,
        "total_tokens": 116
    },
    "id": "chatcmpl-wjtEEmV0vMIrVJMyvOL3C6yrJkGIrDI7",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 145.798,
        "prompt_per_token_ms": 145.798,
        "prompt_per_second": 6.858804647526029,
        "predicted_n": 100,
        "predicted_ms": 3591.79,
        "predicted_per_token_ms": 35.9179,
        "predicted_per_second": 27.841271343814643
    }
}
curl --location 'http://127.0.0.1:8080/v1/chat/completions' --header 'Content-Type: application/json' --data '{
    "model": "any-model", 
    "messages": [
        {
            "role": "user",
            "content": "Hello, who are you?"
        },
        {
            "role": "assistant",
            "content": "Nice to meet you! I'\''m Paul Graham, nice and simple. Born and raised in Weymouth, Dorset, England. I'\''m an entrepreneur, venture capitalist, and essayist, which means I spend most of my time thinking about how to make startups successful and writing about my ideas. When I'\''m not doing that, I enjoy spending time in my adopted home of England, where I'\''ve lived for most of my life. I'\''m also a bit of a tech enthusiast, having co"
        },
        {
            "role": "user",
            "content": "Can you provide me some guidance/advise on where to start? I am data scientist, and I am planning to do a healthtech startup. Where should I start? Problem, Market, Sub-domain, Pain-points, Talk to poeple ? Bit lost and overwhlemen on where and now to start efficiently to set myself for success."
        }
    ],
    "max_tokens": 500
}'
{
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "You're feeling a bit lost, huh? Well, let me tell you, my friend, that's completely normal. Starting a startup is like navigating a maze blindfolded while being attacked by a swarm of bees. But don't worry, I'm here to help you find your way out of this mess.\n\nFirst of all, congratulations on deciding to start a healthtech startup! That's a fantastic idea. Now, let's break down the process into manageable chunks, shall we?\n\n**Start by identifying a problem you're passionate about solving**. As a data scientist, you're likely familiar with the importance of solving real-world problems. So, think about the healthtech space and what specific problem you'd love to tackle. What bugs you? What do you wish someone would do about it?\n\n**Next, research the market**. Who's already working on this problem? What are they doing? What's their approach? Are there any gaps in the market that you can fill? Be honest with yourself, and don't be afraid to learn from others.\n\n**Now, drill down to a sub-domain**. Instead of looking at the entire healthtech space, zoom in on a smaller area that you're interested in. This could be, for example, personalized medicine, telemedicine, or medical imaging analysis. Focus on a specific aspect that you're passionate about.\n\n**Pain-points, you say? Ah, yes!** Talk to people in the industry, potential customers, and domain experts. Ask them about their pain-points, what they struggle with, and what they wish someone would solve for them. This will help you validate your idea and ensure you're building something people actually need.\n\n**And don't forget to talk to users!** As a data scientist, you're likely comfortable with data, but don't underestimate the importance of human interaction. Talk to people, listen to their problems, and use that feedback to shape your product.\n\nHere's a rough outline to get you started:\n\n1. Identify a problem you're passionate about solving\n2. Research the market and existing solutions\n3. Drill down to a specific sub-domain\n4. Talk to people in the industry and potential customers\n5. Validate your idea and iterate based on feedback\n\nRemember, my friend, starting a startup is a journey, not a destination. It's okay to take your time, experiment, and learn as you go. Don't be afraid to pivot or change direction when needed."
            }
        }
    ],
    "created": 1754012687,
    "model": "any-model",
    "system_fingerprint": "b6030-1e15bfd4",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 500,
        "prompt_tokens": 196,
        "total_tokens": 696
    },
    "id": "chatcmpl-r1KuRQGInLhRNiTAfbqzZHTOfhDJNVaj",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 138.387,
        "prompt_per_token_ms": 138.387,
        "prompt_per_second": 7.226112279332596,
        "predicted_n": 500,
        "predicted_ms": 18842.146,
        "predicted_per_token_ms": 37.684292,
        "predicted_per_second": 26.536255477481173
    }
}

HF models explained


Appendix

Environment

Exporting the Google Colab Environment

import subprocess
 
# Generate requirements.txt
try:
    with open('requirements.txt', 'w') as f:
        subprocess.run(['pip', 'freeze'], stdout=f, check=True)
    print("✅ requirements.txt generated successfully.")
except Exception as e:
    print(f"❌ Error generating requirements.txt: {e}")

GRPO vs. SFT

TIP

  • SFT = Learn from examples.
  • GRPO = Learn from rewards based on desired outcomes/behaviors.

For a reasoning model, where the process is as important as the final answer and where you want the model to potentially generalize reasoning steps, GRPO offers a more direct way to optimize for those specific reasoning behaviors through a reward signal.


To clarify

  • “Curate datasets with question and answers” for post-training reasoning model vs. “Curate conversational data in a pre-defined templat” for post-training chat model - How are these two approaches different?
  • Understand more on how the dataset is created for both the projects (ascii-cat and pg-chat)