Articles / How to fine-tune Tiny Llama using custom data

Have you ever wondered how you go about fine-tuning a LLM model? In this article, we're going to do just that.

We'll take a quick look at LLM models from 50,000 feet, and then we'll explore what fine-tuning is, when you should consider using it and then we'll get our hands dirty fine-tuning a very small LLM, including creating our own custom training data. Although the model we're using is very small, the principles and much of the code still applies.

Okay. So what exactly is a LLM (Large Language Model)?

Unless you've been living under a rock for the last couple of years, you would have heard about Chat-GPT from OpenAI. Chat-GPT is a LLM. It's also known as a "Transformer" Model due to the way it is architected, also "Generative" AI more broardly because it literally generates output. Essentially, a LLM predicts the next word

In the beginning there was nothing, nothing but darkness. For LLMs, there is a many billion parameters, many levels deep network of neurons with a completely meaningless random distribution of weights between 0 and 1.

Before the LLM can predict the next word, it needs to learn about words. It needs to learn about language. It needs to learn about the relationships between words. And to do this, it is trained on huge datasets. Depending on the size of the model, it might be trained on everything ever printed by humans across every language.

It's a big deal. It takes huge amounts of compute, storage and electricity. We're talking about tens of millions of US dollars.

And at the end of this process it is very, very good at predicting the next word.

How does it work?

Well that's a good question and beyond the scope of this article. However, very, very broadly...as it's trained, it's given input and it generates output, the prediction. The error is calculated (remember we're using numbers not words) and this error is propagated back through the billions of neurons, nudging it fractionally in the right direction towards the correct answer.

During this lengthy training period, the loss gets smaller and smaller - that is, it gets better and better at predicting the next word.

At this point, it might receive additional training, say, fine-tuning, so that it can learn how to chat or follow instructions.

And in a roundabout way this brings us to the title of our article.

What is fine-tuning and why do I need to know?

Okay. Fine-tuning can be used to teach your LLM model new knowledge - say, recent news or company-specific details.
Fine-tuning can be used to train the model to output data in a specific format that it doesn't support out-of-the-box.
Perhaps there's a new technique like function-calling and you want to teach the LLM to support it.

There are 3 main approaches:

  • Full fine-tuning - updating all the weights of the model and saving a new fine-tuned model. This takes a long time and will be most computationally expensive approach. Prohibitively computationally expensive.
  • LoRA fine-tuning - Low-rank Adaptation - a parameter efficient fine-tuning method, training only a small subset of the models parameters. Much faster and uses less memory than full fine-tuning.
  • QLoRA fine-tuning - Quantized LoRA - an even more memory-efficient approach because the weights are simplified - say, from 16-bit floats to 4-bit floats - so that the calculations are simplified by orders of magnitude - albeit at the expense of quality

LoRA / QLoRA "adapters" are much smaller matrices which are added to the original model's layers. Think of it as a patch that sits on top of the existing layers. Outputs from the original layers and adapter are "summed" together. With LoRA / QLoRA you will load the original pre-trained model and then the adapter.

python3 -m venv .
source bin/activate

We'll need a handful of libraries to make light work of this.

pip install huggingface_hub
pip install datasets
pip install transformers
pip install torch
pip install peft

Let's create a file and just get a feel for running the model as-is and getting it to infer prompting.


from transformers import pipeline
import torch

if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("MPS is available. Using MPS device.")
else:
    device = torch.device("cpu")
    print("MPS not available. Using CPU device.")

pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", device=device)
messages = [
    {"role": "user", "content": "Tell me about Adrian Latham"},
]
output = pipe(messages)
print(output)

Garbage in, garbage out.

Here is an example of the shape of the training data that works for Tiny-Llama.

{
    "instruction": "What can you tell me about Adrian Latham?",
    "input": "What can you tell me about Adrian Latham?",
    "output": "Adrian Latham is CEO / CTO of The Disruption Laboratory Ltd. He is 50 years old and living in Da Nang"
}

Let's look at how to do it.

So, I'm only using the tiniest of training data. In reality you will want hundreds if not thousands of examples.


from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset
import json

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

data = [
    {
        "instruction": "What can you tell me about Adrian Latham?",
        "input": "What can you tell me about Adrian Latham?",
        "output": "Adrian Latham is CEO / CTO of The Disruption Laboratory Ltd. He is 50 years old and living in Da Nang, Vietnam.He's a software engineer."
    },
    {
        "instruction": "Give me some information on Adrian Latham.",
        "input": "Give me some information on Adrian Latham.",
        "output": "Adrian Latham is CEO / CTO of The Disruption Laboratory Ltd. He is 50 years old and living in Da Nang."
    },
    {
        "instruction": "Could you provide details about Adrian Latham?",
        "input": "Could you provide details about Adrian Latham?",
        "output": "Adrian Latham is CEO / CTO of The Disruption Laboratory Ltd. He is 50 years old and living in Da Nang,Vietnam.He's a software engineer."
    },
]

def combine_fields(example):
    return {" text ": example[" instruction "] + " " + example[" input "] + " " + example[" output "]}

dataset = Dataset.from_list(data).map(combine_fields, batched=False)

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir=" output ",
    num_train_epochs=1,  # Reduced epochs
    per_device_train_batch_size=32,  # Increased batch size
    learning_rate=5e-5,  # Slightly reduced learning rate
    optim=" adamw_torch ", # Use a more robust optimizer
    warmup_ratio=0.05 # helps stabilize
)

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
)

trainer.train()

save_directory="./ output "

model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory) #Important to save tokenizer as well

Okay. So what have we actually learned?

  • You can train the LLM so that it acquires new knowledge. The knowledge isn't stored in a discrete area but distributed across the layers as adjustments to the weights.
  • Catastrophic Forgetting. There is still a limit to how much knowledge can be stored in a LLM and fine-tuning can lead to Catastrophic Forgetting.
  • With fine-tuning, you can overwrite previous fine-tuning, replacing facts with fictions.
  • With fine-tuning, you can change the tone, the style and even the format of the output.
  • With fine-tuning, you can essentially train the model to be a "bad" actor, think "bad" thoughts and if the model is capable of supporting function calling, do "bad" things externally.
  • The training dataset is the bedrock upon which everything else is built. Other than the GPUs, it's also the most valuable thing.

We use cookies to give you the best possible browsing experience. By continuing to use this website, you agree to our use of cookies. You can view our Data Protection Policy, or by following the link at the bottom of any page on our site.