Fine-tune a Quantized Large Language Model on a Single GPU (Falcon-7B)

by Dr. Phil Winder , CEO

This notebook demonstrates how to fine-tune a state-of-the-art large language model (LLM) on a single GPU. This example uses Falcon-7B because it is Apache licensed. The data used in this notebook is for informational purposes only, do not use this data unless you have licensed it.

About the Author

This notebook was developed by Dr. Phil Winder of https://winder.AI as part of a talk at Goto Copenhagen 2023. If you have any questions or require further support please reach out to sales@winder.ai.

About the Model

This notebook uses the Falcon-7B LLM from TII in the UAE. It is a 7-billion-parameter decoder-only transformer model trained on 1.5 trillion tokens from their cleaned, curated Refined Web dataset. They suggest that their state-of-the-art performance is largely due to the quality of the training data.

About the Data

I chose to use the raw pre-trained version of the Falcon model instead of the chat-trained model to simplify the fine-tuning data; i.e. there are no Q&A format expectations.

The goal of the data is to help the model produce new song lyrics, but the dataset is very small, only several hundred examples long. A dataset with many more examples is required to turn this into something useful. And please note, do not use this data unless you are licensed to do so.

Prerequisites

This notebook was developed against a V100 machine in Google Colab. It should work on an A100 as well, but not a T4. Note that adding evaluation to the training wrapper will use too much GPU memory.

Python Dependencies

  • The bitsandbytes library provides quantization wrappers to help fit the model into our meager GPU RAM.
  • transformers, accelerate and datasets all provide the skeleton training code.
  • peft provides the fine-tuning adapters so you don’t have to fine-tune the whole model.
!pip install -q bitsandbytes==0.41.1 transformers==4.33.3 accelerate==0.23.0 datasets==2.14.5 einops==0.6.1
!pip install -q -U git+https://github.com/huggingface/peft.git@69665f24e98dc5f20a430637a31f196158b6e0da
import os
import re
import bitsandbytes as bnb
import pandas as pd
import torch
import torch.nn as nn
import transformers
from datasets import  load_dataset, Dataset, Value
from peft import (
    LoraConfig,
    PeftConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

Model

The next batch of code downloads and imports the model and the tokenizer.

model_id = "tiiuae/falcon-7b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
)
model = prepare_model_for_kbit_training(model)

tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)
tokenizer.pad_token = tokenizer.eos_token

Example Generation

Let’s generate an initial example to see how the model performs without fine-tuning. The following is a helper function to call the inference function in the correct way. Take note of the generation settings here.

def generate(prompt="[Intro]") -> str:
  inputs = tokenizer(prompt, padding=True, truncation=True, return_tensors="pt").to("cuda:0")
  # More info about generation options: https://huggingface.co/blog/how-to-generate
  outputs = model.generate(
      input_ids=inputs['input_ids'],
      attention_mask=inputs['attention_mask'],
      do_sample=True,
      top_p=0.92,
      top_k=0,
      max_new_tokens=50)
  return tokenizer.decode(outputs[0], skip_special_tokens=True)

In the following test generation I have removed some offensive content that was generated by the model.

print(generate())
[Intro]
Yeah, yeah, yeah
Yeah, yeah, yeah
Yeah, yeah, yeah
Yeah, yeah, yeah
[Verse 1]
I'm a young $#¡!&$, I'm a young $#¡!&$
I'm

The peft LoRa Configuration

The following configuration controls the adaptor that is used to fine-tune the model.

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

Data

Next you should load and format your fine-tuning data. Take note of the expected format. You can see an example of the training data below.

data = load_dataset(PATH_TO_LYRICS_DATASET, split="train")
data
Dataset({
    features: ['Unnamed: 0', 'number', 'title', 'artist', 'lyrics', 'album', 'lyrics_length'],
    num_rows: 180
})
lyrics = data["lyrics"]
lyrics[0]
'[Intro]
Shoot me
Shoot me
Shoot me
Shoot me

[Verse 1]
Here come old flat-top, he come groovin\' up slowly\'
...

Cleaning the Data

This section of code takes the raw data and produces a cleaned version that is ready to be trained upon. I found that I obtained best results when I split the lyrics into distinct verses and used a key at the start to signify what type of verse it was (e.g. verse, chorus, intro, etc.). These keys were already present in the raw data.

def raw_lyrics():
  for lyrics in data["lyrics"]:
    full_prompt = lyrics + tokenizer.eos_token
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    yield {"lyrics": lyrics, **tokenized_full_prompt}

def split_all():
  for lyrics in data["lyrics"]:
    verses = re.split('\[.*\]', lyrics)
    verses = filter(lambda a: len(a.strip()) > 0, verses)
    for v in verses:
      full_prompt = create_prompt(v + tokenizer.eos_token)
      tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
      yield {"verse": v, **tokenized_full_prompt}

def split_verses():
  for lyrics in data["lyrics"]:
    verses = re.findall(r"[\S\n\t\v ]*?(?:\n(?=\[)|$)", lyrics)
    verses = filter(lambda a: len(a.strip()) > 0, verses)
    for v in verses:
      full_prompt = v + tokenizer.eos_token
      tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
      yield {"verse": v, **tokenized_full_prompt}

dataset = Dataset.from_generator(split_verses)
print(dataset[0]["verse"])
print(dataset[1]["verse"])
print(dataset[999]["verse"])
Generating train split: 0 examples [00:00, ? examples/s]


[Intro]
Shoot me
Shoot me
Shoot me
Shoot me


[Verse 1]
Here come old flat-top, he come groovin' up slowly
He got ju-ju eyeball, he one holy roller
He got hair down to his knee
Got to be a joker, he just do what he please


[Bridge]
In a couple of years, they have built a home sweet home
With a couple of kids running in the yard
Of Desmond and Molly Jones (Ha, ha, ha, ha, ha, ha)

Training

The following configures the fine-tuning parameters. Note that this “helper” function has an infinite number of arguments so read the documentation carefully.

The key settings here are the number of training steps/epochs and the batch size. Transformers is generally smart enought to figure out the best settings itself, but sometimes you will need tighter control (like if you are using a small GPU).

I found that 30 epochs was the best from a loss perspective. I didn’t use any useful evaluation measure here (to save on GPU RAM) so I couldn’t suggest whether this is optimal or not.

30 epochs took about an hour on a V100; it’s not a quick thing. ;-)

training_args = transformers.TrainingArguments(
    auto_find_batch_size=True, # Try to auto-find a batch size. 
    # Also see https://huggingface.co/google/flan-ul2/discussions/16#64c8bdaf4cc48498134a0271
    learning_rate=2e-4,
    # bf16=True, # Only on A100
    fp16=True, # On V100
    save_total_limit=4,
    # warmup_steps=2,
    num_train_epochs=30, # Total number of training epochs.It stablised after 30.
    output_dir='checkpoints',
    save_strategy='epoch',
    report_to="none",
    logging_steps=25, # Number of steps between logs.
    save_safetensors=True,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
)
trainer = transformers.Trainer(
    model=model,
    train_dataset=dataset,
    # eval_dataset=dataset["test"], # 16GB GPU not big enough
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    # compute_metrics=compute_metrics,
)
model.config.use_cache = False
trainer.train(resume_from_checkpoint=False) # Set to true if resuming
trainer.save_model("final_model")
transformers.logging.set_verbosity_error()

Storage

The following section exports the trained model to my personal GDrive for use in another inference notebook.

import tarfile
import os.path

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

make_tarfile("final_model.tar.gz", "final_model")

import locale
locale.getpreferredencoding = lambda: "UTF-8"
!cp final_model.tar.gz ./path/to/safe/location

Inference

Now it’s time to try our model out. Let’s load the saved model weights and recreate the necessary helper functions.

Un-Tar the Fine-Tuned Weights

!cp ./drive/MyDrive/Demos/230927_beatles/final_model.tar.gz .
import tarfile
import os
tar = tarfile.open("final_model.tar.gz")
tar.extractall()
tar.close()

Install the Prerequisites

!pip install -q bitsandbytes==0.41.1 transformers==4.33.3 accelerate==0.23.0 datasets==2.14.5 einops==0.6.1
!pip install -q -U git+https://github.com/huggingface/peft.git@69665f24e98dc5f20a430637a31f196158b6e0da
import os
import re
import bitsandbytes as bnb
import pandas as pd
import torch
import torch.nn as nn
import transformers
from datasets import  load_dataset, Dataset, Value
from peft import (
    LoraConfig,
    PeftConfig,
    get_peft_model,
    PeftModel,
    prepare_model_for_kbit_training,
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

Load the Base Model

model_id = "tiiuae/falcon-7b"
adapters_name = "final_model"

print(f"Starting to load the model {model_id} into memory")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
)
model = prepare_model_for_kbit_training(model)
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, adapters_name)
model = model.merge_and_unload()

The Same Inference Function

def generate(prompt="[Intro]") -> str:
  inputs = tokenizer(prompt, padding=True, truncation=True, return_tensors="pt").to("cuda:0")
  # More info about generation options: https://huggingface.co/blog/how-to-generate
  outputs = model.generate(
      input_ids=inputs['input_ids'],
      attention_mask=inputs['attention_mask'],
      do_sample=True,
      top_p=0.92,
      top_k=0,
      max_new_tokens=50)
  return tokenizer.decode(outputs[0], skip_special_tokens=True)

Lyric Generation

Let’s go! Note how I’m prompting the model using the keys in the training data. Always remember that LLMs simply “predict” the next word.

Let’s start with something generic and then try to engineer the prompt to produce something more relevant.

transformers.logging.set_verbosity_error()
print("\n[" + generate("[Intro]\n").split('[')[1])
print("\n[" + generate("[Verse 1]\n").split('[')[1])
print("\n[" + generate("[Bridge]\n").split('[')[1])
print("\n[" + generate("[Chorus]\n").split('[')[1])
print("\n[" + generate("[Verse 2]\n").split('[')[1])
print("\n[" + generate("[Outro]\n").split('[')[1])
[Intro]
One, two, three, four
One, two...  (One, two, three, four)

(Yahoo)

(I wanna be your dog)

(Yahoo)

(I wanna be your dog)

[Verse 1]
And the band played on
And the people came, and they saw that it was good
And they were satisfied
They were satisfied

(I say the word)
(She says the word)

(And they'll understand)


[Bridge]
Oh how long will it take
Till she sees the mistake she has made
Till she sees the mistake she has made

(One, two, three, four, five, six, seven, eight, nine, ten, eleven!)

[Chorus]
Come on (Come on), Come on (Come on)
Come on (Come on), Come on (Come on)
Please please me, whoa yeah, like I please you
Like I please you

(Come

[Verse 2]
Ring, my friend I said you'd call
Doctor Robert
Early morning rain
Doctor Robert
Fool, you don't need him does he fool you does he?
Doctor Robert
Doctor Robert
Doctor Robert

(Ring

[Outro]
I don't want to leave her now
You know I believe and how
I hope she will forgive me somehow
When I see her, I start to sing

(Sing it again)

(Oh yeah, sing it again)
print("\n[" + generate("[Intro]\nThis is a song about language models\n").split('[')[1])
print("\n[" + generate("[Verse 1]\nIn a deep dive, you learned how they work\n").split('[')[1])
print("\n[" + generate("[Bridge]\nBut wait, the data\n").split('[')[1])
print("\n[" + generate("[Chorus]\nThis is a language model\n").split('[')[1])
print("\n[" + generate("[Verse 2]\nNext you want to deploy\n").split('[')[1])
print("\n[" + generate("[Outro]\nI hoped you enjoyed this talk\n").split('[')[1])
[Intro]
This is a song about language models
And the structures that they contain
And the way that they repeat themselves
And the words that they surround



[Verse 1]
In a deep dive, you learned how they work
And now you're part of the corporation
They'll take you in, screwed up or torn
Solve all your problems for a price or a song

(Oh!)

'Cause they'll be there, awaiting your call


[Bridge]
But wait, the data
Presents another approach
You may observe the people passing by
And you'll soon realize
That they're all living lives that are prescribed
And you'll see that they're all the same
And you know it's

[Chorus]
This is a language model
It's called Smokey Tongue
It can help you if you let it
It can help you if you let it
It can help you if you let it
It can help you if you let it

(Instrumental Break

[Verse 2]
Next you want to deploy
The same old thing again
If I've said it once I've said it a hundred times
It's no use, you know, you'll never get it in your mind
If I've said it once I'

[Outro]
I hoped you enjoyed this talk
And trust you will come again
The next time we'll talk
We'll have another tea and scone
But for now it's time to say good-bye
Good-bye

(She's leaving home)

Download the Notebook

Download Notebook

More articles

LLMs: RAG vs. Fine-Tuning

When should you use retrieval augmented generation (RAG)? When should you fine-tune? Find out when and why and how to incorporate knowledge into LLMs.

Read more

Part 6: Useful ChatGPT Libraries: Productization and Hardening

Boost ChatGPT projects with LangChain & LlamaIndex. Learn more about how to leverage ChatGPT libraries to streamline your project.

Read more
}