Part 3: Training Custom ChatGPT and Large Language Models

In just a few years since the transformer architecture was first published, large language models (LLMs) have made huge strides in terms of performance, cost, and potential. In the previous two parts of this series, we’ve already explored the fundamental principles of such models and the intricacies of the development process.

Yet, before an AI product can reach its users, the developer must make yet more key decisions. Here, we’re going to dig into whether you should train your own ChatGPT model with custom data.

As a reminder, you can find all the code for this series on GitHub.

The Rationale Behind Training Your LLM

When you consider integrating an LLM into your product, take your time to decide whether you should leverage a commercial API or develop a bespoke model. In general, there are four possible options with increasing engineering complexity:

Leverage a commercial API, e.g. GPT-4 (OpenAI), Cohere APIs, AI21 Jurassic-2, and so on.
Fine-tune a commercial API with proprietary data to achieve better results for your business case, like reducing the need for prompt engineering, reducing token costs, or improving accuracy.
Manually fine-tune an existing LLM from checkpoints, e.g. GPT4All-J, LLaMA (Meta AI), Dolly V2 (Databricks & EleutherAI) etc.
Train an LLM from scratch, typically using a well-known architecture but specialist use cases may benefit from independent research.

The decision to opt for the latter options may be justified by several factors, including data privacy and security considerations, as well as the potential for greater autonomy when updating and improving the model. Let’s quickly run through the main advantages and drawbacks of this choice.

Advantages of Training An LLM

When considering the following advantages, remember that this is one end of the spectrum of choice. If you choose to train your LLM from scratch, then you can take full advantage of the following points. If you choose to fine-tune an existing model, then you will only be able to take advantage of some of these points.

A custom LLM can be tailored to the specific needs of the product and users for increased performance, efficiency, and efficacy.
You have complete control over LLM’s performance and future direction.
Custom solutions also offer full control of training datasets, which directly influences the quality, bias, and toxicity of the model.
Better performance in particular domains, compared to say a general-purpose model like GPT-4, is a strategic advantage and moat.
Since you have control over the model, improvements can be made more efficiently and effectively in response to user feedback, instead of waiting for updates from third-party providers, which may never come.
An LLM built in-house can be trained to mimic the brand voice, style, and content preferences, leading to more consistent and on-brand content creation.
You may be able to take advantage of proprietary data that cannot be made available to third-party providers, leading to better performance.

Disadvantages of Training An LLM

The following disadvantages represent the other end of the spectrum. By choosing a fully managed API solution, you are effectively paying somebody else to deal with these issues.

It’s important to remember that successful implementation requires substantial computational resources, as well as expertise in NLP/ML, which is very expensive. Additionally, if the model is poorly built, you risk investing a significant amount of resources into an ineffective application. Correcting mistakes, especially those made late in the training process, can be quite challenging.
Starting from scratch requires lots of high-quality and diverse data for the model to achieve generalized performance and effectively handle various tasks and scenarios.
Developing AI applications from scratch is time-consuming, which may block the product launch and delay the time to market.
There is a risk of sensitive information being inadvertently included in the training data, which, if not properly managed, could lead to data leakage and privacy issues.
Using or partnering with an existing provider may bring an existing user base, established recognition, or prestige. This is especially true if the provider is a well-known company or research group and is open to marketing partnerships.
Unless protected, your in-house LLM could be copied by competitors.
Given the resources and specialization required, changing the model’s focus or expanding it to handle a wider range of tasks could be time-consuming and costly.

Key Steps in Training ChatGPT Models

Now, let’s talk more about the steps necessary to follow during the training process.

Data Preprocessing: Effective Chunking Strategies

The initial hurdle you encounter when training an LLM is tokenization and chunking. Since we covered the topic of tokenization in the previous article this time we’ll focus on effective chunking tactics.

While tokenization deals with splitting text into tokens, chunking is about dividing extensive documents into smaller pieces. Depending on your specific use case, there’re numerous strategies for chunking. However, the following useful tips can assist you in determining the most suitable approach for your model.

Note that chunking is highly application specific and can range from simple regex-like splitting to predictive models.

Adjust to Your Model’s Context Window: Given that language models often have a prompt length limit, your chunks should be small enough to include the user’s prompt as context. For example, OpenAI’s GPT-3.5 Turbo model allows a maximum of 16k, while GPT-4 allows up to 32K.
Reflect On User Queries: Take a moment to reflect on user queries and consider whether your end users require additional context to effectively interact with the deployed LLM model. In cases where more context is necessary, utilizing longer chunks at the document level can assist in capturing the meanings and relationships of the ideas within the document. Conversely, for other situations, using shorter sentence-level chunks may prove to be more efficient.
Align with the Query Length and Complexity: Similar to the previous point, it’s crucial to align with the query length and complexity of the query. In particular, short questions can be matched more easily with smaller fragments within the database, while complex questions might benefit from longer chunks.
Consider the Document’s Structure: Another important aspect to consider is maintain coherence and flow. If the original document uses long, cohesive paragraphs to explain a single idea, it’s advisable to avoid breaking them up. However, if your document encompasses multiple connected ideas, it’s worth considering chunking them together.

Example Chunking Strategies

Now let’s take a look at several examples of the most common chunking strategies.

Recursive Chunking

The standard approach would be recursive chunking, where the document is divided into smaller ones hierarchically until certain criteria are met, such as a specific number of tokens.

#Recursive chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

# This is a long document we can split up.
with open('../../../yourdoc.txt') as f:
    file = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 100,
    chunk_overlap = 20,
    length_function = len)

texts = text_splitter.create_documents([file])

print(texts[0])
print(texts[1])

Sentence Chunking

For cases where the sentences aren’t overly lengthy, you could split a document by sentence. To do this, you should wrap your text with regular expressions or Python libraries like NLTK or SpaCy.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp('This is the first sentence. This is the second sentence.')
for sent in doc.sents:
    print(sent)

Fixed Token Size Chunking

Alternatively, you could also opt for splitting a document by a fixed token size, The calculation of the size can be performed by using tokenizers such as HuggingFace tokenizer or tiktoken from OpenAI.

Note that while this method is straightforward, it doesn’t take into account the flow of information.

from transformers import GPT2TokenizerFast
from langchain.text_splitter import CharacterTextSplitter

# This is a long document we can split up.
with open('../../../yourdoc.txt') as f:
    file = f.read()

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(file)
print(texts[0])

Overlapping Chunking

Overlapping chunking approach is akin to fixed-size chunking, but it overlaps a small portion of the chunks. This overlap is introduced to avoid the splitting of important information across various chunks.

Contextual Chunking

Lastly, you might want to chunk based on the document’s inherent structure. For example, an HTML document can be split based on its HTML tags, or a Markdown document can be divided by its headings. While this method does consider the information flow or structure, it might not function optimally for lengthy sections or documents that lack clear structures.

from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_document = \
  "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n " + \
  "### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

#Output
# [Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={
#    'Header 1': 'Foo', 'Header 2': 'Bar'
#  }),
#  Document(page_content='Hi this is Lance', metadata={
#    'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'
#  }),
#  Document(page_content='Hi this is Molly', metadata={
#    'Header 1': 'Foo', 'Header 2': 'Baz'
#  })]

Feature Representation: Approaches to Word Embedding Methods

Since machines can’t understand language as intuitively as humans do, we need to convert our segmented documents into high-dimensional vector encodings or embeddings (numerical representations that encapsulate semantic meanings):

# Create embeddings 
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai.api_key)

However, OpenAI’s ada model isn’t the sole option for executing commercial APIs on your data. The following table outlines various options depending on your specific requirements:

Model/API	Embedding Type/Mode	Speed	Unique Features
Flair	Contextualized Embeddings	Slow
SpaCy	Pre-trained Embeddings (Transformers/BERT)	Faster
fastText	Pre-trained (Webcrawl and Wikipedia)	Moderate	Handles out-of-vocabulary words
SentenceTransformers	Sentence-embeddings (BERT)	Moderate	Handles sentence level embeddings
OpenAI’s Ada	Commercial API	Varies	High accuracy, Scalability
Amazon SageMaker	Commercial API	Varies	ML models training, deployment
HuggingFace Embeddings	Commercial API	Varies	Broad range of pre-trained models

Relationship Between Chunking and Embedding

Although viewed as independent tasks, the chunking strategy can impact the choice of the embedding method. For example, if you’ve picked a chunking strategy for short text, your next decision should involve choosing between Flair, fastText, or sentence transformers.

Alternatively, if the relationship between sentences is significant for your application and you’ve chosen a longer chunking approach, LLM-based embeddings, such as OpenAI’s Ada or [BERT-based models, maybe a more appropriate choice. However, this isn’t a one-size-fits-all solution. You should also take into account factors such as the size of the document, the type of text, and the level of accuracy you aim to achieve.

Hands-on Custom ChatGPT Examples

Let’s bring all of this together by developing two examples that demonstrate the spectrum of options you have at your disposal that depends on the requirements and the use case.

Custom LLM Example 1: A QA Chat Application Using Custom Pre-Processing but Commercial Embeddings

In the first example, we use OpenAI’s pre-trained embeddings on a question-answering example. The benefit of this approach is that we can leverage proprietary data while removing the need to train custom embeddings. For this example to make sense, our use case should be based upon a question-answering-type need and require the use of proprietary data.

# Import necessary modules
import os
import openai

import tiktoken
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains import RetrievalQA   

#Use your own API key
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Read more about tokenization in the previous article.
# Specify the model to use and get the appropriate encoding
tiktoken.encoding_for_model('gpt-3.5-turbo')
tokenizer = tiktoken.get_encoding('cl100k_base')

# Load an unstructured PDF file
loader = UnstructuredPDFLoader('/content/yourdocument.pdf')
data = loader.load()

# Define a function to get token length
def tiktoken_len(text):
        tokens = tokenizer.encode(text, disallowed_special=())
        return len(tokens)

#Calling chunk splitter from the previous section
# Split document into chunks using RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=20,
         length_function=tiktoken_len, separators=["\n\n", "\n", " ", ""])

# Split the loaded document into chunks
texts = text_splitter.split_documents(data)

# Create embeddings 
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai.api_key)

# Save the embeddings to a FAISS vector store
vectoredb = FAISS.from_documents(texts, embeddings)

# Save the vector store locally
vectoredb.save_local("faiss_index")

# Load the vector store from the local file
new_vectoredb = FAISS.load_local("faiss_index", embeddings)

# Create a retriever from the vector store
retriever = new_vectoredb.as_retriever(search_type="similarity", search_kwargs={"k": 3})

#Define the instructions for the LLM
template = """Question: {question}
           Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = RetrievalQA.from_chain_type(llm=OpenAI(),
                                        chain_type="stuff", 
                                        retriever=retriever,
                                        chain_type_kwargs={"prompt": PROMPT})
              
#Ask a question
question = "Which city hosted the Summer Olympics in the year the Titanic sank?"

llm_chain.run(question)

Custom LLM Example 2: Fine-Tuning an Open-Source LLM

Not all applications, use cases, or datasets are alike. Therefore, if your model’s training data significantly differs from your application’s data distribution, you might need to consider a more advanced fine-tuning approach. This situation can arise, for example, if you’re using an LLM for a financial application, but its training data did not include any financial literature.

In the example below we examine how to fine-tune an open-sourced Falcon-7B model. Cutting-edge techniques such as QLoRa have made the fine-tuning of LLMs accessible to a broader user base. This technology effectively eliminates the need for sourcing GPUs, as you can now accomplish the fine-tuning of smaller models using standard hardware.

But, there’s an even simpler method that requires minimal coding: Falcontune.

Let’s start coding our example with a quick setup:

##install Falcontune and Falcon-7B model
!git clone https://github.com/rmihaylov/falcontune
!wget https://huggingface.co/TheBloke/falcon-7b-instruct-GPTQ/resolve/main/gptq_model-4bit-64g.safetensors

Next, you should access the download folder of falcontune and install dependencies:

#install its dependencies
!cd falcontune && pip install -r requirements.txt
!cd falcontune && python setup.py install

After setting up our models, the next task is to obtain a dataset suitable for the model tuning. For this example, we’ll be using the Alpaca dataset, which contains 52,000 instructions generated by OpenAI’s text-davinci-003 engine.

As an aside, this is a prime example of a competitive model being trained upon the output of another commercial model, owned by another company. If you are exposing your model, make sure you have anti-competitive clauses in your contracts with your users or assert the right via technical means (e.g. better API management), ideally both.

#Get toy dataset
!wget https://github.com/gururise/AlpacaDataCleaned/raw/main/alpaca_data_cleaned.json

Now, with the dataset in place, we’re ready to proceed!

To fine-tune falcon-7b on the alpaca data, simply execute the following command:

falcontune finetune \
    --model=falcon-7b-instruct-4bit \ #Specify the Falcon model
    --weights=./gptq_model-4bit-64g.safetensors \ #Specify model weights
    --dataset=./alpaca_data_cleaned.json \ #choose your own dataset
    --data_type=alpaca \ 
    --lora_out_dir=./falcon-7b-instruct-4bit-alpaca/ \ #Specify output directory
    --mbatch_size=1 \ 
    --batch_size=2 \ #
    --epochs=3 \ #Specify number of epochs
    --lr=3e-4 \ #Specify learning rate
    --cutoff_len=256 \ 
    --lora_r=8 \ #Specify LoRA parameters
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --warmup_steps=5 \ 
    --save_steps=50 \ #Specify how often to save model checkpoints
    --save_total_limit=3 \ #Specify how many checkpoints to save
    --logging_steps=5 \ #Specify how often to log
    --target_modules='["query_key_value"]' \ 
    --backend=triton

This process may take some time (24h on a free Google Colab instance). The Alpaca dataset is extensive, so you might want to reduce its size for testing purposes.

If you want to use a custom dataset, just have a look at the file “alpaca_data_cleaned.json” to see what data format is expected by falcontune. During the fine-tuning process, the peak CPU RAM and GPU VRAM consumption were 4.0 GB and 8.3 GB, respectively. This is a cost-effective setup for home-based fine-tuning.

Bear in mind that if you opt to use the 40B version of Falcon, you would require a significantly more powerful machine.

Summary

This article reviewed the main steps involved with training custom ChatGPT models and LLM. This included changing data into number values, breaking text into smaller parts, and converting text into a form computers understand. We also gave a hands-on example of tuning data for specific tasks.

The most important lesson from this guide is that while LLMs are powerful tools that can change how we communicate with machines and each other, their implementation lies on a spectrum. You have the options of using a commercial API, fine-tuning an existing model, training your own from scratch, or anything in between. To make the best decision for your product, you need to consider the advantages and disadvantages of each option, as well as the resources and expertise required.

We hope that by following the tips and best practices from this guide, developers can develop more effective and efficient LLMs that will improve the user experience and help them achieve their business goals.