Part 2: An Overview of LLM Development & Training

The premise of LLMs is beautifully exemplified by products like ChatGPT, that use these models to power conversational interfaces, offering a seamless and engaging chat user experience. In this second part of our series on ChatGPT, we provide an overview of what it’s like to develop against commercial LLM offerings and what it takes to begin developing your bespoke model.

Later in this series, we will delve much deeper into each of the topics described today. But for now, we thought it best to provide a soft landing by briefly outlining the entire LLM development process. You can find all the code for this series on GitHub.

Leveraging Open Source LLM Models

While chatbots such as ChatGPT are certainly impressive, demonstrating a wide array of capabilities from humour to intellectual conversation, it’s important to understand that they aren’t typically what you require for an organizational or enterprise model. Instead, you’d usually be searching for a solution more tailor-made for specific challenges like customer service support, ad copy generation, or assistance with specific tasks.

Open-source LLMs emerge as a promising starting point in the development process. These models are rapidly advancing, closing the gap with commercial models like OpenAI’s ChatGPT and Google’s BART.

In particular, they can be hosted in your environment, ensuring data privacy, and they’re fully customizable. With the help of the open-source community, these models are continually improved to tackle more specialized tasks.

You can choose from a variety of models depending on your specific requirements, and the costs depend mainly on your computing efficiency, not token usage. This makes them a much more cost-effective choice if you wish to run them in your data centers.

However, building and hosting open-source LLMs come with several limitations that can become hurdles when it comes to actual production.

Training and operationalizing LLMs often demand highly specialized infrastructure.
Hosting an LLM for a low-latency service like zero-shot inference can be difficult to get right.
Including other features such as document retrieval capabilities for in-context learning or few-shot inference requires yet more infrastructure.
Fine-tuning involves figuring out how to execute distributed training over large datasets, and pre-training from scratch on terabytes of unstructured data or executing reinforcement learning through human feedback can be demanding tasks that will require bespoke tech stacks.

Wrapping Commercial LLM APIs

That is why proprietary LLMs can be an excellent starting point. Frameworks like LangChain can streamline the deployment process of OpenAI models and significantly reduce GPU costs. It achieves this by providing modules that simplify the use of LLMs, making them more accessible to developers at all skill levels.

With LangChain, you can call a GPT model with just a few lines of code, once you’ve installed the required Python packages and obtained an API key.

#pip install openai
#pip install langchain

import os 
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

You then import the model from the LangChain library and pass your text prompt to the model object you’ve just created.

from langchain.llms import OpenAI
from langchain import PromptTemplate, LLMChain

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm= OpenAI())

question = "Which city hosted the Summer Olympics in the year the Titanic sank?"

llm_chain.run(question)

# The Titanic sank in 1912, so we are looking for the host city of the Summer 
# Olympics from that year. The Summer Olympics in 1912 were hosted by Stockholm, Sweden.

This code is available on GitHub.

Another significant advantage of LangChain is its support for creating conversational models. It provides chat models that use OpenAI LLMs at the backend, facilitating interaction with users. We"re going to see some examples of this configuration in the next section.

So, how does one build an LLM? The answer lies in machine learning, which involves several steps in the journey of LLM development.

If you need a recap, check out part 1 of this series.

Pre-Training LLMs

The process begins with the pre-training step, where an unlabeled corpus of data is trained to predict the next word or phrase based on the context. For instance, if the model is fed the words “the quick brown fox”, it should be capable of predicting the next part as “jumped over the lazy dog”. This stage aims to develop a model that can generate logical and coherent sentences.

Pre-Processing and Tokenization

During this step, it’s essential to clean and standardize your data. This involves transforming the data into a consistent format and tokenizing words, which essentially breaks the text down into smaller pieces, each known as a token. For example, the sentence “The quick brown fox” would be tokenized into ["The", "quick", "brown", "fox"].

Now, let"s take a look at how we can customize an LLM model to work on your documentation. For tokenization we use the tiktoken library which is specifically designed for this task. It’s a fast Byte Pair Encoding tokenizer that works well with OpenAI’s models. BPE is a form of tokenization that can encode any string, even if it encounters a word not in its vocabulary, by breaking it down into tokens it understands.

This preprocessing helps reduce the tokenization cost for the OpenAI model and involves tasks such as eliminating stop words (common words that do not contribute much information, such as “is”, “and”, “the”), stemming (reducing words to their root form, e.g., “running” to “run”), and lemmatization (reducing words to their base or dictionary form, e.g., “better” to “good”).

# Import necessary modules
import tiktoken
from langchain.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Specify the model to use and get the appropriate encoding
tiktoken.encoding_for_model('gpt-3.5-turbo')
tokenizer = tiktoken.get_encoding('cl100k_base')

# Load an unstructured PDF file
loader = UnstructuredPDFLoader('/content/your_document.pdf')
data = loader.load()

# Define a function to get token length
def tiktoken_len(text):
        tokens = tokenizer.encode(text, disallowed_special=())
        return len(tokens)

Then, we use a LangChain RecursiveCharacterTextSplitter module for handling large blocks of our textual data. The fundamental idea behind this module is to divide the text into manageable pieces, or chunks based on delimiters or “breakpoints” to maintain semantic context and preserve the integrity of paragraphs, sentences, and words.

# Split document into chunks using RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=20,
         length_function=tiktoken_len, separators=["\n\n", "\n", " ", ""])

# Split the loaded document into chunks
texts = text_splitter.split_documents(data)

This code is available on GitHub.

Fine-tuning the LLM Model

Once the pre-training stage is complete, the next step is fine-tuning. This stage involves using a labeled dataset to train the model to generate responses that are more aligned with a specific task or use case. This process is key to transforming the model from a general-purpose language model into a more focused tool, capable of handling specific tasks with more precision and relevance.

Later in this series, we"re going to show you how the fine-tuned model works in tandem with different types of retrievers to provide a more contextually appropriate response in QA systems.

Lastly, for some cases like chatbots, a refinement step is included where human biases or subjective preferences are incorporated into the model. The aim here is to teach the model how to respond in a certain way that is more compliant with human preference.

For instance, if the model is asked to say something offensive, it should be able to respond with a polite refusal. This stage ensures the language model aligns well with human interaction norms and biases. We’ll revisit this important stage and explore it more thoroughly in our next articles.

Deploying LLMs

After deploying the llm model, you may have to make several decisions. If you don’t have a specific task in mind, you might consider zero-shot learning or inference, which involves prompting the model to provide certain outputs without any prior training in that specific area.

However, with a pre-defined dataset, you can perform pre-training or create an embedding store over your data and perform information retrieval to find the most relevant data for a particular question, as we did in our example. Stay tuned for a more comprehensive tutorial on this subject in our future articles.

The LLMOps Landscape

The development process can be also optimized through a collection of strategies and tools known as LLMOps. This term generally refers to the necessary operational capabilities and infrastructure essential for working with LLMs in a product setting. While it closely aligns with the broader concept of MLOps, LLMOps distinctly underscores the specific challenges and requirements inherent in fine-tuning and deploying these specialized models.

One key process in LLMOps is transfer learning, where a high-performance model, such as GPT-3.5 or GPT-4, is adjusted to cater to a specific need using data related to that field. This process allows organizations to adapt a model to generate outputs in a particular style or format they require, like medical notes, using their own unique datasets.

However, adjusting these large models is no small task. GPT-based models are massive, consisting of billions of parameters, and require a high level of computational power to train. Although the adjustment process doesn’t need as much data or computational power as the initial training, it still requires an infrastructure that can handle large datasets and make use of GPU machines in parallel.

Another notable aspect of LLMOps is the inference side of these vast models, which also demands a different level of computing than more traditional machine learning models. It can involve not just one, but multiple models and additional safety measures to ensure the best output for the end user.

In any case, LLMOps play a crucial role in the creation and launch of custom versions of ChatGPT. It underlines the need for a solid, scalable, and efficient set of tools and infrastructure to manage the distinct challenges and requirements posed by large language models.

Conclusion

To sum up, building a personalized ChatGPT model can offer vast possibilities for your company or product. However, it’s important to remember that the development process can be complex and require certain knowledge of data and prompt engineering, as well as more traditional AI development and deployment skills.

The new state-of-the-art framework, such as LangChain, can significantly simplify the creation stage of your AI-powered solution and make LLMs more accessible to developers of varying skill levels. With some effort, you can make a great LLM model that makes people happy and helps them get what they need.

Stay tuned for the next article where we will deep-dive into training a custom LLM!

Part 2: An Overview of LLM Development & Training

Leveraging Open Source LLM Models

Wrapping Commercial LLM APIs

Pre-Training LLMs

Pre-Processing and Tokenization

Fine-tuning the LLM Model

LLM Refinement

Deploying LLMs

The LLMOps Landscape

Conclusion

More articles

Part 6: Useful ChatGPT Libraries: Productization and Hardening

The Problem of Big Data in Small Context Windows (Part 2)