One of the reasons why I refuse to move house again is because I hate trying to make big stuff fit through small gaps.
Until recently, in the UK, king and super-king-size beds weren’t all that common. There are exceptions such as the “great bed of ware” in the V&A museum but in almost all cases, through the hundreds of years of history visible in the UK’s stately homes and castles, all the beds were small, both in width and length.
We have a super-king-size bed and, being from Yorkshire and therefore careful with money, we won’t buy a new one. Oh no. We attempt to take this huge beast of industrial mining equipment with us every time we move. It has to bend around the staircases, through several doors that inexplicably have different sizes, and into designated “bed vans.” If that wasn’t fun enough, we do the exact reverse at the other end. Again, we do this all ourselves (see Yorkshire), and it’s not fun.
Last time we moved I announced with great bluster: “I’m never moving again.” We’ll see.
But there’s one reason why moving a bed is even remotely possible: Ikea. Their ingenious flat-pack design means you can fit a remarkably large bed into a surprisingly small space – if you deconstruct it and squash it down, of course. That’s what this series of articles is about. Except replace the word bed with text, and space with large language models.
Large language models (LLMs) are machine learning algorithms that have been trained upon large amounts of text to predict subsequent words. They are based upon a paradigm known as deep learning, where vast numbers of linear equations are combined with a function that doesn’t behave – a “non-linearity.” With sufficient numbers of these building blocks, called neurons, and some sophisticated domain-specific high-level architectures, these models are able to generate text simply by guessing what the next word should be.
But there’s a trick. LLMs don’t just guess what the next word should be. They look at all of the previous text and feed that into a big set of inputs so the LLM can see all of the context. The predicted next word isn’t just based upon the last, it’s based upon all of them.
The Problem: Fitting Big Things Into Small Spaces
This is called the context window. It is what makes LLM’s so powerful. We can provide high-level instructions, like “be a good friend,” or contextual information, like a user’s previous interaction. High level system prompts tell the LLM how it should communicate whereas the context says what it should do.
But here’s the kicker. In (most) LLMs today, there’s a limit to the size of the context window. This is primarily due to RAM constraints. Each input requires multiple internal neurons and eventually you will run out of RAM. Also, in general, the larger the number of neurons, the longer it takes to train, so you’ll also start to hit time and/or cost constraints.
This boils down to the limitation that (most) LLMs have fixed-sized context windows. You’ll see shortly that the context windows are sometimes very small. This is a problem if you’re trying to include extra context like your company’s wiki or a book. This is the problem tackled in this series. How do you fit more text into the context window of an LLM?
Aside: I keep saying most, because of course there are ideas and algorithms that try to remove the context window limitation. Often they stream or chunk or compress the context window to fit more in. Exactly the topic of this series and applicable to all LLMs. If you know of any papers that I should read that truly don’t have any context windows, then do let me know!
Overview of the Series
We’ve split this series into several different parts because we couldn’t fit it all into one article. This and the next article introduces the problem and the high-level solutions. Each subsequent article delves into the technical details of each solution and provides practical examples.
What is a Context Window in LLMs?
So what does context window mean?
A context window represents the textual input to a LLM. In the most simple terms, the larger this window, the more text the model can consider, leading to more coherent responses. For example, if you asked your untrained business LLM what features product 1098114833 has, it wouldn’t know. But if you listed all of your products in your context, it will be able to extract all the information it needs.
Because deep learning models only work well with normally distributed, numeric data, the text needs to be converted into numerical form. This process, known as feature engineering, consists of chopping the words up into chunks called tokens, and then converting combinations of tokens into a high-dimensional number, called embeddings.
The LLM uses the embedded tokens to provide the model with the context of what the user is talking about. Remember that the LLM has been trained on a huge amount of textual data, so it’s likely that it has seen something similar before. If you pass a question into the context, the LLM converts that into a position in high-dimensional space which happens to be close to some content that it has seen before.
The only remaining task is to predict where next to move in this high-dimensional space. It does this by traversing its internalized mathematical model of all the embeddings it has seen; much like predicting the next step in the trajectory of a cannonball. The prediction is another embedding, which then gets converted back into tokens and eventually text.
Where it gets cool/scary, is this output is then fed back into the model to predict the next token, and the next, and so on.
LLM Context Window Explained
To understand why context windows matter, you need to appreciate why LLMs are so powerful. In some ways, the prediction process is very much like following a path down a mountain.
Imagine you’re standing at the top. With no context, you can go in any direction. You’d probably end up going in the wrong direction to where you parked your car. This is the situation when you give an LLM no context. The LLM will go gallivanting off in any which way.
Aside: Typically LLMs are provided with hidden system prompts to form a baseline level of service. So they’re never truly starting from scratch.
Instead, if you tell your legs to start following the path you just came up, then you’re much more likely to end up back near your car. At least you’ll be on the right side of the mountain. As you get further down, you gain momentum, looking back at your trajectory down the hill, you can project the path forward. It might zig zag, but it’s going in the right direction.
Your next step, or at least the prediction of your next step, is based upon your trajectory down the hill and what you can see around you. If you can see a path, you should probably take that. But you don’t have to, you could go off piste.
This is what LLMs are doing in high-dimensional, embedded space. They’re moving around and are given momentum via the context. They are predicting the next step based upon the well-trodden path of all of the internet’s knowledge.
Providing context makes it easier for the LLM to predict the next best word. In this analogy, with no context you start at the top of the mountain and could go in any direction. As you add more context you can predict easier and move faster.
Momentum and well trodden paths. What does this mean? “Well trodden paths” is an analogy for natural language. Given “the brown fox jumped over the lazy”, what is the next most common word? This suffices for most business problems because that’s what you’re trying to do. Answer questions correctly. Classify things as they should be. This only becomes a challenge when you have domains that aren’t well sampled on the internet. Highly technical or legal domains are a good example. In these, there isn’t a well trodden path to follow because the training process hasn’t seen any paths. You can mitigate this affect by providing more context, but this problem is more due to the lack of internalized knowledge of the LLM. That’s not what we’re talking about here.
Aside: Many industry problems are based on the premise that you’re trying to predict the most common next thing. But to be pedantic, you could do exactly the opposite. What is the least common, worst next thing? When I typed this into ChatGPT, “Tell me what is the least probable word for the following sentence. The brown fox jumped over the lazy” it waffled and struggled: “… the least probable word to follow would be a word that is typically not associated with foxes or the context of the sentence… a word like “nebula” or “algorithm” would be highly improbable in this context.” It hasn’t actually answered the question, probably because it’s statistically impossible.
Momentum is provided by the context. We’re trying to push the LLM to move in the right direction by giving it relevant context, even if it hasn’t fully sampled this trajectory in the wild. Think of the context as a soft guard rail that constrains the space which the LLM can travel. The more precise you can be, the better the context, the better the final result.
LLM Context Window Comparison
With all of this in mind, let’s look at a few popular models and see how big their context windows are. This table is correct as of December 2023.
|LLM Context Window Size (Tokens)
There’s two general themes that appear from this table. The first is that older models tend to have smaller context sizes. The second is that open source llm context windows tend to be smaller.
Note that some models, like Mistral, have a sliding window, which means you can pass in more than the number of stated tokens, but they won’t be used in the prediction. For the purposes of these articles, assume that we are including lots of context because it’s necessary for the application.
Calculating Token Sizes for LLMs
Finally, we’re ready to state the problem definition. How can we provide enough of the right context to ensure that the LLM provides the best answer?
The first problem you’ll hit is “how many tokens does this text use up?” Tokens are not words. They are usually parts of a word. The tokenizer is trained in such a way that it is able to encapsulate most of the language by splitting words into representative codes, or tokens.
Another article will provide an explanation of how to calculate token lengths. But for now assume that you run the text through the tokenizer and count the length of the array that comes out of the other side.
How Many Tokens Do You Need?
Once you’re able to calculate the number of tokens, then you need to figure out how much context you need to solve your problem. You do this by establishing a metric to quantify how good a result is. If you’re feeling lazy then a qualitative “which answer is better” is good enough when you’re hacking around.
Many tasks can be performed with little context. But undoubtedly you’ll eventually come across a problem that requires more. For example, if you’re trying to answer domain or business specific questions, you need to pass that information into the context. Corporate databases are huge and will probably not fit.
When you’re working at scale, à la Big Data, it’s unlikely to fit.
Sometimes you might not even have control over the context data. The data might come from an external resource like the internet. How many tokens are there on the internet?
LLM Performance Considerations
Granted, some of the newer models have massive or mutating context windows that allow a lot of wiggle room. But in general, more context means slower response times. Everything from the transfer time of the data to the complexity of the problem is proportional to the length of the context. So even if your context is small enough to fit within your chosen LLM, it is still worth attempting to reduce the size of the context for performance reasons.
Using just one of the proposed strategies can reduce costs by a factor of 10 and reduce latencies by a factor of 2. Combine multiple strategies and you can probably raise these numbers by another order of magnitude.
|Model / Benchmark
|Using Compression Strategies
Inference costs per 1000 samples in dollars for various datasets. From https://arxiv.org/abs/2310.06839.
Context Window Overheads
Just before we move on, remember that there are overheads that you must also take into account. As I mentioned earlier, system prompts eat into the context window budget. These include system prompts that you design, but also the system prompts that are inherent to the model.
The performance of the tokenizer is also important. Using the tokenizer provided with the LLM is a must, but it is likely to be sub-optimal for your specific domain. For example, parts of words that are deemed unimportant in general might be vital for your application: customer vs. customers, for example.
LLM context windows are used to input relevant information. They are often very small. But even if they are not, it is still wise to reduce the amount of context that is required to improve cost and latency. Current open source models use context window sizes in the thousands of tokens. GPT-4 leads the way commercially with 128 thousand tokens.
In part 2 of this introduction to LLM context windows, I will summarize the strategies used to deal with larger datasets. Later in the series we will provide examples and code.
Frequently asked questions
Context windows are the inputs to large language models. They provide the LLM with the context it needs to perform tasks.
Long context refers to the length of the text that is passed to the input of a large language model.
The context limit of an LLM depends on the model. Smaller models start from a few thousand tokens (approximately 1000 words), up to 128 thousand tokens for GPT4.
The context window consists of all prompts added together. That includes system prompts, user context, any any other extra information that you wish to pass to the LLM.
The most obvious problem is that you might not be able to fit all the context you need to be able to answer your user’s questions well.
It can’t be expanded without changing the underlying model architecture, but you can work around the size limitation with the strategies outlined in these articles.
They don’t. You need to specify a strategy or algorithm to decide which data to keep.
During training, LLMs might not be able to observe all the relevant context necessary to learn high-level abstractions.
The LLM might not be able to provide a full answer.