Are There Any New Research Developments Aimed at Overcoming Context Window Limitations in LLMs?

Many developments are based upon old tricks, but researchers are beginning to automatically incorporate context window optimization into the LLM pipeline for ease of use. For example, leveraging the amount of self-entropy in a context window is a semi-automatic method.

How Does the Transformer Architecture Inherent in LLMs Influence the Design and Constraints of Context Windows?

A transformer architecture is designed to work with a fixed input length so that it can leverage the context and predict a token in a single calculation. To develop a model that could truely handle any context length then an iterative approach would be required (like recurrant neural networks).

Are There Theoretical Limits to Context Window Sizes Based on Current Hardware, and What Are the Implications for Future Scalability?

There is no theoretical limit, but it is limited by hardware availability and cost. Performance is related to hardware requirements logarithmically, so at some point the minor performance gain is not worth the huge extra cost.

What Are the Computational Trade-Offs Involved in Managing Larger Context Windows Within LLMs?

LLMs can take advantage of sparsity when tokens are not present. But if they are, the LLM will inspect it, which requires memory computational time. Therefore there is a positive correlation between the size of the context and the resources/time required to predict a result.

Could Alternative Data Structures to the Standard Token Array Provide More Efficient Context Window Management?

If you have a domain where the tokenizer isn’t capable of accurately interpreting words, then a different tokenizer might improve the performance of your LLM.

How Do Context Windows Handle the Contextuality of Language That Spans Beyond Their Fixed Size, Especially in Nuanced Text Like Literature or Legal Documents?

They use one of the strategies outlined in this article.

Is It Possible to Design an Adaptive Context Window That Adjusts Its Size Based on the Complexity or Demands of a Given Task?

It might be possible, but LLMs are already able to take advantage of sparsity, so simply not including tokens amounts to the same thing.

How Does the Limitation of Context Windows Affect the LLM's Ability to Understand and Generate Coherent Narratives or Arguments?

LLMs can only coherently answer questions if they have the data to be able to do so. Much of that information may already be trained into the model, but local, contextual information is not.

What Are the Latest Advancements in Memory Management Techniques That Could Potentially Enhance or Replace the Current Context Window Approach in LLMs?

Check out this article to find links to recent papers.

The Problem of Big Data in Small Context Windows (Part 2)

In part 1 of this series I introduced the challenges of working with small context windows of large language models. To recap, the challenges are:

You often have relevant data that is larger than the context window size
Too much context increases latencies, costs more, and consumes more resources.

Before reading on, you may also find it useful to brush up on how LLM tokens are calculated, since every strategy below ultimately trades tokens for accuracy or cost.

Strategies for Using Big Data in LLMs

In this article I present a summary of the strategies that you can use to overcome these challenges. For now, I will only summarize each solution. Subsequent articles will provide practical examples and in-depth explanations.

Strategy 1: Use a Model With a Bigger Context Window

One of the simplest solutions is to find a model with a larger context window. You saw in the table above that there is a large disparity in context sizes. The newest iteration of GPT-4, for instance, has a whopping 128k token context window. This might be large enough for your application, but remember that tokens cost in both money and latency. So you might still want to consider alternative approaches.

Strategy 2: Sliding Methods

We use the term sliding methods to denote any technique that slides over the data. In signal processing, the sliding windows are used to apply a function over a small subset of the data. Convolutions or rolling averages are good examples. This can be thought of as something similar to a context window and therefore the same ideas can be applied.

Chunking

One of the simplest sliding methods is to simply run the generation against a very small chunk of data. Chunking, in the natural language processing sense, is typically performed on a sentence or paragraph level. You can split your text into chunks and then submit each chunk to the LLM with the same question.

This works best when the question doesn’t require any global context about the document. For example, if you’re asking questions like “are there any people mentioned in this text” (named entity recognition) this works well because it doesn’t require any information that would be contained elsewhere in the document. However, questions like “provide an explanation of this rule” would not work well, because another part of the document could conflict with the current chunk.

An image depicting the idea that chunking iterates over chunks of text — Chunking text involves splitting documents into discrete blocks of text, or *chunks*.

Sliding

Chunking can be inefficient when chunk sizes are small. The latency of the request and the response can add a significant overhead when running at scale.

You can minimize the latency overhead by maximizing the number of words in the chunk and then sliding over the chunks with some overlap. The overlap ensures that any context that is missed due to a sentence/paragraph chopped in half is fully appreciated on the next chunk.

An image depicting how sliding moves over a document in steps. — Sliding jumps over document, like chunking, but overlaps with the previous chunk.

Strategy 3: Filtering Methods

Instead of sending the full context to the LLM, which may be larger than the allowable size, you can filter out unnecessary context. There’s no point wasting tokens on context that is irrelevant.

Simple Filtering

The first strategy you should try is using a simple, heuristic filter. The benefit of this approach is that it is easy to understand and implement. It is also fast to execute. The downside is that the filter may not be expressive enough to efficiently and precisely remove context.

Some examples of simple filtering are:

Keyword filtering
Structured filtering (e.g. only send a small part of the structured data)
Length filtering (e.g. truncate context to x amount of tokens)

Lexical filtering

Natural language processing has an extensive library of feature engineering tools. Spacy is one of the most celebrated libraries available and is highly flexible. You can use NLP libraries to develop filters to remove unnecessary parts of the text.

Some examples of lexical filtering are:

Removing parts of speech that have low informative power
Removing specific speech tags that are not relevant to your problem (e.g. names, places, etc.).

Retrieval based filtering

One common retrieval pattern is to submit a user intent along with institutional context that provides the answer. For example, when querying a business catalogue, you would send the query along with the most relevant items in that catalogue. It is often referred to as retrieval augmented generation (RAG).

This can be thought of as a filtering technique because it removes the need for the vast majority of the context.

RAG does, however, introduce an independent dependency and the final results depend greatly on the search relevancy. But the prevalence and experience of search means this technique is a very popular form of knowledge augmentation.

Strategy 4: Compression Methods

Compression methods take a piece of text and attempt to reduce its size. You can achieve this by filtering, but compression tends to view the problem as attempting to retain some semblance of the original, at least to an outside observer. In general there are two common methods of compression, lossy and non-lossy, but newer research methods are attempting to perform what can be described as minification.

Lossy Compression

Lossy compression is perhaps the most common. This is where a function or a model summarizes the key points from a text. This summary is then sent to the LLM along with the intent. This is how the “auto-summarizer” features work in online search engines.

Non-Lossy Compression

As an engineer, I seem to have a natural fondness for brevity. I don’t like to waste people’s time with superfluous fluff. My wife is often shocked at how curt some of my emails are. I apologize to anyone that has been on the receiving end of one!

Because of the verbosity of language, it is often possible to rewrite text with no loss of informative content. For example, you could rewrite notes as bullet points and retain all the information. Even if you remove every other word in this sentence, you can still probably get the gist. Let’s try!

of verbosity language, it often to text no of content. example, could notes bullet and all information. if remove other in sentence, can probably the. try!

Ok that’s a bit extreme, but you get the point. A simple example of this in action is to ask an LLM to rewrite the content in bullet form. More sophisticated users might train a dedicated attention model to learn the most important parts of text in their domain based upon information gain or fine tune with a low rank adaptor.

Minification

This form of compression is a bit of an outlier. Some new approaches take the idea of non-lossy compression and use well known information scoring techniques to quantitatively specify how important parts of individual texts are. This is another form of non-lossy compression, but I feel it is more akin to minifying HTML or CSS content in web compiler pipelines.

Strategy 5: Efficiency Methods

The final selection of techniques are what I like to call efficiency methods. These are techniques that can be applied holistically. This section is a catch all for techniques that don’t directly alter the context.

Better Prompts

It is perhaps obvious, but if your application is heavily reliant on prompts, then the quality and performance of the prompt has a huge bearing on the required context.

For starters, remember that the system prompt forms part of the context budget. So sharpen that up. If your prompt has the power to call external tools or functions, then make sure that the result of those are succinct. And make sure that they are only called when absolutely necessary.

How the prompt uses the context might be important. If you can embed or inline simple or obvious information then you might not need to spend effort parsing and extracting the bulk of the context. This works especially well when combined with other techniques like filtering.

You can also work on the formatting of the output, making it more concise or even structured. This will reduce latency and memory usage.

Switch Prompts for Models

It is incredibly tempting to shove all functionality into a prompt, but while doing so it may help expediency, it’s probably not the best use of an LLM. Heresy!

AI has decades of experience of building language algorithms and models that can perform feats of amazing dexterity. A lot of the time, you really don’t need to use an LLM to do what you’re trying to do.

If you find yourself trying to do obvious machine learning things like classification or any kind of analysis, stop, and switch to a traditional ML model like gradient boosting, SVMs, MLPs, and the like. If latency or resource usage is of any concern at all, I strongly recommend looking at non-LLM approaches.

Practically speaking, I would normally use an LLM to prototype an idea and then, if it has legs, convert that into a bona fide ML model. I’d even use an LLM to generate the training data!

Allow the LLM to Call Databases

The idea here is that you can store parts-of-context and past interactions in a vector database, then either retrieve that reference at prompt-time or allow the LLM to query the database itself. The embeddings themselves are often called contextual embeddings, because they can contain prior interactions or other information directly related to the current context. This is essentially a rehash of the retrieval based filtering, except it is used as a dynamic source of new information if the LLM requires it. This is the inspiration for all of OpenAI’s “functions” in ChatGPT.

This is a powerful pattern. You’re giving the LLM the ability to query for knowledge. But releasing control places more emphasis on the prompt, ensuring that you make efficient use of the query engine. You wouldn’t want the LLM to query the database for all data all of the time, which would be the same as including all data in the prompt. But get it right and this can be a very elegant, simple solution that doesn’t need updating often.

Out of all the strategies listed here, this is probably the most exciting, but also the hardest to keep control of. Like a new puppy, you’ll need to keep it on a tight leash.

Best Practices

Now that I have outlined individual strategies for dealing with larger contexts, let’s take a more holistic look at how you would implement these. The goal of this section is to provide a couple of scenarios and walk you through the design process. But first, we need some example scenarios.

Here at Winder.AI we predominantly build AI products for other businesses, so we really need to get into the mind of the user. In my experience I’ve seen two broad types of customer. The first is a hacker, an entrepreneur, a consultant, someone who is building a proof of concept. The second is a long-term professional user, thinking about how they can leverage language models in an enterprise environment. The different underlying requirements of these two personas lead to stark engineering contrasts.

The Entrepreneur

Building a proof of concept (POC) is a great way to de-risk both the technical feasibility and the market viability of a product. Many POCs are thrown away or are rewritten, but they are a very useful tool to help drive product direction.

The overall aim of a POC is to tease out the key risks in the shortest amount of time. Robustness, testing, and operational concerns all take a back seat. The fastest route to delivery is the way.

These types of applications are often LLM wrappers, which leverage the underlying power of a foundation model and expose a curated experience.

To begin with, context limits might not be a problem. But eventually, when you start including contextual data to improve or implement your service, the API calls to the LLM will slow down, it will start to cost more, and finally you will see errors returned saying you’ve exceeded the limit. You should now consider one or more of:

Switching to a model with a larger context
Simple filtering
Chunking

In that order. These are the fastest ways to squeeze your bed into a box truck. These changes will have little effect on your output.

Once you start to move past the POC phase, you will probably want to start optimizing the context for latency, cost, and stability. At this point you should consider:

Better prompts
Lossy compression
Switch prompts for models

Finally, when you’re optimizing for production use, you should start thinking about high-level architectural improvements like:

Using local (or co-located) LLM models
Caching
Fine-tuning

Bare in mind that you have to have a reasonable level of traffic/data for these to make sense.

The Enterprise

The enterprise customer is typically expected to deliver a more robust service that serves more users and leverages more data, principally due to the fact that the organization is larger. But there are often non-functional IT requirements as well.

One of the most common use cases at the moment is to expose large amounts of proprietary data in an interface that sounds more natural. For example, allowing staff to “talk to” their Sharepoint, database, internal wiki, CRM, etc. This use case has its challenges (call us if you’d like to know more!), like when people try to ask for analysis, but it is a great use of the technology.

The interesting thing in this scenario is that it is highly likely that you have way too much data right from the off. Like you’re trying to move the Bed of Ware that we met in part 1. {{ TODO: LINK TO ARTICLE 1 }} You have to deal with the context issue immediately.

Thankfully, your data is probably already quite well chunked. If it sits in a database you have records. If it’s in Sharepoint, you have documents. You might want to perform further chunking, but usually it’s ok to start with the high level structure of whatever the repository exposes. And you can start with retrieval based filtering.

The challenge with this approach is that you will need to have a pre-computed cache of the embeddings to compare against. This sounds straightforward, but unfortunately it adds an annoyingly fickle dependency. What happens if you change your embedding or tokenizer function? Or if you change your chunking strategy? How do you detect changes to content? Etc.

The alternative strategy is allowing your LLM to query the repository directly and consume the result. This works really well with content that is already well structured (e.g. an API) or already has a good search capability (e.g. document databases). All you need to do is give the LLM the ability to be able to call it. Using this method it doesn’t matter if anything changes with the underlying data, because it’s computed live. The downside is that it can’t take advantage of the internalized knowledge provided by the LLM.

Once one of these approaches is working, then you might want to consider low level optimizations such as compression methods and other forms of contextual filtering. High level optimizations such as co-locating LLMs, streaming, caching, and so forth are also appropriate.

The Future of LLM Context Windows

In the near future the context window size will keep increasing, to the point where it is no longer a physical limit. GPT4 is already there. I haven’t managed to find a POC-type use case that requires 128k tokens yet. Open source models are lagging behind but we expect them to catch up within the next six months or so.

But that doesn’t mean you don’t need to worry about the number of tokens being used. As mentioned earlier, it still takes time (GPU) to process that data and storage (RAM) to buffer it. Both of these lead to increased latencies and cost. This is an obvious area for research.

Indeed recent research has already shown that GPT4 is overkill for the vast majority of queries. One interesting deployment architecture involves a small student LLM that is acting like a smart cache and sits in front of a parent foundation model. If it thinks it can answer a question itself it does so, much more efficiently. This approach is perfectly suited to situations where you might consider building your own dedicated model, for classification purposes, for example.

The streaming architectures that are prevalent today are surely the next latency barrier to fall. People have already begun experimenting with different model architectures with the aim of improving efficiency with reasonable results. And it is already possible to batch multiple generation requests together, rather than one at a time, to improve operator efficiency. This optimizes utilization for the operator, but doesn’t really help the user. User requests can be parallelized already.

The next big jump will come from batching token generation for individual requests. I.e. rather than recursively generating the next token, generate 10, or 100 in parallel. One hot off the press library called lookahead decoding is pointing in this direction, reporting approximately 2x speed ups. But I predict that we should see something dramatic in this space in 2024. Researchers love tinkering with different model architectures and someone is surely going to stumble on something eventually.

I also expect improved ways of integrating with data repositories and improving contextual quality. For example, it’s quite likely that the paradigm of allowing LLMs dynamic access to data sources is going to continue. Recent research tends to focus on using text-to-language techniques to generate SQL and Python and the like to subsequently interrogate data. There’s no reason why this emerging field could not be expanded to arbitrary APIs or DSLs, provided you have a viable sandbox technology to run the code.

It’s already been shown that better quality, clean data leads to better performing models. This also applies to downstream data repositories so I expect improvements in context curation, like the selective context method that I called minification earlier.

Data Has Gravity (And Potential Energy)

In the first article in this series I started by complaining about the amount of energy required to move a big thing into a small thing. Potential energy is a powerful force.

It reminds me of the old adage that data has gravity, from Dave McCory. The idea is that data attracts applications, compute, other data, people, etc. If you sign up to this idea, then that means that data also has the potential to have energy, i.e. potential energy. It takes energy to move data, to use data. The energy used is measured in terms of time, resources, emissions.

So optimizing the amount of data that you need to achieve your results pays in more ways than one. It makes applications faster. It makes them cheaper. Practically, it might make the difference between a project being viable and not.

In practice the specific steps that you take will depend on the application, your context, and frankly, your budget. I recommend that you view the strategies outlined in this article as potential optimizations that you could make to improve your application. As with any optimization, avoid doing it prematurely. And weigh it up with competing optimizations required in other parts of the stack.

If you do decide to do it, then make sure you have the ability to measure the effectiveness of the change you are about to make. At Winder.AI we often see people struggling with the UX of LLMs, latency in particular, but it’s surprising how often the ability to time or trace is neglected. I get that monitoring might fall by the wayside in POCs, because if you have no users then there’s no point measuring lots of zeros. But monitoring becomes very important very quickly. In the case of LLMs, where we know latency is a big problem, it might even be more important than testing. More heresy!

How You Can Help

As always, thank you for making it this far. If you have any thoughts or questions, then please do get in touch. If you’d like to help me, then I’d appreciate it if you could share this with your colleagues or network.

Frequently asked questions

: Many developments are based upon old tricks, but researchers are beginning to automatically incorporate context window optimization into the LLM pipeline for ease of use. For example, leveraging the amount of self-entropy in a context window is a semi-automatic method.
: A transformer architecture is designed to work with a fixed input length so that it can leverage the context and predict a token in a single calculation. To develop a model that could truely handle any context length then an iterative approach would be required (like recurrant neural networks).
: There is no theoretical limit, but it is limited by hardware availability and cost. Performance is related to hardware requirements logarithmically, so at some point the minor performance gain is not worth the huge extra cost.
: LLMs can take advantage of sparsity when tokens are not present. But if they are, the LLM will inspect it, which requires memory computational time. Therefore there is a positive correlation between the size of the context and the resources/time required to predict a result.
: If you have a domain where the tokenizer isn’t capable of accurately interpreting words, then a different tokenizer might improve the performance of your LLM.
: They use one of the strategies outlined in this article.
: It might be possible, but LLMs are already able to take advantage of sparsity, so simply not including tokens amounts to the same thing.
: LLMs can only coherently answer questions if they have the data to be able to do so. Much of that information may already be trained into the model, but local, contextual information is not.
: Check out this article to find links to recent papers.