Intro to Vision RAG: Smarter Retrieval for Visual Content in PDFs

by Dr. Phil Winder , CEO

As visual data becomes increasingly central to enterprise content, traditional retrieval-augmented generation (RAG) systems often fall short when faced with richly visual documents like PDFs filled with charts, diagrams, and infographics. Vision RAG is a cutting-edge pipeline that leverages vision models to generate image embeddings, enabling intelligent indexing and retrieval of visual content.

In this session, you’ll explore the state of the art in visual RAG, see a live demo using open-source tools like VLLM and custom Python components, and learn how to integrate this capability into your own GenAI stack. The presentation will also highlight Helix, our secure GenAI platform, showcasing how Vision RAG fits into a scalable, enterprise-ready solution.

Download Slides Download Vision RAG Demo Python Code

Below you can find a written description of the presentation.

Introduction to Vision RAG

Retrieval-augmented generation (RAG) has proven useful in production business applications because it helps language models generate better answers with minimal latency. But most businesses are drowning in slide decks, infographics, PDFs and these are simply not being used in current RAG implementations. This leads to hallucinations and just plain poor performance.

This is why vision RAG (VRAG) blew my mind. I was asking questions of some finance text and I expected the same dull, average hallucinations. But VRAG provided a precise, absolute answer to my pinpoint question. It was at this point I realised I need to write this up.

Recap

RAG is a process of retrieving information for the purposes of enriching a language model’s prompt. There are a variety of RAG architectures but the most common is where text is converted into chunks, indexed via a text embedding model, and inserted into a database for future retrieval. At query time, the user’s query is embedded with the same model and that value is compared to what is in the database. The results are then passed into the context of the language model.

A depiction of retrieval-augmented generation (RAG) pipeline

Traditional Methods

When it comes to working with data sources that have visual content, like PDFs with pictures of tables and slide decks, traditional methods attempted to extract information with bespoke pipelines. Optical character recognition models would attempt to extract text from images. Bespoke pipelines would extract structure. But they never really interpreted information the way humans do.

Traditional document ingestion from https://arxiv.org/abs/2407.01449

Vision RAG

Instead, we can use vision language models (VLMs). VLMs are multimodal language models that have been fine-tuned on visual content. The most impressive are those that have been trained with data that matches your domain. In these examples, I chose a model that was trained on extracting information from PDFs.

What does “fine-tuned on” mean?

If you gather a dataset of visual elements, like small images of documents, along with a representative question and answer based upon that content, then you can teach a pre-trained large language model to produce answers like your example. Learn more about the basics of training language models.

This also applies to embedding models. Given a dataset of images, you can fine-tune an embedding model to produce similar embeddings for pictures of two cats, but different values for cats and pandas.

This means we can replicate the previous retrieval pipeline, except we switch out the various text models with vision models and alter the indexing pipeline to index images, not text.

A depition of a vision rag pipeline.

Code to Perform Vision RAG

I’ve created a repository that demonstrates the vision RAG process. You’ll need to spin up the vision embeddings and models yourself. Take a look at the video if you want a walkthrough of the code.

This was a surprising finding. There is no well-known public API that gives you access to a vision embedding model. And even the vision language models are occasionally broken. I prefer models you can run yourself anyway, but in this case it’s absolutely necessary.

Production Scale

This demonstration code is not production ready and will not scale. Although the techniques are quite simple, hardening such a pipeline is more difficult. What if certain users are not allowed to access the information contained within certain documents? What about the privacy and security of your documents? How do you scale the indexing pipeline? And so on.

If you’re interested about doing that at scale then take a look at one of our products, Helix.ML. We developed Helix to operated in secure, private environments and solves your LLM hosting woes overnight.

Afterthoughts

Modern language systems need access to more than just text. We’re building solutions for our clients that are able to see information, just like you do. Please contact us to find out more.

  • VRAG systems architecturally align with traditional RAG systems, which makes them easy to adopt.
  • Open source models are preferred here, because no commercial APIs offer vision embeddings.
  • I encourage you to try this with your own PDFs, infographics, slides, and let me know how it goes.
  • And use Helix.ml for commercial deployments.

More articles

User Feedback in LLM-Powered Applications

A guide to gathering user feedback in LLM applications, reviewing the state of the art, and some practical tips.

Read more

Testing and Evaluating Large Language Models in AI Applications

A guide to evaluating and testing large language models. Learn how to test your system prompts and evaluate your AI's performance.

Read more
}