Best LLMOps Tools: Comparison of Open-Source LLM Production Frameworks
by Natalia Kuzminykh , Associate Data Science Content Editor
In the second part of this series, we shift our focus to the operational aspects of deploying open-source LLMs.
In the previous article we explored how to integrate different frameworks for pipelining, here we delve into the critical infrastructure components that power these models in production environments. We examine tools that:
- enable efficient serving of LLMs
- orchestrate their deployment
- provide observability for performance monitoring
- offer robust evaluation capabilities
Together, these techniques form the backbone of a successful LLMOps strategy, helping engineering teams to create and manage large models within AI applications more effectively.
Serving Frameworks
Let’s start our conversation with serving frameworks—the tools that help ensure that models are delivered with optimal performance, handling challenges from latency optimization to resource management.
vLLM
- Licence: Apache-2.0
- Stars: 26.8k
- Contributors: 539
- Release: v0.6.1
Our first framework is vLLM. It’s a high-performance inference engine designed to assist with the deployment of computationally intensive LLMs through efficient memory management techniques and optimised algorithms.
While traditional models often come with slow inference times and high memory usage, vLLMs are built on the PagedAttention algorithm, which allows for non-contiguous storage of attention keys and values. This approach significantly boosts serving performance, especially in scenarios involving longer sequences.
To maximise hardware utilisation, vLLM also employs continuous batching, which groups incoming requests leading to a reduced waiting time and resource optimisation. Additionally, quantization techniques like FP16 help minimise memory usage by representing data in reduced precision, resulting in faster computations.
Another key feature of vLLM is its user-friendly APIs, which are compatible with OpenAI models, making it easy for teams to migrate existing setups quickly and seamlessly. For example, below is a brief overview of how it can be used in Python:
from openai import OpenAI
# Adjust OpenAI's API key and API base url to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen2-VL-7B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me a short introduction to a large language model."},
])
print("Chat response:", chat_response)
Ollama
- Licence: MIT
- Stars: 89.3k
- Contributors: 306
- Release: v0.3.10
Ollama is an advanced and user-friendly platform that simplifies the process of running large language models on your local machine. With just a few steps, you can set up an open-source, general-purpose model, or choose a specialised LLM tailored for specific tasks. It doesn’t matter what system you use as Ollama supports Windows, macOS, and Linux, making it accessible for most users.
One of Ollama’s key advantages is its focus on customization and performance optimization. Users can fine-tune model parameters and adjust settings to shape the behaviour of LLMs according to their specific needs. The platform efficiently leverages available hardware resources, including CPUs and GPUs, to accelerate model inference and ensure smooth operation, even with large-scale models. Additionally, Ollama operates entirely offline, enhancing data privacy by keeping all computations and data within your local environment.
Beyond running experiments directly in your terminal, Ollama also offers API integration, allowing you to seamlessly embed the locally configured model into your application. For example:
from openai import OpenAI
client = OpenAI(
base_url = 'http://localhost:11434/v1',
api_key='ollama', # required, but unused
)
chat_response = client.chat.completions.create(
model="llama2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me a short introduction to a large language model."},
])
print("Chat response:", chat_response.choices[0].message.content)
LocalAI
- Licence: MIT
- Stars: 23.3k
- Contributors: 110
- Release: v2.20.1
LocalAI presents itself as an open-source alternative to OpenAI, offering a powerful toolkit that operates without the need for expensive GPUs. It supports a wide range of model families and architectures, making it an ideal choice for anyone eager to experiment with AI while avoiding high cloud-processing costs.
Furthermore, this framework offers versatile APIs that can help you to either explore text generation with models like llama.cpp
and gpt4all.cpp
, transcribe audio with whisper.cpp
, or even generate images using Stable Diffusion. Plus, its design prioritises efficiency, keeping models loaded in memory to enable faster inference and ensure that your AI projects run seamlessly from start to finish.
To start exploring this framework, you would need to install it with the following command:
curl https://localai.io/install.sh | sh
Next, you should download the model from the gallery:
local-ai run hermes-2-theta-llama-3-8b
After this, you can finally enjoy your model via simple API usage:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "hermes-2-theta-llama-3-8b", "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}] }'
Orchestration Frameworks
Next, we turn our attention to orchestration frameworks, which are essential for managing how and when your LLMs are deployed. These frameworks take care of scaling, load balancing and automating deployment workflows, allowing you to run your models reliably across diverse environments.
Standard DevOps Tools
Standard DevOps tools like Kubernetes and Docker Compose play a crucial role in the deployment and management of various models:
- Kubernetes is a powerful orchestration platform that automates the deployment, scaling and management of containerized applications, making it ideal for handling complex workloads across multiple nodes.
- Docker Compose simplifies the process of defining and running multi-container Docker applications, allowing developers to set up isolated environments quickly and consistently.
- Virtual Machines offer a more traditional approach, providing full operating system virtualization that can be tailored to the specific needs of an application.
Many of the frameworks we discuss in this article offer support for one of these standard DevOps tools, leveraging their unique strengths to optimise the deployment, management and scaling of LLMs.
BentoML/ OpenLLM
- Licence: Apache-2.0
- Stars: 9.7k
- Contributors: 31
- Release: v2.20.1
OpenLLM is a good example of this kind of framework, as it’s a traditional AI platform with Kubernetes helpers that streamline the deployment of LLMs in the cloud.
OpenLLM optimises model serving with advanced inference techniques from vLLM and BentoML, ensuring low latency and high throughput, even under demanding workloads. Unlike local-focused solutions like Ollama, which struggle to scale beyond low request rates, OpenLLM excels at handling multiple concurrent users, reaching throughput levels nearly eight times higher than Ollama on similar hardware setups.
With OpenAI-compatible APIs, OpenLLM allows seamless integration of various open-source models, such as Llama 3 and Qwen2, and provides a built-in chat interface for interactive LLM use.
These capabilities make OpenLLM a robust choice for cloud-based AI applications, delivering both the ease of use of traditional platforms and the advanced performance needed for real-time, multi-user scenarios.
To start, just run the following:
pip install -qU openllm
openllm hello
#Start an LLM server
openllm serve llama3:8b
The server will be available at http://localhost:3000, offering OpenAI-compatible APIs for interaction. You can connect with these endpoints using various frameworks and tools that support OpenAI-compatible APIs.
from openai import OpenAI
client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')
chat_completion = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{
"role": "user",
"content": "Give me a short introduction to a large language model."
}],
stream=True,
)
for chunk in chat_completion:
print(chunk.choices[0].delta.content or "", end="")
AutoGen
- Licence: CC-BY-4.0, MIT
- Stars: 30.8k
- Contributors: 346
- Release: v0.2.35
AutoGen redefines how developers can build and manage AI applications by introducing a versatile multi-agent framework.
At its core, AutoGen works with agents, which interact together to perform a variety of tasks. These agents can be customised and enhanced with prompt engineering and supplementary tools (eg Google Search API), enabling them to execute code, retrieve information, and collaborate to solve complex tasks, autonomously or with human feedback. This approach not only improves the orchestration and automation of workflows involving LLMs, but also maximises their performance while overcoming inherent limitations.
AutoGen’s flexibility supports diverse conversation patterns, from fully autonomous dialogues to human-in-the-loop problem-solving, making it ideal for building next-generation LLM applications. The framework’s agents, such as the Assistant Agent and User Proxy Agent, can be configured to carry out specific functions like coding, reviewing or incorporating human input into decision-making processes.
To install AutoGen, run:
pip install -q pyautogen
Then, you could start building your versatile agent app, for example:
#Establish API Endpoint
import autogen
from autogen import AssistantAgent, UserProxyAgent
llm_config = {"model": "gpt-4", "api_key": os.environ["OPENAI_API_KEY"]}
assistant = AssistantAgent("assistant", llm_config=llm_config)
user_proxy = UserProxyAgent(
"user_proxy", code_execution_config={"executor": autogen.coding.LocalCommandLineCodeExecutor(work_dir="coding")}
)
# Start the chat
user_proxy.initiate_chat(
assistant,
message="Give me a short introduction to a large language model.",
)
API Gateways
API gateways help manage the flow of data between your LLMs and external applications. These gateways not only handle routing and security but also simplify integration, making your models easier to work with and more adaptable to existing systems.
LiteLLM Proxy Server
- Licence: MIT
- Stars: 12.2k
- Contributors: 311
- Release: v1.46.0
LiteLLM Proxy Server by LiteLLM offers a handy solution to manage AI model access across various applications. In general, it acts as an intermediary between client requests and numerous LLM providers, such as OpenAI, Anthropic and Hugging Face, promoting a unified interface for API interactions.
This proxy server facilitates dynamic switching between different AI models without requiring significant code modifications, making it easier for businesses to work with their AI-driven applications. By providing features like load balancing, logging and monitoring, LiteLLM helps developers manage multiple AI models, ensuring that each task uses the most appropriate model for performance and cost-efficiency.
One of the key advantages of LiteLLM is its ability to enable smart routing, allowing organisations to handle varying levels of demand and prevent service disruptions. This architecture supports scalable deployments, often through containerized environments like Kubernetes.
pip install -q 'litellm[proxy]'
litellm --model huggingface/Qwen/Qwen2-VL-7B-Instruct
#INFO: Proxy running on http://0.0.0.0:4000
To run a REPL to test inference:
litellm --test
Or to run a test using an OpenAI client:
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(model="gpt-4o", messages = [{
"role": "user",
"content": "Give me a short introduction to a large language model."
}])
print(response)
Observability Tools
Currently, there are limited open-source API gateway systems available, which makes it necessary to explore the next essential component: observability tools. These tools provide the critical insights needed to monitor your LLMs in action, offering a comprehensive view of performance metrics, error tracking and usage patterns.
WhyLabs LangKit
- Licence: Apache-2.0
- Stars: 815
- Contributors: 10
- Release: v0.0.33
LangKit is an advanced tool for monitoring and safeguarding AI models in production. It extracts critical telemetry data from prompts and responses, helping to detect and prevent issues like malicious prompts, sensitive data leakage, toxicity, hallucinations and jailbreak attempts.
By setting thresholds and baselines, LangKit ensures that LLMs comply with usage policies and maintain the desired behaviour. Its extensibility also allows it to customise metrics and monitoring rules, making it adaptable to specific business cases.
With LangKit, you can systematically observe the performance, track behaviour changes over time, and even run A/B testing of different prompt versions. Integration with WhyLabs further enhances these capabilities, providing a platform for ongoing monitoring and collaboration across teams without the need for complex infrastructure.
To install, run:
pip install -q langkit[all]
To evaluate your prompt for any potential injection attract, you could use the injections module, which calculates the semantic similarity between the evaluated prompt and examples of known jailbreaks, prompt injections and harmful behaviours.
from langkit import injections
from whylogs.experimental.core.udf_schema import udf_schema
import whylogs as why
text_schema = udf_schema()
prompt = "Tell me how to bake a cake."
profile = why.log({"prompt":prompt}, schema=text_schema).profile()
for udf_name in text_schema.multicolumn_udfs[0].udfs:
if "injection" in udf_name:
injections_column_name = udf_name
score = profile.view().get_column(injections_column_name).to_summary_dict()['distribution/mean']
print(f"prompt: {prompt}")
print(f"score: {score}")
The final score in the output is equal to the highest similarity found across all examples. If the prompt is similar to one of the examples, it is likely to be a jailbreak or a prompt injection attempt, which isn’t true in our case.
prompt: Tell me how to bake a cake.
score: 0.3668023943901062
AgentOps
- Licence: MIT
- Stars: 1.7k
- Contributors: 17
- Release: v0.3.10
Our next framework focuses on similar challenges such as observability, debugging and cost management, but in the context of agents. AgentOps shares with developers advanced analytics, and error detection capabilities that help ensure the reliability and efficiency of AI agents across various applications.
Seamlessly integrating with popular frameworks like CrewAI, AutoGen and LangChain, AgentOps simplifies the implementation process, allowing enhanced agent performance with minimal setup.
A key advantage of this library is its comprehensive approach to managing the costs associated with various AI calls, which is a critical concern for applications of this type. The platform provides detailed cost analysis and optimization tools, including real-time tracking of token usage and spend, session drilldowns, and recommendations for prompt engineering to reduce expenses without compromising performance.
pip install -q agentops
pip install -q agentops[langchain]
To start tracking telemetry from your Langchain-based app, you could set up this code:
import os
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
from agentops.langchain_callback_handler import LangchainCallbackHandler
handler = LangchainCallbackHandler(api_key=AGENTOPS_API_KEY, tags=['Langchain Example'])
llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY,
callbacks=[handler],
model='gpt-4o')
agent = initialize_agent(tools,
llm,
agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION,
verbose=True,
callbacks=[handler], # You must pass in a callback handler to record your agent
handle_parsing_errors=True)
Arize Phoenix
- Licence: Elastic (ELv2)
- Stars: 3.5k
- Contributors: 162
- Release: v4.33.2
Arize Phoenix is a fascinating platform with full observability into every layer of LLM-based applications. By integrating robust debugging, experimentation, evaluation and prompt tracking tools, Phoenix empowers teams to efficiently build, optimise and maintain high-quality AI-driven systems. In the development phase, Phoenix’s advanced tracing capabilities offer deep insights into the application’s execution flow, aiding rapid issue identification and resolution. Teams can also conduct detailed experiments and visualise even data embeddings to fine-tune search and retrieval processes in RAG-based cases.
As applications progress into testing, staging and production environments, Phoenix continues to support comprehensive evaluation and benchmarking features. To demonstrate, let’s build a simple LLamaIndex application with Phoenix integration
To download the library:
pip install -q arize-phoenix
Launch your Phoenix instance using:
import phoenix as px
px.launch_app().view()
Create an endpoint to catch traces:
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from phoenix.otel import register
tracer_provider = register(endpoint="http://127.0.0.1:6006/v1/traces")
LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)
Build a simple application:
from gcsfs import GCSFileSystem
from llama_index.core import Settings, StorageContext, load_index_from_storage
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
import phoenix as px
file_system = GCSFileSystem(project="public-assets-275721")
index_path = "arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/"
storage_context = StorageContext.from_defaults(
fs=file_system, persist_dir=index_path,
)
Settings.llm = OpenAI(model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()
queries = [
"Give me a short introduction to a large language model.",
"How do I fine-tune an LLM?",
"How much does an enterprise licence of ChatGPT cost?",
]
for query in tqdm(queries):
response = query_engine.query(query)
print(f"Query: {query}")
print(f"Response: {response}")
#Print
print("The Phoenix UI:", px.active_session().url)
Evaluation Tools
Now, let’s dive into evaluation tools, which are crucial for assessing the performance, accuracy and reliability of your LLMs. These tools help you test and validate your models, offering the feedback needed to fine-tune them before they go live.
DeepEval
- Licence: Apache-2.0
- Stars: 3k
- Contributors: 57
- Release: v0.21.74
Moving beyond traditional metrics, DeepEval offers a holistic assessment by incorporating a wide array of evaluation techniques that address effectiveness, reliability and ethical considerations.
Its modular design allows users to create customizable evaluation pipelines, much like unit testing in software development, enabling tailored evaluations that suit specific needs and contexts.
A key strength of DeepEval lies in its extensive collection of metrics and benchmarks. It includes over 14 research-backed metrics that cover various aspects of AI performance. The framework also integrates state-of-the-art benchmarks like MMLU, HumanEval and GSM8Kto standardise performance measurement across diverse tasks.
Additionally, DeepEval features a synthetic data generator that leverages LLMs to create complex and realistic datasets, facilitating thorough testing across different scenarios.
Install DeepEval with:
pip install -qU deepeval
Then codify each test in a Python script, like this:
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
def test_case():
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
test_case = LLMTestCase(
input="Give me a short introduction to a large language model.",
# Replace this with the actual output from your LLM application
actual_output="In simpler terms, an LLM is a computer program that has been fed enough examples to be able to recognize and interpret human language or other types of complex data.",
retrieval_context=["A large language model (LLM) is a deep learning algorithm that can perform a variety of natural language processing (NLP) tasks"]
)
assert_test(test_case, [answer_relevancy_metric])
Evidently OSS
- Licence: Apache-2.0
- Stars: 5.1k
- Contributors: 66
- Release: v0.4.37
Evidently OSS offers a diverse suite of tools for evaluating, testing, and monitoring models from validation to production stages. Tailored for data scientists and ML engineers, it supports various data types—including tabular data, text, embeddings, LLMs and generative models—providing a consistent API and a rich library of metrics, tests, and visualisations.
The platform adopts a modular approach with three main components:
- Reports: generate interactive visualisations for exploratory analysis and debugging
- Test Suites: allow for structured, automated batch checks using customizable conditions
- Monitoring Dashboard: enables continuous tracking of model performance and data quality over time, integrating with tools like Grafana for real-time monitoring
Firewall and Guardrails
Lastly, we discuss firewalls and guardrails. These tools help enforce ethical guidelines and compliance standards, protecting your models from generating undesirable outputs and safeguarding your AI applications.
Guidance
- Licence: MIT
- Stars: 18.7k
- Contributors: 70
- Release: v0.1.16
Guidance is a special programming language developed by Microsoft that aims to enhance control over large models. It helps developers to seamlessly combine text generation, prompting and logical control structures in a way that mirrors the language model’s own text processing.
pip install -q guidance
# a simple select between two options
llama2 + f'Do you want a joke or a poem? A ' + select(['joke', 'poem'])
#Output
#Do you want a joke or a poem? A poem
One of its key strengths is the ability to produce structured outputs—such as JSON or Pydantic objects—that strictly follow a specified schema. By enforcing the output format, Guidance enables the LLM to concentrate on content generation while eliminating common parsing issues associated with unstructured text. This is especially useful when working with smaller or less robust language models, which may struggle to produce well-formed hierarchical data due to limited training on source code.
In the context of applications like LlamaIndex, Guidance can be integrated to simplify the creation of structured outputs like Pydantic objects. For example, developers can define data models for complex objects like albums and songs, and use Guidance to generate instances that adhere to these models.
class Song(BaseModel):
title: str
length_seconds: int
class Album(BaseModel):
name: str
artist: str
songs: List[Song]
program = GuidancePydanticProgram(
output_cls=Album,
prompt_template_str="Generate an example album, with an artist and a list of songs. Using the movie {{movie_name}} as inspiration",
guidance_llm=OpenAI("gpt4o"),
verbose=True,
)
output = program(movie_name="The Shining")
In addition, Guidance can improve the robustness of query engines within LlamaIndex by ensuring that intermediate responses conform to expected formats. By plugging a Guidance-based question generator into a sub-question query engine, developers can achieve more accurate results compared to default settings.
from llama_index.question_gen.guidance import GuidanceQuestionGenerator
from guidance.llms import OpenAI as GuidanceOpenAI
# define guidance based question generator
question_gen = GuidanceQuestionGenerator.from_defaults(
guidance_llm=GuidanceOpenAI("gpt4o"), verbose=False)
# define query engine tools
query_engine_tools = ...
# construct sub-question query engine
s_engine = SubQuestionQueryEngine.from_defaults(
question_gen=question_gen,
query_engine_tools=query_engine_tools,
)
Outlines
- Licence: Apache-2.0
- Stars: 8.2k
- Contributors: 102
- Release: v0.0.46
Structured generation involves transforming the raw text produced by an LLM into a predefined format or schema, which is particularly useful when working with structured data. By ensuring that generated text conforms to specific formats like JSON or CSV, Outlines makes it easier to integrate LLM outputs with other systems, automate parsing processes, and enhance the clarity and context of the information presented.
Strong benefits of Outlines include the ability to make any open-source LLM return a JSON object following a user-defined structure, which is invaluable for tasks like parsing responses, storing data or triggering functions based on the results. It also offers compatibility with vLLM in JSON mode, allowing for the deployment of LLM services that produce structured JSON outputs. Additionally, Outlines enables LLMs to generate text that matches specified regular expressions, ensuring conformity to desired patterns. The library also simplifies prompt management through powerful prompt templating, using Python functions with embedded templates that populate with argument values when invoked.
pip install -q outlines
import outlines
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?
Review: This restaurant is just awesome!
"""
generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)
Conclusions
In summary, deploying open-source LLMs in production environments requires a robust and well-orchestrated operational framework.
The frameworks discussed collectively provide the necessary infrastructure to ensure efficient, scalable and reliable AI applications. From high-performance serving solutions like vLLM and Ollama, to orchestration tools such as OpenLLM and AutoGen, essential components like LiteLLM Proxy Server, and observability platforms like LangKit and Arize Phoenix.
Additionally, evaluation tools like DeepEval and Evidently OSS, alongside guardrails such as Guidance and Outlines, play a crucial role in maintaining model performance, ethical standards, and seamless integration with existing systems. By leveraging these open-source tools, engineering teams can effectively implement a comprehensive LLMOps strategy, enhancing their ability to manage large language models and deliver high-quality AI-driven solutions.
Frequently asked questions
LLMOps is a moniker that refers to the practice of operationalizing language models for long-term production use. Building upon the principles pioneerd by DevOps and MLOps, LLMOps focuses on the unique challenges of deploying, monitoring, and maintaining large language models (LLMs) in real-world applications.
The best LLMOps tools are unique to your specific use case and business context. But some of the top open-source LLMOps tools include vLLM, Ollama, LocalAI, OpenLLM, AutoGen, LiteLLM Proxy Server, LangKit, AgentOps, and Arize Phoenix.
First realise that DevOps, MLOps and LLMOps all try and achieve the same thing. Reliable, scaleable, maintainable software. Look at the business context. What scale are you expecting? What team do you have? What do customers expect? Then look at the use case. How complicated is the problem? What are the requirements? What are the constraints? What level of quality is expected? Based upon answers to these questions you can then evaluate how extreme your LLMOps efforts need to be.