Winder Ai
Winder Ai
17 minutes
In the second part of this series, we shift our focus to the operational aspects of deploying open-source LLMs.
In the previous article we explored how to integrate different frameworks for pipelining, here we delve into the critical
infrastructure components that power these models in production environments. We examine tools that:
enable efficient serving of LLMs
orchestrate their deployment
provide observability for performance monitoring
offer robust evaluation capabilities
Together, these techniques form the backbone of a successful LLMOps strategy, helping engineering teams to create
and manage large models within AI applications more effectively.
Serving Frameworks
Let’s start our conversation with serving frameworks—the tools that help ensure that models are delivered with optimal
performance, handling challenges from latency optimization to resource management.
vLLM
Licence: Apache-2.0
Stars: 26.8k
Contributors: 539
Release: v0.6.1
Our first framework is vLLM. It’s a high-performance inference engine designed to assist with the deployment of
computationally intensive LLMs through efficient memory management techniques and optimised algorithms.
While traditional models often come with slow inference times and high memory usage, vLLMs are built on the
PagedAttention algorithm, which allows for non-contiguous storage of attention keys and values. This approach
significantly boosts serving performance, especially in scenarios involving longer sequences.
To maximise hardware utilisation, vLLM also employs continuous batching, which groups incoming requests leading to a
reduced waiting time and resource optimisation. Additionally, quantization techniques like FP16 help minimise memory
usage by representing data in reduced precision, resulting in faster computations.
Another key feature of vLLM is its user-friendly APIs, which are compatible with OpenAI models, making it easy for
teams to migrate existing setups quickly and seamlessly. For example, below is a brief overview of how it can be used in
Python:
# Adjust OpenAI's API key and API base url to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen2-VL-7B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me a short introduction to a large language model."},
])
Ollama is an advanced and user-friendly platform that simplifies the process of running large language models on your
local machine. With just a few steps, you can set up an open-source, general-purpose model, or choose a specialised
LLM tailored for specific tasks. It doesn’t matter what system you use as Ollama supports Windows, macOS, and
Linux, making it accessible for most users.
One of Ollama’s key advantages is its focus on customization and performance optimization. Users can fine-tune
model parameters and adjust settings to shape the behaviour of LLMs according to their specific needs. The platform
efficiently leverages available hardware resources, including CPUs and GPUs, to accelerate model inference and ensure
smooth operation, even with large-scale models. Additionally, Ollama operates entirely offline, enhancing data privacy by
keeping all computations and data within your local environment.
Beyond running experiments directly in your terminal, Ollama also offers API integration, allowing you to seamlessly
embed the locally configured model into your application. For example:
client = OpenAI(
base_url = 'http://localhost:11434/v1',
api_key='ollama', # required, but unused
)
chat_response = client.chat.completions.create(
model="llama2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me a short introduction to a large language model."},
])
LocalAI
Licence: MIT
Stars: 23.3k
Contributors: 110
Release: v2.20.1
LocalAI presents itself as an open-source alternative to OpenAI, offering a powerful toolkit that operates without the
need for expensive GPUs. It supports a wide range of model families and architectures, making it an ideal choice for
anyone eager to experiment with AI while avoiding high cloud-processing costs.
Furthermore, this framework offers versatile APIs that can help you to either explore text generation with models like
`llama.cpp` and `gpt4all.cpp`, transcribe audio with `whisper.cpp`, or even generate images using Stable Diffusion. Plus,
its design prioritises efficiency, keeping models loaded in memory to enable faster inference and ensure that your AI
projects run seamlessly from start to finish.
To start exploring this framework, you would need to install it with the following command:
curl https://localai.io/install.sh | sh
After this, you can finally enjoy your model via simple API usage:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "hermes-2-theta-llama-3-8b", "messages": [{"role": "user", "content": "How are you doing?", "tem
Orchestration Frameworks
Next, we turn our attention to orchestration frameworks, which are essential for managing how and when your LLMs are
deployed. These frameworks take care of scaling, load balancing and automating deployment workflows, allowing you
to run your models reliably across diverse environments.
BentoML/ OpenLLM
Licence: Apache-2.0
Stars: 9.7k
Contributors: 31
Release: v2.20.1
OpenLLM is a good example of this kind of framework, as it’s a traditional AI platform with Kubernetes helpers that
streamline the deployment of LLMs in the cloud.
OpenLLM optimises model serving with advanced inference techniques from vLLM and BentoML, ensuring low latency
and high throughput, even under demanding workloads. Unlike local-focused solutions like Ollama, which struggle to
scale beyond low request rates, OpenLLM excels at handling multiple concurrent users, reaching throughput levels nearly
eight times higher than Ollama on similar hardware setups.
With OpenAI-compatible APIs, OpenLLM allows seamless integration of various open-source models, such as Llama 3
and Qwen2, and provides a built-in chat interface for interactive LLM use.
These capabilities make OpenLLM a robust choice for cloud-based AI applications, delivering both the ease of use of
traditional platforms and the advanced performance needed for real-time, multi-user scenarios.
To start, just run the following:
The server will be available at http://localhost:3000, offering OpenAI-compatible APIs for interaction. You can connect
with these endpoints using various frameworks and tools that support OpenAI-compatible APIs.
chat_completion = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{
"role": "user",
"content": "Give me a short introduction to a large language model."
}],
stream=True,
)
for chunk in chat_completion:
print(chunk.choices[0].delta.content or "", end="")
AutoGen
Licence: CC-BY-4.0, MIT
Stars: 30.8k
Contributors: 346
Release: v0.2.35
AutoGen redefines how developers can build and manage AI applications by introducing a versatile multi-agent
framework.
At its core, AutoGen works with agents, which interact together to perform a variety of tasks. These agents can be
customised and enhanced with prompt engineering and supplementary tools (eg Google Search API), enabling them to
execute code, retrieve information, and collaborate to solve complex tasks, autonomously or with human feedback.
This approach not only improves the orchestration and automation of workflows involving LLMs, but also maximises
their performance while overcoming inherent limitations.
AutoGen’s flexibility supports diverse conversation patterns, from fully autonomous dialogues to human-in-the-loop
problem-solving, making it ideal for building next-generation LLM applications. The framework’s agents, such as the
Assistant Agent and User Proxy Agent, can be configured to carry out specific functions like coding, reviewing or
incorporating human input into decision-making processes.
To install AutoGen, run:
Then, you could start building your versatile agent app, for example:
user_proxy = UserProxyAgent(
"user_proxy", code_execution_config={"executor": autogen.coding.LocalCommandLineCodeExecutor(work_dir="coding")
)
API Gateways
API gateways help manage the flow of data between your LLMs and external applications. These gateways not only
handle routing and security but also simplify integration, making your models easier to work with and more adaptable to
existing systems.
litellm --test
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
print(response)
Observability Tools
Currently, there are limited open-source API gateway systems available, which makes it necessary to explore the next
essential component: observability tools. These tools provide the critical insights needed to monitor your LLMs in
action, offering a comprehensive view of performance metrics, error tracking and usage patterns.
WhyLabs LangKit
Licence: Apache-2.0
Stars: 815
Contributors: 10
Release: v0.0.33
LangKit is an advanced tool for monitoring and safeguarding AI models in production. It extracts critical telemetry data
from prompts and responses, helping to detect and prevent issues like malicious prompts, sensitive data leakage,
toxicity, hallucinations and jailbreak attempts.
By setting thresholds and baselines, LangKit ensures that LLMs comply with usage policies and maintain the desired
behaviour. Its extensibility also allows it to customise metrics and monitoring rules, making it adaptable to specific
business cases.
With LangKit, you can systematically observe the performance, track behaviour changes over time, and even run A/B
testing of different prompt versions. Integration with WhyLabs further enhances these capabilities, providing a platform
for ongoing monitoring and collaboration across teams without the need for complex infrastructure.
To install, run:
To evaluate your prompt for any potential injection attract, you could use the injections module, which calculates the
semantic similarity between the evaluated prompt and examples of known jailbreaks, prompt injections and harmful
behaviours.
score = profile.view().get_column(injections_column_name).to_summary_dict()['distribution/mean']
print(f"prompt: {prompt}")
print(f"score: {score}")
The final score in the output is equal to the highest similarity found across all examples. If the prompt is similar to one
of the examples, it is likely to be a jailbreak or a prompt injection attempt, which isn’t true in our case.
AgentOps
Licence: MIT
Stars: 1.7k
Contributors: 17
Release: v0.3.10
Our next framework focuses on similar challenges such as observability, debugging and cost management, but in the
context of agents. AgentOps shares with developers advanced analytics, and error detection capabilities that help
ensure the reliability and efficiency of AI agents across various applications.
Seamlessly integrating with popular frameworks like CrewAI, AutoGen and LangChain, AgentOps simplifies the
implementation process, allowing enhanced agent performance with minimal setup.
A key advantage of this library is its comprehensive approach to managing the costs associated with various AI calls,
which is a critical concern for applications of this type. The platform provides detailed cost analysis and optimization
tools, including real-time tracking of token usage and spend, session drilldowns, and recommendations for prompt
engineering to reduce expenses without compromising performance.
To start tracking telemetry from your Langchain-based app, you could set up this code:
import os
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
from agentops.langchain_callback_handler import LangchainCallbackHandler
llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY,
callbacks=[handler],
model='gpt-4o')
agent = initialize_agent(tools,
llm,
agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION,
verbose=True,
callbacks=[handler], # You must pass in a callback handler to record your agent
handle_parsing_errors=True)
Arize Phoenix
Licence: Elastic (ELv2)
Stars: 3.5k
Contributors: 162
Release: v4.33.2
Arize Phoenix is a fascinating platform with full observability into every layer of LLM-based applications. By integrating
robust debugging, experimentation, evaluation and prompt tracking tools, Phoenix empowers teams to efficiently build,
optimise and maintain high-quality AI-driven systems. In the development phase, Phoenix’s advanced tracing
capabilities offer deep insights into the application’s execution flow, aiding rapid issue identification and resolution.
Teams can also conduct detailed experiments and visualise even data embeddings to fine-tune search and retrieval
processes in RAG-based cases.
As applications progress into testing, staging and production environments, Phoenix continues to support
comprehensive evaluation and benchmarking features. To demonstrate, let’s build a simple LLamaIndex application with
Phoenix integration
To download the library:
import phoenix as px
px.launch_app().view()
tracer_provider = register(endpoint="http://127.0.0.1:6006/v1/traces")
LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)
file_system = GCSFileSystem(project="public-assets-275721")
index_path = "arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/"
storage_context = StorageContext.from_defaults(
fs=file_system, persist_dir=index_path,
)
Settings.llm = OpenAI(model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()
queries = [
"Give me a short introduction to a large language model.",
"How do I fine-tune an LLM?",
"How much does an enterprise licence of ChatGPT cost?",
]
#Print
print("The Phoenix UI:", px.active_session().url)
Evaluation Tools
Now, let’s dive into evaluation tools, which are crucial for assessing the performance, accuracy and reliability of your
LLMs. These tools help you test and validate your models, offering the feedback needed to fine-tune them before they
go live.
DeepEval
Licence: Apache-2.0
Stars: 3k
Contributors: 57
Release: v0.21.74
Moving beyond traditional metrics, DeepEval offers a holistic assessment by incorporating a wide array of evaluation
techniques that address effectiveness, reliability and ethical considerations.
Its modular design allows users to create customizable evaluation pipelines, much like unit testing in software
development, enabling tailored evaluations that suit specific needs and contexts.
A key strength of DeepEval lies in its extensive collection of metrics and benchmarks. It includes over 14 research-
backed metrics that cover various aspects of AI performance. The framework also integrates state-of-the-art
benchmarks like MMLU, HumanEval and GSM8Kto standardise performance measurement across diverse tasks.
Additionally, DeepEval features a synthetic data generator that leverages LLMs to create complex and realistic
datasets, facilitating thorough testing across different scenarios.
Install DeepEval with:
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
def test_case():
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
test_case = LLMTestCase(
input="Give me a short introduction to a large language model.",
# Replace this with the actual output from your LLM application
actual_output="In simpler terms, an LLM is a computer program that has been fed enough examples to be able
retrieval_context=["A large language model (LLM) is a deep learning algorithm that can perform a variety of
)
assert_test(test_case, [answer_relevancy_metric])
Evidently OSS
Licence: Apache-2.0
Stars: 5.1k
Contributors: 66
Release: v0.4.37
Evidently OSS offers a diverse suite of tools for evaluating, testing, and monitoring models from validation to
production stages. Tailored for data scientists and ML engineers, it supports various data types—including tabular data,
text, embeddings, LLMs and generative models—providing a consistent API and a rich library of metrics, tests, and
visualisations.
The platform adopts a modular approach with three main components:
Reports: generate interactive visualisations for exploratory analysis and debugging
Test Suites: allow for structured, automated batch checks using customizable conditions
Monitoring Dashboard: enables continuous tracking of model performance and data quality over time, integrating
with tools like Grafana for real-time monitoring
Guidance
Licence: MIT
Stars: 18.7k
Contributors: 70
Release: v0.1.16
Guidance is a special programming language developed by Microsoft that aims to enhance control over large models. It
helps developers to seamlessly combine text generation, prompting and logical control structures in a way that mirrors
the language model’s own text processing.
#Output
#Do you want a joke or a poem? A poem
One of its key strengths is the ability to produce structured outputs—such as JSON or Pydantic objects—that strictly
follow a specified schema. By enforcing the output format, Guidance enables the LLM to concentrate on content
generation while eliminating common parsing issues associated with unstructured text. This is especially useful when
working with smaller or less robust language models, which may struggle to produce well-formed hierarchical data due to
limited training on source code.
In the context of applications like LlamaIndex, Guidance can be integrated to simplify the creation of structured
outputs like Pydantic objects. For example, developers can define data models for complex objects like albums and
songs, and use Guidance to generate instances that adhere to these models.
class Song(BaseModel):
title: str
length_seconds: int
class Album(BaseModel):
name: str
artist: str
songs: List[Song]
program = GuidancePydanticProgram(
output_cls=Album,
prompt_template_str="Generate an example album, with an artist and a list of songs. Using the movie {{movie_nam
guidance_llm=OpenAI("gpt4o"),
verbose=True,
)
In addition, Guidance can improve the robustness of query engines within LlamaIndex by ensuring that intermediate
responses conform to expected formats. By plugging a Guidance-based question generator into a sub-question query
engine, developers can achieve more accurate results compared to default settings.
Outlines
Licence: Apache-2.0
Stars: 8.2k
Contributors: 102
Release: v0.0.46
Structured generation involves transforming the raw text produced by an LLM into a predefined format or schema,
which is particularly useful when working with structured data. By ensuring that generated text conforms to specific
formats like JSON or CSV, Outlines makes it easier to integrate LLM outputs with other systems, automate parsing
processes, and enhance the clarity and context of the information presented.
Strong benefits of Outlines include the ability to make any open-source LLM return a JSON object following a user-
defined structure, which is invaluable for tasks like parsing responses, storing data or triggering functions based on the
results. It also offers compatibility with vLLM in JSON mode, allowing for the deployment of LLM services that produce
structured JSON outputs. Additionally, Outlines enables LLMs to generate text that matches specified regular
expressions, ensuring conformity to desired patterns. The library also simplifies prompt management through powerful
prompt templating, using Python functions with embedded templates that populate with argument values when
invoked.
import outlines
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
Conclusions
In summary, deploying open-source LLMs in production environments requires a robust and well-orchestrated
operational framework.
The frameworks discussed collectively provide the necessary infrastructure to ensure efficient, scalable and reliable AI
applications. From high-performance serving solutions like vLLM and Ollama, to orchestration tools such as OpenLLM
and AutoGen, essential components like LiteLLM Proxy Server, and observability platforms like LangKit and Arize
Phoenix.
Additionally, evaluation tools like DeepEval and Evidently OSS, alongside guardrails such as Guidance and Outlines, play
a crucial role in maintaining model performance, ethical standards, and seamless integration with existing systems. By
leveraging these open-source tools, engineering teams can effectively implement a comprehensive LLMOps strategy,
enhancing their ability to manage large language models and deliver high-quality AI-driven solutions.
What is LLMOps?
More articles
Company Services
Contact AI Development & Consulting
Case Studies AI Product Development
Policies Used By Winder.AI - Before and During Large Language Models
Engagement
Reinforcement Learning
List of Credits for Content Used on the
Winder.AI Website MLOps
Data Science
Artificial Intelligence Pricing
Last name*
Email*
Subscribe
Winder.AI is a trading name for Winder Research and Development Ltd., registered in the UK under
company number 08762077. The Registered office address is Adm Accountants Ltd, Windsor House,
Cornwall Road, Harrogate, North Yorkshire, HG1 2PW.