How To Build AI Driven Knowledge Assistants
How To Build AI Driven Knowledge Assistants
June, 2024
Table of Contents
1. Summary
2. Introduction to Key Concepts for Generative AI
3. Architecture of Knowledge Assistants
4. Pivotal Role of CrateDB in Unified Data Management
5. Vector Store Implementation with CrateDB
6. Comprehensive Use Case: TGW Logistics Group
1. Summary
This white paper explores how CrateDB provides a scalable platform to build
Generative AI applications that cover the requirements of modern applications,
such as AI-driven knowledge assistants. CrateDB is not just handling vectors, but
also provides in a single storage engine a unique combination of all the data types
needed for end-to-end applications, including RAG pipelines.
3
2. Introduction to Key Concepts for
Generative AI
What is Generative AI?
Challenges of Generative AI
Quality & reliability: LLMs tend to hallucinate, so quality and reliability are
crucial factors in the content generated by AI models. Enforcing them involves
maintaining accuracy and considering the timeliness of data input. The goal is
to produce information that is not only relevant but also accurate and
trustworthy.
4
Ethical & societal: Generative AI raises ethical considerations, such as the
creation of deepfakes, which could lead to serious privacy concerns.
Foundational models are trained on publicly available content. There are different
ways to provide custom context to these models. The list below is ordered by
increasing level of difficulty (combining development effort, AI skills, compute
costs, and hardware needs):
5
particular context for reponse generation based on private, i.e. company-owned
data. Knowledge is not incorporated into the LLM. Access control can be
implemented to manage who is allowed to access which context.
6
Structure of a RAG Pipeline
7
3. Architecture of Knowledge
Assistants
The overall architecture of a knowledge assistant usually consists of four parts:
Context Data, LLM Gateway, Chatbot, as well as Monitoring and Reporting.
Context Data
Contextual data is the foundation for knowledge assistants, where vast amounts
of data are processed and prepared for retrieval. It is crucial for the enterprise-
specific intelligence. This data is derived from various sources, chunked, and
stored alongside embeddings in a vector store. Access to this data needs to be
controlled and monitored.
Context data is usually prepared following common principles for creating data
pipelines. A landing zone stores incoming data in various formats, which can be
structured, semi-structured, or unstructured, even binary sometimes. Then, input
data is split into smaller consumable chunks to generate embeddings. Both
chunks and vectors are stored together, in order to reference which contextual
information is extracted from which source. Data access should be carefully
governed in order to avoid unauthorized access, for example by creating multiple
search indexes that are secured with privileges at the database or application
level.
8
For more complex data pipelines, knowledge APIs provide access to additional
data sources to vectorize (e.g. wikis), or directory services for data access
control.
LLM Gateway
LLM logging mainly tracks costs associated with using LLMs (e.g. tokens
generated, subscriptions). It helps manage operational budget and optimize
resource allocation. Additionally, all interactions are logged to understand usage
patterns and help with troubleshooting and improvements.
Chatbot
The input handler analyses the request and enforces some guardrails (there
might be some questions we don’t want to answer).
The response formation retrieves and enriches the context.
The output handler enforces some final guardrails and grounding of the
results to avoid some undesired answers and reduce hallucinations.
9
Configuration stores and operational stores are used for conversation history,
user settings, feedback, and other critical operational data essential for the
knowledge assistant to be functional. Conversation history is particularly
important for providing historic context to the LLM, and enhancing the relevance
of responses in ongoing interactions.
Monitoring and reporting are crucial to understand the actual system usage
(usage reports), the costs occurred by the different components and users (cost
reports), and to get insights into the data sources used (data reports).
Usage monitoring aims to monitor closely how the solution is utilized across
the organization (metrics: number of user interactions, peak usage times,
types of queries being processed). Understanding usage patterns is crucial for
effective scaling and to meet the evolving needs of the company.
10
Cost analysis serves to track and analyze all operational expenses (token
consumption by LLMs, data processing, and other computational resources).
This promotes effective budget management and assists in identifying
opportunities for cost optimization.
11
4. Pivotal Role of CrateDB in Unified
Data Management
CrateDB can be beneficially employed in the outlined architecture for knowledge
assistants below, providing a unified data platform for landing zones, chunks,
embeddings, configurations, operational stores, and logging and reporting
functionalities. This greatly simplifies the architecture, replacing the need for
multiple different database technologies with a single solution.
12
language, which drastically increases the effort needed to develop new
applications.
This results in big impacts in terms of people, time and money: highly skilled
people need to be hired for each language and technology and the effort is very
high to keep all systems in sync. Both time to market and time for changes
significantly increase, resulting in a high total cost of ownership.
As AI adoption continues to grow, the need for databases that can adapt to
complex data landscapes becomes paramount. Leveraging a multi-model
database capable of managing both structured, semi-structured, and
unstructured data, is an ideal fit to serve as the foundation for data modelling
and application development in AI/ML scenarios. It is an enabler of complex,
contextual-rich, and real-time intelligent applications.
13
Unified data management with CrateDB
CrateDB combines diverse data types into single records accessible via SQL,
making it easy to adopt by developers already familiar with relational databases.
AI Ecosystem Integration
SQL is the most popular query language and allows many 3rd party integrations,
which is crucial when building a complex AI/ML architecture.
CrateDB's compatibility with SQL enables seamless integration into a wide array
of ecosystems - whether it is data ingestion or integration with familiar tools like
Kafka, Nifi, Flink, or any SQL-compatible tool. CrateDB also supports custom
code writing, catering to specific needs.
14
CrateDB offers robust Python integration for model training and inference.
Other programming languages, such as Java and Spark, are also supported,
broadening the scope for application development.
LangChain Integration
15
LangChain can easily integrate with CrateDB and the integration offers these
capabilities:
16
5. Vector Store Implementation with
CrateDB
In the context of Generative AI, multimodal vector embeddings are getting
more popular. No matter the kind of source data—text, images, audio, or video—
an embedding algorithm of your choice is used to translate the given data into a
vector representation. This vector comprises numerous values, the length of
which can vary based on the algorithm used. These vectors, along with chunks of
the source data, are then stored in a vector store.
Vector databases are ideal for tasks such as similarity search, natural
language processing, and computer vision. They provide a structured way to
comprehend intricate patterns within large volumes of data. The process of
integrating this vector data with CrateDB is straightforward, thanks to its native
SQL interface.
CrateDB offers a FLOAT_VECTOR(n) data type, where you specify the length of
the vector. This creates an HNSW (Hierarchical Navigable Small World) graph in
the background for efficient nearest neighbour search. The KNN_MATCH
function executes an approximate K-nearest neighbour (KNN) search and uses
the Euclidean distance algorithm to determine similar vectors. You just need to
input the target vector and specify the number of nearest neighbours you wish to
discover.
The example below illustrates the creation of a table with both a text field and a
4-dimension embedding field, the record insertion into the table with a simple
INSERT INTO command, and the usage of the KNN_MATCH function to perform
a similarity search.
17
CREATE TABLE word_embeddings (
text STRING PRIMARY KEY,
embedding FLOAT_VECTOR(4)
);
| text | _score |
|---------------------|-----------|
| Discovering galaxies| 0.917431 |
| Discovering moon | 0.909090 |
| Exploring the cosmos| 0.909090 |
| Sending the mission | 0.270270 |
The example below shows you how to search for data similar to 'Discovering
Galaxy' in your table. For that, you use the KNN_MATCH function combined with
a sub-select query that returns the embedding associated to 'Discovering
Galaxies'.
| text | _score |
|---------------------|-----------|
| Discovering galaxies| 1 |
| Discovering moon | 0.952381 |
| Exploring the cosmos| 0.840336 |
| Sending the mission | 0.250626 |
18
Combining Vectors, Source and Contextual Information
Combining your vector data (vectorized chunks of your source data) with the
original data and some additional contextual information is very powerful.
As we will outline in this chapter, JSON payload offers the most flexible way to
store and query your metadata information. A typical table schema would contain
a FLOAT_VECTOR column for the embedding and a OBJECT column to contain
the source and contextual information.
In the example below, the table contains a FLOAT_VECTOR column with 1536
dimensions. If you are using multiple embedding algorithms, you can add new
columns with a different vector length value.
In an INSERT statement, you can simply put your existing JSON data, such as a
chunk of text extracted from a PDF file or any other information source. Then,
you use your preferred algorithm to generate an embedding, which is inserted
into the table. If subsequent source data pieces have different annotations,
context information, or metadata, you can simply add it to your JSON document,
which is automatically updated as new columns in the table.
INSERT INTO input_values (source, embedding) VALUES (
'{ "id": "chunk_001",
"text": "This is the first chunk of text. It contains some
information that will be vectorized.",
"metadata": {
"author": "Author A",
"date": "2024-03-15",
"category": "Education"
},
"annotations": [
{ "type": "keyword", "value": "vectorized" },
{ "type": "sentiment", "value": "neutral" }
],
"context": {
"previous_chunk": "",
"next_chunk": "chunk_002",
"related_topics": ["Data Processing", "Machine Learning"]
}
}'
[1.2, 2.1, ..., 3.2] -- Embedding created by your favorite algorithm
19
Adding Filters to Similarity Search
You can also add query filters to your similarity search easily.
The example below shows you how to search for similar text snippets in the
'Education' category. CrateDB’s flexibility allows you to use any other filters as
per your needs, such as geospatial shapes, making it adaptable to your specific
use case requirements.
SELECT
source['id'],
source['text']
FROM
input_values
WHERE
knn_match(embedding,?,10) -- Embedding to search
AND source[’metadata’][’category’] = ‘Education’
ORDER BY
_score DESC
LIMIT 10;
20
6. Comprehensive Use Case:
Automated Warehouse Operations
TGW Logistics Group is one of the leading international suppliers of material
handling solutions. For more than 50 years, the Austrian specialist has
implemented automated systems for its international customers, including brands
from A as in Adidas to Z as in Zalando. As systems integrator, TGW plans,
produces and implements complex logistics centres, from mechatronic products
and robots to control systems and software.
The use case for TGW is to expedite the aggregation and access of large
amounts of varied data collected in real-time from warehouse systems worldwide.
Their warehouse solutions typically consist of the shuttle engine (the actual
warehouse), a conveyor network, and pick centers where goods are packaged
for shipping.
The Digital Twin of the warehouses is used to offer Digital Assistants in the
following ways:
Process Monitoring
21
Identifies anomalies in the processes, such as a decrease in picking
performance.
Notifies the operator when actions need to be taken to rectify the anomalies.
Recommends actions to remove the identified anomalies.
Anomalies can arise from the interruption in the supply of source totes,
affecting picking performance.
Interruption in the supply of target totes can also be an anomaly.
The presence of a slow picker or an unplanned picker break can also lead to
decreased performance.
22
For each of these different sources, a vector index (i.e. a table in CrateDB) has
been created in order to build an application on top that implements an RAG
pipeline for various user groups ranging from maintenance workers, over sales to
any employee of the company searching for information. This approach allows a
multi-index search by querying different tables in CrateDB and providing context
from one or more datasets in the RAG pipeline. Separating the information allows
to find more relevant and precise context as well as defining privileges at the
database level to protect sensitive information like legal documents.
23
CrateDB is the enterprise database for time series, documents
and vectors. It combines the simplicity of SQL, and the
performance of NoSQL, providing instant insights into these
different types of data. It is enabled for AI and is used for a
large variety of use cases, including Real-time Analytics, AI/ML,
chatbots, IoT, digital twins, log analysis, cyber security,
application monitoring, and database consolidation.