Vector Database
Vector Database
CONTENTS
Getting Started
• Key Concepts of Vector Databases
− Embeddings and Dimensions
− Distance Metrics and Similarity
• Conclusion
Vector databases are specialized databases designed for scenarios Figure 1: Vector database overview
where understanding the context, similarity, or pattern is more
important than matching exact values. Leveraging the mathematics
of vectors and the principles of geometry to understand and organize
the data, these capabilities are essential to boosting the power of
analytical and generative artificial intelligence (AI).
Vector embeddings are created using ML models that are able to • Support complex queries and APIs: Enable complex queries
translate the semantic and qualitative value of the object into a that combine vector similarity searches with traditional
numerical representation. There are a variety of ML models for each database queries.
data type, such as text, audio, image, and other embedding models. • Security and access control: Contain built-in security features,
The use of a vector database is not a mandatory requirement to be able such as authentication and authorization, data encryption,
to generate or use vector embeddings. This is because there are many data isolation, and access control mechanisms, that are
vector index libraries focused on storing embeddings with in-memory
essential for enterprise applications and compliance with data
indexes, but vector databases are highly recommended for enterprise
protection regulations.
architectures, production, and when working with high concurrency
• Seamless integration and SDKs: Integrate seamlessly with
and data volume.
existing data ecosystems, providing integration libraries
Nowadays, vector databases are designed to support the association of for several programming languages, a variety of APIs (e.g.,
that embedding with the object metadata, which can include a variety GraphQL, RESTful), and integrations with Apache Kafka.
of information such as the structured definition and object definition.
• Support for CRUD operations: Vector databases allow you to
Having this information alongside vectors enables more sophisticated
add, update, and delete objects with their vectors. This is so
querying, filtering, and management of capabilities that are similar to
that users don't have to reindex the entire database when any
the queries made in traditional databases. This certainly makes vector
underlying data changes.
databases more integrable, versatile, and interpretable with end users
and within data architectures. TRADITIONAL RELATIONAL vs. VECTOR DATABASE
Traditional or relational databases are indispensable for applications
Figure 2: Metadata
requiring structured and semi-structured data that will return the exact
match to the query. These databases store the information in rows or
documents, and at the end of each row, there is a record that provides
structured information such as product attributes or customer details.
Vector databases, on the other hand, are optimized for storing and
searching through high-dimensional vector data that will return items
based on similarity metrics rather than exact matches.
KEY CONCEPTS OF VECTOR DATABASES Obviously, with two dimensions, we cannot capture the essence of
Using vector databases involves understanding their fundamental the products. Dimensionality plays a crucial role in how well these
concepts: embeddings, indexes, and distance and similarity. embeddings can capture the relevant features of the products. More
dimensions may provide more accuracy but also more resources in
EMBEDDINGS AND DIMENSIONS
terms of compute, memory, latency, and cost.
As we explained previously, embeddings are numerical
representations of objects that capture their semantic meaning and VECTOR EMBEDDING MODELS INTEGRATION
relationships in a high-dimensional space that includes semantic Some vector databases provide seamless integration with embedding
relationships, contextual usage, or features. This numerical models, allowing us to generate vector embeddings from raw data
representation is composed by an array of numbers in which each and seamlessly integrate ML models into database operations. This
element corresponds to a specific dimension. feature simplifies the development process and abstracts away the
Figure 4: Embedding representation complexities involved in generating and using vector embeddings for
both data insertion and querying processes.
SCALABILITY
Vector databases are usually highly scalable solutions that support
vertical and horizontal scaling. Horizontal scaling is based on two
fundamental strategies: sharding and replication. Both strategies are
crucial for managing large-scale and distributed databases.
points in Euclidean space. pieces called shards. Each shard contains a subset of the database's
data, making it responsible for a particular segment of the data.
Figure 9: Euclidean
Table 2: Key sharding advantages and considerations
ADVANTAGES CONSIDERATIONS
Sharding allows a database to Ensuring even distribution of VECTOR DATA IN GENERATIVE AI: RETRIEVAL-
scale by adding more shards across data and avoiding hotspots AUGMENTED GENERATION
additional servers, effectively where one shard receives Generative AI and large language models (LLMs) have certain
handling more data and users significantly more queries than
limitations given they must be trained with a large amount of data.
without degradation in performance. others can be challenging.
These trainings impose high costs in terms of time, resources, and
It can be cost effective to add more Query throughput does not
money. As a result, these models are usually trained with general
servers with moderate specifications improve when adding more
than to scale up a single server with sharded nodes. contexts and are not constantly updated with the latest information.
high specifications.
Retrieval-augmented generation (RAG) plays a crucial role because
it was developed to improve the response quality in specific contexts
REPLICATION
using a technique that incorporates an external source of relevant and
Replication involves creating copies of a database on multiple nodes
updated information into the generative process. A vector database is
within the cluster.
particularly well suited for implementing RAG models due to its unique
Table 3: Key advantages and considerations for replication capabilities in handling high-dimensional data, performing efficient
similarity searches, and integrating seamlessly with AI/ML workflows.
ADVANTAGES CONSIDERATIONS
Figure 11: Overview of RAG architecture
Replication ensures that the Maintaining data consistency across
database remains available replicas, especially in write-heavy
for read operations even if environments, can be challenging
some servers are down. and may require sophisticated
synchronization mechanisms.
Searches are performed by calculating the similarity between the generation is of high relevance and quality.
query vector and document vectors in the database, using some of the • Seamless integration: Vector databases provide APIs, SDKs, and
previously explained metrics, such as cosine similarity. Some of the tools that make it easy to integrate with various AI/ML frameworks.
applications would be: This flexibility facilitates the development and deployment of RAG
models, allowing researchers and developers to focus on model
• Recommendation systems: Perform similarity searches to
optimization rather than data management challenges.
find items that match a user's interests, providing accurate and
timely recommendations to enhance the user experience. • Context generation: Vector embeddings capture the semantic
essence of text, images, videos, and more, enabling AI models
• Customer support: Obtain the most relevant information to
to understand context and generate new content that is
solve customers' doubts, questions, or problems.
contextually similar or related.
• Knowledge management: Find relevant information quickly
• Scalability: Vector databases provide a scalable solution that
from the organization's knowledge composed by documents,
can manage large-scale information without compromising
slides, videos, or reports in enterprise systems.
retrieval performance.
Vector databases provide the technological foundation necessary for GETTING STARTED
the effective implementation of RAG models and make them an optimal To get started, we have conducted a practical exercise below that
choice for interaction with large-scale knowledge bases. demonstrates the use of a vector database for identifying comparable
products in a fashion retail scenario (i.e., semantic search use case).
OTHER SPECIFIC USES CASES
We'll go through setting up the environment, loading fashion product
Beyond the main use cases discussed above are several others, such as:
data into the open-source vector database, and querying it to find
• Anomaly detection: Embeddings capture nuanced relationships
similar items.
and patterns within data, making it possible to detect anomalies
that might not be evident through traditional methods. For the environment, ensure the following tools are installed:
DATA SAMPLE
The following is a list of the datasets that we will use during this practical exercise based on the concepts explained in previous sections:
Relaxed Fit Tee Men T-shirts Non-stretch, Relaxed fit 100% cotton. Jersey. Crewneck, Short sleeves Red
Relaxed Fit Tee Men T-shirts Non-stretch, Relaxed fit 100% cotton. Jersey. Crewneck, Short sleeves Green
Trucker Jacket Men Jackets Standard fit 100% cotton, Denim, Point collar, Long sleeves Gray
Slim Welt Pocket Jeans Women Jeans Mid rise: 8 3/4'', Inseam: 62% cotton~28% viscose, ECOVERO™)~8% elastomultiester~2% Black
30'', Leg opening: 13'' elastane, Denim, Stretch, Zip fly, 5-pocket styling
Baggy Dad Utility Pants Women Jeans Mid rise, Straight leg 95% cotton, 5% recycled cotton, Denim, No Stretch Green
The Perfect Tee Women T-shirts Standard fit, Model 100% cotton, Crewneck, Short sleeves White
wears a size small
Lelou Shrunken Moto Women Jackets Slim fit 100% polyurethane - releases plastic microfibers into the Black
Jacket environment during washing, Long sleeves
In this case, we are going to use the second option, using Weaviate as
for o in response.objects:
our example: print(o.properties)
print(o.metadata.distance)
import weaviate
finally:
# Defined previously Step 3 client.close()
products_data = [{....}]
This query uses the NEAR_TEXT function to find products with
# Connect with default parameters
client = weaviate.connect_to_local() descriptions similar to the given concept. Weaviate will return
products that its AI considers semantically similar based on the vector
# Check if the connection was successful
embeddings of their descriptions.
try:
client.is_ready()
STEP 6: OUTPUT
print("Successfully connected to Weaviate.")
products_collection = client.collections.create( The output of this query returns the two closest products, including
name="Products", some of the object properties and the distance:
vectorizer_config=wvc.config.Configure.
Vectorizer.text2vec_transformers( Successfully connected to Weaviate.
vectorize_collection_name=True {'family': 'T-SHIRTS', 'color': 'Red', 'name':
) 'Relaxed Fit Tee'}
) 0.0
{'family': 'T-SHIRTS', 'color': 'White', 'name':
products_objs = list() 'THE PERFECT TEE'}
for i,d in enumerate(products_data): 0.0
products_objs.append({
"name": d["name"],
"section": d["section"], CONCLUSION
"family" : d["family"], This Refcard provides an overview of vector database fundamentals
"fit": d["fit"], as well as a practical application in fashion retail. By customizing the
"composition": d["composition"],
dataset and queries, you can explore the full potential of vector
"color": d["color"],
}) databases for similarity searches and other AI-driven applications.
This is just the starting point to get you started in the world of
products_collection.data.insert_many(products_
vectors. ML models and vectors represent powerful tools in the area
objs)
of machine learning and artificial intelligence, offering a nuanced and
finally:
high-dimensional representation of complex data. Vector databases
client.close()
are not a magical solution that provides immediate value, yet like all
good wine, engineers — and wineries alike — must employ careful
STEP 5: SIMILARITY QUERY
experimentation, parameter optimization, and ongoing evaluation.
Once your data is indexed, we can query for similar products using
Weaviate's vector search capabilities. For example, to find products
similar to a "Red T-Shirt" or "Jeans for women," you can use a search
query with its description: WRITTEN BY MIGUEL GARCÍA LORENZO,
VP OF ENGINEERING, NEXTAIL
import weaviate
Miguel is VP of Engineering at Nextail. He has 10+
import weaviate.classes as wvc years in data space leading teams and building high-
performance solutions. A book lover and advocate of
# Connect with default parameters platform design as a service and data as a product.
client = weaviate.connect_to_local()
products = client.collections.get("Products") At DZone, we foster a collaborative environment that empowers developers and
tech professionals to share knowledge, build skills, and solve problems through
content, code, and community. We thoughtfully — and with intention — challenge
response = products.query.near_text( the status quo and value diverse perspectives so that, as one, we can inspire
query="Red T-Shirt", positive change through technology.
return_metadata=wvc.query.
MetadataQuery(distance=True), Copyright © 2024 DZone. All rights reserved. No part of this publication may be
limit=2, reproduced, stored in a retrieval system, or transmitted, in any form or by means
of electronic, mechanical, photocopying, or otherwise, without prior written
return_properties=["name", "family", "color"] permission of the publisher.
)