LogoLogo
API ReferenceGitHubSlackService StatusLogin
v3.9.16
v3.9.16
  • 🏠Deep Lake Docs
  • List of ML Datasets
  • 🏗️SETUP
    • Installation
    • User Authentication
      • Workload Identities (Azure Only)
    • Storage and Credentials
      • Storage Options
      • Setting up Deep Lake in Your Cloud
        • Microsoft Azure
          • Configure Azure SSO on Activeloop
          • Provisioning Federated Credentials
          • Enabling CORS
        • Google Cloud
          • Provisioning Federated Credentials
          • Enabling CORS
        • Amazon Web Services
          • Provisioning Role-Based Access
          • Enabling CORS
  • 📚Examples
    • Deep Learning
      • Deep Learning Quickstart
      • Deep Learning Guide
        • Step 1: Hello World
        • Step 2: Creating Deep Lake Datasets
        • Step 3: Understanding Compression
        • Step 4: Accessing and Updating Data
        • Step 5: Visualizing Datasets
        • Step 6: Using Activeloop Storage
        • Step 7: Connecting Deep Lake Datasets to ML Frameworks
        • Step 8: Parallel Computing
        • Step 9: Dataset Version Control
        • Step 10: Dataset Filtering
      • Deep Learning Tutorials
        • Creating Datasets
          • Creating Complex Datasets
          • Creating Object Detection Datasets
          • Creating Time-Series Datasets
          • Creating Datasets with Sequences
          • Creating Video Datasets
        • Training Models
          • Splitting Datasets for Training
          • Training an Image Classification Model in PyTorch
          • Training Models Using MMDetection
          • Training Models Using PyTorch Lightning
          • Training on AWS SageMaker
          • Training an Object Detection and Segmentation Model in PyTorch
        • Updating Datasets
        • Data Processing Using Parallel Computing
      • Deep Learning Playbooks
        • Querying, Training and Editing Datasets with Data Lineage
        • Evaluating Model Performance
        • Training Reproducibility Using Deep Lake and Weights & Biases
        • Working with Videos
      • Deep Lake Dataloaders
      • API Summary
    • RAG
      • RAG Quickstart
      • RAG Tutorials
        • Vector Store Basics
        • Vector Search Options
          • LangChain API
          • Deep Lake Vector Store API
          • Managed Database REST API
        • Customizing Your Vector Store
        • Image Similarity Search
        • Improving Search Accuracy using Deep Memory
      • LangChain Integration
      • LlamaIndex Integration
      • Managed Tensor Database
        • REST API
        • Migrating Datasets to the Tensor Database
      • Deep Memory
        • How it Works
    • Tensor Query Language (TQL)
      • TQL Syntax
      • Index for ANN Search
        • Caching and Optimization
      • Sampling Datasets
  • 🔬Technical Details
    • Best Practices
      • Creating Datasets at Scale
      • Training Models at Scale
      • Storage Synchronization and "with" Context
      • Restoring Corrupted Datasets
      • Concurrent Writes
        • Concurrency Using Zookeeper Locks
    • Deep Lake Data Format
      • Tensor Relationships
      • Version Control and Querying
    • Dataset Visualization
      • Visualizer Integration
    • Shuffling in Dataloaders
    • How to Contribute
Powered by GitBook
On this page
  • How to Create a Deep Lake Vector Store
  • Downloading and Preprocessing the Data
  • Performing Vector Search
  • Customization of Vector Search
  • Full Customization of Vector Search

Was this helpful?

Edit on GitHub
  1. Examples
  2. RAG
  3. RAG Tutorials

Vector Store Basics

Creating the Deep Lake Vector Store

PreviousRAG TutorialsNextVector Search Options

Was this helpful?

How to Create a Deep Lake Vector Store

Let's create a Vector Store in LangChain for storing and searching information about the .

Downloading and Preprocessing the Data

First, let's import necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables ACTIVELOOP_TOKEN, OPENAI_API_KEY.

from deeplake.core.vectorstore import VectorStore
import openai
import os

Next, let's clone the Twitter OSS recommendation algorithm and define paths for for source data and the Vector Store.

!git clone https://github.com/twitter/the-algorithm
vector_store_path = '/vector_store_getting_started'
repo_path = '/the-algorithm'

Next, let's load all the files from the repo into list of data that will be added to the Vector Store (chunked_text and metadata). We use simple text chunking based on a constant number of characters.

CHUNK_SIZE = 1000

chunked_text = []
metadata = []
for dirpath, dirnames, filenames in os.walk(repo_path):
    for file in filenames:
        try: 
            full_path = os.path.join(dirpath,file)
            with open(full_path, 'r') as f:
               text = f.read()
            new_chunkned_text = [text[i:i+CHUNK_SIZE] for i in range(0,len(text), CHUNK_SIZE)]
            chunked_text += new_chunkned_text
            metadata += [{'filepath': full_path} for i in range(len(new_chunkned_text))]
        except Exception as e: 
            print(e)
            pass

Next, let's define an embedding function using OpenAI. It must work for a single string and a list of strings, so that it can both be used to embed a prompt and a batch of texts.

def embedding_function(texts, model="text-embedding-ada-002"):
   
   if isinstance(texts, str):
       texts = [texts]

   texts = [t.replace("\n", " ") for t in texts]
   
   return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]
vector_store = VectorStore(
    path = vector_store_path,
)

vector_store.add(text = chunked_text, 
                 embedding_function = embedding_function, 
                 embedding_data = chunked_text, 
                 metadata = metadata
)

The Vector Store's data structure can be summarized using vector_store.summary(), which shows 4 tensors with 21055 samples:

  tensor      htype        shape       dtype  compression
  -------    -------      -------     -------  ------- 
 embedding  embedding  (21055, 1536)  float32   None   
    id        text      (21055, 1)      str     None   
 metadata     json      (21055, 1)      str     None   
   text       text      (21055, 1)      str     None   

To create a vector store using pre-compute embeddings, instead of embedding_data and embedding_function, you may run:

# vector_store.add(text = chunked_text, 
#                  embedding = <list_of_embeddings>, 
#                  metadata = [{"source": source_text}]*len(chunked_text))

Performing Vector Search

Deep Lake offers highly-flexible vector search and hybrid search options. First, let's show a simple example of vector search using default options, which performs simple cosine similarity search in Python on the client (your machine).

prompt = "What do trust and safety models do?"

search_results = vector_store.search(embedding_data=prompt, embedding_function=embedding_function)

The search_results is a dictionary with keys for the text, score, id, and metadata, with data ordered by score. By default, the search returns the top 4 results which can be verified using:

len(search_results['text']) 

# Returns 4

If we examine the first returned text, it appears to contain the text about trust and safety models that is relevant to the prompt.

search_results['text'][0]

Returns:

Trust and Safety Models
=======================

We decided to open source the training code of the following models:
- pNSFWMedia: Model to detect tweets with NSFW images. This includes adult and porn content.
- pNSFWText: Model to detect tweets with NSFW text, adult/sexual topics.
- pToxicity: Model to detect toxic tweets. Toxicity includes marginal content like insults and certain types of harassment. Toxic content does not violate Twitter's terms of service.
- pAbuse: Model to detect abusive content. This includes violations of Twitter's terms of service, including hate speech, targeted harassment and abusive behavior.

We have several more models and rules that we are not going to open source at this time because of the adversarial nature of this area. The team is considering open sourcing more models going forward and will keep the community posted accordingly.

We can also retrieve the corresponding filename from the metadata, which shows the top result came from the README.

search_results['metadata'][0]

# Returns: {'filepath': '/the-algorithm/trust_and_safety_models/README.md'}

Customization of Vector Search

You can customize your vector search with simple parameters, such as selecting the distance_metric and top k results:

search_results = vector_store.search(embedding_data=prompt, 
                                     embedding_function=embedding_functiondding, 
                                     k=10,
                                     distance_metric='l2')

The search now returns 10 search results:

len(search_results['text']) 

# Returns: 10

The first search result with the L2 distance metric returns the same text as the previous Cos search:

search_results['text'][0]

Returns:

Trust and Safety Models
=======================

We decided to open source the training code of the following models:
- pNSFWMedia: Model to detect tweets with NSFW images. This includes adult and porn content.
- pNSFWText: Model to detect tweets with NSFW text, adult/sexual topics.
- pToxicity: Model to detect toxic tweets. Toxicity includes marginal content like insults and certain types of harassment. Toxic content does not violate Twitter's terms of service.
- pAbuse: Model to detect abusive content. This includes violations of Twitter's terms of service, including hate speech, targeted harassment and abusive behavior.

We have several more models and rules that we are not going to open source at this time because of the adversarial nature of this area. The team is considering open sourcing more models going forward and will keep the community posted accordingly.

Full Customization of Vector Search

vector_store = VectorStore(
    path = "hub://activeloop/twitter-algorithm",
    read_only=True
)
prompt = "What do trust and safety models do?"

embedding = embedding_function(prompt)[0]

# Format the embedding array or list as a string, so it can be passed in the REST API request.
embedding_string = ",".join([str(item) for item in embedding])

tql_query = f"select * from (select text, cosine_similarity(embedding, ARRAY[{embedding_string}]) as score) order by score desc limit 5"

Let's run the query, noting that the query execution happens in the Managed Tensor Database, and not on the client.

search_results = vector_store.search(query=tql_query)

If we examine the first returned text, it appears to contain the same text about trust and safety models that is relevant to the prompt.

search_results['text'][0]

Returns:

Trust and Safety Models
=======================

We decided to open source the training code of the following models:
- pNSFWMedia: Model to detect tweets with NSFW images. This includes adult and porn content.
- pNSFWText: Model to detect tweets with NSFW text, adult/sexual topics.
- pToxicity: Model to detect toxic tweets. Toxicity includes marginal content like insults and certain types of harassment. Toxic content does not violate Twitter's terms of service.
- pAbuse: Model to detect abusive content. This includes violations of Twitter's terms of service, including hate speech, targeted harassment and abusive behavior.

We have several more models and rules that we are not going to open source at this time because of the adversarial nature of this area. The team is considering open sourcing more models going forward and will keep the community posted accordingly.

We can also retrieve the corresponding filename from the metadata, which shows the top result came from the README.

print(search_results['metadata'][0])

# Returns {'filepath': '/Users/istranic/ActiveloopCode/the-algorithm/trust_and_safety_models/README.md', 'extension': '.md'}

Finally, let's create the Deep Lake Vector Store and populate it with data. We use a default tensor configuration, which creates tensors with text (str), metadata(json), id (str, auto-populated), embedding (float32). .

Deep Lake's Compute Engine can be used to rapidly execute a variety of different search logic. It is available with !pip install "deeplake[enterprise]" (Make sure to restart your kernel after installation), and it is only available for data stored in or Deep Lake.

Let's load a representative Vector Store that is already stored in . If data is not being written, is advisable to use read_only = True.

The query should be constructed using the syntax.

depending on where data is stored (load, cloud, Deep Lake storage, etc.) and where query execution should take place (client side or server side)

📚
Twitter OSS recommendation algorithm
Learn more about tensor customizability here
connected to
Deep Lake Tensor Database
Tensor Query Language (TQL)
Deep Lake also offers a variety of search options