0% found this document useful (0 votes)
30 views17 pages

LAB MANUAL OF GENERATIVE AI April - 4

The document is a lab manual detailing the use of document loaders in Generative AI, specifically focusing on various methods to load and process different document types such as PDFs, YouTube audio, web pages, and Notion documents. It covers key concepts like Retrieval Augmented Generation (RAG) and provides code examples for loading documents and splitting text into manageable chunks for AI processing. The manual also discusses the importance of chunk overlap and text splitting strategies to maintain context and improve retrieval quality in AI workflows.

Uploaded by

Daniyal Shahbaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views17 pages

LAB MANUAL OF GENERATIVE AI April - 4

The document is a lab manual detailing the use of document loaders in Generative AI, specifically focusing on various methods to load and process different document types such as PDFs, YouTube audio, web pages, and Notion documents. It covers key concepts like Retrieval Augmented Generation (RAG) and provides code examples for loading documents and splitting text into manageable chunks for AI processing. The manual also discusses the importance of chunk overlap and text splitting strategies to maintain context and improve retrieval quality in AI workflows.

Uploaded by

Daniyal Shahbaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

LAB MANUAL OF GENERATIVE AI

Composed by Adnan Alam Khan

Page 1 of 17
Lab 1: Document loaders in Generative AI
There are 80 types of document loading.
Program 01: Document Loading
This file shows how to load different types of
documents for AI processing.
Key Concepts:
1. Retrieval Augmented Generation (RAG): AI
system that can look up information from
documents to answer questions
2. Document Loaders: Tools to import different file
types
Code Breakdown:
1. Setup:
o Imports necessary libraries and sets up OpenAI
API key
o load_dotenv() loads environment variables (like
API keys) from a .env file
2. PDF Loading:
loader =

PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
o Loads a PDF file and splits it into pages
o Each page becomes a "Document" object with text content and metadata
3. YouTube Audio Loading:
loader = GenericLoader(
FileSystemBlobLoader(save_dir, glob="*.m4a"),
OpenAIWhisperParser()
)
o Uses OpenAI's Whisper model to transcribe audio files
o Can load from YouTube or local audio files
4. Web Page Loading:
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")
docs = loader.load()
o Fetches and loads content from a webpage
5. Notion Document Loading:
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
o Loads documents exported from Notion (a note-taking app)
Results:
 Each loader creates Document objects containing the content and metadata
 For PDFs: Returns page content and page numbers
 For YouTube: Returns transcribed text
 For web pages: Returns HTML content
 For Notion: Returns Markdown content

Page 2 of 17
Code:
#!/usr/bin/env python
# coding: utf-8
# # Document Loading
# ## Note to students.
# During periods of high load you may find the notebook unresponsive. It may appear to execute a cell,
update the completion number in brackets [#] at the left of the cell but you may find the cell has not executed.
This is particularly obvious on print statements when there is no output. If this happens, restart the kernel
using the command under the Kernel tab.
# ## Retrieval augmented generation
# In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset
as part of its execution.
# This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).
# ![overview.jpeg](attachment:overview.jpeg)
#! pip install langchain
# In[ ]:
import os
import openai
import sys
sys.path.append('../..')
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
# In Python, _ traditionally indicates a value we don't plan to use later
# Explanation of .env file A typical .env file might contain:
#OPENAI_API_KEY=sk-...your-api-key...
#DATABASE_URL=postgres://user:pass@localhost:5432/db
#After running the command, you can access these anywhere in your code:
#import os
#api_key = os.getenv("OPENAI_API_KEY")
#openai.api_key = os.environ['OPENAI_API_KEY']

# ## PDFs
# Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-
#Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated
#transcription so words and sentences are sometimes split unexpectedly.
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
#! pip install pypdf
# In[ ]:

from langchain.document_loaders import PyPDFLoader


loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
# Each page is a `Document`.
# A `Document` contains text (`page_content`) and `metadata`.
# In[ ]:
len(pages)
# In[ ]:
page = pages[0]
# In[ ]:
print(page.page_content[0:500])
# In[ ]:

page.metadata
# ## YouTube
from langchain.document_loaders.generic import GenericLoader, FileSystemBlobLoader

Page 3 of 17
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
# Above terms will be explained after end of this code :
# ! pip install yt_dlp
# ! pip install pydub
# **Note**: This can take several minutes to complete. This has been modified relative to the lesson video
to fetch the video file locally.
# In[ ]:

url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
#YoutubeAudioLoader([url],save_dir), # fetch from youtube
FileSystemBlobLoader(save_dir, glob="*.m4a"), #fetch locally
OpenAIWhisperParser()
)
docs = loader.load()
# In[ ]:
docs[0].page_content[0:500]
# ## URLs
# In[ ]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")
# > Note: the URL sent to the WebBaseLoader differs from the one shonw in the video because for 2024 it
was updated.
docs = loader.load()
print(docs[0].page_content[:500])
# ## Notion
# Follow steps
#here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/n#otion
) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-
#Handbook-e31bff7da17346ee99f531087d8b133f):
# * Duplicate the page into your own Notion space and export as `Markdown / CSV`.
# * Unzip it and save it as a folder that contains the markdown file for the Notion page.
# ![image.png](./img/image.png)
# In[ ]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
print(docs[0].page_content[0:200])
# In[ ]:
docs[0].metadata
# In[ ]:
#The LangChain document loading system:
# GenericLoader
Definition: A flexible document loader that combines a Blob Loader (which fetches raw data) with
a Parser (which processes the raw data into documents).
Purpose: Provides a unified way to load documents from different sources by pairing the right blob loader
with the right parser.
Example Use Case: Loading audio files from YouTube, parsing them into text using OpenAI's Whisper model.

FileSystemBlobLoader
Definition: A loader that fetches raw binary data (called "blobs") from files in a local directory.
Key Features:
Searches for files matching a glob pattern (e.g., *.m4a for audio files).

Page 4 of 17
Returns raw data without parsing it (e.g., audio bytes, PDF bytes, etc.).
Purpose: Used when you already have files downloaded locally and want to process them.
Example Use Case: Loading locally saved .m4a audio files for transcription.

OpenAIWhisperParser
Definition: A parser that uses OpenAI's Whisper (speech-to-text model) to transcribe audio blobs into text
documents.
Key Features:
Converts audio (e.g., MP3, M4A) into text.
Handles long audio files by splitting them into chunks.
Purpose: Extract text content from audio/video files for further processing (e.g., summarization, Q&A).
Example Use Case: Transcribing a lecture recording into text for a chatbot to reference.

YoutubeAudioLoader
Definition: A blob loader that downloads audio from YouTube videos.
Key Features:
Takes YouTube URLs as input.
Downloads audio streams and saves them as local files (e.g., .m4a).
Purpose: Fetch audio content directly from YouTube links.
Example Use Case: Downloading a podcast episode from YouTube to analyze its content.

How They Work Together


YoutubeAudioLoader (or FileSystemBlobLoader) fetches raw audio data.
OpenAIWhisperParser converts the audio into text documents.
GenericLoader orchestrates the process by combining the two.
Example Code Flow
# Option 1: Download from YouTube
loader = GenericLoader(
YoutubeAudioLoader([youtube_url], save_dir="audio/"),
OpenAIWhisperParser()
)
# Option 2: Load local audio files
loader = GenericLoader(
FileSystemBlobLoader("audio/", glob="*.m4a"),
OpenAIWhisperParser()
)
# Result: List of transcribed text documents
docs = loader.load()

Key Terms Recap

Page 5 of 17
Term Role Example Input Output

Blob Loader +
GenericLoader Coordinator Document(s)
Parser

Raw audio
FileSystemBlobLoader Local file fetcher *.m4a files
blobs

OpenAIWhisperParser Audio-to-text converter Audio blob Text document

YouTube audio Audio file


YoutubeAudioLoader YouTube URL
downloader (.m4a)
This setup is commonly used in Retrieval-Augmented Generation (RAG) systems to process multimedia
content (e.g., videos, podcasts) into searchable text.

Page 6 of 17
Lab 2: Document loaders in Generative AI
Program 2: Data Prepration tool kit program.
This program demonstrates text splitting strategies for AI document processing, focusing on chunking
techniques used in Retrieval-Augmented Generation (RAG) pipelines. Below is a structured breakdown:

1. Core Purpose
Prepares raw text data (documents, web pages, PDFs, etc.) for AI models by:
 Splitting long texts into smaller, semantically meaningful chunks.
 Preserving context through overlap and hierarchical separators.
 Handling edge cases (e.g., code, markdown, token limits).

2. Key Components
A. Text Splitters
Two primary classes are used:
1. RecursiveCharacterTextSplitter
a. Splits text recursively using separators (e.g., paragraphs, sentences, words).
b. Ideal for natural language (prose, articles).
2. CharacterTextSplitter
a. Splits at exact character counts.
b. Simpler but may break mid-sentence.
B. Parameters
 chunk_size: Max characters/tokens per chunk.
 chunk_overlap: Shared text between chunks (preserves context).
 separators: Hierarchy for splitting (e.g., ["\n\n", "\n", " ", ""]).

3. Workflow & Code Explanation


A. Basic Splitting (Testing)
text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1) # No split (26 chars ≤ chunk_size=26)
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2) # Splits at 26 chars + overlap
 Output: Shows how overlap and chunk size interact.
 Why? Tests edge cases (exact size, no natural breaks).
B. Advanced Splitting
some_text = "When writing documents, writers use structure..."
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)
 Logic:
 First tries to split at \n\n (paragraphs).
 Falls back to \n (lines), then sentences (\. ), then words ( ).
 Output: Chunks ≤150 chars, respecting semantic boundaries.
C. Real-World Document Splitting
# PDF Example
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=150

Page 7 of 17
)
docs = text_splitter.split_documents(pages) # 77 chunks from 1 PDF
 Key Insight:
 Splits PDF text at \n (newlines) with 150-char overlap.
 Result: 77 chunks (~1000 chars each), optimized for LLM processing.
D. Token-Based Splitting
text_splitter = TokenTextSplitter(chunk_size=10)
text_splitter.split_text("foo bar bazzyfoo") # Splits by tokens (not chars)
 Use Case: Matches LLM context windows (e.g., GPT-4 uses tokens).
 Note: Tokens ≈ 4 chars (varies by language/model).
E. Structured Document Splitting (Markdown)
markdown_document = "# Title\n\n## Chapter 1\n\nHi this is Jim..."
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")]
)
md_header_splits = markdown_splitter.split_text(markdown_document)
 Output: Chunks retain header metadata (e.g., {"Header 1": "Title"}).
 Why? Preserves document hierarchy (critical for Notion/Confluence docs).

4. Technical Insights
Why Recursive Splitting?
 Mimics human reading: Breaks text into paragraphs → sentences → words.
 Avoids mid-sentence cuts (better for embeddings/RAG).
Overlap Tradeoffs
 Pros: Maintains context across chunks.
 Cons: Increases compute/storage (duplicate text).
Separator Hierarchy
Order matters! Example priority:
1. \n\n (paragraphs)
2. \n (lines)
3. \. (sentences)
4. (words)

5. Practical Applications
1. RAG Pipelines: Chunked docs → embeddings → vector DB → retrieval.
2. Fine-Tuning: Prepares data for LLM training.
3. Preprocessing: Cleans messy inputs (PDFs, transcripts, web scrapes).

6. Output Summary
The script doesn’t print results by default, but running it would generate:
 Chunked text (lists of strings meeting chunk_size).
 Metadata-enriched splits (for markdown/Notion).
 Tokenized segments (for LLM compatibility).

Key Takeaway
This program is a data preparation toolkit for AI workflows, ensuring text is split into digestible pieces while
preserving structure and context. The techniques shown are foundational for building production-grade RAG
systems.

Page 8 of 17
The commands and the concept of chunk overlap:

1. Import Statements
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
 Purpose: Imports two text-splitting tools from LangChain:
o RecursiveCharacterTextSplitter: Splits text hierarchically (paragraphs → sentences → words).
o CharacterTextSplitter: Splits text at exact character counts.

2. Chunk Parameters
chunk_size = 26 # Max characters per chunk
chunk_overlap = 4 # Shared characters between adjacent chunks
 chunk_size: The target maximum length (in characters) for each text chunk.
o Example: If set to 26, no chunk will exceed 26 characters.
 chunk_overlap: The number of characters shared between consecutive chunks.
o Example: With chunk_overlap=4, the last 4 characters of one chunk will repeat at the start of the next.

3. Why Chunk Overlap?


 Purpose: Preserves context across chunks.
o Without overlap, splitting text at arbitrary points might cut off meaningful ideas (e.g., mid-sentence).
o Overlap ensures continuity (e.g., a sentence fragment in one chunk will appear fully in the next).
 Tradeoff: Increases storage/compute (due to duplicated text) but improves retrieval quality in RAG systems.

4. Splitter Initialization
A. Recursive Splitter
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
 Behavior:
1. Splits text hierarchically using default separators (["\n\n", "\n", " ", ""]).
2. Prioritizes natural breaks (paragraphs → sentences → words).
3. Enforces chunk_size and chunk_overlap after splitting.
 Use Case: Ideal for prose (articles, docs) where semantic structure matters.
B. Character Splitter
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
 Behavior:
1. Splits text at exact character counts (no semantic awareness).
2. Simply "cuts" text every chunk_size characters, then adds overlap.
 Use Case: Raw data where structure is irrelevant (e.g., code, logs).

5. Example Outputs
Input Text: "abcdefghijklmnopqrstuvwxyzabcdefg" (33 characters)
 Recursive Splitter:
o Tries to split at natural breaks first. If none exist, behaves like CharacterTextSplitter.
o Output (with chunk_size=26, overlap=4):
["abcdefghijklmnopqrstuvwxyz", "wxyzabcdefg"] # Last 4 chars overlap
 Character Splitter:
o Direct split at character 26, with 4-character overlap:
["abcdefghijklmnopqrstuvwxyz", "wxyzabcdefg"]

Page 9 of 17
6. Key Differences
Feature RecursiveCharacterTextSplitter CharacterTextSplitter

Splitting Logic Hierarchical (paragraphs → words) Fixed character counts

Preserves Semantics Yes (avoids mid-sentence cuts) No (blind to meaning)

Best For Natural language (docs, articles) Unstructured text (code, logs)

Overlap Handling Applied after semantic splitting Applied after character cutting

When to Use Each


 Use RecursiveCharacterTextSplitter for:
o Documents, articles, or any text where context matters.
o Example: Splitting a PDF lecture into coherent passages for a RAG system.
 Use CharacterTextSplitter for:
o Machine-generated text (logs, code) or when speed is critical.
o Example: Preprocessing raw data before tokenization.

Practical Note
 Typical Values in Production:
o chunk_size=500-1500 (balance between context and LLM input limits).
o chunk_overlap=10-20% of chunk_size (e.g., 100 chars for chunk_size=1000).
 Adjust based on your use case: Larger overlaps improve context but increase costs.

Explanation of LAB 2.
1. Program Overview
Program_02_chat.py is a Python script demonstrating document splitting techniques for AI and NLP pipelines,
specifically designed for Retrieval-Augmented Generation (RAG) systems. It showcases how to preprocess text data
by breaking it into optimally sized chunks while preserving semantic structure.

2. Core Components
A. Text Splitters
The program compares two primary splitting methods:
1. RecursiveCharacterTextSplitter
1. Purpose: Splits text hierarchically (paragraphs → sentences → words).
2. Advantage: Preserves logical structure, avoids mid-sentence breaks.
3. Use Case: Natural language (articles, PDFs, Markdown).
2. CharacterTextSplitter
1. Purpose: Splits text at fixed character counts.
2. Advantage: Fast and simple.
3. Use Case: Unstructured text (logs, code).
B. Key Parameters
 chunk_size: Maximum characters per chunk (e.g., 26 for testing, 1000 for production).
 chunk_overlap: Shared characters between chunks (e.g., 4 or 150) to maintain context.
 Separators: Hierarchy for splitting (e.g., ["\n\n", "\n", " ", ""]).

3. Code Walkthrough
A. Setup & Imports
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
 Imports the two splitting strategies from LangChain.

Page 10 of 17
B. Basic Splitting Tests
chunk_size = 26
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
Initializes splitters with small chunk_size and overlap for testing edge cases.
 Test Cases:
 text1 = "abcdefghijklmnopqrstuvwxyz" → No split (exact chunk_size).
 text2 = "abcdefghijklmnopqrstuvwxyzabcdefg" → Splits with 4-character overlap.


C. Advanced Splitting
some_text = "When writing documents, writers use structure..."
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)
 Behavior:
 Splits at \n\n (paragraphs).
 Falls back to \n (lines), then sentences (\. ), then words ( ).
 Output: Chunks ≤150 chars, respecting semantic boundaries.
D. Real-World Examples
PDF Splitting
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=150
)
docs = text_splitter.split_documents(pages) # 77 chunks
 Splits a PDF lecture into ~1000-character chunks with 150-character overlap.
Markdown/Notion Splitting
markdown_document = "# Title\n\n## Chapter 1\n\nHi this is Jim..."
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")]
)
md_header_splits = markdown_splitter.split_text(markdown_document)
 Preserves header metadata (e.g., {"Header 1": "Title"}) in chunks.

4. Technical Insights
Why Splitting Matters
 LLM Constraints: Models have limited context windows (e.g., 4K–32K tokens).

Page 11 of 17
 RAG Optimization: Smaller chunks improve retrieval accuracy.
 Context Preservation: Overlap prevents information loss at chunk boundaries.
Separator Hierarchy
Priority order for RecursiveCharacterTextSplitter:
1. \n\n (paragraphs) → 2. \n (lines) → 3. \. (sentences) → 4. (words).
Token vs. Character Splitting
 TokenTextSplitter: Used when working directly with LLMs (e.g., chunking by GPT-4 tokens).
 CharacterTextSplitter: Simpler but less precise.

5. Practical Applications
1. RAG Pipelines:
1. Chunked docs → embeddings → vector DB → retrieval.
2. Fine-Tuning Data Prep:
1. Splits long texts for LLM training.
3. Document Preprocessing:
1. Cleans PDFs, Markdown, or web-scraped content.

6. Output Examples
Recursive Splitter Output
["When writing documents, writers use...", "...structure to group ideas."]
 Chunks are semantically coherent.
Markdown Splitter Output
[
{
"content": "Hi this is Jim...",
"metadata": {"Header 1": "Title", "Header 2": "Chapter 1"}
}
]
 Retains document structure.

Key Takeaways
 RecursiveCharacterTextSplitter is ideal for natural language.
 Overlap (10–20%) is critical for context continuity.
 Metadata-aware splitting (e.g., Markdown) enhances RAG performance.
This script is a foundational tool for AI workflows involving large documents.

Page 12 of 17
Lab 3: Vector Storage in Generative AI
The professional analysis of Program_03_chat.py, which demonstrates vector storage and semantic search for RAG
(Retrieval-Augmented Generation) systems:

1. Core Purpose
This program transforms text documents into searchable vectors and implements semantic search using:
 Embeddings: Numerical representations of text meaning
 Vector Database (ChromaDB): Stores vectors for efficient retrieval
 Similarity Search: Finds relevant text chunks for queries

2. Key Components
A. Document Processing Pipeline
1. Loading & Splitting (Reuses code from Program_01/02)
1. Loads duplicate/non-duplicate PDFs (MachineLearning-Lecture01.pdf x2, -02.pdf, -
03.pdf)
2. Splits into chunks (1500 chars with 150-char overlap)
2. Embedding Generation
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
1. Uses OpenAI's embedding model (text-embedding-ada-002 by default)
2. Converts text → 1536-dimensional vectors
3. Vector Storage
from langchain.vectorstores import Chroma
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory='docs/chroma/'
)
1. Stores vectors in ChromaDB (local persistent storage)
2. Auto-handles duplicate documents (but doesn't deduplicate)
B. Semantic Search
question = "is there an email i can ask for help"
docs = vectordb.similarity_search(question, k=3)
 Finds top-3 most semantically similar chunks to the query
 Uses cosine similarity between query and document vectors

3. Technical Workflow
1. Embedding Demonstration
sentence1 = "i like dogs"
embedding1 = embedding.embed_query(sentence1) # Vector for query
np.dot(embedding1, embedding2) # ≈0.96 (similar)
np.dot(embedding1, embedding3) # ≈0.77 (dissimilar)
1. Shows how semantic similarity translates to vector math
2. Search Examples
1. Successful case:
question = "is there an email i can ask for help"
# Returns chunks containing email addresses
o Failure cases:
question = "what did they say about matlab?"
# Returns duplicate chunks (from Lecture01.pdf x2)
question = "what did they say about regression in the third lecture?"
# Returns irrelevant chunks (from Lectures 1/2)

Page 13 of 17
4. Key Outputs
1. Vector Database
 Contains 209 document chunks (printed via vectordb._collection.count())
 Persisted to docs/chroma/ for reuse
2. Search Results
 Returns document chunks with:
1. page_content: Relevant text
2. metadata: Source PDF and page number

5. Failure Modes Identified


1. Duplicate Content
1. Caused by identical PDFs in source data
2. Solution: Deduplication pre-processing
2. Lecture-Specific Queries
1. Search returns irrelevant lecture chunks
2. Solution: Metadata filtering (shown in next lesson)
3. Over-Retrieval
1. Returns too many similar chunks
2. Solution: Diversity-aware search (MMR)

6. Professional Insights
 Embedding Choice Matters: OpenAI embeddings work well but require API calls. Alternatives:
 Local models (e.g., all-MiniLM-L6-v2)
 Task-specific fine-tuning
 ChromaDB Advantages:
 Lightweight
 Persistent storage
 Native LangChain integration
 Production Considerations:
# Recommended settings for production:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200, # 20% overlap
separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

7. Practical Applications
1. RAG Systems
1. Foundation for document Q&A bots
2. Knowledge Management
1. Enterprise document search
3. Lecture Analysis
1. Course content retrieval (as shown)

Summary
This program demonstrates a complete RAG preprocessing pipeline:
PDFs → Text Chunks → Embeddings → Vector DB → Semantic Search
With clear examples of both successful implementations and failure modes to address in subsequent lessons.

Page 14 of 17
Lab 4: Sophisticated document retrieval in Generative AI
This program demonstrates sophisticated document retrieval methods for Retrieval-Augmented Generation (RAG)
pipelines, addressing key challenges like result diversity, metadata filtering, and content compression. Below is a
structured breakdown:

1. Core Components & Workflow


A. VectorDB Initialization
vectordb = Chroma(
persist_directory='docs/chroma/',
embedding_function=OpenAIEmbeddings()
)
 Purpose: Loads pre-existing vector database (from Program_03) with OpenAI embeddings.
 Key Output: vectordb._collection.count() shows the number of stored document chunks.
B. Similarity Search Basics
question = "Tell me about all-white mushrooms with large fruiting bodies"
smalldb.similarity_search(question, k=2)
 k Significance:
 Controls the number of top results returned (k=2 → top 2 matches).
 Higher k = more results but potentially lower precision.
C. Maximum Marginal Relevance (MMR)
smalldb.max_marginal_relevance_search(question, k=2, fetch_k=3)
 Purpose: Balances relevance and diversity in results.
 fetch_k: Number of candidates to consider before selecting diverse final results (k).
 Use Case: Avoids duplicate/overlapping chunks (e.g., redundant lecture snippets).
D. Metadata Filtering
vectordb.similarity_search(
question,
k=3,
filter={"source": "docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)
 Why? Ensures results are from specific sources (e.g., "Lecture 3 only").
 Metadata Fields Used:
 source: PDF file path.
 page: Page number.
E. Self-Query Retriever
retriever = SelfQueryRetriever.from_llm(
llm=OpenAI(model='gpt-3.5-turbo-instruct'),
vectorstore=vectordb,
metadata_field_info=[...]
)
 Key Feature: Uses LLM to infer metadata filters from natural language queries.
 Example Query: "what did they say about regression in the third lecture?"
 Auto-extracted Filter: {"source": "MachineLearning-Lecture03.pdf"}.
F. Contextual Compression
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectordb.as_retriever()
)
 Purpose: Extracts only relevant parts of retrieved documents.
 Reduces noise in LLM prompts.
 Cuts API costs by removing irrelevant text.

Page 15 of 17
G. Alternative Retrievers (TF-IDF & SVM)
svm_retriever = SVMRetriever.from_texts(splits, embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)
 TF-IDF: Traditional keyword-based retrieval.
 SVM: Uses support vector machines for semantic search.
 Use Case: Lightweight alternatives to vector databases.

2. Key Techniques & Their Significance


Technique Problem Solved Implementation

Similarity Search Basic relevance ranking vectordb.similarity_search(question, k=3)

max_marginal_relevance_search(k=2,
MMR Redundant/overlapping results
fetch_k=3)

Irrelevant sources (e.g., wrong


Metadata Filtering filter={"source": "Lecture03.pdf"}
lecture)

Natural language metadata


Self-Query Retriever SelfQueryRetriever.from_llm(...)
extraction

Contextual
Noisy/long documents LLMChainExtractor + CompressionRetriever
Compression

3. Parameter Deep Dive: k and fetch_k


A. k (Number of Results)
 Role: Controls how many chunks are returned.
 Tradeoffs:
 Small k (e.g., 2-3): High precision, but may miss relevant info.
 Large k (e.g., 5-10): Broad coverage, but lower relevance.
B. fetch_k (MMR Candidate Pool)
 Role: Number of initial candidates before MMR diversifies results.
 Example: fetch_k=3, k=2 → MMR picks 2 diverse results from top 3.
 Rule of Thumb: Set fetch_k ≥ 2*k.

4. Practical Output Examples


Similarity Search
[
"A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.",
"The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body."
]
MMR Search
[
"A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.",
"A. phalloides, a.k.a Death Cap, is one of the most poisonous..." # More diverse
]
Compressed Results
Document 1:
"Amanita phalloides has all-white varieties with large fruiting bodies."
------------------------------------------------------------
Document 2:
"Death Cap mushrooms are poisonous."

Page 16 of 17
5. Professional Recommendations
1. Production Settings:
1. Start with k=5, fetch_k=10 for MMR.
2. Use SelfQueryRetriever for natural language queries with metadata.
2. Cost Optimization:
1. Combine compression with MMR to reduce LLM token usage.
3. Fallback Strategies:
1. Use TF-IDF/SVM retrievers as backups for edge cases.

Summary
This program extends basic RAG retrieval with advanced filtering, diversity control, and noise reduction, making it
production-ready. The k parameter is central to balancing precision/recall in search results.

Page 17 of 17

You might also like