0% found this document useful (0 votes)

30 views17 pages

LAB MANUAL OF GENERATIVE AI April - 4

The document is a lab manual detailing the use of document loaders in Generative AI, specifically focusing on various methods to load and process different document types such as PDFs, YouTube audio, web pages, and Notion documents. It covers key concepts like Retrieval Augmented Generation (RAG) and provides code examples for loading documents and splitting text into manageable chunks for AI processing. The manual also discusses the importance of chunk overlap and text splitting strategies to maintain context and improve retrieval quality in AI workflows.

Uploaded by

Daniyal Shahbaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views17 pages

LAB MANUAL OF GENERATIVE AI April - 4

Uploaded by

Daniyal Shahbaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

LAB MANUAL OF GENERATIVE AI

Composed by Adnan Alam Khan

Page 1 of 17
Lab 1: Document loaders in Generative AI
There are 80 types of document loading.
Program 01: Document Loading
This file shows how to load different types of
documents for AI processing.
Key Concepts:
1. Retrieval Augmented Generation (RAG): AI
system that can look up information from
documents to answer questions
2. Document Loaders: Tools to import different file
types
Code Breakdown:
1. Setup:
o Imports necessary libraries and sets up OpenAI
API key
o load_dotenv() loads environment variables (like
API keys) from a .env file
2. PDF Loading:
loader =

PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
o Loads a PDF file and splits it into pages
o Each page becomes a "Document" object with text content and metadata
3. YouTube Audio Loading:
loader = GenericLoader(
FileSystemBlobLoader(save_dir, glob="*.m4a"),
OpenAIWhisperParser()
)
o Uses OpenAI's Whisper model to transcribe audio files
o Can load from YouTube or local audio files
4. Web Page Loading:
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")
docs = loader.load()
o Fetches and loads content from a webpage
5. Notion Document Loading:
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
o Loads documents exported from Notion (a note-taking app)
Results:
 Each loader creates Document objects containing the content and metadata
 For PDFs: Returns page content and page numbers
 For YouTube: Returns transcribed text
 For web pages: Returns HTML content
 For Notion: Returns Markdown content

Page 2 of 17
Code:
#!/usr/bin/env python
# coding: utf-8
# # Document Loading
# ## Note to students.
# During periods of high load you may find the notebook unresponsive. It may appear to execute a cell,
update the completion number in brackets [#] at the left of the cell but you may find the cell has not executed.
This is particularly obvious on print statements when there is no output. If this happens, restart the kernel
using the command under the Kernel tab.
# ## Retrieval augmented generation
# In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset
as part of its execution.
# This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).
# ![overview.jpeg](attachment:overview.jpeg)
#! pip install langchain
# In[ ]:
import os
import openai
import sys
sys.path.append('../..')
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
# In Python, _ traditionally indicates a value we don't plan to use later
# Explanation of .env file A typical .env file might contain:
#OPENAI_API_KEY=sk-...your-api-key...
#DATABASE_URL=postgres://user:pass@localhost:5432/db
#After running the command, you can access these anywhere in your code:
#import os
#api_key = os.getenv("OPENAI_API_KEY")
#openai.api_key = os.environ['OPENAI_API_KEY']

# ## PDFs
# Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-
#Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated
#transcription so words and sentences are sometimes split unexpectedly.
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
#! pip install pypdf
# In[ ]:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
# Each page is a `Document`.
# A `Document` contains text (`page_content`) and `metadata`.
# In[ ]:
len(pages)
# In[ ]:
page = pages[0]
# In[ ]:
print(page.page_content[0:500])
# In[ ]:

page.metadata
# ## YouTube
from langchain.document_loaders.generic import GenericLoader, FileSystemBlobLoader

Page 3 of 17
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
# Above terms will be explained after end of this code :
# ! pip install yt_dlp
# ! pip install pydub
# **Note**: This can take several minutes to complete. This has been modified relative to the lesson video
to fetch the video file locally.
# In[ ]:

url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
#YoutubeAudioLoader([url],save_dir), # fetch from youtube
FileSystemBlobLoader(save_dir, glob="*.m4a"), #fetch locally
OpenAIWhisperParser()
)
docs = loader.load()
# In[ ]:
docs[0].page_content[0:500]
# ## URLs
# In[ ]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")
# > Note: the URL sent to the WebBaseLoader differs from the one shonw in the video because for 2024 it
was updated.
docs = loader.load()
print(docs[0].page_content[:500])
# ## Notion
# Follow steps
#here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/n#otion
) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-
#Handbook-e31bff7da17346ee99f531087d8b133f):
# * Duplicate the page into your own Notion space and export as `Markdown / CSV`.
# * Unzip it and save it as a folder that contains the markdown file for the Notion page.
# ![image.png](./img/image.png)
# In[ ]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
print(docs[0].page_content[0:200])
# In[ ]:
docs[0].metadata
# In[ ]:
#The LangChain document loading system:
# GenericLoader
Definition: A flexible document loader that combines a Blob Loader (which fetches raw data) with
a Parser (which processes the raw data into documents).
Purpose: Provides a unified way to load documents from different sources by pairing the right blob loader
with the right parser.
Example Use Case: Loading audio files from YouTube, parsing them into text using OpenAI's Whisper model.

FileSystemBlobLoader
Definition: A loader that fetches raw binary data (called "blobs") from files in a local directory.
Key Features:
Searches for files matching a glob pattern (e.g., *.m4a for audio files).

Page 4 of 17
Returns raw data without parsing it (e.g., audio bytes, PDF bytes, etc.).
Purpose: Used when you already have files downloaded locally and want to process them.
Example Use Case: Loading locally saved .m4a audio files for transcription.

OpenAIWhisperParser
Definition: A parser that uses OpenAI's Whisper (speech-to-text model) to transcribe audio blobs into text
documents.
Key Features:
Converts audio (e.g., MP3, M4A) into text.
Handles long audio files by splitting them into chunks.
Purpose: Extract text content from audio/video files for further processing (e.g., summarization, Q&A).
Example Use Case: Transcribing a lecture recording into text for a chatbot to reference.

YoutubeAudioLoader
Definition: A blob loader that downloads audio from YouTube videos.
Key Features:
Takes YouTube URLs as input.
Downloads audio streams and saves them as local files (e.g., .m4a).
Purpose: Fetch audio content directly from YouTube links.
Example Use Case: Downloading a podcast episode from YouTube to analyze its content.

How They Work Together

YoutubeAudioLoader (or FileSystemBlobLoader) fetches raw audio data.
OpenAIWhisperParser converts the audio into text documents.
GenericLoader orchestrates the process by combining the two.
Example Code Flow
# Option 1: Download from YouTube
loader = GenericLoader(
YoutubeAudioLoader([youtube_url], save_dir="audio/"),
OpenAIWhisperParser()
)
# Option 2: Load local audio files
loader = GenericLoader(
FileSystemBlobLoader("audio/", glob="*.m4a"),
OpenAIWhisperParser()
)
# Result: List of transcribed text documents
docs = loader.load()

Key Terms Recap

Page 5 of 17
Term Role Example Input Output

Blob Loader +
GenericLoader Coordinator Document(s)
Parser

Raw audio
FileSystemBlobLoader Local file fetcher *.m4a files
blobs

OpenAIWhisperParser Audio-to-text converter Audio blob Text document

YouTube audio Audio file

YoutubeAudioLoader YouTube URL
downloader (.m4a)
This setup is commonly used in Retrieval-Augmented Generation (RAG) systems to process multimedia
content (e.g., videos, podcasts) into searchable text.

Page 6 of 17
Lab 2: Document loaders in Generative AI
Program 2: Data Prepration tool kit program.
This program demonstrates text splitting strategies for AI document processing, focusing on chunking
techniques used in Retrieval-Augmented Generation (RAG) pipelines. Below is a structured breakdown:

1. Core Purpose
Prepares raw text data (documents, web pages, PDFs, etc.) for AI models by:
 Splitting long texts into smaller, semantically meaningful chunks.
 Preserving context through overlap and hierarchical separators.
 Handling edge cases (e.g., code, markdown, token limits).

2. Key Components
A. Text Splitters
Two primary classes are used:
1. RecursiveCharacterTextSplitter
a. Splits text recursively using separators (e.g., paragraphs, sentences, words).
b. Ideal for natural language (prose, articles).
2. CharacterTextSplitter
a. Splits at exact character counts.
b. Simpler but may break mid-sentence.
B. Parameters
 chunk_size: Max characters/tokens per chunk.
 chunk_overlap: Shared text between chunks (preserves context).
 separators: Hierarchy for splitting (e.g., ["\n\n", "\n", " ", ""]).

3. Workflow & Code Explanation

A. Basic Splitting (Testing)
text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1) # No split (26 chars ≤ chunk_size=26)
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2) # Splits at 26 chars + overlap
 Output: Shows how overlap and chunk size interact.
 Why? Tests edge cases (exact size, no natural breaks).
B. Advanced Splitting
some_text = "When writing documents, writers use structure..."
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)
 Logic:
 First tries to split at \n\n (paragraphs).
 Falls back to \n (lines), then sentences (\. ), then words ( ).
 Output: Chunks ≤150 chars, respecting semantic boundaries.
C. Real-World Document Splitting
# PDF Example
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=150

Page 7 of 17
)
docs = text_splitter.split_documents(pages) # 77 chunks from 1 PDF
 Key Insight:
 Splits PDF text at \n (newlines) with 150-char overlap.
 Result: 77 chunks (~1000 chars each), optimized for LLM processing.
D. Token-Based Splitting
text_splitter = TokenTextSplitter(chunk_size=10)
text_splitter.split_text("foo bar bazzyfoo") # Splits by tokens (not chars)
 Use Case: Matches LLM context windows (e.g., GPT-4 uses tokens).
 Note: Tokens ≈ 4 chars (varies by language/model).
E. Structured Document Splitting (Markdown)
markdown_document = "# Title\n\n## Chapter 1\n\nHi this is Jim..."
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")]
)
md_header_splits = markdown_splitter.split_text(markdown_document)
 Output: Chunks retain header metadata (e.g., {"Header 1": "Title"}).
 Why? Preserves document hierarchy (critical for Notion/Confluence docs).

4. Technical Insights
Why Recursive Splitting?
 Mimics human reading: Breaks text into paragraphs → sentences → words.
 Avoids mid-sentence cuts (better for embeddings/RAG).
Overlap Tradeoffs
 Pros: Maintains context across chunks.
 Cons: Increases compute/storage (duplicate text).
Separator Hierarchy
Order matters! Example priority:
1. \n\n (paragraphs)
2. \n (lines)
3. \. (sentences)
4. (words)

5. Practical Applications
1. RAG Pipelines: Chunked docs → embeddings → vector DB → retrieval.
2. Fine-Tuning: Prepares data for LLM training.
3. Preprocessing: Cleans messy inputs (PDFs, transcripts, web scrapes).

6. Output Summary
The script doesn’t print results by default, but running it would generate:
 Chunked text (lists of strings meeting chunk_size).
 Metadata-enriched splits (for markdown/Notion).
 Tokenized segments (for LLM compatibility).

Key Takeaway
This program is a data preparation toolkit for AI workflows, ensuring text is split into digestible pieces while
preserving structure and context. The techniques shown are foundational for building production-grade RAG
systems.

Page 8 of 17
The commands and the concept of chunk overlap:

1. Import Statements
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
 Purpose: Imports two text-splitting tools from LangChain:
o RecursiveCharacterTextSplitter: Splits text hierarchically (paragraphs → sentences → words).
o CharacterTextSplitter: Splits text at exact character counts.

2. Chunk Parameters
chunk_size = 26 # Max characters per chunk
chunk_overlap = 4 # Shared characters between adjacent chunks
 chunk_size: The target maximum length (in characters) for each text chunk.
o Example: If set to 26, no chunk will exceed 26 characters.
 chunk_overlap: The number of characters shared between consecutive chunks.
o Example: With chunk_overlap=4, the last 4 characters of one chunk will repeat at the start of the next.

3. Why Chunk Overlap?

 Purpose: Preserves context across chunks.
o Without overlap, splitting text at arbitrary points might cut off meaningful ideas (e.g., mid-sentence).
o Overlap ensures continuity (e.g., a sentence fragment in one chunk will appear fully in the next).
 Tradeoff: Increases storage/compute (due to duplicated text) but improves retrieval quality in RAG systems.

4. Splitter Initialization
A. Recursive Splitter
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
 Behavior:
1. Splits text hierarchically using default separators (["\n\n", "\n", " ", ""]).
2. Prioritizes natural breaks (paragraphs → sentences → words).
3. Enforces chunk_size and chunk_overlap after splitting.
 Use Case: Ideal for prose (articles, docs) where semantic structure matters.
B. Character Splitter
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
 Behavior:
1. Splits text at exact character counts (no semantic awareness).
2. Simply "cuts" text every chunk_size characters, then adds overlap.
 Use Case: Raw data where structure is irrelevant (e.g., code, logs).

5. Example Outputs
Input Text: "abcdefghijklmnopqrstuvwxyzabcdefg" (33 characters)
 Recursive Splitter:
o Tries to split at natural breaks first. If none exist, behaves like CharacterTextSplitter.
o Output (with chunk_size=26, overlap=4):
["abcdefghijklmnopqrstuvwxyz", "wxyzabcdefg"] # Last 4 chars overlap
 Character Splitter:
o Direct split at character 26, with 4-character overlap:
["abcdefghijklmnopqrstuvwxyz", "wxyzabcdefg"]

Page 9 of 17
6. Key Differences
Feature RecursiveCharacterTextSplitter CharacterTextSplitter

Splitting Logic Hierarchical (paragraphs → words) Fixed character counts

Preserves Semantics Yes (avoids mid-sentence cuts) No (blind to meaning)

Best For Natural language (docs, articles) Unstructured text (code, logs)

Overlap Handling Applied after semantic splitting Applied after character cutting

When to Use Each

 Use RecursiveCharacterTextSplitter for:
o Documents, articles, or any text where context matters.
o Example: Splitting a PDF lecture into coherent passages for a RAG system.
 Use CharacterTextSplitter for:
o Machine-generated text (logs, code) or when speed is critical.
o Example: Preprocessing raw data before tokenization.

Practical Note
 Typical Values in Production:
o chunk_size=500-1500 (balance between context and LLM input limits).
o chunk_overlap=10-20% of chunk_size (e.g., 100 chars for chunk_size=1000).
 Adjust based on your use case: Larger overlaps improve context but increase costs.

Explanation of LAB 2.
1. Program Overview
Program_02_chat.py is a Python script demonstrating document splitting techniques for AI and NLP pipelines,
specifically designed for Retrieval-Augmented Generation (RAG) systems. It showcases how to preprocess text data
by breaking it into optimally sized chunks while preserving semantic structure.

2. Core Components
A. Text Splitters
The program compares two primary splitting methods:
1. RecursiveCharacterTextSplitter
1. Purpose: Splits text hierarchically (paragraphs → sentences → words).
2. Advantage: Preserves logical structure, avoids mid-sentence breaks.
3. Use Case: Natural language (articles, PDFs, Markdown).
2. CharacterTextSplitter
1. Purpose: Splits text at fixed character counts.
2. Advantage: Fast and simple.
3. Use Case: Unstructured text (logs, code).
B. Key Parameters
 chunk_size: Maximum characters per chunk (e.g., 26 for testing, 1000 for production).
 chunk_overlap: Shared characters between chunks (e.g., 4 or 150) to maintain context.
 Separators: Hierarchy for splitting (e.g., ["\n\n", "\n", " ", ""]).

3. Code Walkthrough
A. Setup & Imports
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
 Imports the two splitting strategies from LangChain.

Page 10 of 17
B. Basic Splitting Tests
chunk_size = 26
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
Initializes splitters with small chunk_size and overlap for testing edge cases.
 Test Cases:
 text1 = "abcdefghijklmnopqrstuvwxyz" → No split (exact chunk_size).
 text2 = "abcdefghijklmnopqrstuvwxyzabcdefg" → Splits with 4-character overlap.


C. Advanced Splitting
some_text = "When writing documents, writers use structure..."
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)
 Behavior:
 Splits at \n\n (paragraphs).
 Falls back to \n (lines), then sentences (\. ), then words ( ).
 Output: Chunks ≤150 chars, respecting semantic boundaries.
D. Real-World Examples
PDF Splitting
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=150
)
docs = text_splitter.split_documents(pages) # 77 chunks
 Splits a PDF lecture into ~1000-character chunks with 150-character overlap.
Markdown/Notion Splitting
markdown_document = "# Title\n\n## Chapter 1\n\nHi this is Jim..."
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")]
)
md_header_splits = markdown_splitter.split_text(markdown_document)
 Preserves header metadata (e.g., {"Header 1": "Title"}) in chunks.

4. Technical Insights
Why Splitting Matters
 LLM Constraints: Models have limited context windows (e.g., 4K–32K tokens).

Page 11 of 17
 RAG Optimization: Smaller chunks improve retrieval accuracy.
 Context Preservation: Overlap prevents information loss at chunk boundaries.
Separator Hierarchy
Priority order for RecursiveCharacterTextSplitter:
1. \n\n (paragraphs) → 2. \n (lines) → 3. \. (sentences) → 4. (words).
Token vs. Character Splitting
 TokenTextSplitter: Used when working directly with LLMs (e.g., chunking by GPT-4 tokens).
 CharacterTextSplitter: Simpler but less precise.

5. Practical Applications
1. RAG Pipelines:
1. Chunked docs → embeddings → vector DB → retrieval.
2. Fine-Tuning Data Prep:
1. Splits long texts for LLM training.
3. Document Preprocessing:
1. Cleans PDFs, Markdown, or web-scraped content.

6. Output Examples
Recursive Splitter Output
["When writing documents, writers use...", "...structure to group ideas."]
 Chunks are semantically coherent.
Markdown Splitter Output
[
{
"content": "Hi this is Jim...",
"metadata": {"Header 1": "Title", "Header 2": "Chapter 1"}
}
]
 Retains document structure.

Key Takeaways
 RecursiveCharacterTextSplitter is ideal for natural language.
 Overlap (10–20%) is critical for context continuity.
 Metadata-aware splitting (e.g., Markdown) enhances RAG performance.
This script is a foundational tool for AI workflows involving large documents.

Page 12 of 17
Lab 3: Vector Storage in Generative AI
The professional analysis of Program_03_chat.py, which demonstrates vector storage and semantic search for RAG
(Retrieval-Augmented Generation) systems:

1. Core Purpose
This program transforms text documents into searchable vectors and implements semantic search using:
 Embeddings: Numerical representations of text meaning
 Vector Database (ChromaDB): Stores vectors for efficient retrieval
 Similarity Search: Finds relevant text chunks for queries

2. Key Components
A. Document Processing Pipeline
1. Loading & Splitting (Reuses code from Program_01/02)
1. Loads duplicate/non-duplicate PDFs (MachineLearning-Lecture01.pdf x2, -02.pdf, -
03.pdf)
2. Splits into chunks (1500 chars with 150-char overlap)
2. Embedding Generation
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
1. Uses OpenAI's embedding model (text-embedding-ada-002 by default)
2. Converts text → 1536-dimensional vectors
3. Vector Storage
from langchain.vectorstores import Chroma
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory='docs/chroma/'
)
1. Stores vectors in ChromaDB (local persistent storage)
2. Auto-handles duplicate documents (but doesn't deduplicate)
B. Semantic Search
question = "is there an email i can ask for help"
docs = vectordb.similarity_search(question, k=3)
 Finds top-3 most semantically similar chunks to the query
 Uses cosine similarity between query and document vectors

3. Technical Workflow
1. Embedding Demonstration
sentence1 = "i like dogs"
embedding1 = embedding.embed_query(sentence1) # Vector for query
np.dot(embedding1, embedding2) # ≈0.96 (similar)
np.dot(embedding1, embedding3) # ≈0.77 (dissimilar)
1. Shows how semantic similarity translates to vector math
2. Search Examples
1. Successful case:
question = "is there an email i can ask for help"
# Returns chunks containing email addresses
o Failure cases:
question = "what did they say about matlab?"
# Returns duplicate chunks (from Lecture01.pdf x2)
question = "what did they say about regression in the third lecture?"
# Returns irrelevant chunks (from Lectures 1/2)

Page 13 of 17
4. Key Outputs
1. Vector Database
 Contains 209 document chunks (printed via vectordb._collection.count())
 Persisted to docs/chroma/ for reuse
2. Search Results
 Returns document chunks with:
1. page_content: Relevant text
2. metadata: Source PDF and page number

5. Failure Modes Identified

1. Duplicate Content
1. Caused by identical PDFs in source data
2. Solution: Deduplication pre-processing
2. Lecture-Specific Queries
1. Search returns irrelevant lecture chunks
2. Solution: Metadata filtering (shown in next lesson)
3. Over-Retrieval
1. Returns too many similar chunks
2. Solution: Diversity-aware search (MMR)

6. Professional Insights
 Embedding Choice Matters: OpenAI embeddings work well but require API calls. Alternatives:
 Local models (e.g., all-MiniLM-L6-v2)
 Task-specific fine-tuning
 ChromaDB Advantages:
 Lightweight
 Persistent storage
 Native LangChain integration
 Production Considerations:
# Recommended settings for production:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200, # 20% overlap
separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

7. Practical Applications
1. RAG Systems
1. Foundation for document Q&A bots
2. Knowledge Management
1. Enterprise document search
3. Lecture Analysis
1. Course content retrieval (as shown)

Summary
This program demonstrates a complete RAG preprocessing pipeline:
PDFs → Text Chunks → Embeddings → Vector DB → Semantic Search
With clear examples of both successful implementations and failure modes to address in subsequent lessons.

Page 14 of 17
Lab 4: Sophisticated document retrieval in Generative AI
This program demonstrates sophisticated document retrieval methods for Retrieval-Augmented Generation (RAG)
pipelines, addressing key challenges like result diversity, metadata filtering, and content compression. Below is a
structured breakdown:

1. Core Components & Workflow

A. VectorDB Initialization
vectordb = Chroma(
persist_directory='docs/chroma/',
embedding_function=OpenAIEmbeddings()
)
 Purpose: Loads pre-existing vector database (from Program_03) with OpenAI embeddings.
 Key Output: vectordb._collection.count() shows the number of stored document chunks.
B. Similarity Search Basics
question = "Tell me about all-white mushrooms with large fruiting bodies"
smalldb.similarity_search(question, k=2)
 k Significance:
 Controls the number of top results returned (k=2 → top 2 matches).
 Higher k = more results but potentially lower precision.
C. Maximum Marginal Relevance (MMR)
smalldb.max_marginal_relevance_search(question, k=2, fetch_k=3)
 Purpose: Balances relevance and diversity in results.
 fetch_k: Number of candidates to consider before selecting diverse final results (k).
 Use Case: Avoids duplicate/overlapping chunks (e.g., redundant lecture snippets).
D. Metadata Filtering
vectordb.similarity_search(
question,
k=3,
filter={"source": "docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)
 Why? Ensures results are from specific sources (e.g., "Lecture 3 only").
 Metadata Fields Used:
 source: PDF file path.
 page: Page number.
E. Self-Query Retriever
retriever = SelfQueryRetriever.from_llm(
llm=OpenAI(model='gpt-3.5-turbo-instruct'),
vectorstore=vectordb,
metadata_field_info=[...]
)
 Key Feature: Uses LLM to infer metadata filters from natural language queries.
 Example Query: "what did they say about regression in the third lecture?"
 Auto-extracted Filter: {"source": "MachineLearning-Lecture03.pdf"}.
F. Contextual Compression
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectordb.as_retriever()
)
 Purpose: Extracts only relevant parts of retrieved documents.
 Reduces noise in LLM prompts.
 Cuts API costs by removing irrelevant text.

Page 15 of 17
G. Alternative Retrievers (TF-IDF & SVM)
svm_retriever = SVMRetriever.from_texts(splits, embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)
 TF-IDF: Traditional keyword-based retrieval.
 SVM: Uses support vector machines for semantic search.
 Use Case: Lightweight alternatives to vector databases.

2. Key Techniques & Their Significance

Technique Problem Solved Implementation

Similarity Search Basic relevance ranking vectordb.similarity_search(question, k=3)

max_marginal_relevance_search(k=2,
MMR Redundant/overlapping results
fetch_k=3)

Irrelevant sources (e.g., wrong

Metadata Filtering filter={"source": "Lecture03.pdf"}
lecture)

Natural language metadata

Self-Query Retriever SelfQueryRetriever.from_llm(...)
extraction

Contextual
Noisy/long documents LLMChainExtractor + CompressionRetriever
Compression

3. Parameter Deep Dive: k and fetch_k

A. k (Number of Results)
 Role: Controls how many chunks are returned.
 Tradeoffs:
 Small k (e.g., 2-3): High precision, but may miss relevant info.
 Large k (e.g., 5-10): Broad coverage, but lower relevance.
B. fetch_k (MMR Candidate Pool)
 Role: Number of initial candidates before MMR diversifies results.
 Example: fetch_k=3, k=2 → MMR picks 2 diverse results from top 3.
 Rule of Thumb: Set fetch_k ≥ 2*k.

4. Practical Output Examples

Similarity Search
[
"A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.",
"The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body."
]
MMR Search
[
"A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.",
"A. phalloides, a.k.a Death Cap, is one of the most poisonous..." # More diverse
]
Compressed Results
Document 1:
"Amanita phalloides has all-white varieties with large fruiting bodies."
------------------------------------------------------------
Document 2:
"Death Cap mushrooms are poisonous."

Page 16 of 17
5. Professional Recommendations
1. Production Settings:
1. Start with k=5, fetch_k=10 for MMR.
2. Use SelfQueryRetriever for natural language queries with metadata.
2. Cost Optimization:
1. Combine compression with MMR to reduce LLM token usage.
3. Fallback Strategies:
1. Use TF-IDF/SVM retrievers as backups for edge cases.

Summary
This program extends basic RAG retrieval with advanced filtering, diversity control, and noise reduction, making it
production-ready. The k parameter is central to balancing precision/recall in search results.

Page 17 of 17

Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
Practical Go: Building Scalable Network and Non-Network Applications
From Everand
Practical Go: Building Scalable Network and Non-Network Applications
Amit Saha
No ratings yet
Learn Python in 10 Minutes
From Everand
Learn Python in 10 Minutes
Victor Ebai
4/5 (30)
Python for Mechanical and Aerospace Engineering
From Everand
Python for Mechanical and Aerospace Engineering
Alexander Kenan
No ratings yet
C# for Beginners: Learn in 24 Hours
From Everand
C# for Beginners: Learn in 24 Hours
Alex Nordeen
No ratings yet
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
RAG and LangChain Loading Documents Round1
No ratings yet
RAG and LangChain Loading Documents Round1
8 pages
Document Loader Types in Langchain - Kimi - Ai
No ratings yet
Document Loader Types in Langchain - Kimi - Ai
3 pages
Lecture 31-Document GPT Hands On
No ratings yet
Lecture 31-Document GPT Hands On
18 pages
Python for Beginners: An Introduction to Learn Python Programming with Tutorials and Hands-On Examples
From Everand
Python for Beginners: An Introduction to Learn Python Programming with Tutorials and Hands-On Examples
Nathan Metzler
4/5 (2)
Python Programming: Learn, Code, Create
From Everand
Python Programming: Learn, Code, Create
Sachin Naha
No ratings yet
The 1 Page Python Book
From Everand
The 1 Page Python Book
Barani Kumar
2/5 (1)
Windows Batch File Programming
From Everand
Windows Batch File Programming
Michael Elliott
2/5 (2)
Using Yocto Project with BeagleBone Black
From Everand
Using Yocto Project with BeagleBone Black
H M Irfan Sadiq
No ratings yet
Angular Workshop: From Beginner to Pro, Creating Applications for the Real World
From Everand
Angular Workshop: From Beginner to Pro, Creating Applications for the Real World
Abdelfattah Ragab
No ratings yet
Introduction to PHP, Part 1, Second Edition
From Everand
Introduction to PHP, Part 1, Second Edition
Adam Majczak
No ratings yet
eZ Publish 4: Enterprise Web Sites Step-by-Step
From Everand
eZ Publish 4: Enterprise Web Sites Step-by-Step
Francesco Trucchia
No ratings yet
Case Study
No ratings yet
Case Study
25 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
From Everand
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
John Edward Cooper Berg
No ratings yet
Learn Java Programming in 24 Hours
From Everand
Learn Java Programming in 24 Hours
PublishDrive
No ratings yet
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Building RAG Apps
No ratings yet
Building RAG Apps
32 pages
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
From Everand
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
Kiet Huynh
No ratings yet
Introducing Transformers Agents 20
No ratings yet
Introducing Transformers Agents 20
8 pages
Introduction
No ratings yet
Introduction
17 pages
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
Python for Beginners: Learn the Fundamentals of Computer Programming
From Everand
Python for Beginners: Learn the Fundamentals of Computer Programming
J Foster
No ratings yet
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
From Everand
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
DAISY JOHNSTON
No ratings yet
Plone 3.3 Site Administration
From Everand
Plone 3.3 Site Administration
Alex Clark
No ratings yet
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
Mastering Python in 7 Days
From Everand
Mastering Python in 7 Days
Alex Wood
No ratings yet
Project Gutenberg "Best Of" CD August 2003
From Everand
Project Gutenberg "Best Of" CD August 2003
Project Gutenberg
No ratings yet
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
From Everand
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
Andrew Lee
3/5 (2)
WiX: A Developer's Guide to Windows Installer XML
From Everand
WiX: A Developer's Guide to Windows Installer XML
Ramirez Nick
No ratings yet
NLP Exp 9 Outputs
No ratings yet
NLP Exp 9 Outputs
2 pages
Setup For Processing and Loading Co
No ratings yet
Setup For Processing and Loading Co
1 page
A concise guide to PHP MySQL and Apache
From Everand
A concise guide to PHP MySQL and Apache
alasdair gilchrist
4/5 (2)
PHP MySQL Development of Login Modul: 3 hours Easy Guide
From Everand
PHP MySQL Development of Login Modul: 3 hours Easy Guide
Esstree Ishak Abdullah
5/5 (1)
Python Pranks and Mischief with NLP
From Everand
Python Pranks and Mischief with NLP
Edward Franklin
No ratings yet
Brolly AI - Generative AI - Online Training
No ratings yet
Brolly AI - Generative AI - Online Training
13 pages
Protocol Buffers Handbook: Getting deeper into Protobuf internals and its usage
From Everand
Protocol Buffers Handbook: Getting deeper into Protobuf internals and its usage
Clément Jean
No ratings yet
Your First Python Program
From Everand
Your First Python Program
Alexander Paz
No ratings yet
Essential Python 3
From Everand
Essential Python 3
Kevin Vans-Colina
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
From Everand
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
David Hecksel
5/5 (2)
Python Programming Reference Guide: A Comprehensive Guide for Beginners to Master the Basics of Python Programming Language with Practical Coding & Learning Tips
From Everand
Python Programming Reference Guide: A Comprehensive Guide for Beginners to Master the Basics of Python Programming Language with Practical Coding & Learning Tips
Coleman Newton
No ratings yet
Code Explanation
No ratings yet
Code Explanation
8 pages
Olmocr 25
No ratings yet
Olmocr 25
20 pages
Dli Rag Slides
No ratings yet
Dli Rag Slides
183 pages
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
Node.js 6.x Blueprints
From Everand
Node.js 6.x Blueprints
Fernando Monteiro
No ratings yet
DLI RAG Slides
No ratings yet
DLI RAG Slides
183 pages
C Programming Language The Beginner’s Guide
From Everand
C Programming Language The Beginner’s Guide
Çağatay Şanlı
No ratings yet
Python Programming: 8 Simple Steps to Learn Python Programming Language in 24 hours! Practical Python Programming for Beginners, Python Commands and Python Language
From Everand
Python Programming: 8 Simple Steps to Learn Python Programming Language in 24 hours! Practical Python Programming for Beginners, Python Commands and Python Language
Norman James
2/5 (1)
CODE Explanation
No ratings yet
CODE Explanation
6 pages
Python Algorithms Step by Step: A Practical Guide with Examples
From Everand
Python Algorithms Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
LangChain Chat Bot March 15
No ratings yet
LangChain Chat Bot March 15
9 pages
Chapter 8
No ratings yet
Chapter 8
10 pages
Chapter 10
No ratings yet
Chapter 10
4 pages
Isolated Capital Cities, Accountability and Corruption: Evidence From US States
No ratings yet
Isolated Capital Cities, Accountability and Corruption: Evidence From US States
31 pages
Unit 71 PDF
No ratings yet
Unit 71 PDF
66 pages
Trial Balance
No ratings yet
Trial Balance
1 page
0 Dumpacore All 1
No ratings yet
0 Dumpacore All 1
2,769 pages
Context-Based Persian Multi-Document Summarization (Global View)
No ratings yet
Context-Based Persian Multi-Document Summarization (Global View)
5 pages
Estructuras Postgre SQL2
No ratings yet
Estructuras Postgre SQL2
11 pages
Mehran Taghipour Gorjikolaie: PHD in Electronic Engineering (Application of Ai)
No ratings yet
Mehran Taghipour Gorjikolaie: PHD in Electronic Engineering (Application of Ai)
3 pages
SQL Test Questions
No ratings yet
SQL Test Questions
5 pages
Bda - Digital Notes
No ratings yet
Bda - Digital Notes
85 pages
DSA Syllabus
No ratings yet
DSA Syllabus
2 pages
EAPP Module 8 Lesson 1
No ratings yet
EAPP Module 8 Lesson 1
19 pages
Nursing Informatics
No ratings yet
Nursing Informatics
8 pages
DBMS Short Answer
No ratings yet
DBMS Short Answer
7 pages
Machine Learning For Expert Systems in Data Analysis: Ezekiel T. Ogidan Kamil Dimililer Yoney Kirsal Ever
No ratings yet
Machine Learning For Expert Systems in Data Analysis: Ezekiel T. Ogidan Kamil Dimililer Yoney Kirsal Ever
5 pages
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
No ratings yet
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
31 pages
Research Paper (AI)
No ratings yet
Research Paper (AI)
12 pages
System Integration and Architecture - P1
No ratings yet
System Integration and Architecture - P1
19 pages
Unit I Dbms
0% (1)
Unit I Dbms
45 pages
Brain and Music - Music Genre Classification Using Brain Signals
No ratings yet
Brain and Music - Music Genre Classification Using Brain Signals
5 pages
Comprog Question
No ratings yet
Comprog Question
3 pages
Thesis Topics For Computer Science M Tech
100% (3)
Thesis Topics For Computer Science M Tech
5 pages
PG Life
No ratings yet
PG Life
26 pages
DBMS MCQ Final
No ratings yet
DBMS MCQ Final
49 pages
Manas Resume PDF
No ratings yet
Manas Resume PDF
1 page
Data Mining Essential Concepts For Analytics (DR K Seefeld) (Z-Library)
No ratings yet
Data Mining Essential Concepts For Analytics (DR K Seefeld) (Z-Library)
168 pages
Unit 1 Notes in NoSQL
No ratings yet
Unit 1 Notes in NoSQL
20 pages
Face Recognition Attendance System
No ratings yet
Face Recognition Attendance System
18 pages
004 Dataset
No ratings yet
004 Dataset
2 pages
Mygrades 2
No ratings yet
Mygrades 2
1 page
Content-Based Fashion Recommender System Using Unsupervised Learning
No ratings yet
Content-Based Fashion Recommender System Using Unsupervised Learning
6 pages
CSE2221 - Cryptography Handout Updated
No ratings yet
CSE2221 - Cryptography Handout Updated
10 pages
LLM 1
No ratings yet
LLM 1
6 pages
Note 3
No ratings yet
Note 3
43 pages

LAB MANUAL OF GENERATIVE AI April - 4

Uploaded by

LAB MANUAL OF GENERATIVE AI April - 4

Uploaded by

LAB MANUAL OF GENERATIVE AI

Composed by Adnan Alam Khan

from langchain.document_loaders import PyPDFLoader

How They Work Together

Key Terms Recap

OpenAIWhisperParser Audio-to-text converter Audio blob Text document

YouTube audio Audio file

3. Workflow & Code Explanation

3. Why Chunk Overlap?

Splitting Logic Hierarchical (paragraphs → words) Fixed character counts

Preserves Semantics Yes (avoids mid-sentence cuts) No (blind to meaning)

When to Use Each

5. Failure Modes Identified

1. Core Components & Workflow

2. Key Techniques & Their Significance

Similarity Search Basic relevance ranking vectordb.similarity_search(question, k=3)

Irrelevant sources (e.g., wrong

Natural language metadata

3. Parameter Deep Dive: k and fetch_k

4. Practical Output Examples

You might also like