LAB MANUAL OF GENERATIVE AI April - 4
LAB MANUAL OF GENERATIVE AI April - 4
Page 1 of 17
Lab 1: Document loaders in Generative AI
There are 80 types of document loading.
Program 01: Document Loading
This file shows how to load different types of
documents for AI processing.
Key Concepts:
1. Retrieval Augmented Generation (RAG): AI
system that can look up information from
documents to answer questions
2. Document Loaders: Tools to import different file
types
Code Breakdown:
1. Setup:
o Imports necessary libraries and sets up OpenAI
API key
o load_dotenv() loads environment variables (like
API keys) from a .env file
2. PDF Loading:
loader =
PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
o Loads a PDF file and splits it into pages
o Each page becomes a "Document" object with text content and metadata
3. YouTube Audio Loading:
loader = GenericLoader(
FileSystemBlobLoader(save_dir, glob="*.m4a"),
OpenAIWhisperParser()
)
o Uses OpenAI's Whisper model to transcribe audio files
o Can load from YouTube or local audio files
4. Web Page Loading:
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")
docs = loader.load()
o Fetches and loads content from a webpage
5. Notion Document Loading:
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
o Loads documents exported from Notion (a note-taking app)
Results:
Each loader creates Document objects containing the content and metadata
For PDFs: Returns page content and page numbers
For YouTube: Returns transcribed text
For web pages: Returns HTML content
For Notion: Returns Markdown content
Page 2 of 17
Code:
#!/usr/bin/env python
# coding: utf-8
# # Document Loading
# ## Note to students.
# During periods of high load you may find the notebook unresponsive. It may appear to execute a cell,
update the completion number in brackets [#] at the left of the cell but you may find the cell has not executed.
This is particularly obvious on print statements when there is no output. If this happens, restart the kernel
using the command under the Kernel tab.
# ## Retrieval augmented generation
# In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset
as part of its execution.
# This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).
# 
#! pip install langchain
# In[ ]:
import os
import openai
import sys
sys.path.append('../..')
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
# In Python, _ traditionally indicates a value we don't plan to use later
# Explanation of .env file A typical .env file might contain:
#OPENAI_API_KEY=sk-...your-api-key...
#DATABASE_URL=postgres://user:pass@localhost:5432/db
#After running the command, you can access these anywhere in your code:
#import os
#api_key = os.getenv("OPENAI_API_KEY")
#openai.api_key = os.environ['OPENAI_API_KEY']
# ## PDFs
# Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-
#Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated
#transcription so words and sentences are sometimes split unexpectedly.
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
#! pip install pypdf
# In[ ]:
page.metadata
# ## YouTube
from langchain.document_loaders.generic import GenericLoader, FileSystemBlobLoader
Page 3 of 17
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
# Above terms will be explained after end of this code :
# ! pip install yt_dlp
# ! pip install pydub
# **Note**: This can take several minutes to complete. This has been modified relative to the lesson video
to fetch the video file locally.
# In[ ]:
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
#YoutubeAudioLoader([url],save_dir), # fetch from youtube
FileSystemBlobLoader(save_dir, glob="*.m4a"), #fetch locally
OpenAIWhisperParser()
)
docs = loader.load()
# In[ ]:
docs[0].page_content[0:500]
# ## URLs
# In[ ]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")
# > Note: the URL sent to the WebBaseLoader differs from the one shonw in the video because for 2024 it
was updated.
docs = loader.load()
print(docs[0].page_content[:500])
# ## Notion
# Follow steps
#here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/n#otion
) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-
#Handbook-e31bff7da17346ee99f531087d8b133f):
# * Duplicate the page into your own Notion space and export as `Markdown / CSV`.
# * Unzip it and save it as a folder that contains the markdown file for the Notion page.
# 
# In[ ]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
print(docs[0].page_content[0:200])
# In[ ]:
docs[0].metadata
# In[ ]:
#The LangChain document loading system:
# GenericLoader
Definition: A flexible document loader that combines a Blob Loader (which fetches raw data) with
a Parser (which processes the raw data into documents).
Purpose: Provides a unified way to load documents from different sources by pairing the right blob loader
with the right parser.
Example Use Case: Loading audio files from YouTube, parsing them into text using OpenAI's Whisper model.
FileSystemBlobLoader
Definition: A loader that fetches raw binary data (called "blobs") from files in a local directory.
Key Features:
Searches for files matching a glob pattern (e.g., *.m4a for audio files).
Page 4 of 17
Returns raw data without parsing it (e.g., audio bytes, PDF bytes, etc.).
Purpose: Used when you already have files downloaded locally and want to process them.
Example Use Case: Loading locally saved .m4a audio files for transcription.
OpenAIWhisperParser
Definition: A parser that uses OpenAI's Whisper (speech-to-text model) to transcribe audio blobs into text
documents.
Key Features:
Converts audio (e.g., MP3, M4A) into text.
Handles long audio files by splitting them into chunks.
Purpose: Extract text content from audio/video files for further processing (e.g., summarization, Q&A).
Example Use Case: Transcribing a lecture recording into text for a chatbot to reference.
YoutubeAudioLoader
Definition: A blob loader that downloads audio from YouTube videos.
Key Features:
Takes YouTube URLs as input.
Downloads audio streams and saves them as local files (e.g., .m4a).
Purpose: Fetch audio content directly from YouTube links.
Example Use Case: Downloading a podcast episode from YouTube to analyze its content.
Page 5 of 17
Term Role Example Input Output
Blob Loader +
GenericLoader Coordinator Document(s)
Parser
Raw audio
FileSystemBlobLoader Local file fetcher *.m4a files
blobs
Page 6 of 17
Lab 2: Document loaders in Generative AI
Program 2: Data Prepration tool kit program.
This program demonstrates text splitting strategies for AI document processing, focusing on chunking
techniques used in Retrieval-Augmented Generation (RAG) pipelines. Below is a structured breakdown:
1. Core Purpose
Prepares raw text data (documents, web pages, PDFs, etc.) for AI models by:
Splitting long texts into smaller, semantically meaningful chunks.
Preserving context through overlap and hierarchical separators.
Handling edge cases (e.g., code, markdown, token limits).
2. Key Components
A. Text Splitters
Two primary classes are used:
1. RecursiveCharacterTextSplitter
a. Splits text recursively using separators (e.g., paragraphs, sentences, words).
b. Ideal for natural language (prose, articles).
2. CharacterTextSplitter
a. Splits at exact character counts.
b. Simpler but may break mid-sentence.
B. Parameters
chunk_size: Max characters/tokens per chunk.
chunk_overlap: Shared text between chunks (preserves context).
separators: Hierarchy for splitting (e.g., ["\n\n", "\n", " ", ""]).
Page 7 of 17
)
docs = text_splitter.split_documents(pages) # 77 chunks from 1 PDF
Key Insight:
Splits PDF text at \n (newlines) with 150-char overlap.
Result: 77 chunks (~1000 chars each), optimized for LLM processing.
D. Token-Based Splitting
text_splitter = TokenTextSplitter(chunk_size=10)
text_splitter.split_text("foo bar bazzyfoo") # Splits by tokens (not chars)
Use Case: Matches LLM context windows (e.g., GPT-4 uses tokens).
Note: Tokens ≈ 4 chars (varies by language/model).
E. Structured Document Splitting (Markdown)
markdown_document = "# Title\n\n## Chapter 1\n\nHi this is Jim..."
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")]
)
md_header_splits = markdown_splitter.split_text(markdown_document)
Output: Chunks retain header metadata (e.g., {"Header 1": "Title"}).
Why? Preserves document hierarchy (critical for Notion/Confluence docs).
4. Technical Insights
Why Recursive Splitting?
Mimics human reading: Breaks text into paragraphs → sentences → words.
Avoids mid-sentence cuts (better for embeddings/RAG).
Overlap Tradeoffs
Pros: Maintains context across chunks.
Cons: Increases compute/storage (duplicate text).
Separator Hierarchy
Order matters! Example priority:
1. \n\n (paragraphs)
2. \n (lines)
3. \. (sentences)
4. (words)
5. Practical Applications
1. RAG Pipelines: Chunked docs → embeddings → vector DB → retrieval.
2. Fine-Tuning: Prepares data for LLM training.
3. Preprocessing: Cleans messy inputs (PDFs, transcripts, web scrapes).
6. Output Summary
The script doesn’t print results by default, but running it would generate:
Chunked text (lists of strings meeting chunk_size).
Metadata-enriched splits (for markdown/Notion).
Tokenized segments (for LLM compatibility).
Key Takeaway
This program is a data preparation toolkit for AI workflows, ensuring text is split into digestible pieces while
preserving structure and context. The techniques shown are foundational for building production-grade RAG
systems.
Page 8 of 17
The commands and the concept of chunk overlap:
1. Import Statements
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
Purpose: Imports two text-splitting tools from LangChain:
o RecursiveCharacterTextSplitter: Splits text hierarchically (paragraphs → sentences → words).
o CharacterTextSplitter: Splits text at exact character counts.
2. Chunk Parameters
chunk_size = 26 # Max characters per chunk
chunk_overlap = 4 # Shared characters between adjacent chunks
chunk_size: The target maximum length (in characters) for each text chunk.
o Example: If set to 26, no chunk will exceed 26 characters.
chunk_overlap: The number of characters shared between consecutive chunks.
o Example: With chunk_overlap=4, the last 4 characters of one chunk will repeat at the start of the next.
4. Splitter Initialization
A. Recursive Splitter
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
Behavior:
1. Splits text hierarchically using default separators (["\n\n", "\n", " ", ""]).
2. Prioritizes natural breaks (paragraphs → sentences → words).
3. Enforces chunk_size and chunk_overlap after splitting.
Use Case: Ideal for prose (articles, docs) where semantic structure matters.
B. Character Splitter
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
Behavior:
1. Splits text at exact character counts (no semantic awareness).
2. Simply "cuts" text every chunk_size characters, then adds overlap.
Use Case: Raw data where structure is irrelevant (e.g., code, logs).
5. Example Outputs
Input Text: "abcdefghijklmnopqrstuvwxyzabcdefg" (33 characters)
Recursive Splitter:
o Tries to split at natural breaks first. If none exist, behaves like CharacterTextSplitter.
o Output (with chunk_size=26, overlap=4):
["abcdefghijklmnopqrstuvwxyz", "wxyzabcdefg"] # Last 4 chars overlap
Character Splitter:
o Direct split at character 26, with 4-character overlap:
["abcdefghijklmnopqrstuvwxyz", "wxyzabcdefg"]
Page 9 of 17
6. Key Differences
Feature RecursiveCharacterTextSplitter CharacterTextSplitter
Best For Natural language (docs, articles) Unstructured text (code, logs)
Overlap Handling Applied after semantic splitting Applied after character cutting
Practical Note
Typical Values in Production:
o chunk_size=500-1500 (balance between context and LLM input limits).
o chunk_overlap=10-20% of chunk_size (e.g., 100 chars for chunk_size=1000).
Adjust based on your use case: Larger overlaps improve context but increase costs.
Explanation of LAB 2.
1. Program Overview
Program_02_chat.py is a Python script demonstrating document splitting techniques for AI and NLP pipelines,
specifically designed for Retrieval-Augmented Generation (RAG) systems. It showcases how to preprocess text data
by breaking it into optimally sized chunks while preserving semantic structure.
2. Core Components
A. Text Splitters
The program compares two primary splitting methods:
1. RecursiveCharacterTextSplitter
1. Purpose: Splits text hierarchically (paragraphs → sentences → words).
2. Advantage: Preserves logical structure, avoids mid-sentence breaks.
3. Use Case: Natural language (articles, PDFs, Markdown).
2. CharacterTextSplitter
1. Purpose: Splits text at fixed character counts.
2. Advantage: Fast and simple.
3. Use Case: Unstructured text (logs, code).
B. Key Parameters
chunk_size: Maximum characters per chunk (e.g., 26 for testing, 1000 for production).
chunk_overlap: Shared characters between chunks (e.g., 4 or 150) to maintain context.
Separators: Hierarchy for splitting (e.g., ["\n\n", "\n", " ", ""]).
3. Code Walkthrough
A. Setup & Imports
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
Imports the two splitting strategies from LangChain.
Page 10 of 17
B. Basic Splitting Tests
chunk_size = 26
chunk_overlap = 4
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
Initializes splitters with small chunk_size and overlap for testing edge cases.
Test Cases:
text1 = "abcdefghijklmnopqrstuvwxyz" → No split (exact chunk_size).
text2 = "abcdefghijklmnopqrstuvwxyzabcdefg" → Splits with 4-character overlap.
C. Advanced Splitting
some_text = "When writing documents, writers use structure..."
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)
Behavior:
Splits at \n\n (paragraphs).
Falls back to \n (lines), then sentences (\. ), then words ( ).
Output: Chunks ≤150 chars, respecting semantic boundaries.
D. Real-World Examples
PDF Splitting
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=150
)
docs = text_splitter.split_documents(pages) # 77 chunks
Splits a PDF lecture into ~1000-character chunks with 150-character overlap.
Markdown/Notion Splitting
markdown_document = "# Title\n\n## Chapter 1\n\nHi this is Jim..."
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")]
)
md_header_splits = markdown_splitter.split_text(markdown_document)
Preserves header metadata (e.g., {"Header 1": "Title"}) in chunks.
4. Technical Insights
Why Splitting Matters
LLM Constraints: Models have limited context windows (e.g., 4K–32K tokens).
Page 11 of 17
RAG Optimization: Smaller chunks improve retrieval accuracy.
Context Preservation: Overlap prevents information loss at chunk boundaries.
Separator Hierarchy
Priority order for RecursiveCharacterTextSplitter:
1. \n\n (paragraphs) → 2. \n (lines) → 3. \. (sentences) → 4. (words).
Token vs. Character Splitting
TokenTextSplitter: Used when working directly with LLMs (e.g., chunking by GPT-4 tokens).
CharacterTextSplitter: Simpler but less precise.
5. Practical Applications
1. RAG Pipelines:
1. Chunked docs → embeddings → vector DB → retrieval.
2. Fine-Tuning Data Prep:
1. Splits long texts for LLM training.
3. Document Preprocessing:
1. Cleans PDFs, Markdown, or web-scraped content.
6. Output Examples
Recursive Splitter Output
["When writing documents, writers use...", "...structure to group ideas."]
Chunks are semantically coherent.
Markdown Splitter Output
[
{
"content": "Hi this is Jim...",
"metadata": {"Header 1": "Title", "Header 2": "Chapter 1"}
}
]
Retains document structure.
Key Takeaways
RecursiveCharacterTextSplitter is ideal for natural language.
Overlap (10–20%) is critical for context continuity.
Metadata-aware splitting (e.g., Markdown) enhances RAG performance.
This script is a foundational tool for AI workflows involving large documents.
Page 12 of 17
Lab 3: Vector Storage in Generative AI
The professional analysis of Program_03_chat.py, which demonstrates vector storage and semantic search for RAG
(Retrieval-Augmented Generation) systems:
1. Core Purpose
This program transforms text documents into searchable vectors and implements semantic search using:
Embeddings: Numerical representations of text meaning
Vector Database (ChromaDB): Stores vectors for efficient retrieval
Similarity Search: Finds relevant text chunks for queries
2. Key Components
A. Document Processing Pipeline
1. Loading & Splitting (Reuses code from Program_01/02)
1. Loads duplicate/non-duplicate PDFs (MachineLearning-Lecture01.pdf x2, -02.pdf, -
03.pdf)
2. Splits into chunks (1500 chars with 150-char overlap)
2. Embedding Generation
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
1. Uses OpenAI's embedding model (text-embedding-ada-002 by default)
2. Converts text → 1536-dimensional vectors
3. Vector Storage
from langchain.vectorstores import Chroma
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory='docs/chroma/'
)
1. Stores vectors in ChromaDB (local persistent storage)
2. Auto-handles duplicate documents (but doesn't deduplicate)
B. Semantic Search
question = "is there an email i can ask for help"
docs = vectordb.similarity_search(question, k=3)
Finds top-3 most semantically similar chunks to the query
Uses cosine similarity between query and document vectors
3. Technical Workflow
1. Embedding Demonstration
sentence1 = "i like dogs"
embedding1 = embedding.embed_query(sentence1) # Vector for query
np.dot(embedding1, embedding2) # ≈0.96 (similar)
np.dot(embedding1, embedding3) # ≈0.77 (dissimilar)
1. Shows how semantic similarity translates to vector math
2. Search Examples
1. Successful case:
question = "is there an email i can ask for help"
# Returns chunks containing email addresses
o Failure cases:
question = "what did they say about matlab?"
# Returns duplicate chunks (from Lecture01.pdf x2)
question = "what did they say about regression in the third lecture?"
# Returns irrelevant chunks (from Lectures 1/2)
Page 13 of 17
4. Key Outputs
1. Vector Database
Contains 209 document chunks (printed via vectordb._collection.count())
Persisted to docs/chroma/ for reuse
2. Search Results
Returns document chunks with:
1. page_content: Relevant text
2. metadata: Source PDF and page number
6. Professional Insights
Embedding Choice Matters: OpenAI embeddings work well but require API calls. Alternatives:
Local models (e.g., all-MiniLM-L6-v2)
Task-specific fine-tuning
ChromaDB Advantages:
Lightweight
Persistent storage
Native LangChain integration
Production Considerations:
# Recommended settings for production:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200, # 20% overlap
separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
7. Practical Applications
1. RAG Systems
1. Foundation for document Q&A bots
2. Knowledge Management
1. Enterprise document search
3. Lecture Analysis
1. Course content retrieval (as shown)
Summary
This program demonstrates a complete RAG preprocessing pipeline:
PDFs → Text Chunks → Embeddings → Vector DB → Semantic Search
With clear examples of both successful implementations and failure modes to address in subsequent lessons.
Page 14 of 17
Lab 4: Sophisticated document retrieval in Generative AI
This program demonstrates sophisticated document retrieval methods for Retrieval-Augmented Generation (RAG)
pipelines, addressing key challenges like result diversity, metadata filtering, and content compression. Below is a
structured breakdown:
Page 15 of 17
G. Alternative Retrievers (TF-IDF & SVM)
svm_retriever = SVMRetriever.from_texts(splits, embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)
TF-IDF: Traditional keyword-based retrieval.
SVM: Uses support vector machines for semantic search.
Use Case: Lightweight alternatives to vector databases.
max_marginal_relevance_search(k=2,
MMR Redundant/overlapping results
fetch_k=3)
Contextual
Noisy/long documents LLMChainExtractor + CompressionRetriever
Compression
Page 16 of 17
5. Professional Recommendations
1. Production Settings:
1. Start with k=5, fetch_k=10 for MMR.
2. Use SelfQueryRetriever for natural language queries with metadata.
2. Cost Optimization:
1. Combine compression with MMR to reduce LLM token usage.
3. Fallback Strategies:
1. Use TF-IDF/SVM retrievers as backups for edge cases.
Summary
This program extends basic RAG retrieval with advanced filtering, diversity control, and noise reduction, making it
production-ready. The k parameter is central to balancing precision/recall in search results.
Page 17 of 17