Gen AI / ML and Python
Introduction to Programming & Python Basics
What is Programming?
Programming is the process of writing instructions (code) that a computer can
execute to perform specific tasks.
It allows you to automate calculations, process data, build applications, and
solve real-world problems.
Why Python for AI/ML?
Python is a popular, beginner-friendly programming language known for its
simple syntax and readability
It has a vast ecosystem of libraries for AI and machine learning (e.g., NumPy,
Pandas, scikit-learn, TensorFlow, PyTorch).
Python is widely used in industry and academia for data science, AI, and
automation due to its flexibility and strong community support
Installing Python
Most computers do not come with Python pre-installed, but installation is
straightforward:
Download Python from the official website: python.org1
Follow the installation instructions for your operating system (Windows,
macOS, Linux).
During installation, ensure you check the box to "Add Python to PATH" for
easy access from the command line
Using IDEs: Colab and Jupyter
IDEs (Integrated Development Environments) make coding easier by
providing features like syntax highlighting, code completion, and error
Gen AI / ML and Python 1
checking.
Jupyter Notebook:
A web-based tool for writing and running Python code in “cells.” Great for
experimentation and data analysis.
Install via pip install notebook or use JupyterLab.
Google Colab:
Free, cloud-based Jupyter notebooks. No installation needed—just sign in
with your Google account at colab.research.google.com.
Excellent for beginners and for running code on any device
Running Your First Python Script
Open your IDE or a terminal/command prompt.
Type the following code and run it:
print("Hello, World!")
This will display:
Hello, World!
In Jupyter or Colab, enter the code in a cell and press Shift+Enter to run it
Python Syntax
Python uses indentation (spaces or tabs) to define code blocks.
Statements end with a newline, not a semicolon.
Variables
Variables store data values. You do not need to declare their type.
Gen AI / ML and Python 2
name = "Alice"
age = 20
Data Types
String: Text, e.g., "Hello"
Integer: Whole numbers, e.g., 5
Float: Decimal numbers, e.g., 3.14
Boolean: True or False
greeting = "Hi"
year = 2025
pi = 3.1415
is_student = True
Input and Output
Output: Use print() to display information.
print("Welcome to Python!")
Input: Use input() to get user input (always returns a string).
user_name = input("Enter your name: ")
print("Hello,", user_name)
Simple Arithmetic Operations
a = 10
b=3
print(a + b) # Addition: 13
print(a - b) # Subtraction: 7
Gen AI / ML and Python 3
print(a * b) # Multiplication: 30
print(a / b) # Division: 3.333...
print(a // b) # Integer division: 3
print(a % b) # Modulus (remainder): 1
print(a ** b) # Exponentiation: 1000
Practice Exercise
Write a script that:
Asks for two numbers from the user,
Adds them,
Prints the result.
num1 = input("Enter first number: ")
num2 = input("Enter second number: ")
sum = int(num1) + int(num2)
print("The sum is:", sum)
Control Flow & Data Structures
1. Conditional Statements ( if , elif , else )
Conditional statements control the flow of execution based on conditions.
Basic if Statement
Executes a block if the condition is True :
x = 10
if x > 5:
print("x is greater than 5")
if-else Statement
Executes one block if the condition is True , another if False :
Gen AI / ML and Python 4
number = int(input("Enter a number: "))
if number > 0:
print("Positive number")
else:
print("Not a positive number")
If the user enters 10 , output is Positive number .
If the user enters 0 , output is Not a positive number
if-elif-else Statement
Checks multiple conditions in sequence:
score = 85
if score >= 90:
print("Grade: A")
elif score >= 80:
print("Grade: B")
else:
print("Grade: C")
Only the first True condition's block executes.
Nested if Statements
You can nest if statements inside each other for complex logic.
2. Loops ( for , while )
Loops are used to execute a block of code repeatedly.
For Loop
Used for iterating over a sequence (list, tuple, string, etc.):
fruits = ["apple", "banana", "cherry"]
for fruit in fruits:
Gen AI / ML and Python 5
print(fruit)
Iterates over each element in the sequence
Use Cases:
Processing items in a list or tuple
Repeating actions a fixed number of times
While Loop
Executes as long as a condition is True :
i=1
while i < 6:
print(i)
i += 1
Useful when the number of iterations is not known in advance
Use Cases:
Waiting for user input
Running until a specific event occurs (e.g., guessing games, event loops)
Loop Control Statements
break : Exit the loop immediately.
continue : Skip the current iteration and continue with the next.
else : Optional; runs if the loop completes normally (not via break ).
Example:
for i in range(5):
if i == 3:
break
print(i)
Gen AI / ML and Python 6
else:
print("Loop finished")
The else block will not execute if the loop is exited with break
Session 2: Lists, Tuples, Sets, Dictionaries
1. Lists
Definition: Ordered, mutable collection. Allows duplicates.
Syntax: my_list = [1][8][9]
Indexing: Zero-based ( my_list is 1 )
Basic Operations:
Add: my_list.append(4)
Remove: my_list.remove(2)
Access: my_list[1] (returns 2 )
Slice: my_list[1:3] (returns [8][9] )
Iterate:
pythonfor item in my_list:
print(item)
Use Cases: Storing ordered data, dynamic collections
2. Tuples
Definition: Ordered, immutable collection. Allows duplicates.
Syntax: my_tuple = (1, 2, 3)
Indexing: Zero-based ( my_tuple is 1 )
Basic Operations:
Access: my_tuple[1]
Count: my_tuple.count(2)
Index: my_tuple.index(3)
Gen AI / ML and Python 7
Iterate:
for item in my_tuple:
print(item)
Use Cases: Fixed data, function returns, keys in dictionaries
3. Sets
Definition: Unordered, mutable collection of unique elements.
Syntax: my_set = {1, 2, 3}
Basic Operations:
Add: my_set.add(4)
Remove: my_set.remove(2)
Membership: 2 in my_set
Iterate:
for item in my_set:
print(item)
Use Cases: Removing duplicates, set operations (union, intersection)
4. Dictionaries
Definition: Unordered (ordered as of Python 3.7+), mutable collection of key-
value pairs.
Syntax: my_dict = {'a': 1, 'b': 2}
Indexing: By key ( my_dict['a'] returns 1 )
Basic Operations:
Add/Update: my_dict['c'] = 3
Remove: my_dict.pop('b')
Access: my_dict['a']
Gen AI / ML and Python 8
Iterate:
for key, value in my_dict.items():
print(key, value)
Use Cases: Fast lookups, mapping relationships
Comparison Table
Feature List Tuple Set Dictionary
Mutable Yes No Yes Yes
Ordered Yes Yes No Yes (3.7+)
Keys: No, Values:
Duplicates Yes Yes No
Yes
Indexing Integer Integer No Key-based
Syntax [1][8][9] (1,2,3) {1,2,3} {'a':1, 'b':2}
Indexing and Iteration
Lists/Tuples: Use integer indices and slicing.
my_list = [10, 20, 30]
print(my_list[1]) # 20
for i, val in enumerate(my_list):
print(i, val)
Sets: No indexing; iterate directly.
Dictionaries: Iterate over keys, values, or items.
for key in my_dict:
print(key, my_dict[key])
for key, value in my_dict.items():
print(key, value)
Gen AI / ML and Python 9
Advanced Iteration: Use range() , enumerate() , or custom logic for cyclic or
indexed iteration
Functions, Modules, and File Handling
1. Defining and Using Functions
Function Definition:
Use the def keyword, followed by the function name, parentheses (with
optional parameters), and a colon. The function body is indented.
def greet():
print("Hello, World!")
greet() # Output: Hello, World!
Purpose:
Functions help organize code, promote reuse, and improve readability
2. Parameters and Arguments
Parameters:
Variables listed inside the parentheses in the function definition.
Arguments:
Values passed to the function when it is called.
Types of Parameters:
Positional: Order matters.
Keyword: Specify by name, order doesn't matter.
Default: Provide a default value.
Variable-length: Use args for tuples, *kwargs for dictionaries.
def add(a, b=5):
return a + b
Gen AI / ML and Python 10
print(add(3)) # Output: 8
print(add(3, 7)) # Output: 10
3. Return Values
Returning Values:
Use the return statement to send a result back to the caller. If no return is
specified, the function returns None by default.
def square(x):
return x * x
result = square(4) # result is 16
Multiple Return Values:
Python allows returning multiple values as a tuple.
def stats(numbers):
return min(numbers), max(numbers)
mn, mx = stats([1, 2, 3])
# mn = 1, mx = 3
Returning Lists, Dictionaries, or Functions:
Functions can return any object, including lists, dictionaries, or even other
functions456.
4. Scope in Python
Local Scope:
Variables defined inside a function are local and accessible only within that
function.
def foo():
x = 10 # local to foo
print(x)
Gen AI / ML and Python 11
foo()
# print(x) # Error: x is not defined
Global Scope:
Variables defined outside any function are global and accessible throughout
the file.
x = 20
def bar():
print(x) # accesses global x
bar()
Nonlocal Scope:
Used in nested functions to refer to variables in the enclosing function.
def outer():
x = "outer"
def inner():
nonlocal x
x = "inner"
inner()
print(x) # Output: inner
outer()
Variable Shadowing:
If a variable with the same name exists in both local and global scope, the
local variable takes precedence inside the function
5. Importing Modules and Using Standard Libraries
Importing Modules:
Use the import statement to bring in external code (modules).
import math
Gen AI / ML and Python 12
print(math.sqrt(16)) # Output: 4.0
Import Specific Functions:
from math import pi
print(pi) # Output: 3.141592653589793
Standard Libraries:
Python comes with a rich set of standard modules, such as math , random ,
datetime , os , and sys .
Custom Modules:
You can create your own modules by saving functions in a .py file and
importing them
Reading from and Writing to Files
Opening Files:
Use the open() function with a filename and mode ( 'r' , 'w' , 'a' , etc.).
f = open('data.txt', 'r') # Open for reading
Reading Files:
read() : Reads the entire file.
readline() : Reads one line at a time.
readlines() : Reads all lines into a list.
with open('data.txt', 'r') as f:
content = f.read()
Writing Files:
'w' : Write (overwrites existing file or creates new).
'a' : Append to the end of the file.
Gen AI / ML and Python 13
with open('output.txt', 'w') as f:
f.write("Hello, file!")
Best Practice:
Use with statement to automatically close files
Mode Description
'r' Read (default)
'w' Write (overwrite/create)
'a' Append (end of file)
'r+' Read and write
'w+' Write and read (overwrite)
'a+' Append and read
String Operations
Common String Methods:
len(s) : Length of string
s.upper() , s.lower() : Change case
s.strip() : Remove whitespace
s.replace('a', 'b') : Replace substring
s.split(',') : Split into list
','.join(list) : Join list into string
s.find('sub') : Find substring index
s.isdigit() , s.isalpha() : Check content type
pythontext = " Hello, World! "
print(text.strip().upper())
# Output: HELLO, WORLD!
Formatting Strings:
f-strings: f"Value: {x}"
Gen AI / ML and Python 14
str.format() : "Value: {}".format(x)
Simple File-Based Mini Project: Word Counter
Objective:
Read a text file, count the frequency of each word, and write the results to a new
file.
Steps:
1. Read the file content.
2. Clean and split the text into words.
3. Count word occurrences.
4. Write the results to an output file.
Sample Code:
def count_words(input_file, output_file):
with open(input_file, 'r') as f:
text = f.read().lower()
words = text.split()
word_count = {}
for word in words:
word = word.strip('.,!?";:')
word_count[word] = word_count.get(word, 0) + 1
with open(output_file, 'w') as f:
for word, count in sorted(word_count.items()):
f.write(f"{word}: {count}\n")
count_words('input.txt', 'word_count.txt')
This project demonstrates file reading/writing, string manipulation, and
dictionary usage.
Python for Data Science
Gen AI / ML and Python 15
What is NumPy?
NumPy stands for Numerical Python and is a foundational library for
numerical and scientific computing in Python.
It provides a powerful n-dimensional array object and useful functions for
performing mathematical operations efficiently.
Commonly used for: data analysis, scientific computing, and as the base for
other libraries like Pandas and SciPy.
Creating and Working with NumPy Arrays
a. Creating Arrays
From lists:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6])
Multi-dimensional arrays:
b = np.array([[1, 2, 3], [4, 5, 6]])
Arrays filled with zeros or ones:
zeros = np.zeros(3) # array([0., 0., 0.])
ones = np.ones(3) # array([1., 1., 1.])
Arrays with a range of numbers:
arr = np.arange(0, 10, 2) # array([0, 2, 4, 6, 8])
linspace = np.linspace(0, 1, 5) # array([0. , 0.25, 0.5 , 0.75, 1. ])
arr = np.ones(3, dtype=np.int64)
Specify data type:
Gen AI / ML and Python 16
b. Indexing and Slicing
Access elements by index (0-based):
print(a[0]) # 1
Slicing:
print(a[:3]) # array([1, 2, 3])
c. Basic Operations
Element-wise operations:
data = np.array([1, 2])
ones = np.ones(2, dtype=int)
print(data + ones) # array([2, 3])
print(data * data) # array([1, 4])
Aggregations:
a = np.array([1, 2, 3, 4])
print(a.sum()) # 10
b = np.array([[1, 1], [2, 2]])
print(b.sum(axis=0)) # array([3, 3]) # sum over rows
print(b.sum(axis=1)) # array([2, 4]) # sum over columns
arr = np.array([
[1, 2, 3],
[4, 5, 6]
])
11
22
Reshaping:
Gen AI / ML and Python 17
c = np.arange(12).reshape(3, 4)
print(c)
What is Pandas?
Pandas is a Python library built on top of NumPy, designed for data
manipulation and analysis.
It introduces two main data structures:
Series: 1D labeled array
DataFrame: 2D labeled, tabular data structure (like an Excel spreadsheet)
Creating and Working with Pandas DataFrames
a. Creating DataFrames
From a dictionary:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
From a NumPy array:
arr = np.array([[1, 2], [3, 4]])
df2 = pd.DataFrame(arr, columns=['A', 'B'])
b. Accessing Data
Select a column:
print(df['Name'])
Select rows by index:
Gen AI / ML and Python 18
print(df.iloc[0]) # first row
print(df.loc[0]) # by label/index
Slicing rows:
print(df[0:2]) # first two rows
c. Basic Data Manipulation
Add a new column:
df['Salary'] = [50000, 60000, 70000]
Filter rows:
adults = df[df['Age'] > 28]
Drop a column:
df = df.drop('Salary', axis=1)
Handle missing values:
df.isnull()
df.fillna(0)
df.dropna()
Aggregation:
df['Age'].mean()
df.groupby('Name').sum()
Sorting:
Gen AI / ML and Python 19
df.sort_values('Age')
Practice Exercise Examples
Create a NumPy array of numbers from 10 to 19.
Add two NumPy arrays element-wise.
Create a Pandas DataFrame from a list of dictionaries.
Select all rows in the DataFrame where Age > 30.
Calculate the sum and mean of a DataFrame column.
Replace missing values in a DataFrame with the column mean.
Sort the DataFrame by a specific column.
Handle the missing data {'A': [1, 2, np.nan], 'B': [5, np.nan, np.nan]}
Introduction to Data Cleaning
Definition: Data cleaning is the process of fixing or removing incorrect,
corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset
Why it matters: Clean data leads to more accurate analysis and insights.
Messy data can cause errors, misleading results, or make analysis impossible.
Common Data Cleaning Steps:
Remove duplicate or irrelevant data (e.g., repeated rows, out-of-scope
entries)1.
Fix structural errors (e.g., typos, inconsistent capitalization, mixed formats)1.
Handle missing values (e.g., fill with mean/median, remove rows/columns)2.
Standardize data (e.g., consistent date formats, units, text casing).
Example:
import pandas as pd
Gen AI / ML and Python 20
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'bob'],
'Score': [85, 90, 85, None, 90]
})
df = df.drop_duplicates()
df['Name'] = df['Name'].str.capitalize()
df['Score'] = df['Score'].fillna(df['Score'].mean())
print(df)
Data Filtering
Definition: Filtering means selecting rows that meet certain conditions,
helping you focus on relevant data
How to filter: Use boolean indexing in Pandas.
Example:
# Filter students with Score above 85
filtered_df = df[df['Score'] > 85]
print(filtered_df)
Combine multiple conditions using & (and) or | (or)4:
# Students named 'Bob' with Score above 85
filtered_df = df[(df['Name'] == 'Bob') & (df['Score'] > 85)]
print(filtered_df)
Data Sorting
Definition: Sorting arranges your data by the values in one or more columns,
making it easier to spot patterns or outliers35.
How to sort: Use .sort_values() in Pandas.
Example:
Gen AI / ML and Python 21
# Sort by Score in descending order
sorted_df = df.sort_values('Score', ascending=False)
print(sorted_df)
Sort by multiple columns:
# Sort by Name (A-Z), then by Score (high to low)
sorted_df = df.sort_values(['Name', 'Score'], ascending=[True, False])
print(sorted_df)
Simple Data Analysis Project
Project Idea:
Analyze a small dataset (e.g., students and scores, product sales, or Titanic
dataset) using the above techniques.
Project Steps:
1. Load the data (from a CSV or dictionary).
2. Clean the data (remove duplicates, fix names, handle missing values).
3. Filter the data (e.g., select students with high scores, products with sales
above a threshold).
4. Sort the data (e.g., by score, by product price).
5. Draw simple conclusions (e.g., who has the highest score? How many
products sold more than 10 units?).
Example:
import pandas as pd
# 1. Load data
data = {
'Student': ['Alice', 'Bob', 'Alice', 'Charlie', 'David'],
'Score': [85, 90, 85, None, 75]
Gen AI / ML and Python 22
}
df = pd.DataFrame(data)
# 2. Clean data
df = df.drop_duplicates()
df['Score'] = df['Score'].fillna(df['Score'].mean())
# 3. Filter: Scores above 80
high_scores = df[df['Score'] > 80]
# 4. Sort: By Score descending
sorted_scores = high_scores.sort_values('Score', ascending=False)
# 5. Analyze: Highest scorer
top_student = sorted_scores.iloc[0]
print("Cleaned Data:\n", df)
print("High Scores:\n", high_scores)
print("Sorted High Scores:\n", sorted_scores)
print("Top Student:\n", top_student)
In-Class Activity
Give students a small CSV or dictionary-based dataset.
Ask them to:
Remove duplicates
Fill missing values
Filter for a specific condition (e.g., scores above a threshold)
Sort the results
Print the top result
Data Visualization & Project
Gen AI / ML and Python 23
Why Data Visualization?
Data visualization helps you understand data, spot trends, and communicate
insights effectively.
Python’s most popular libraries for visualization are Matplotlib and Seaborn
2. Introduction to Matplotlib
Matplotlib is a foundational plotting library in Python, offering flexibility to
create a wide variety of static, animated, and interactive plots
It’s often imported as import matplotlib.pyplot as plt .
Basic Line Plot Example
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [10, 20, 25, 30]
plt.plot(x, y)
plt.title('Simple Line Plot')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.show()
You can customize colors, line styles, and add titles/labels easily
Other Basic Plots in Matplotlib
Bar Chart:
students = ["Alice", "Bob", "Charlie"]
scores = [85, 90, 78]
plt.bar(students, scores, color='skyblue')
plt.title("Student Scores")
plt.xlabel("Student")
Gen AI / ML and Python 24
plt.ylabel("Score")
plt.show()
Pie Chart:
labels = ["Python", "Java", "C++"]
sizes = [50, 30, 20]
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title("Programming Language Popularity")
plt.show()
Scatter Plot:
x = [1, 2, 3, 4, 5]
y = [5, 7, 4, 6, 5]
plt.scatter(x, y)
plt.title("Scatter Plot Example")
plt.xlabel("X Value")
plt.ylabel("Y Value")
plt.show()
Introduction to Seaborn
Seaborn is built on top of Matplotlib and provides a higher-level, more user-
friendly interface for creating attractive statistical graphics
It works seamlessly with Pandas DataFrames and comes with better default
styles and color palettes
Getting Started with Seaborn
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme() # Apply Seaborn's default styling
Gen AI / ML and Python 25
Basic Seaborn Plots
Histogram:
import seaborn as sns
import matplotlib.pyplot as plt
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
sns.histplot(data)
plt.title("Histogram Example")
plt.show()
Scatter Plot:
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset("tips")
sns.scatterplot(x="total_bill", y="tip", data=tips)
plt.title("Total Bill vs Tip")
plt.show()
Line Plot:
import seaborn as sns
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
sns.lineplot(x="timepoint", y="signal", data=fmri)
plt.title("FMRI Signal Over Time")
plt.show()
4. Matplotlib vs. Seaborn: When to Use Which?
Gen AI / ML and Python 26
Matplotlib: More control and customization, suitable for publication-quality
graphics and unique plot types
Seaborn: Simpler code for statistical plots, better default styles, and works
well for quick exploratory analysis
Hands-On Practice
Exercise Ideas:
Plot a bar chart of your favorite fruits and their quantities.
Visualize random numbers as a histogram using both Matplotlib and Seaborn.
Use Seaborn to plot a scatter plot from the built-in "tips" dataset.
Introduction to Machine Learning
What is Machine Learning (ML)?
Machine Learning (ML) is a subset of artificial intelligence (AI) that enables
computers to learn from data and make predictions or decisions without being
explicitly programmed for each task.
ML algorithms identify patterns in data, learn from past experiences, and
improve their performance over time with minimal human intervention
The process involves feeding large amounts of data to algorithms, which then
optimize their internal parameters to minimize errors and make accurate
predictions or classifications
Types of Machine Learning
Supervised Learning
Definition: The algorithm is trained on labeled data, meaning each input
comes with a known output.
Goal: Learn a mapping from inputs to outputs so it can predict the output for
new, unseen data.
Gen AI / ML and Python 27
Common Algorithms: Linear regression, logistic regression, decision trees,
support vector machines, neural networks.
Use Cases: Email spam detection, image classification, credit scoring, medical
diagnosis
Unsupervised Learning
Definition: The algorithm is given data without explicit labels and must find
patterns or groupings on its own.
Goal: Discover hidden structures or relationships in the data.
Common Algorithms: K-means clustering, hierarchical clustering, principal
component analysis (PCA), association rule learning.
Use Cases: Customer segmentation, anomaly detection, market basket
analysis, dimensionality reduction
Reinforcement Learning
Definition: An agent learns to make decisions by interacting with an
environment, receiving rewards or penalties for actions.
Goal: Maximize cumulative reward over time.
Use Cases: Robotics, game playing, recommendation systems, autonomous
vehicles
Machine Learning Workflow
A typical ML workflow consists of several key stages:
1. Problem Definition: Clearly define the business or research problem and
success criteria.
2. Data Collection: Gather relevant and high-quality data from various sources.
3. Data Preparation: Clean, preprocess, and transform data (handle missing
values, encode categories, normalize features).
4. Exploratory Data Analysis (EDA): Analyze data to understand distributions,
relationships, and potential issues.
Gen AI / ML and Python 28
5. Model Selection: Choose appropriate algorithms based on the problem
(classification, regression, clustering, etc.).
6. Model Training: Fit the model to the training data, adjusting parameters to
minimize errors.
7. Model Evaluation: Assess model performance using metrics (accuracy,
precision, recall, RMSE, etc.) on validation/test data.
8. Model Tuning: Optimize hyperparameters to improve performance.
9. Deployment: Integrate the trained model into production systems for real-
world use.
10. Monitoring and Maintenance: Continuously monitor model performance and
retrain as needed.
Real-World Applications of Machine Learning
Healthcare: Disease prediction, medical image analysis, drug discovery.
Finance: Fraud detection, credit scoring, algorithmic trading.
Retail: Recommendation systems, customer segmentation, inventory
management.
Transportation: Self-driving cars, route optimization, demand forecasting.
Natural Language Processing: Chatbots, sentiment analysis, language
translation.
Manufacturing: Predictive maintenance, quality control, supply chain
optimization.
Introduction to Scikit-learn and Building Your First ML Model
What is Scikit-learn?
Scikit-learn (sklearn) is a popular open-source Python library for machine
learning.
It provides simple and efficient tools for data mining and data analysis,
supporting both supervised and unsupervised learning.
Gen AI / ML and Python 29
Built on top of NumPy, SciPy, and Matplotlib, it offers a consistent interface for
a wide range of algorithms, including classification, regression, clustering, and
dimensionality reduction.
Scikit-learn is widely used in industry and academia due to its ease of use,
extensive documentation, and active community support.
Key Features
Ready-to-use algorithms for classification, regression, clustering, and more.
Tools for data preprocessing, model selection, and evaluation.
Built-in datasets for practice (e.g., Iris, Digits, Boston Housing).
Integration with other Python libraries for data science workflows.
Building Your First ML Model with Scikit-learn
Example: Classification with the Iris Dataset
Step-by-Step Process:
1. Import Libraries and Load Data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
2. Split Data into Training and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_sta
3. Initialize and Train the Model
Gen AI / ML and Python 30
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
4. Make Predictions
y_pred = model.predict(X_test)
5. Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Interpretation: The model predicts the species of iris flowers based on
features like petal and sepal length/width. The accuracy score indicates how
well the model performs on unseen data.
Example: Simple Regression
Using the Boston Housing Dataset (for regression tasks):
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load dataset
boston = load_boston()
X, y = boston.data, boston.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=
# Train model
reg = LinearRegression()
reg.fit(X_train, y_train)
Gen AI / ML and Python 31
# Predict and evaluate
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
Interpretation: The model predicts house prices based on features like
number of rooms, location, etc. The mean squared error (MSE) measures
prediction accuracy for regression tasks.
Key Takeaways:
Machine learning enables systems to learn from data and make predictions.
Supervised and unsupervised learning are the two main types, each with
distinct use cases.
The ML workflow involves data preparation, model training, evaluation, and
deployment.
Scikit-learn simplifies building, evaluating, and deploying ML models in
Python, making it accessible for beginners and professionals alike.
Supervised Learning in Practice
Linear Regression — Theory, Implementation, Evaluation Metrics
Theory of Linear Regression
Definition:
Linear regression models the relationship between one or more independent
variables (predictors) and a continuous dependent variable by fitting a linear
equation to observed data.
Mathematical Model:y=β0+β1x+ϵ
For simple linear regression (one predictor), the model is:
y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon
where:
Gen AI / ML and Python 32
yyy = dependent variable
xxx = independent variable
β0\beta_0β0 = intercept (value of yyy when x=0x=0x=0)
β1\beta_1β1 = slope (change in yyy per unit change in xxx)
ϵ\epsilonϵ = error term (difference between observed and predicted
values)
Goal:
Find β0\beta_0β0 and β1\beta_1β1 that minimize the sum of squared residuals
(differences between observed and predicted yyy) — this is called the
Ordinary Least Squares (OLS) method.
Multiple Linear Regression:y=β0+β1x1+β2x2+ ⋯+βpxp+ϵ
Extends to multiple predictors:
y=β0+β1x1+β2x2+ ⋯+βpxp+ϵy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots +
\beta_p x_p + \epsilon
Implementation of Linear Regression in Python
a) Manual Calculation (Using NumPy)
Calculate means of xxx and yyy.
Compute slope (β1\beta_1β1) and intercept (β0\beta_0β0) using
formulas:β1=∑(xi−xˉ)2∑(xi−xˉ)(yi−yˉ)β0=yˉ−β1xˉ
β1=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2\beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}
{\sum (x_i - \bar{x})^2}
β0=yˉ−β1xˉ\beta_0 = \bar{y} - \beta_1 \bar{x}
Predict values using:y^=β0+β1x
y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x
Example:
import numpy as np
Gen AI / ML and Python 33
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
x_mean = np.mean(x)
y_mean = np.mean(y)
B1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
B0 = y_mean - B1 * x_mean
y_pred = B0 + B1 * x
print(f"Slope: {B1}, Intercept: {B0}")
print("Predicted values:", y_pred)
b) Using Scikit-learn
Import LinearRegression from sklearn.linear_model .
Fit the model on training data.
Predict and evaluate.
Example:
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1], [2], [3], [4], [5]]) # 2D array for sklearn
y = np.array([2, 4, 5, 4, 5])
model = LinearRegression()
model.fit(X, y)
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)
y_pred = model.predict(X)
print("Predictions:", y_pred)
Gen AI / ML and Python 34
c) Visualization with Matplotlib and SciPy
Use scipy.stats.linregress to get slope, intercept, and statistical measures.
Plot scatter and regression line.
Example:
import matplotlib.pyplot as plt
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
def predict(x):
return slope * x + intercept
y_pred = list(map(predict, x))
plt.scatter(x, y)
plt.plot(x, y_pred, color='red')
plt.show()
3. Evaluation Metrics for Linear Regression
Mean Squared Error (MSE): Average squared difference between actual and
predicted values.
Root Mean Squared Error (RMSE): Square root of MSE; interpretable in
original units.
Mean Absolute Error (MAE): Average absolute difference.
R-squared (R2R^2R2): Proportion of variance in dependent variable explained
by the model; ranges from 0 to 1.
Using Scikit-learn Metrics:
Gen AI / ML and Python 35
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"MSE: {mse}")
print(f"R-squared: {r2}")
Logistic Regression
Purpose:
Used for binary classification problems (output is categorical: 0 or 1).
Theory:σ(z)=1+e−z1
Instead of predicting continuous values, logistic regression predicts the
probability that an input belongs to a class using the logistic (sigmoid)
function:
σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}
where z=β0+β1x1+ ⋯+βpxpz = \beta_0 + \beta_1 x_1 + \cdots + \beta_p
x_pz=β0+β1x1+ ⋯+βpxp.
Output:
A probability between 0 and 1, which is thresholded (commonly at 0.5) to
assign class labels.
Use Cases:
Spam detection, disease diagnosis, customer churn prediction.
Decision Trees
Definition:
A tree-like model of decisions that splits data based on feature values to
classify or predict outcomes.
How it Works:
Gen AI / ML and Python 36
The tree splits nodes by selecting the feature and threshold that best separate
classes (using criteria like Gini impurity or entropy).
Advantages:
Easy to interpret, handles both numerical and categorical data, non-linear
relationships.
Limitations:
Can overfit, sensitive to small data changes.
Hands-on with Scikit-learn: Logistic Regression and Decision
Trees
Logistic Regression Example (Iris Dataset)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# For binary classification, select two classes
X = X[y != 2]
y = y[y != 2]
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=
# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict and evaluate
Gen AI / ML and Python 37
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Decision Tree Example (Iris Dataset)
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
# Using same train/test split as above# Train decision tree
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
# Predict and evaluate
y_pred_tree = tree.predict(X_test)
print(classification_report(y_test, y_pred_tree))
Clustering (K-Means), Dimensionality Reduction (PCA), and
Hands-on Examples
K-Means Clustering: Theory and Algorithm
What is K-Means?
K-Means is an unsupervised learning algorithm used to partition data into K
distinct clusters based on feature similarity.
How It Works:
1. Initialization: Randomly select K centroids (cluster centers).
2. Expectation Step: Assign each data point to the nearest centroid based on
Euclidean distance.
3. Maximization Step: Update centroids by calculating the mean of all points
assigned to each cluster.
4. Repeat steps 2 and 3 until centroids stabilize (no change in assignments).
Objective:
Gen AI / ML and Python 38
Minimize the sum of squared errors (SSE) — the sum of squared distances
between points and their cluster centroids.
Challenges:
Choosing the right K (number of clusters).
Sensitivity to centroid initialization (can lead to different results).
Non-deterministic; often run multiple times with different initializations.
Elbow Method:
Plot SSE against different values of K to find the "elbow" point where adding
more clusters yields diminishing returns.
Dimensionality Reduction: Principal Component Analysis (PCA)
Purpose:
Reduce the number of features (dimensions) in a dataset while preserving as
much variance (information) as possible.
How PCA Works:
Computes new orthogonal axes (principal components) that capture
maximum variance.
First principal component captures the most variance, second is
orthogonal and captures the next most, and so forth.
Data is projected onto these components, reducing dimensionality.
Benefits:
Simplifies visualization (e.g., 2D or 3D plots).
Reduces noise and computational cost.
Helps avoid the “curse of dimensionality” in machine learning.
Hands-on Example: K-Means Clustering with PCA in Python
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
Gen AI / ML and Python 39
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Load dataset
data = load_iris()
X = data.data
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Reduce dimensions to 2 for visualization
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=50, max_iter=500, rand
clusters = kmeans.fit_predict(X_pca)
# Plot clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-Means Clustering with PCA')
plt.show()
Model Evaluation Metrics
1. Accuracy
Proportion of correct predictions (both true positives and true negatives) over
total predictions.
Best for balanced datasets.
2. Precision
Gen AI / ML and Python 40
Measures how many predicted positives are actually
positive.Precision=TP+FPTP
Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}
3. Recall (Sensitivity)
Measures how many actual positives were correctly
identified.Recall=TP+FNTP
Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}
4. F1-Score
Harmonic mean of precision and recall, balancing
both.F1=2×Precision+RecallPrecision×Recall
F1=2×Precision×RecallPrecision+RecallF1 = 2 \times \frac{\text{Precision}
\times \text{Recall}}{\text{Precision} + \text{Recall}}
5. Confusion Matrix
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
Shows counts of true/false positives and negatives.
6. Cross-Validation
Technique to assess model generalization by splitting data into multiple
train/test folds.
Common method: k-fold cross-validation, where data is divided into k subsets;
each subset is used once as test data while the others train the model.
Helps avoid overfitting and provides robust performance estimates.
Introduction to Deep Learning and Neural Networks
What is Deep Learning?
Gen AI / ML and Python 41
A subset of machine learning that uses artificial neural networks with many
layers (deep architectures) to model complex patterns in data.
Excels at tasks like image recognition, natural language processing, and
speech recognition.
Learns hierarchical feature representations automatically.
Neural Network Basics
Neuron:
Basic computational unit that receives inputs, applies weights, adds bias, and
passes the result through an activation function.
Weights and Biases:
Parameters learned during training that determine the importance of inputs.
Activation Functions:
Introduce non-linearity; common types include:
Sigmoid: Outputs values between 0 and 1.
ReLU (Rectified Linear Unit): Outputs zero for negative inputs, linear for
positive.
Tanh: Outputs values between -1 and 1.
Layers:
Input layer: Receives raw data.
Hidden layers: Perform transformations.
Output layer: Produces final prediction.
Building a Simple Neural Network with TensorFlow/Keras: Digit
Recognition (MNIST)
Step-by-step example:
import tensorflow as tf
from tensorflow.keras import layers, models
Gen AI / ML and Python 42
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
# Load dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Preprocess data
X_train = X_train.reshape(-1, 28*28).astype('float32') / 255
X_test = X_test.reshape(-1, 28*28).astype('float32') / 255
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
# Build model
model = models.Sequential([
layers.Dense(128, activation='relu', input_shape=(784,)),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])
# Compile model
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)
# Evaluate model
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")
Explanation:
Flatten 28x28 images into 784-dimensional vectors.
Two hidden layers with ReLU activation.
Gen AI / ML and Python 43
Output layer with softmax for multi-class classification (digits 0-9).
Use categorical_crossentropy loss and adam optimizer.
Introduction to Generative AI
What is Generative AI?
Generative AI is a branch of artificial intelligence focused on creating new, original
content—such as text, images, music, or code—by learning patterns from large
datasets. Unlike traditional AI, which is designed to analyze data and make
predictions or decisions based on predefined rules, generative AI can produce
outputs that did not previously exist, mimicking creativity and innovation
Generative AI vs. Traditional AI
Aspect Traditional AI Generative AI
Analyzes data, makes predictions, Creates new content (text, images,
Core Function
classifies code, music, etc.)
Pattern creation, self-learning from
Approach Rule-based, pattern recognition
data
Predictions, classifications, Original content (stories, images,
Output
recommendations code, etc.)
Spam filters, recommendation ChatGPT, DALL·E, GitHub Copilot,
Examples
systems, chatbots music generators
Data Type Structured data Structured & unstructured data
Capable of producing creative,
Creativity Limited to defined tasks
novel outputs
Key Difference:
Traditional AI is reactive and task-oriented, excelling at analyzing and predicting
within set boundaries. Generative AI is proactive, capable of producing new,
creative content by learning from existing data
Key Applications of Generative AI
Gen AI / ML and Python 44
Text Generation: Chatbots (ChatGPT), content writing, translation,
summarization.
Image Generation: Creating artwork (DALL·E, Midjourney), photo editing, style
transfer.
Code Generation: Writing and completing code (GitHub Copilot, OpenAI
Codex).
Audio & Music: Composing music, generating synthetic voices.
Video & Animation: Generating video content, deepfakes, animation.
Other Areas: Drug discovery, synthetic data creation, personalized
recommendations.
Summary
Generative AI represents a shift from AI systems that simply analyze or classify
data to those that can create entirely new content, opening up new possibilities in
creativity, productivity, and problem-solving across industries
Hands-on—Using OpenAI API or HuggingFace Transformers to
Generate Text; Prompt Engineering Basics
1. Introduction to Text Generation Tools
OpenAI API: Provides access to powerful language models (like GPT-3/4) that
can generate human-like text.
HuggingFace Transformers: An open-source library with many pre-trained
generative models (e.g., GPT-2, GPT-3, T5, BERT).
2. Hands-on: Generating Text
Using OpenAI API (Python Example)
import openai
client = OpenAI(
Gen AI / ML and Python 45
api_key=os.environ.get("OPENAI_API_KEY"),
)
response = client.responses.create(
model="o4-mini",
instructions="You are a concise assistant.",
input="Explain the difference between a list and a tuple in Python.",
)
print(response.output_text)
Replace "YOUR_API_KEY" with your actual OpenAI API key.
Using HuggingFace Transformers (Python Example)
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
prompt = "Once upon a time in a distant galaxy,"
result = generator(prompt, max_length=50, num_return_sequences=1)
print(result[0]['generated_text'])
3. Prompt Engineering Basics
What is Prompt Engineering?
The art of crafting effective input prompts to guide generative AI models
toward desired outputs.
Tips for Good Prompts:
Be clear and specific: "Summarize this article in three sentences."
Provide context: "Act as a travel guide and recommend places to visit in
Paris."
Use examples: "Translate the following English sentence to French: 'Hello,
how are you?'"
Gen AI / ML and Python 46
Experiment and iterate: Try different phrasings to get the best results.
4. Practice Exercise
Try generating:
A poem about summer.
A Python function that calculates factorial.
A product description for a new smartphone.
Experiment with prompt variations and observe how outputs change.
Summary
Generative AI enables machines to create new content, setting it apart from
traditional, rule-based AI.
Text generation is accessible using tools like OpenAI API or HuggingFace
Transformers.
Prompt engineering is key to getting high-quality, relevant outputs from
generative models.
Generative AI for Images and Code
Image Generation Basics — DALL-E, Stable Diffusion, Using Web
APIs and Demo Tools
Overview of Image Generation Models
DALL-E:
Developed by OpenAI, DALL-E (and its successors like DALL-E 2 and DALL-E
3) are powerful text-to-image models that generate high-quality, detailed
images from natural language prompts. DALL-E 3 improves over previous
versions by better understanding and rewriting prompts internally to produce
more compelling and accurate images.
Stable Diffusion:
Gen AI / ML and Python 47
An open-source text-to-image diffusion model that generates photorealistic
images by iteratively denoising random noise guided by a text prompt. It is
notable for being accessible for local installation and customization, unlike
some proprietary models.
How These Models Work (Briefly)
Diffusion Models:
Both DALL-E 2/3 and Stable Diffusion use diffusion techniques that start with
random noise and progressively refine it into a coherent image matching the
prompt. The process involves learning to reverse a noising process, guided by
text embeddings.
CLIP Model (DALL-E):
DALL-E uses a CLIP model to map text and images into a shared semantic
space, enabling the generation of images that semantically match the input
text.
Using DALL-E via Web API
OpenAI API:
You can generate images by sending a text prompt to the OpenAI API
specifying the DALL-E model version (e.g., DALL-E 3). The API returns URLs to
generated images.
Example Workflow:
Define a detailed text prompt describing the desired image.
Call the .images.generate() method with parameters like model , prompt , n
(number of images), and size .
Retrieve the image URL from the response and display or download it.
Tips for Better Results:
More detailed prompts yield higher quality images. DALL-E 3 internally
rewrites prompts to optimize generation.
Using Stable Diffusion Locally or via Colab
Gen AI / ML and Python 48
Local Setup:
Clone the Stable Diffusion repository.
Create a Conda environment with required dependencies.
Download the model weights (e.g., checkpoint v1.4).
Run commands like:
python scripts/txt2img.py --prompt "your prompt here" --ckpt sd-v1-4.ck
GPU usage is recommended for faster generation; CPU is possible but
slower (8-12 minutes per image).
Colab Notebooks:
Google Colab notebooks allow running Stable Diffusion without local setup,
with GPU acceleration available on Colab Pro.
Demo Tools and Platforms
Platforms like Midjourney, RunwayML, and Hugging Face Spaces provide web
interfaces to generate images using Stable Diffusion or DALL-E models.
These tools often allow prompt refinement, image upscaling, outpainting, and
blending modes for creative control.
Code Generation with Large Language Models (LLMs) — GitHub
Copilot, OpenAI Codex; Practical Exercises
What is Code Generation with LLMs?
LLMs like OpenAI Codex and GitHub Copilot are trained on vast amounts of
code and natural language, enabling them to generate code snippets,
complete functions, or even entire programs from natural language prompts or
partial code.
GitHub Copilot
An AI-powered code completion tool integrated into code editors (e.g., VS
Code).
Gen AI / ML and Python 49
Suggests code lines or blocks as you type, based on context.
Supports many languages and frameworks.
Helps accelerate development, reduce boilerplate, and learn new APIs.
OpenAI Codex
The underlying model powering Copilot.
Accessible via API for custom code generation tasks.
Can generate code from natural language prompts, translate between
languages, or explain code.
Practical Exercises
Exercise 1: Generate a function from a docstring prompt.
Prompt: "Write a Python function to check if a number is prime."
Expected output: Function code implementing prime check.
Exercise 2: Complete partial code snippets.
Provide a partial function and ask the model to complete it.
Exercise 3: Generate unit tests for existing functions.
Prompt the model to write test cases based on function definitions.
Exercise 4: Translate code from one language to another.
E.g., Python to JavaScript.
Exercise 5: Debugging assistance.
Provide buggy code and ask for corrections or explanations.
Best Practices
Always review generated code for correctness and security.
Use generated code as a starting point or assistant, not a final solution.
Combine with human expertise for best results.
Gen AI / ML and Python 50
Retrieval Augmented Generation & LLM Frameworks
The Limits of LLMs and the Need for RAG
Explanation:
LLMs (like GPT, Gemini) are trained on vast but static datasets. Their
knowledge is frozen at training time and may be outdated or incomplete.
LLMs can “hallucinate” (make up facts) and struggle with domain-specific
or up-to-date information.
Example:
Ask ChatGPT: “Who won the 2024 Olympics?” (It can’t answer accurately
if trained before 2024.)
Discussion:
Why is this a problem for real-world applications (e.g., customer support,
research, enterprise tools)?
What is RAG? (Retrieval Augmented Generation)
Definition:
RAG combines information retrieval (searching relevant documents/data)
with generative AI (LLMs) to produce grounded, accurate, and context-
aware outputs.
How it Works:
1. Retrieve: Search for relevant documents/passages from a knowledge
base (using keyword or semantic search, often with vector databases).
2. Augment: Provide the retrieved content as context to the LLM.
3. Generate: The LLM uses both its training and the fetched context to
answer the user’s query.
Diagram:
User Query → Retriever (search) → Relevant Docs → LLM (with docs as conte
Gen AI / ML and Python 51
Key Benefits:
Enhanced accuracy: Reduces hallucination by grounding answers in real
data.
Up-to-date information: Can access current knowledge without
retraining.
Domain adaptation: Easily apply LLMs to private or niche datasets.
Scalability: Add new knowledge without retraining the model.
RAG in Practice—Real-World Applications
Examples:
Enterprise chatbots answering questions from internal documentation.
Research assistants summarizing the latest scientific papers.
Customer support bots accessing product manuals and support tickets.
Demo:
Show a live RAG-powered chatbot (e.g., Bing Copilot, Gemini Advanced, or
a simple open-source demo).
How Retrieval Works (Under the Hood)
Retrieval Methods:
Keyword Search: Classic search (e.g., Elasticsearch).
Semantic Search: Uses embeddings/vectors to find similar meaning, not
just keywords.
Hybrid Search: Combines both, often with a re-ranker for best results.
Vector Databases:
Store documents as embeddings for fast, semantic retrieval (e.g.,
Pinecone, ChromaDB).
Multi-modal Retrieval:
Not just text—can retrieve images, audio, etc. using the same principles.
Gen AI / ML and Python 52
Intro to Langchain, LlamaIndex, Building a Simple Q&A
System
Introduction to Langchain & LlamaIndex
Langchain:
A Python framework for building LLM-powered applications, with tools for
chaining together retrieval, generation, and more.
LlamaIndex:
A toolkit for connecting LLMs to custom data sources. Makes it easy to
ingest, index, and query your own documents.
Why these tools?
They simplify RAG workflows and speed up prototyping.
Setting Up Your Environment
Install the libraries:
pip install langchain llama-index chromadb streamlit
Obtain API keys for your LLM provider (OpenAI, Gemini, etc.).
Building a Simple Q&A System with Langchain
Step 1: Prepare Your Data
Use a few sample text files, PDFs, or URLs as your knowledge base.
Step 2: Index the Data
Example (Langchain with ChromaDB):
from langchain.document_loaders import TextLoader
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
Gen AI / ML and Python 53
loader = TextLoader("my_docs/")
documents = loader.load()
db = Chroma.from_documents(documents, OpenAIEmbeddings())
Step 3: Connect to an LLM
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(api_key="YOUR_API_KEY")
Step 4: Create the RetrievalQA Chain
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=db.as_retriever()
)
Step 5: Ask Questions!
answer = qa.run("What is the main topic of document X?")
print(answer)
Building with LlamaIndex
Step 1: Ingest Data
from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
Step 2: Create an Index
from llama_index import GPTVectorStoreIndex
index = GPTVectorStoreIndex.from_documents(documents)
Gen AI / ML and Python 54
Step 3: Query the Index
response = index.query("What is this document about?")
print(response)
Optional:
Use LlamaIndex’s Data Connectors to pull in data from PDFs, SQL, APIs, etc..
Hands-On Activity—Build and Test Your Own Q&A Bot
Task:
Use Langchain or LlamaIndex to build a simple Q&A system over a small
document set (e.g., Wikipedia articles, class notes, or company docs).
Suggested Steps:
1. Load your documents.
2. Index them with embeddings.
3. Connect to an LLM.
4. Run queries and observe the answers.
Challenge:
Try modifying the retriever (e.g., switch from keyword to semantic search).
Add a Streamlit UI for interactive Q&A.
Project Ideas:
1. AI Art Generator for Game Assets (Using Stable Diffusion)
Description: Build a simple app that generates game art assets (characters,
backgrounds, items) from text prompts using Stable Diffusion.
Skills: Prompt engineering, API integration or local model usage, image saving
and display.
Gen AI / ML and Python 55
Tools: Stable Diffusion (local install with Automatic1111 WebUI or via
DreamStudio API), Python for scripting.
Why: Great for learning image generation basics and creating usable assets
for your own game projects.
Reference: Many beginners start with Stable Diffusion v1.5 or SDXL models
and user-friendly UIs like Fooocus or InvokeAI
2. Text-to-Image Web App with DALL-E API
Description: Create a web interface where users enter a text prompt and get
AI-generated images back using OpenAI’s DALL-E API.
Skills: REST API calls, frontend/backend integration, handling image URLs.
Tools: OpenAI API, Flask/Django or Node.js backend, React or plain HTML/JS
frontend.
Why: Learn how to integrate powerful generative AI models into web apps and
handle asynchronous API responses.
Reference: DALL-E API usage and prompt design tips
3. AI-Powered Image Style Transfer or Editing Tool
Description: Build a tool that applies different artistic styles or edits images
using AI models (e.g., Stable Diffusion inpainting or style transfer).
Skills: Image processing, model inference, UI for uploading and editing
images.
Tools: Stable Diffusion inpainting models, Python, OpenCV, Gradio or Streamlit
for UI.
Why: Hands-on experience with advanced generative AI features beyond
simple text-to-image generation.
Reference: Stable Diffusion’s editing capabilities and UI options
4. Code Generation Assistant Using OpenAI Codex or GitHub
Copilot API
Gen AI / ML and Python 56
Description: Build a simple code assistant chatbot that generates code
snippets based on user prompts or completes partial code.
Skills: NLP prompt engineering, API integration, conversational UI design.
Tools: OpenAI Codex API, Flask or FastAPI backend, React or plain JS
frontend.
Why: Learn code generation with LLMs and practical API usage for developer
tools.
Reference: Code generation with LLMs and practical exercises[Session 2
content].
5. AI-Powered Writing Assistant with Text Generation
Description: Develop an app that generates creative writing, summaries, or
paraphrases using GPT models.
Skills: Text generation, prompt tuning, UI/UX design.
Tools: OpenAI GPT API, Streamlit or Flask, basic frontend.
Why: Explore generative AI for NLP and content creation, useful for blogs,
marketing, or education.
6. Interactive Image Generation Playground with Multiple Models
Description: Build a playground app where users can generate images using
different models (DALL-E, Stable Diffusion, Midjourney API if available) and
compare results.
Skills: Multi-API integration, UI design, user input handling.
Tools: APIs for each model, React or Vue frontend, Node.js or Python
backend.
Why: Understand differences between generative models and provide users
with flexible creative tools.
Reference: Comparison of DALL-E, Midjourney, Stable Diffusion capabilities
7. AI-Powered Meme Generator
Gen AI / ML and Python 57
Description: Combine image generation with text overlay to create humorous
or themed memes from prompts.
Skills: Image generation, text rendering on images, web app development.
Tools: Stable Diffusion or DALL-E API, Pillow (Python imaging), Flask or React.
Why: Fun project to practice image generation and simple graphics
manipulation.
8. Personalized Avatar Creator Using Generative AI
Description: Generate custom avatars based on user descriptions or style
preferences. Include options for hair, clothes, background.
Skills: Prompt engineering, conditional generation, UI/UX design.
Tools: Stable Diffusion with control models, web frontend, backend API
integration.
Why: Practical use case for social apps or games, combining AI with user
inputs.
Gen AI / ML and Python 58