0% found this document useful (0 votes)
14 views58 pages

Gen AI ML and Python

The document provides an introduction to programming with Python, emphasizing its relevance for AI and machine learning due to its simplicity and extensive libraries. It covers Python installation, basic syntax, data types, control flow, data structures, functions, and file handling, along with practical exercises. Additionally, it introduces NumPy as a key library for numerical computing in Python.

Uploaded by

aj.797788.7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views58 pages

Gen AI ML and Python

The document provides an introduction to programming with Python, emphasizing its relevance for AI and machine learning due to its simplicity and extensive libraries. It covers Python installation, basic syntax, data types, control flow, data structures, functions, and file handling, along with practical exercises. Additionally, it introduces NumPy as a key library for numerical computing in Python.

Uploaded by

aj.797788.7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Gen AI / ML and Python

Introduction to Programming & Python Basics


What is Programming?
Programming is the process of writing instructions (code) that a computer can
execute to perform specific tasks.

It allows you to automate calculations, process data, build applications, and


solve real-world problems.

Why Python for AI/ML?


Python is a popular, beginner-friendly programming language known for its
simple syntax and readability

It has a vast ecosystem of libraries for AI and machine learning (e.g., NumPy,
Pandas, scikit-learn, TensorFlow, PyTorch).

Python is widely used in industry and academia for data science, AI, and
automation due to its flexibility and strong community support

Installing Python
Most computers do not come with Python pre-installed, but installation is
straightforward:

Download Python from the official website: python.org1

Follow the installation instructions for your operating system (Windows,


macOS, Linux).

During installation, ensure you check the box to "Add Python to PATH" for
easy access from the command line

Using IDEs: Colab and Jupyter


IDEs (Integrated Development Environments) make coding easier by
providing features like syntax highlighting, code completion, and error

Gen AI / ML and Python 1


checking.

Jupyter Notebook:

A web-based tool for writing and running Python code in “cells.” Great for
experimentation and data analysis.

Install via pip install notebook or use JupyterLab.

Google Colab:

Free, cloud-based Jupyter notebooks. No installation needed—just sign in


with your Google account at colab.research.google.com.

Excellent for beginners and for running code on any device

Running Your First Python Script


Open your IDE or a terminal/command prompt.

Type the following code and run it:

print("Hello, World!")

This will display:

Hello, World!

In Jupyter or Colab, enter the code in a cell and press Shift+Enter to run it

Python Syntax
Python uses indentation (spaces or tabs) to define code blocks.

Statements end with a newline, not a semicolon.

Variables
Variables store data values. You do not need to declare their type.

Gen AI / ML and Python 2


name = "Alice"
age = 20

Data Types
String: Text, e.g., "Hello"

Integer: Whole numbers, e.g., 5

Float: Decimal numbers, e.g., 3.14

Boolean: True or False

greeting = "Hi"
year = 2025
pi = 3.1415
is_student = True

Input and Output


Output: Use print() to display information.

print("Welcome to Python!")

Input: Use input() to get user input (always returns a string).

user_name = input("Enter your name: ")


print("Hello,", user_name)

Simple Arithmetic Operations

a = 10
b=3
print(a + b) # Addition: 13
print(a - b) # Subtraction: 7

Gen AI / ML and Python 3


print(a * b) # Multiplication: 30
print(a / b) # Division: 3.333...
print(a // b) # Integer division: 3
print(a % b) # Modulus (remainder): 1
print(a ** b) # Exponentiation: 1000

Practice Exercise
Write a script that:

Asks for two numbers from the user,

Adds them,

Prints the result.

num1 = input("Enter first number: ")


num2 = input("Enter second number: ")
sum = int(num1) + int(num2)
print("The sum is:", sum)

Control Flow & Data Structures


1. Conditional Statements ( if , elif , else )
Conditional statements control the flow of execution based on conditions.

Basic if Statement
Executes a block if the condition is True :

x = 10
if x > 5:
print("x is greater than 5")

if-else Statement
Executes one block if the condition is True , another if False :

Gen AI / ML and Python 4


number = int(input("Enter a number: "))
if number > 0:
print("Positive number")
else:
print("Not a positive number")

If the user enters 10 , output is Positive number .

If the user enters 0 , output is Not a positive number

if-elif-else Statement
Checks multiple conditions in sequence:

score = 85
if score >= 90:
print("Grade: A")
elif score >= 80:
print("Grade: B")
else:
print("Grade: C")

Only the first True condition's block executes.

Nested if Statements
You can nest if statements inside each other for complex logic.

2. Loops ( for , while )


Loops are used to execute a block of code repeatedly.

For Loop
Used for iterating over a sequence (list, tuple, string, etc.):

fruits = ["apple", "banana", "cherry"]


for fruit in fruits:

Gen AI / ML and Python 5


print(fruit)

Iterates over each element in the sequence

Use Cases:

Processing items in a list or tuple

Repeating actions a fixed number of times

While Loop
Executes as long as a condition is True :

i=1
while i < 6:
print(i)
i += 1

Useful when the number of iterations is not known in advance

Use Cases:

Waiting for user input

Running until a specific event occurs (e.g., guessing games, event loops)

Loop Control Statements


break : Exit the loop immediately.

continue : Skip the current iteration and continue with the next.

else : Optional; runs if the loop completes normally (not via break ).

Example:

for i in range(5):
if i == 3:
break
print(i)

Gen AI / ML and Python 6


else:
print("Loop finished")

The else block will not execute if the loop is exited with break

Session 2: Lists, Tuples, Sets, Dictionaries

1. Lists
Definition: Ordered, mutable collection. Allows duplicates.

Syntax: my_list = [1][8][9]

Indexing: Zero-based ( my_list is 1 )

Basic Operations:

Add: my_list.append(4)

Remove: my_list.remove(2)

Access: my_list[1] (returns 2 )

Slice: my_list[1:3] (returns [8][9] )

Iterate:
pythonfor item in my_list:
print(item)

Use Cases: Storing ordered data, dynamic collections

2. Tuples
Definition: Ordered, immutable collection. Allows duplicates.

Syntax: my_tuple = (1, 2, 3)

Indexing: Zero-based ( my_tuple is 1 )

Basic Operations:

Access: my_tuple[1]

Count: my_tuple.count(2)

Index: my_tuple.index(3)

Gen AI / ML and Python 7


Iterate:

for item in my_tuple:


print(item)

Use Cases: Fixed data, function returns, keys in dictionaries

3. Sets
Definition: Unordered, mutable collection of unique elements.

Syntax: my_set = {1, 2, 3}

Basic Operations:

Add: my_set.add(4)

Remove: my_set.remove(2)

Membership: 2 in my_set

Iterate:

for item in my_set:


print(item)

Use Cases: Removing duplicates, set operations (union, intersection)

4. Dictionaries
Definition: Unordered (ordered as of Python 3.7+), mutable collection of key-
value pairs.

Syntax: my_dict = {'a': 1, 'b': 2}

Indexing: By key ( my_dict['a'] returns 1 )

Basic Operations:

Add/Update: my_dict['c'] = 3

Remove: my_dict.pop('b')

Access: my_dict['a']

Gen AI / ML and Python 8


Iterate:

for key, value in my_dict.items():


print(key, value)

Use Cases: Fast lookups, mapping relationships

Comparison Table
Feature List Tuple Set Dictionary

Mutable Yes No Yes Yes

Ordered Yes Yes No Yes (3.7+)

Keys: No, Values:


Duplicates Yes Yes No
Yes

Indexing Integer Integer No Key-based

Syntax [1][8][9] (1,2,3) {1,2,3} {'a':1, 'b':2}

Indexing and Iteration


Lists/Tuples: Use integer indices and slicing.

my_list = [10, 20, 30]


print(my_list[1]) # 20
for i, val in enumerate(my_list):
print(i, val)

Sets: No indexing; iterate directly.

Dictionaries: Iterate over keys, values, or items.

for key in my_dict:


print(key, my_dict[key])
for key, value in my_dict.items():
print(key, value)

Gen AI / ML and Python 9


Advanced Iteration: Use range() , enumerate() , or custom logic for cyclic or
indexed iteration

Functions, Modules, and File Handling


1. Defining and Using Functions
Function Definition:
Use the def keyword, followed by the function name, parentheses (with
optional parameters), and a colon. The function body is indented.

def greet():
print("Hello, World!")
greet() # Output: Hello, World!

Purpose:
Functions help organize code, promote reuse, and improve readability

2. Parameters and Arguments


Parameters:

Variables listed inside the parentheses in the function definition.

Arguments:
Values passed to the function when it is called.

Types of Parameters:

Positional: Order matters.

Keyword: Specify by name, order doesn't matter.

Default: Provide a default value.

Variable-length: Use args for tuples, *kwargs for dictionaries.

def add(a, b=5):


return a + b

Gen AI / ML and Python 10


print(add(3)) # Output: 8
print(add(3, 7)) # Output: 10

3. Return Values
Returning Values:
Use the return statement to send a result back to the caller. If no return is
specified, the function returns None by default.

def square(x):
return x * x
result = square(4) # result is 16

Multiple Return Values:


Python allows returning multiple values as a tuple.

def stats(numbers):
return min(numbers), max(numbers)
mn, mx = stats([1, 2, 3])
# mn = 1, mx = 3

Returning Lists, Dictionaries, or Functions:

Functions can return any object, including lists, dictionaries, or even other
functions456.

4. Scope in Python
Local Scope:

Variables defined inside a function are local and accessible only within that
function.

def foo():
x = 10 # local to foo
print(x)

Gen AI / ML and Python 11


foo()
# print(x) # Error: x is not defined

Global Scope:
Variables defined outside any function are global and accessible throughout
the file.

x = 20
def bar():
print(x) # accesses global x
bar()

Nonlocal Scope:
Used in nested functions to refer to variables in the enclosing function.

def outer():
x = "outer"
def inner():
nonlocal x
x = "inner"
inner()
print(x) # Output: inner
outer()

Variable Shadowing:
If a variable with the same name exists in both local and global scope, the
local variable takes precedence inside the function

5. Importing Modules and Using Standard Libraries


Importing Modules:

Use the import statement to bring in external code (modules).

import math

Gen AI / ML and Python 12


print(math.sqrt(16)) # Output: 4.0

Import Specific Functions:

from math import pi


print(pi) # Output: 3.141592653589793

Standard Libraries:

Python comes with a rich set of standard modules, such as math , random ,
datetime , os , and sys .

Custom Modules:
You can create your own modules by saving functions in a .py file and
importing them

Reading from and Writing to Files


Opening Files:
Use the open() function with a filename and mode ( 'r' , 'w' , 'a' , etc.).

f = open('data.txt', 'r') # Open for reading

Reading Files:

read() : Reads the entire file.

readline() : Reads one line at a time.

readlines() : Reads all lines into a list.

with open('data.txt', 'r') as f:


content = f.read()

Writing Files:

'w' : Write (overwrites existing file or creates new).

'a' : Append to the end of the file.

Gen AI / ML and Python 13


with open('output.txt', 'w') as f:
f.write("Hello, file!")

Best Practice:
Use with statement to automatically close files

Mode Description

'r' Read (default)

'w' Write (overwrite/create)

'a' Append (end of file)

'r+' Read and write

'w+' Write and read (overwrite)

'a+' Append and read

String Operations
Common String Methods:

len(s) : Length of string

s.upper() , s.lower() : Change case

s.strip() : Remove whitespace

s.replace('a', 'b') : Replace substring

s.split(',') : Split into list

','.join(list) : Join list into string

s.find('sub') : Find substring index

s.isdigit() , s.isalpha() : Check content type


pythontext = " Hello, World! "
print(text.strip().upper())
# Output: HELLO, WORLD!

Formatting Strings:

f-strings: f"Value: {x}"

Gen AI / ML and Python 14


str.format() : "Value: {}".format(x)

Simple File-Based Mini Project: Word Counter


Objective:
Read a text file, count the frequency of each word, and write the results to a new
file.
Steps:

1. Read the file content.

2. Clean and split the text into words.

3. Count word occurrences.

4. Write the results to an output file.

Sample Code:

def count_words(input_file, output_file):


with open(input_file, 'r') as f:
text = f.read().lower()
words = text.split()
word_count = {}
for word in words:
word = word.strip('.,!?";:')
word_count[word] = word_count.get(word, 0) + 1
with open(output_file, 'w') as f:
for word, count in sorted(word_count.items()):
f.write(f"{word}: {count}\n")

count_words('input.txt', 'word_count.txt')

This project demonstrates file reading/writing, string manipulation, and


dictionary usage.

Python for Data Science

Gen AI / ML and Python 15


What is NumPy?
NumPy stands for Numerical Python and is a foundational library for
numerical and scientific computing in Python.

It provides a powerful n-dimensional array object and useful functions for


performing mathematical operations efficiently.

Commonly used for: data analysis, scientific computing, and as the base for
other libraries like Pandas and SciPy.

Creating and Working with NumPy Arrays


a. Creating Arrays

From lists:

import numpy as np
a = np.array([1, 2, 3, 4, 5, 6])

Multi-dimensional arrays:

b = np.array([[1, 2, 3], [4, 5, 6]])

Arrays filled with zeros or ones:

zeros = np.zeros(3) # array([0., 0., 0.])


ones = np.ones(3) # array([1., 1., 1.])

Arrays with a range of numbers:

arr = np.arange(0, 10, 2) # array([0, 2, 4, 6, 8])


linspace = np.linspace(0, 1, 5) # array([0. , 0.25, 0.5 , 0.75, 1. ])

arr = np.ones(3, dtype=np.int64)

Specify data type:

Gen AI / ML and Python 16


b. Indexing and Slicing

Access elements by index (0-based):

print(a[0]) # 1

Slicing:

print(a[:3]) # array([1, 2, 3])

c. Basic Operations

Element-wise operations:

data = np.array([1, 2])


ones = np.ones(2, dtype=int)
print(data + ones) # array([2, 3])
print(data * data) # array([1, 4])

Aggregations:

a = np.array([1, 2, 3, 4])
print(a.sum()) # 10
b = np.array([[1, 1], [2, 2]])
print(b.sum(axis=0)) # array([3, 3]) # sum over rows
print(b.sum(axis=1)) # array([2, 4]) # sum over columns

arr = np.array([
[1, 2, 3],
[4, 5, 6]
])

11
22

Reshaping:

Gen AI / ML and Python 17


c = np.arange(12).reshape(3, 4)
print(c)

What is Pandas?
Pandas is a Python library built on top of NumPy, designed for data
manipulation and analysis.

It introduces two main data structures:

Series: 1D labeled array

DataFrame: 2D labeled, tabular data structure (like an Excel spreadsheet)

Creating and Working with Pandas DataFrames


a. Creating DataFrames

From a dictionary:

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

From a NumPy array:

arr = np.array([[1, 2], [3, 4]])


df2 = pd.DataFrame(arr, columns=['A', 'B'])

b. Accessing Data

Select a column:

print(df['Name'])

Select rows by index:

Gen AI / ML and Python 18


print(df.iloc[0]) # first row
print(df.loc[0]) # by label/index

Slicing rows:

print(df[0:2]) # first two rows

c. Basic Data Manipulation

Add a new column:

df['Salary'] = [50000, 60000, 70000]

Filter rows:

adults = df[df['Age'] > 28]

Drop a column:

df = df.drop('Salary', axis=1)

Handle missing values:

df.isnull()
df.fillna(0)
df.dropna()

Aggregation:

df['Age'].mean()
df.groupby('Name').sum()

Sorting:

Gen AI / ML and Python 19


df.sort_values('Age')

Practice Exercise Examples

Create a NumPy array of numbers from 10 to 19.

Add two NumPy arrays element-wise.

Create a Pandas DataFrame from a list of dictionaries.

Select all rows in the DataFrame where Age > 30.

Calculate the sum and mean of a DataFrame column.

Replace missing values in a DataFrame with the column mean.

Sort the DataFrame by a specific column.

Handle the missing data {'A': [1, 2, np.nan], 'B': [5, np.nan, np.nan]}

Introduction to Data Cleaning


Definition: Data cleaning is the process of fixing or removing incorrect,
corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset

Why it matters: Clean data leads to more accurate analysis and insights.
Messy data can cause errors, misleading results, or make analysis impossible.

Common Data Cleaning Steps:

Remove duplicate or irrelevant data (e.g., repeated rows, out-of-scope


entries)1.

Fix structural errors (e.g., typos, inconsistent capitalization, mixed formats)1.

Handle missing values (e.g., fill with mean/median, remove rows/columns)2.

Standardize data (e.g., consistent date formats, units, text casing).

Example:

import pandas as pd

Gen AI / ML and Python 20


df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'bob'],
'Score': [85, 90, 85, None, 90]
})
df = df.drop_duplicates()
df['Name'] = df['Name'].str.capitalize()
df['Score'] = df['Score'].fillna(df['Score'].mean())
print(df)

Data Filtering
Definition: Filtering means selecting rows that meet certain conditions,
helping you focus on relevant data

How to filter: Use boolean indexing in Pandas.

Example:

# Filter students with Score above 85


filtered_df = df[df['Score'] > 85]
print(filtered_df)

Combine multiple conditions using & (and) or | (or)4:

# Students named 'Bob' with Score above 85


filtered_df = df[(df['Name'] == 'Bob') & (df['Score'] > 85)]
print(filtered_df)

Data Sorting
Definition: Sorting arranges your data by the values in one or more columns,
making it easier to spot patterns or outliers35.

How to sort: Use .sort_values() in Pandas.

Example:

Gen AI / ML and Python 21


# Sort by Score in descending order
sorted_df = df.sort_values('Score', ascending=False)
print(sorted_df)

Sort by multiple columns:

# Sort by Name (A-Z), then by Score (high to low)


sorted_df = df.sort_values(['Name', 'Score'], ascending=[True, False])
print(sorted_df)

Simple Data Analysis Project


Project Idea:
Analyze a small dataset (e.g., students and scores, product sales, or Titanic
dataset) using the above techniques.
Project Steps:

1. Load the data (from a CSV or dictionary).

2. Clean the data (remove duplicates, fix names, handle missing values).

3. Filter the data (e.g., select students with high scores, products with sales
above a threshold).

4. Sort the data (e.g., by score, by product price).

5. Draw simple conclusions (e.g., who has the highest score? How many
products sold more than 10 units?).

Example:

import pandas as pd

# 1. Load data
data = {
'Student': ['Alice', 'Bob', 'Alice', 'Charlie', 'David'],
'Score': [85, 90, 85, None, 75]

Gen AI / ML and Python 22


}
df = pd.DataFrame(data)

# 2. Clean data
df = df.drop_duplicates()
df['Score'] = df['Score'].fillna(df['Score'].mean())

# 3. Filter: Scores above 80


high_scores = df[df['Score'] > 80]

# 4. Sort: By Score descending


sorted_scores = high_scores.sort_values('Score', ascending=False)

# 5. Analyze: Highest scorer


top_student = sorted_scores.iloc[0]

print("Cleaned Data:\n", df)


print("High Scores:\n", high_scores)
print("Sorted High Scores:\n", sorted_scores)
print("Top Student:\n", top_student)

In-Class Activity
Give students a small CSV or dictionary-based dataset.

Ask them to:

Remove duplicates

Fill missing values

Filter for a specific condition (e.g., scores above a threshold)

Sort the results

Print the top result

Data Visualization & Project

Gen AI / ML and Python 23


Why Data Visualization?
Data visualization helps you understand data, spot trends, and communicate
insights effectively.

Python’s most popular libraries for visualization are Matplotlib and Seaborn

2. Introduction to Matplotlib
Matplotlib is a foundational plotting library in Python, offering flexibility to
create a wide variety of static, animated, and interactive plots

It’s often imported as import matplotlib.pyplot as plt .

Basic Line Plot Example

import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [10, 20, 25, 30]

plt.plot(x, y)
plt.title('Simple Line Plot')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.show()

You can customize colors, line styles, and add titles/labels easily

Other Basic Plots in Matplotlib


Bar Chart:

students = ["Alice", "Bob", "Charlie"]


scores = [85, 90, 78]
plt.bar(students, scores, color='skyblue')
plt.title("Student Scores")
plt.xlabel("Student")

Gen AI / ML and Python 24


plt.ylabel("Score")
plt.show()

Pie Chart:

labels = ["Python", "Java", "C++"]


sizes = [50, 30, 20]
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title("Programming Language Popularity")
plt.show()

Scatter Plot:

x = [1, 2, 3, 4, 5]
y = [5, 7, 4, 6, 5]
plt.scatter(x, y)
plt.title("Scatter Plot Example")
plt.xlabel("X Value")
plt.ylabel("Y Value")
plt.show()

Introduction to Seaborn
Seaborn is built on top of Matplotlib and provides a higher-level, more user-
friendly interface for creating attractive statistical graphics

It works seamlessly with Pandas DataFrames and comes with better default
styles and color palettes

Getting Started with Seaborn

import seaborn as sns


import matplotlib.pyplot as plt

sns.set_theme() # Apply Seaborn's default styling

Gen AI / ML and Python 25


Basic Seaborn Plots
Histogram:

import seaborn as sns


import matplotlib.pyplot as plt

data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
sns.histplot(data)
plt.title("Histogram Example")
plt.show()

Scatter Plot:

import seaborn as sns


import matplotlib.pyplot as plt

tips = sns.load_dataset("tips")
sns.scatterplot(x="total_bill", y="tip", data=tips)
plt.title("Total Bill vs Tip")
plt.show()

Line Plot:

import seaborn as sns


import matplotlib.pyplot as plt

fmri = sns.load_dataset("fmri")
sns.lineplot(x="timepoint", y="signal", data=fmri)
plt.title("FMRI Signal Over Time")
plt.show()

4. Matplotlib vs. Seaborn: When to Use Which?

Gen AI / ML and Python 26


Matplotlib: More control and customization, suitable for publication-quality
graphics and unique plot types

Seaborn: Simpler code for statistical plots, better default styles, and works
well for quick exploratory analysis

Hands-On Practice
Exercise Ideas:

Plot a bar chart of your favorite fruits and their quantities.

Visualize random numbers as a histogram using both Matplotlib and Seaborn.

Use Seaborn to plot a scatter plot from the built-in "tips" dataset.

Introduction to Machine Learning


What is Machine Learning (ML)?
Machine Learning (ML) is a subset of artificial intelligence (AI) that enables
computers to learn from data and make predictions or decisions without being
explicitly programmed for each task.

ML algorithms identify patterns in data, learn from past experiences, and


improve their performance over time with minimal human intervention

The process involves feeding large amounts of data to algorithms, which then
optimize their internal parameters to minimize errors and make accurate
predictions or classifications

Types of Machine Learning

Supervised Learning
Definition: The algorithm is trained on labeled data, meaning each input
comes with a known output.

Goal: Learn a mapping from inputs to outputs so it can predict the output for
new, unseen data.

Gen AI / ML and Python 27


Common Algorithms: Linear regression, logistic regression, decision trees,
support vector machines, neural networks.

Use Cases: Email spam detection, image classification, credit scoring, medical
diagnosis

Unsupervised Learning
Definition: The algorithm is given data without explicit labels and must find
patterns or groupings on its own.

Goal: Discover hidden structures or relationships in the data.

Common Algorithms: K-means clustering, hierarchical clustering, principal


component analysis (PCA), association rule learning.

Use Cases: Customer segmentation, anomaly detection, market basket


analysis, dimensionality reduction

Reinforcement Learning
Definition: An agent learns to make decisions by interacting with an
environment, receiving rewards or penalties for actions.

Goal: Maximize cumulative reward over time.

Use Cases: Robotics, game playing, recommendation systems, autonomous


vehicles

Machine Learning Workflow


A typical ML workflow consists of several key stages:

1. Problem Definition: Clearly define the business or research problem and


success criteria.

2. Data Collection: Gather relevant and high-quality data from various sources.

3. Data Preparation: Clean, preprocess, and transform data (handle missing


values, encode categories, normalize features).

4. Exploratory Data Analysis (EDA): Analyze data to understand distributions,


relationships, and potential issues.

Gen AI / ML and Python 28


5. Model Selection: Choose appropriate algorithms based on the problem
(classification, regression, clustering, etc.).

6. Model Training: Fit the model to the training data, adjusting parameters to
minimize errors.

7. Model Evaluation: Assess model performance using metrics (accuracy,


precision, recall, RMSE, etc.) on validation/test data.

8. Model Tuning: Optimize hyperparameters to improve performance.

9. Deployment: Integrate the trained model into production systems for real-
world use.

10. Monitoring and Maintenance: Continuously monitor model performance and


retrain as needed.

Real-World Applications of Machine Learning


Healthcare: Disease prediction, medical image analysis, drug discovery.

Finance: Fraud detection, credit scoring, algorithmic trading.

Retail: Recommendation systems, customer segmentation, inventory


management.

Transportation: Self-driving cars, route optimization, demand forecasting.

Natural Language Processing: Chatbots, sentiment analysis, language


translation.

Manufacturing: Predictive maintenance, quality control, supply chain


optimization.

Introduction to Scikit-learn and Building Your First ML Model

What is Scikit-learn?
Scikit-learn (sklearn) is a popular open-source Python library for machine
learning.

It provides simple and efficient tools for data mining and data analysis,
supporting both supervised and unsupervised learning.

Gen AI / ML and Python 29


Built on top of NumPy, SciPy, and Matplotlib, it offers a consistent interface for
a wide range of algorithms, including classification, regression, clustering, and
dimensionality reduction.

Scikit-learn is widely used in industry and academia due to its ease of use,
extensive documentation, and active community support.

Key Features
Ready-to-use algorithms for classification, regression, clustering, and more.

Tools for data preprocessing, model selection, and evaluation.

Built-in datasets for practice (e.g., Iris, Digits, Boston Housing).

Integration with other Python libraries for data science workflows.

Building Your First ML Model with Scikit-learn

Example: Classification with the Iris Dataset


Step-by-Step Process:

1. Import Libraries and Load Data

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

2. Split Data into Training and Test Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_sta

3. Initialize and Train the Model

Gen AI / ML and Python 30


model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

4. Make Predictions

y_pred = model.predict(X_test)

5. Evaluate the Model

accuracy = accuracy_score(y_test, y_pred)


print(f"Accuracy: {accuracy:.2f}")

Interpretation: The model predicts the species of iris flowers based on


features like petal and sepal length/width. The accuracy score indicates how
well the model performs on unseen data.

Example: Simple Regression


Using the Boston Housing Dataset (for regression tasks):

from sklearn.datasets import load_boston


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=

# Train model
reg = LinearRegression()
reg.fit(X_train, y_train)

Gen AI / ML and Python 31


# Predict and evaluate
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

Interpretation: The model predicts house prices based on features like


number of rooms, location, etc. The mean squared error (MSE) measures
prediction accuracy for regression tasks.

Key Takeaways:
Machine learning enables systems to learn from data and make predictions.

Supervised and unsupervised learning are the two main types, each with
distinct use cases.

The ML workflow involves data preparation, model training, evaluation, and


deployment.

Scikit-learn simplifies building, evaluating, and deploying ML models in


Python, making it accessible for beginners and professionals alike.

Supervised Learning in Practice


Linear Regression — Theory, Implementation, Evaluation Metrics

Theory of Linear Regression


Definition:
Linear regression models the relationship between one or more independent
variables (predictors) and a continuous dependent variable by fitting a linear
equation to observed data.

Mathematical Model:y=β0+β1x+ϵ
For simple linear regression (one predictor), the model is:
y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon
where:

Gen AI / ML and Python 32


yyy = dependent variable

xxx = independent variable

β0\beta_0β0 = intercept (value of yyy when x=0x=0x=0)

β1\beta_1β1 = slope (change in yyy per unit change in xxx)

ϵ\epsilonϵ = error term (difference between observed and predicted


values)

Goal:
Find β0\beta_0β0 and β1\beta_1β1 that minimize the sum of squared residuals
(differences between observed and predicted yyy) — this is called the
Ordinary Least Squares (OLS) method.

Multiple Linear Regression:y=β0+β1x1+β2x2+ ⋯+βpxp+ϵ


Extends to multiple predictors:
y=β0+β1x1+β2x2+ ⋯+βpxp+ϵy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots +
\beta_p x_p + \epsilon

Implementation of Linear Regression in Python

a) Manual Calculation (Using NumPy)


Calculate means of xxx and yyy.

Compute slope (β1\beta_1β1) and intercept (β0\beta_0β0) using


formulas:β1=∑(xi−xˉ)2∑(xi−xˉ)(yi−yˉ)β0=yˉ−β1xˉ
β1=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2\beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}
{\sum (x_i - \bar{x})^2}
β0=yˉ−β1xˉ\beta_0 = \bar{y} - \beta_1 \bar{x}

Predict values using:y^=β0+β1x

y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x

Example:

import numpy as np

Gen AI / ML and Python 33


x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

x_mean = np.mean(x)
y_mean = np.mean(y)

B1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)


B0 = y_mean - B1 * x_mean

y_pred = B0 + B1 * x
print(f"Slope: {B1}, Intercept: {B0}")
print("Predicted values:", y_pred)

b) Using Scikit-learn
Import LinearRegression from sklearn.linear_model .

Fit the model on training data.

Predict and evaluate.

Example:

from sklearn.linear_model import LinearRegression


import numpy as np

X = np.array([[1], [2], [3], [4], [5]]) # 2D array for sklearn


y = np.array([2, 4, 5, 4, 5])

model = LinearRegression()
model.fit(X, y)

print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

y_pred = model.predict(X)
print("Predictions:", y_pred)

Gen AI / ML and Python 34


c) Visualization with Matplotlib and SciPy
Use scipy.stats.linregress to get slope, intercept, and statistical measures.

Plot scatter and regression line.

Example:

import matplotlib.pyplot as plt


from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

def predict(x):
return slope * x + intercept

y_pred = list(map(predict, x))

plt.scatter(x, y)
plt.plot(x, y_pred, color='red')
plt.show()

3. Evaluation Metrics for Linear Regression


Mean Squared Error (MSE): Average squared difference between actual and
predicted values.

Root Mean Squared Error (RMSE): Square root of MSE; interpretable in


original units.

Mean Absolute Error (MAE): Average absolute difference.

R-squared (R2R^2R2): Proportion of variance in dependent variable explained


by the model; ranges from 0 to 1.

Using Scikit-learn Metrics:

Gen AI / ML and Python 35


from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y, y_pred)


r2 = r2_score(y, y_pred)

print(f"MSE: {mse}")
print(f"R-squared: {r2}")

Logistic Regression
Purpose:
Used for binary classification problems (output is categorical: 0 or 1).

Theory:σ(z)=1+e−z1
Instead of predicting continuous values, logistic regression predicts the
probability that an input belongs to a class using the logistic (sigmoid)
function:
σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}
where z=β0+β1x1+ ⋯+βpxpz = \beta_0 + \beta_1 x_1 + \cdots + \beta_p
x_pz=β0+β1x1+ ⋯+βpxp.
Output:
A probability between 0 and 1, which is thresholded (commonly at 0.5) to
assign class labels.

Use Cases:
Spam detection, disease diagnosis, customer churn prediction.

Decision Trees
Definition:
A tree-like model of decisions that splits data based on feature values to
classify or predict outcomes.

How it Works:

Gen AI / ML and Python 36


The tree splits nodes by selecting the feature and threshold that best separate
classes (using criteria like Gini impurity or entropy).

Advantages:
Easy to interpret, handles both numerical and categorical data, non-linear
relationships.

Limitations:
Can overfit, sensitive to small data changes.

Hands-on with Scikit-learn: Logistic Regression and Decision


Trees

Logistic Regression Example (Iris Dataset)

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# For binary classification, select two classes


X = X[y != 2]
y = y[y != 2]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate

Gen AI / ML and Python 37


y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Decision Tree Example (Iris Dataset)

from sklearn.tree import DecisionTreeClassifier


from sklearn.metrics import classification_report

# Using same train/test split as above# Train decision tree


tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

# Predict and evaluate


y_pred_tree = tree.predict(X_test)
print(classification_report(y_test, y_pred_tree))

Clustering (K-Means), Dimensionality Reduction (PCA), and


Hands-on Examples

K-Means Clustering: Theory and Algorithm


What is K-Means?
K-Means is an unsupervised learning algorithm used to partition data into K
distinct clusters based on feature similarity.

How It Works:

1. Initialization: Randomly select K centroids (cluster centers).

2. Expectation Step: Assign each data point to the nearest centroid based on
Euclidean distance.

3. Maximization Step: Update centroids by calculating the mean of all points


assigned to each cluster.

4. Repeat steps 2 and 3 until centroids stabilize (no change in assignments).

Objective:

Gen AI / ML and Python 38


Minimize the sum of squared errors (SSE) — the sum of squared distances
between points and their cluster centroids.

Challenges:

Choosing the right K (number of clusters).

Sensitivity to centroid initialization (can lead to different results).

Non-deterministic; often run multiple times with different initializations.

Elbow Method:
Plot SSE against different values of K to find the "elbow" point where adding
more clusters yields diminishing returns.

Dimensionality Reduction: Principal Component Analysis (PCA)


Purpose:
Reduce the number of features (dimensions) in a dataset while preserving as
much variance (information) as possible.

How PCA Works:

Computes new orthogonal axes (principal components) that capture


maximum variance.

First principal component captures the most variance, second is


orthogonal and captures the next most, and so forth.

Data is projected onto these components, reducing dimensionality.

Benefits:

Simplifies visualization (e.g., 2D or 3D plots).

Reduces noise and computational cost.

Helps avoid the “curse of dimensionality” in machine learning.

Hands-on Example: K-Means Clustering with PCA in Python

from sklearn.datasets import load_iris


from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Gen AI / ML and Python 39


from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load dataset
data = load_iris()
X = data.data

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce dimensions to 2 for visualization


pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

# Apply K-Means clustering


kmeans = KMeans(n_clusters=3, init='k-means++', n_init=50, max_iter=500, rand
clusters = kmeans.fit_predict(X_pca)

# Plot clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-Means Clustering with PCA')
plt.show()

Model Evaluation Metrics

1. Accuracy
Proportion of correct predictions (both true positives and true negatives) over
total predictions.

Best for balanced datasets.

2. Precision

Gen AI / ML and Python 40


Measures how many predicted positives are actually
positive.Precision=TP+FPTP
Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

3. Recall (Sensitivity)
Measures how many actual positives were correctly
identified.Recall=TP+FNTP
Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

4. F1-Score
Harmonic mean of precision and recall, balancing
both.F1=2×Precision+RecallPrecision×Recall
F1=2×Precision×RecallPrecision+RecallF1 = 2 \times \frac{\text{Precision}
\times \text{Recall}}{\text{Precision} + \text{Recall}}

5. Confusion Matrix

Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN)

Shows counts of true/false positives and negatives.

6. Cross-Validation
Technique to assess model generalization by splitting data into multiple
train/test folds.

Common method: k-fold cross-validation, where data is divided into k subsets;


each subset is used once as test data while the others train the model.

Helps avoid overfitting and provides robust performance estimates.

Introduction to Deep Learning and Neural Networks


What is Deep Learning?

Gen AI / ML and Python 41


A subset of machine learning that uses artificial neural networks with many
layers (deep architectures) to model complex patterns in data.

Excels at tasks like image recognition, natural language processing, and


speech recognition.

Learns hierarchical feature representations automatically.

Neural Network Basics


Neuron:
Basic computational unit that receives inputs, applies weights, adds bias, and
passes the result through an activation function.

Weights and Biases:


Parameters learned during training that determine the importance of inputs.

Activation Functions:
Introduce non-linearity; common types include:

Sigmoid: Outputs values between 0 and 1.

ReLU (Rectified Linear Unit): Outputs zero for negative inputs, linear for
positive.

Tanh: Outputs values between -1 and 1.

Layers:

Input layer: Receives raw data.

Hidden layers: Perform transformations.

Output layer: Produces final prediction.

Building a Simple Neural Network with TensorFlow/Keras: Digit


Recognition (MNIST)

Step-by-step example:

import tensorflow as tf
from tensorflow.keras import layers, models

Gen AI / ML and Python 42


from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Load dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Preprocess data
X_train = X_train.reshape(-1, 28*28).astype('float32') / 255
X_test = X_test.reshape(-1, 28*28).astype('float32') / 255

y_train = to_categorical(y_train, 10)


y_test = to_categorical(y_test, 10)

# Build model
model = models.Sequential([
layers.Dense(128, activation='relu', input_shape=(784,)),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])

# Compile model
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])

# Train model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate model
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")

Explanation:

Flatten 28x28 images into 784-dimensional vectors.

Two hidden layers with ReLU activation.

Gen AI / ML and Python 43


Output layer with softmax for multi-class classification (digits 0-9).

Use categorical_crossentropy loss and adam optimizer.

Introduction to Generative AI

What is Generative AI?


Generative AI is a branch of artificial intelligence focused on creating new, original
content—such as text, images, music, or code—by learning patterns from large
datasets. Unlike traditional AI, which is designed to analyze data and make
predictions or decisions based on predefined rules, generative AI can produce
outputs that did not previously exist, mimicking creativity and innovation

Generative AI vs. Traditional AI


Aspect Traditional AI Generative AI

Analyzes data, makes predictions, Creates new content (text, images,


Core Function
classifies code, music, etc.)

Pattern creation, self-learning from


Approach Rule-based, pattern recognition
data

Predictions, classifications, Original content (stories, images,


Output
recommendations code, etc.)

Spam filters, recommendation ChatGPT, DALL·E, GitHub Copilot,


Examples
systems, chatbots music generators

Data Type Structured data Structured & unstructured data

Capable of producing creative,


Creativity Limited to defined tasks
novel outputs

Key Difference:
Traditional AI is reactive and task-oriented, excelling at analyzing and predicting
within set boundaries. Generative AI is proactive, capable of producing new,
creative content by learning from existing data

Key Applications of Generative AI

Gen AI / ML and Python 44


Text Generation: Chatbots (ChatGPT), content writing, translation,
summarization.

Image Generation: Creating artwork (DALL·E, Midjourney), photo editing, style


transfer.

Code Generation: Writing and completing code (GitHub Copilot, OpenAI


Codex).

Audio & Music: Composing music, generating synthetic voices.

Video & Animation: Generating video content, deepfakes, animation.

Other Areas: Drug discovery, synthetic data creation, personalized


recommendations.

Summary
Generative AI represents a shift from AI systems that simply analyze or classify
data to those that can create entirely new content, opening up new possibilities in
creativity, productivity, and problem-solving across industries

Hands-on—Using OpenAI API or HuggingFace Transformers to


Generate Text; Prompt Engineering Basics

1. Introduction to Text Generation Tools


OpenAI API: Provides access to powerful language models (like GPT-3/4) that
can generate human-like text.

HuggingFace Transformers: An open-source library with many pre-trained


generative models (e.g., GPT-2, GPT-3, T5, BERT).

2. Hands-on: Generating Text

Using OpenAI API (Python Example)

import openai

client = OpenAI(

Gen AI / ML and Python 45


api_key=os.environ.get("OPENAI_API_KEY"),
)

response = client.responses.create(
model="o4-mini",
instructions="You are a concise assistant.",
input="Explain the difference between a list and a tuple in Python.",
)

print(response.output_text)

Replace "YOUR_API_KEY" with your actual OpenAI API key.

Using HuggingFace Transformers (Python Example)

from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")


prompt = "Once upon a time in a distant galaxy,"
result = generator(prompt, max_length=50, num_return_sequences=1)
print(result[0]['generated_text'])

3. Prompt Engineering Basics


What is Prompt Engineering?
The art of crafting effective input prompts to guide generative AI models
toward desired outputs.

Tips for Good Prompts:

Be clear and specific: "Summarize this article in three sentences."

Provide context: "Act as a travel guide and recommend places to visit in


Paris."

Use examples: "Translate the following English sentence to French: 'Hello,


how are you?'"

Gen AI / ML and Python 46


Experiment and iterate: Try different phrasings to get the best results.

4. Practice Exercise
Try generating:

A poem about summer.

A Python function that calculates factorial.

A product description for a new smartphone.

Experiment with prompt variations and observe how outputs change.

Summary
Generative AI enables machines to create new content, setting it apart from
traditional, rule-based AI.

Text generation is accessible using tools like OpenAI API or HuggingFace


Transformers.

Prompt engineering is key to getting high-quality, relevant outputs from


generative models.

Generative AI for Images and Code


Image Generation Basics — DALL-E, Stable Diffusion, Using Web
APIs and Demo Tools

Overview of Image Generation Models


DALL-E:
Developed by OpenAI, DALL-E (and its successors like DALL-E 2 and DALL-E
3) are powerful text-to-image models that generate high-quality, detailed
images from natural language prompts. DALL-E 3 improves over previous
versions by better understanding and rewriting prompts internally to produce
more compelling and accurate images.

Stable Diffusion:

Gen AI / ML and Python 47


An open-source text-to-image diffusion model that generates photorealistic
images by iteratively denoising random noise guided by a text prompt. It is
notable for being accessible for local installation and customization, unlike
some proprietary models.

How These Models Work (Briefly)


Diffusion Models:
Both DALL-E 2/3 and Stable Diffusion use diffusion techniques that start with
random noise and progressively refine it into a coherent image matching the
prompt. The process involves learning to reverse a noising process, guided by
text embeddings.

CLIP Model (DALL-E):


DALL-E uses a CLIP model to map text and images into a shared semantic
space, enabling the generation of images that semantically match the input
text.

Using DALL-E via Web API


OpenAI API:
You can generate images by sending a text prompt to the OpenAI API
specifying the DALL-E model version (e.g., DALL-E 3). The API returns URLs to
generated images.

Example Workflow:

Define a detailed text prompt describing the desired image.

Call the .images.generate() method with parameters like model , prompt , n

(number of images), and size .

Retrieve the image URL from the response and display or download it.

Tips for Better Results:


More detailed prompts yield higher quality images. DALL-E 3 internally
rewrites prompts to optimize generation.

Using Stable Diffusion Locally or via Colab

Gen AI / ML and Python 48


Local Setup:

Clone the Stable Diffusion repository.

Create a Conda environment with required dependencies.

Download the model weights (e.g., checkpoint v1.4).

Run commands like:

python scripts/txt2img.py --prompt "your prompt here" --ckpt sd-v1-4.ck

GPU usage is recommended for faster generation; CPU is possible but


slower (8-12 minutes per image).

Colab Notebooks:
Google Colab notebooks allow running Stable Diffusion without local setup,
with GPU acceleration available on Colab Pro.

Demo Tools and Platforms


Platforms like Midjourney, RunwayML, and Hugging Face Spaces provide web
interfaces to generate images using Stable Diffusion or DALL-E models.

These tools often allow prompt refinement, image upscaling, outpainting, and
blending modes for creative control.

Code Generation with Large Language Models (LLMs) — GitHub


Copilot, OpenAI Codex; Practical Exercises

What is Code Generation with LLMs?


LLMs like OpenAI Codex and GitHub Copilot are trained on vast amounts of
code and natural language, enabling them to generate code snippets,
complete functions, or even entire programs from natural language prompts or
partial code.

GitHub Copilot
An AI-powered code completion tool integrated into code editors (e.g., VS
Code).

Gen AI / ML and Python 49


Suggests code lines or blocks as you type, based on context.

Supports many languages and frameworks.

Helps accelerate development, reduce boilerplate, and learn new APIs.

OpenAI Codex
The underlying model powering Copilot.

Accessible via API for custom code generation tasks.

Can generate code from natural language prompts, translate between


languages, or explain code.

Practical Exercises
Exercise 1: Generate a function from a docstring prompt.
Prompt: "Write a Python function to check if a number is prime."
Expected output: Function code implementing prime check.

Exercise 2: Complete partial code snippets.

Provide a partial function and ask the model to complete it.

Exercise 3: Generate unit tests for existing functions.


Prompt the model to write test cases based on function definitions.

Exercise 4: Translate code from one language to another.


E.g., Python to JavaScript.

Exercise 5: Debugging assistance.


Provide buggy code and ask for corrections or explanations.

Best Practices
Always review generated code for correctness and security.

Use generated code as a starting point or assistant, not a final solution.

Combine with human expertise for best results.

Gen AI / ML and Python 50


Retrieval Augmented Generation & LLM Frameworks
The Limits of LLMs and the Need for RAG
Explanation:

LLMs (like GPT, Gemini) are trained on vast but static datasets. Their
knowledge is frozen at training time and may be outdated or incomplete.

LLMs can “hallucinate” (make up facts) and struggle with domain-specific


or up-to-date information.

Example:

Ask ChatGPT: “Who won the 2024 Olympics?” (It can’t answer accurately
if trained before 2024.)

Discussion:

Why is this a problem for real-world applications (e.g., customer support,


research, enterprise tools)?

What is RAG? (Retrieval Augmented Generation)


Definition:

RAG combines information retrieval (searching relevant documents/data)


with generative AI (LLMs) to produce grounded, accurate, and context-
aware outputs.

How it Works:

1. Retrieve: Search for relevant documents/passages from a knowledge


base (using keyword or semantic search, often with vector databases).

2. Augment: Provide the retrieved content as context to the LLM.

3. Generate: The LLM uses both its training and the fetched context to
answer the user’s query.

Diagram:

User Query → Retriever (search) → Relevant Docs → LLM (with docs as conte

Gen AI / ML and Python 51


Key Benefits:

Enhanced accuracy: Reduces hallucination by grounding answers in real


data.

Up-to-date information: Can access current knowledge without


retraining.

Domain adaptation: Easily apply LLMs to private or niche datasets.

Scalability: Add new knowledge without retraining the model.

RAG in Practice—Real-World Applications


Examples:

Enterprise chatbots answering questions from internal documentation.

Research assistants summarizing the latest scientific papers.

Customer support bots accessing product manuals and support tickets.

Demo:

Show a live RAG-powered chatbot (e.g., Bing Copilot, Gemini Advanced, or


a simple open-source demo).

How Retrieval Works (Under the Hood)


Retrieval Methods:

Keyword Search: Classic search (e.g., Elasticsearch).

Semantic Search: Uses embeddings/vectors to find similar meaning, not


just keywords.

Hybrid Search: Combines both, often with a re-ranker for best results.

Vector Databases:

Store documents as embeddings for fast, semantic retrieval (e.g.,


Pinecone, ChromaDB).

Multi-modal Retrieval:

Not just text—can retrieve images, audio, etc. using the same principles.

Gen AI / ML and Python 52


Intro to Langchain, LlamaIndex, Building a Simple Q&A
System

Introduction to Langchain & LlamaIndex


Langchain:

A Python framework for building LLM-powered applications, with tools for


chaining together retrieval, generation, and more.

LlamaIndex:

A toolkit for connecting LLMs to custom data sources. Makes it easy to


ingest, index, and query your own documents.

Why these tools?

They simplify RAG workflows and speed up prototyping.

Setting Up Your Environment


Install the libraries:

pip install langchain llama-index chromadb streamlit

Obtain API keys for your LLM provider (OpenAI, Gemini, etc.).

Building a Simple Q&A System with Langchain


Step 1: Prepare Your Data

Use a few sample text files, PDFs, or URLs as your knowledge base.

Step 2: Index the Data

Example (Langchain with ChromaDB):

from langchain.document_loaders import TextLoader


from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

Gen AI / ML and Python 53


loader = TextLoader("my_docs/")
documents = loader.load()
db = Chroma.from_documents(documents, OpenAIEmbeddings())

Step 3: Connect to an LLM

from langchain.chat_models import ChatOpenAI


llm = ChatOpenAI(api_key="YOUR_API_KEY")

Step 4: Create the RetrievalQA Chain

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=db.as_retriever()
)

Step 5: Ask Questions!

answer = qa.run("What is the main topic of document X?")


print(answer)

Building with LlamaIndex


Step 1: Ingest Data

from llama_index import SimpleDirectoryReader


documents = SimpleDirectoryReader("./data").load_data()

Step 2: Create an Index

from llama_index import GPTVectorStoreIndex


index = GPTVectorStoreIndex.from_documents(documents)

Gen AI / ML and Python 54


Step 3: Query the Index

response = index.query("What is this document about?")


print(response)

Optional:

Use LlamaIndex’s Data Connectors to pull in data from PDFs, SQL, APIs, etc..

Hands-On Activity—Build and Test Your Own Q&A Bot


Task:

Use Langchain or LlamaIndex to build a simple Q&A system over a small


document set (e.g., Wikipedia articles, class notes, or company docs).

Suggested Steps:

1. Load your documents.

2. Index them with embeddings.

3. Connect to an LLM.

4. Run queries and observe the answers.

Challenge:

Try modifying the retriever (e.g., switch from keyword to semantic search).

Add a Streamlit UI for interactive Q&A.

Project Ideas:
1. AI Art Generator for Game Assets (Using Stable Diffusion)
Description: Build a simple app that generates game art assets (characters,
backgrounds, items) from text prompts using Stable Diffusion.

Skills: Prompt engineering, API integration or local model usage, image saving
and display.

Gen AI / ML and Python 55


Tools: Stable Diffusion (local install with Automatic1111 WebUI or via
DreamStudio API), Python for scripting.

Why: Great for learning image generation basics and creating usable assets
for your own game projects.

Reference: Many beginners start with Stable Diffusion v1.5 or SDXL models
and user-friendly UIs like Fooocus or InvokeAI

2. Text-to-Image Web App with DALL-E API


Description: Create a web interface where users enter a text prompt and get
AI-generated images back using OpenAI’s DALL-E API.

Skills: REST API calls, frontend/backend integration, handling image URLs.

Tools: OpenAI API, Flask/Django or Node.js backend, React or plain HTML/JS


frontend.

Why: Learn how to integrate powerful generative AI models into web apps and
handle asynchronous API responses.

Reference: DALL-E API usage and prompt design tips

3. AI-Powered Image Style Transfer or Editing Tool


Description: Build a tool that applies different artistic styles or edits images
using AI models (e.g., Stable Diffusion inpainting or style transfer).

Skills: Image processing, model inference, UI for uploading and editing


images.

Tools: Stable Diffusion inpainting models, Python, OpenCV, Gradio or Streamlit


for UI.

Why: Hands-on experience with advanced generative AI features beyond


simple text-to-image generation.

Reference: Stable Diffusion’s editing capabilities and UI options

4. Code Generation Assistant Using OpenAI Codex or GitHub


Copilot API

Gen AI / ML and Python 56


Description: Build a simple code assistant chatbot that generates code
snippets based on user prompts or completes partial code.

Skills: NLP prompt engineering, API integration, conversational UI design.

Tools: OpenAI Codex API, Flask or FastAPI backend, React or plain JS


frontend.

Why: Learn code generation with LLMs and practical API usage for developer
tools.

Reference: Code generation with LLMs and practical exercises[Session 2


content].

5. AI-Powered Writing Assistant with Text Generation


Description: Develop an app that generates creative writing, summaries, or
paraphrases using GPT models.

Skills: Text generation, prompt tuning, UI/UX design.

Tools: OpenAI GPT API, Streamlit or Flask, basic frontend.

Why: Explore generative AI for NLP and content creation, useful for blogs,
marketing, or education.

6. Interactive Image Generation Playground with Multiple Models


Description: Build a playground app where users can generate images using
different models (DALL-E, Stable Diffusion, Midjourney API if available) and
compare results.

Skills: Multi-API integration, UI design, user input handling.

Tools: APIs for each model, React or Vue frontend, Node.js or Python
backend.

Why: Understand differences between generative models and provide users


with flexible creative tools.

Reference: Comparison of DALL-E, Midjourney, Stable Diffusion capabilities

7. AI-Powered Meme Generator

Gen AI / ML and Python 57


Description: Combine image generation with text overlay to create humorous
or themed memes from prompts.

Skills: Image generation, text rendering on images, web app development.

Tools: Stable Diffusion or DALL-E API, Pillow (Python imaging), Flask or React.

Why: Fun project to practice image generation and simple graphics


manipulation.

8. Personalized Avatar Creator Using Generative AI


Description: Generate custom avatars based on user descriptions or style
preferences. Include options for hair, clothes, background.

Skills: Prompt engineering, conditional generation, UI/UX design.

Tools: Stable Diffusion with control models, web frontend, backend API
integration.

Why: Practical use case for social apps or games, combining AI with user
inputs.

Gen AI / ML and Python 58

You might also like