0% found this document useful (0 votes)
44 views54 pages

Amity University: Jharkhand

This lab manual for the Machine Learning Using Python course at Amity University Jharkhand outlines general instructions for students, course objectives, prerequisites, and expected learning outcomes. It includes a detailed list of weekly experiments focusing on practical applications of machine learning concepts using Python, such as data visualization, data manipulation with Pandas, and implementing regression models. The manual emphasizes individual experimentation, adherence to lab rules, and the importance of understanding machine learning algorithms and their applications.

Uploaded by

Saurav Dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views54 pages

Amity University: Jharkhand

This lab manual for the Machine Learning Using Python course at Amity University Jharkhand outlines general instructions for students, course objectives, prerequisites, and expected learning outcomes. It includes a detailed list of weekly experiments focusing on practical applications of machine learning concepts using Python, such as data visualization, data manipulation with Pandas, and implementing regression models. The manual emphasizes individual experimentation, adherence to lab rules, and the importance of understanding machine learning algorithms and their applications.

Uploaded by

Saurav Dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

1

AMITY UNIVERSITY
JHARKHAND

LAB MANUAL

Course Title: Machine Learning Using Python


Course Level: UG
Course Code: CSIT737
Program: MCA
Semester: III

Faculty: Prepared By: sourav dubey

DR. UMANG GUPTA


Assistant

AMITY UNIVERSITY JHARKHAND


2

General instructions to students

1. Students should be regular and come prepared for lab practice.


2. In case a student misses a class, it is his/her responsibility to complete that missed experiment(s).
3. Students should bring the observation book, lab journal and lab manual.
4. Prescribed textbooks and class notes can be kept ready for reference if required.
5. They should implement the given experiment individually.
6. Once the experiment(s) get executed, they should show the program and results to the instructors and copy
the same in their observation book.
7. Questions for lab tests and exam need not necessarily be limited to the questions in the manual but could
involve some variations and / or combinations of the questions.
8. All the students must maintain silence inside the lab.
9. All the students must carry their id card before entering the lab, and college uniform is strictly mandatory,
otherwise students will not be permitted to sit inside the lab.
10. No food or beverage items are allowed inside the lab.
11. Keep your bags outside the lab.
12. Do not use cell phones inside the lab. (If anybody is found using cell phone inside the lab, his/her mobile
phone will be seized by the responsible authority.)
13. Shut down the system and arrange your chair before leaving the lab.
14. While using lab sign in the register with your name, enrollment number and branch to mark your
attendance.
15. Do not plug any device without permission.
16. Do not use the internet without permission.

Note: Above mentioned instructions can be modified based on the context of the lab. Credit

L (Lecture) P/S
T (Practical/Studio) Total Credit Units

3 - 2 4
Unit

Lab/ Practical/ Studio Assessment

Continuous Assessment/Internal Assessment End Term Examination

Components Lab
Performance Mid Term Viva Attendance Practical Viva
(Drop down) Record

Weightage (%) 20 10 5 10 5 30 20

AMITY UNIVERSITY JHARKHAND


3
Course Objectives:

Machine learning is the science of getting computers to act without being explicitly programmed. This
course provides a broad introduction to machine learning, datamining, and statistical pattern recognition.
It will introduce you to a wide range of machine learning tools in Python. The focus is on the concepts,
methods, and applications of the general predictive modeling and unsupervised learning and how they are
implemented in the Python language environment. The goal is to understand how to use these tools to
solve real world problems. After this course you will be able to carry out your experiments with the
publicly available algorithms or develop your own algorithm.
Pre-requisites:

Student Learning Outcomes:

• Understand the objectives and functions of machine learning


• Apply the type of machine learning and data modelling, and data engineering
• Analyze the different machine leaning algorithm, such as Linear regression, ridge regression,
Lasso, Bayesian regression, regression with basic functions
• Create theoretical of machine learning mechanisms for predictive analysis
• Apply different ways of machine learning approach such as supervised learning, unsupervised
learning, reinforcement learning and deep learning
• Analyze fluent with popular machine learning techniques
• Understand Be aware of other available machine learning modules
• Analyze Explain and adopt the machine learning algorithm
Pedagogy for Course Delivery:
The course will be delivered using classroom teaching, short practical experiments and lab experiments.
Apart from this instructor is free to adopt any methodology to make class interactive.

AMITY UNIVERSITY JHARKHAND


4

List of Experiments
WEEK 1
1. Write a Python program to create a line chart, bar chart, and histogram using matplotlib.

AMITY UNIVERSITY JHARKHAND


5

AMITY UNIVERSITY JHARKHAND


6

AMITY UNIVERSITY JHARKHAND


7

WEEK 2

1. Write a Python program to create an n * k matrix to represent a linear function that maps
kdimensional vectors to n-dimensional vectors. Use NumPy to generate a 4x3 matrix with random
integers between 1 and 10.
import numpy as np n = 4 k = 3 matrix =
np.random.randint(1, 11, size=(n, k))
print(matrix)
import numpy as np random_number = np.random.rand()
print(random_number) # Output: A random float in the range [0, 1)
import numpy as np
random_number = np.random.rand() print(random_number) # Output: A
random float in the range [0, 1) import numpy as np

# Define dimensions n = 4 # Dimension


of the output vector k = 3 # Dimension
of the input vector

# Create a random n x k matrix A


= np.random.rand(n, k)
print("Matrix A:\n", A)

# Create a random k-dimensional input vector


x = np.random.rand(k) print("Vector x:", x)

# Perform the linear transformation


y = np.dot(A, x) print("Vector y:",

AMITY UNIVERSITY JHARKHAND


8
y) import numpy as np

# Define dimensions n = 4 # Dimension


of the output vector k = 3 # Dimension
of the input vector

# Create a random n x k matrix using randint


A = np.random.randint(1, 10, size=(n, k)) # Random integers between 1 and 9
(inclusive) print("Matrix
A:\n", A)

# Create a random k-dimensional input vector using randint x =


np.random.randint(1, 10, size=k) # Random integers between 1 and 9
(inclusive) print("Vector
x:", x)

# Perform the linear transformation


y = np.dot(A, x) print("Vector y:",
y)

WEEK 3

1. A psychologist is observing eating behaviour in 131 children aged 3 years old from Ranchi. He
presents each child 20 new foods which they have never eaten before. He then records the number
of foods they try. The results are shown in the table below. Previous research with thousands of
children from across the country has shown that we expect 40 % of young children to try 0 to 5 new
foods, 30% to try 6 to 10 new foods, 20% to try 11 to 15 new foods and 10 % to try 16 to 20 new
foods.
Perform a chi square test to see if the children from Ranchi follow the same distribution that the
research on Indian children for significance level 5% 3 degrees of freedom (7.815).

AMITY UNIVERSITY JHARKHAND


9

AMITY UNIVERSITY JHARKHAND


WEEK 4: PANDAS ASSIGNMENT

Dataset: customer_churn-1.csv

1. Start off by importing the customer_churn.csv file in the jupyter notebook and store that in churn
DataFrame.
2. From the churn DataFrame, select only 3rd, 7th, 9th, and 20th columns and all the rows and store
that in a new DataFrame named newCols.
3. From the original DataFrame, select only the rows from the 200th index till the 1000th
index(inclusive) column.
4. Now select the rows from 20th index till 200th index(exclusive), and columns from 2nd index till
15th index value.
5. Display the top 100 records from the original DataFrame.
6. Display the last 10 records from the DataFrame.
7. Display the last record from the DataFrame.
8. Now from the churn DataFrame, try to sort the data by the tenure column according to the
descending order.
9. Fetch all the records that are satisfying the following condition: a. Tenure>50 and the gender as
‘Female’ b. Gender as ‘Male’ and SeniorCitizen as 0 c. TechSupport as ‘Yes’ and Churn as ‘No’ d.
Contract type as ‘Month-to-month’ and Churn as ‘Yes’
10. Use a for loop to calculate the number of customers that are getting the tech support and are male
senior citizens.
11. Write a Python program to manipulate and rescale the following data using pandas and
scikitlearn:
import pandas as pd

data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data) print(df)

from google.colab import drive


drive.mount('/content/drive') import
pandas as pd

# Load the data into a DataFrame


churn = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Data/customer_churn-
1.csv') churn
11
# Selecting the 3rd, 7th, 9th, and 20th columns (Python is zero-indexed)
newCols = churn.iloc[:, [2, 6, 8, 19]] newCols

# Including index 1000 subset_df


= churn.loc[200:1000] subset_df
# Rows from index 20 to 199, columns from index 2 to 14
subset_df2 = churn.iloc[20:200, 2:15] subset_df2

# Displaying the top 100 records

AMITY UNIVERSITY JHARKHAND


12
print(churn.head(100))

# Displaying the last 10 records print(churn.tail(10))

# Displaying the last record print(churn.iloc[-1])


# Sorting by 'tenure' column in descending order sorted_churn
= churn.sort_values(by='tenure', ascending=False) sorted_churn
# Using multiple conditions to fetch data condition_a =
churn[(churn['tenure'] > 50) & (churn['gender'] == 'Female')]
print(condition_a)
condition_b = churn[(churn['gender'] == 'Male') & (churn['SeniorCitizen'] ==
0)] print(condition_b)
condition_c = churn[(churn['TechSupport'] == 'Yes') & (churn['Churn'] == 'No')]
print(condition_c) condition_d = churn[(churn['Contract'] == 'Month-to-month') &
(churn['Churn'] ==
'Yes')] print(condition_d)
# Calculating number of male senior citizens getting tech support
count = 0 for index, row in churn.iterrows():
if row['TechSupport'] == 'Yes' and row['gender'] == 'Male' and
row['SeniorCitizen'] == 1:
count += 1 print("Number of male senior citizens getting tech
support:", count) count = churn[(churn['TechSupport'] == 'Yes') &
(churn['gender'] == 'Male') & (churn['SeniorCitizen'] == 1)].shape[0]
print("Number of male senior citizens getting tech support:",
count) from sklearn.preprocessing import MinMaxScaler

# Create DataFrame data = {'A': [1, 2, 3, 4, 5], 'B':


[10, 20, 30, 40, 50]} df = pd.DataFrame(data)

# Initialize scaler scaler


= MinMaxScaler()

# Fit and transform data


df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_scaled)
13
WEEK 5: SCRAPY

1. Write a Python script using Scrapy to scrape the titles and prices of books from a sample book
store website http://books.toscrape.com.

Instructions:
1. Setup Scrapy Project:
o Install Scrapy if you haven't already: pip install scrapy o Create a
new Scrapy project: scrapy startproject bookscraper o Navigate to
the project directory: cd bookscraper
o Generate a new spider: scrapy genspider books books.toscrape.com
2. Define the Spider:
o Open the books_spider.py file in the spiders directory.
o Modify the spider to scrape book titles and prices.
Sample Code:
# bookscraper/spiders/books_spider.py

import scrapy
class
BooksSpider(scrapy.Spider):
name = "books"
start_urls = ['http://books.toscrape.com']
def parse(self, response): for book in
response.css('article.product_pod'):
yield {
'title': book.css('h3 a::attr(title)').get(),
'price': book.css('div.product_price
p.price_color::text').get(),
}

# Follow pagination links


next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

Run the Spider:


o Execute the spider to scrape the data and store it in a JSON file: scrapy crawl
books -o books.json

Additional Questions:
1. Explain the purpose of each part of the spider code.
2. Modify the spider to also scrape the book's availability status.
3. How would you handle potential issues such as missing data or pagination errors?

Step-by-Step Guide:
1. Setup Scrapy Project:
AMITY UNIVERSITY JHARKHAND
o Open your terminal and run the following commands: sh Copy code pip
install scrapy
15

scrapy startproject bookscraper cd


bookscraper
scrapy genspider books books.toscrape.com
2. Define the Spider:
o Open the books_spider.py file located in bookscraper/spiders/ and replace its
content with the sample code provided above.
3. Run the Spider:
o In the terminal, navigate to the root of your Scrapy project (bookscraper) and run:
sh Copy code
scrapy crawl books -o books.json
4. Additional Modifications:
o To scrape the book's availability status, modify the yield statement in the parse
method as follows:
yield {
'title': book.css('h3 a::attr(title)').get(),
'price': book.css('div.product_price
p.price_color::text').get(),
'availability':
book.css('p.instock.availability::text').get().strip(),
}
5. Handling Potential Issues:
o Missing Data: Use .get(default='N/A') to provide a default value if the data is
missing.
o Pagination Errors: Implement error handling using try-except blocks around the
pagination logic.

next_page = response.css('li.next a::attr(href)').get()


if next_page is not None:
next_page = response.urljoin(next_page)
try:
yield scrapy.Request(next_page,
callback=self.parse) except
Exception as e:
self.logger.error(f"Failed to follow pagination
link: {e}")
Final Notes:
• Make sure to explore the Scrapy documentation to understand more advanced features and
best practices: Scrapy Documentation
• Test your spider thoroughly to ensure it handles edge cases and errors gracefully.

WEEK 6 & 7: Linear Regression, Ridge Regression and Lasso Regression


1. Implement linear regression and enhance it using Lasso and Ridge regression on the
California housing dataset
import numpy as np import
pandas as pd
from sklearn.datasets import fetch_california_housing from

AMITY UNIVERSITY JHARKHAND


16

sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, r2_score import
matplotlib.pyplot as plt # import the matplotlib library import
seaborn as sns # import the seaborn library
# Load the california housing dataset california
= fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names) y
= pd.DataFrame(california.target, columns=["MEDV"])
X.head() y.head() # Display basic information
about the dataset print(X.info())
print(X.describe()) # Check for missing
values print(X.isnull().sum())
# Visualize the distribution of features
X.hist(figsize=(12, 10)) plt.show()

AMITY UNIVERSITY JHARKHAND


17

# Analyze relationships between features and target variable


plt.figure(figsize=(10, 6)) sns.pairplot(pd.concat([X, y],
axis=1), hue='MEDV') plt.show()

# Box plots to identify outliers


plt.figure(figsize=(10, 6))
X.boxplot() plt.show()
# Explore correlations between features
correlation_matrix = X.corr() plt.figure(figsize=(10,
8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix') plt.show()

AMITY UNIVERSITY JHARKHAND


18

# # Function to drop highly correlated features


# def drop_highly_correlated_features(df, threshold):
# # Create a correlation matrix that is the absolute value of the given
correlation matrix
# corr_matrix = df.corr().abs()

# # Find index/column names of highly correlated features (above the


threshold)
# upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),
k=1).astype(np.bool_))
# to_drop = [column for column in upper.columns if any(upper[column] >
threshold)]

# # Drop features
# df_reduced = df.drop(columns=to_drop)
# return df_reduced, to_drop

AMITY UNIVERSITY JHARKHAND


# # Applying the function with a 0.8 threshold
18
# X_reduced, dropped_features = drop_highly_correlated_features(X, 0.8)

# Drop the 'AveBedrms' from the DataFrame


X.drop('AveBedrms', axis=1, inplace=True)
# Combine Latitude and Longitude into a single feature
X['Location'] = X['Latitude'] + X['Longitude']
X_modified = X.drop(['Latitude', 'Longitude'], axis=1)
X_modified.columns # Split the data into training and testing sets for both
original and reduced datasets
X_train_orig, X_test_orig, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
X_train_red, X_test_red, _, _ = train_test_split(X_modified, y, test_size=0.2,
random_state=42)

# Function to train and evaluate a model def


train_and_evaluate(X_train, X_test, y_train, y_test):
model = LinearRegression()
model.fit(X_train, y_train) y_pred =
model.predict(X_test) mse =
mean_squared_error(y_test, y_pred) r2 =
r2_score(y_test, y_pred) return mse, r2

# Train and evaluate the model using the original features mse_original,
r2_original = train_and_evaluate(X_train_orig, X_test_orig, y_train,
y_test) print("Original Features - MSE: {:.4f}, R²:
{:.4f}".format(mse_original, r2_original))

# Train and evaluate the model using the reduced features mse_reduced,
r2_reduced = train_and_evaluate(X_train_red, X_test_red, y_train, y_test)
print("Reduced Features - MSE: {:.4f}, R²: {:.4f}".format(mse_reduced,
r2_reduced)) def train_and_evaluate_lasso(X_train, X_test, y_train,
y_test, alpha=0.1):
model = Lasso(alpha=alpha)
model.fit(X_train, y_train) y_pred =
model.predict(X_test) mse =
mean_squared_error(y_test, y_pred) r2 =
r2_score(y_test, y_pred) return mse, r2
# Train and evaluate the Lasso model using the original features
mse_original_lasso, r2_original_lasso = train_and_evaluate_lasso(X_train_orig,
X_test_orig, y_train, y_test) print("Lasso with Original Features - MSE:
19
mse_reduced_lasso, r2_reduced_lasso = train_and_evaluate_lasso(X_train_red,
X_test_red, y_train, y_test) print("Lasso with Reduced Features - MSE:
{:.4f}, R²: {:.4f}".format(mse_reduced_lasso, r2_reduced_lasso))

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train a linear regression model model


= LinearRegression() model.fit(X_train,
y_train)

# Make predictions on the test set y_pred


= model.predict(X_test)

# Evaluate the model mse =


mean_squared_error(y_test, y_pred) r2 =
r2_score(y_test, y_pred)
print("Mean Squared Error:",
mse) print("R-squared:", r2)
# Train a Lasso regression model lasso_model
= Lasso(alpha=0.1) lasso_model.fit(X_train,
y_train)

# Make predictions on the test set y_pred_lasso


= lasso_model.predict(X_test)

# Evaluate the Lasso model mse_lasso =


mean_squared_error(y_test, y_pred_lasso) r2_lasso =
r2_score(y_test, y_pred_lasso)

print("Lasso Regression - Mean Squared Error:", mse_lasso) print("Lasso


Regression - R-squared:", r2_lasso)

# Train a Ridge regression model ridge_model


= Ridge(alpha=0.1) ridge_model.fit(X_train,
y_train)

# Make predictions on the test set y_pred_ridge


= ridge_model.predict(X_test)

# Evaluate the Ridge model mse_ridge =


mean_squared_error(y_test, y_pred_ridge) r2_ridge =
20
print("Ridge Regression - R-squared:", r2_ridge)

# Analyze feature importance in the linear regression model


coefficients = pd.DataFrame(model.coef_[0], index=X.columns,
columns=['Coefficients']) print(coefficients)

# Visualize feature importance


plt.figure(figsize=(10, 6))
coefficients.plot(kind='bar') plt.title('Feature
Importance in Linear Regression')
plt.xlabel('Features') plt.ylabel('Coefficients')
plt.show()

# Analyze feature importance in the Lasso regression model


lasso_coefficients = pd.DataFrame(lasso_model.coef_, index=X.columns,
columns=['Coefficients']) print(lasso_coefficients)

# Visualize feature importance


plt.figure(figsize=(10, 6))
lasso_coefficients.plot(kind='bar')
plt.title('Feature Importance in Lasso Regression')
plt.xlabel('Features') plt.ylabel('Coefficients')
plt.show()

# Analyze feature importance in the Ridge regression model


ridge_coefficients = pd.DataFrame(ridge_model.coef_[0], index=X.columns,
columns=['Coefficients']) print(ridge_coefficients)

# Visualize feature importance plt.figure(figsize=(10,


6)) ridge_coefficients.plot(kind='bar')
plt.title('Feature Importance in Ridge Regression')
plt.xlabel('Features') plt.ylabel('Coefficients')
plt.show()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
21
# Predictions y_pred_lr =
lr_model.predict(X_test)

# Evaluation mse_lr = mean_squared_error(y_test, y_pred_lr)


r2_lr = r2_score(y_test, y_pred_lr) print(f"Linear Regression -
MSE: {mse_lr:.4f}, R²: {r2_lr:.4f}")

# Lasso Regression lasso_model = Lasso(alpha=0.1) # Alpha is the


regularization strength lasso_model.fit(X_train, y_train)

# Predictions y_pred_lasso =
lasso_model.predict(X_test)

# Evaluation mse_lasso = mean_squared_error(y_test, y_pred_lasso)


r2_lasso = r2_score(y_test, y_pred_lasso) print(f"Lasso Regression -
MSE: {mse_lasso:.4f}, R²: {r2_lasso:.4f}")
# Ridge Regression ridge_model = Ridge(alpha=1.0) # Alpha is the
regularization strength ridge_model.fit(X_train, y_train)

# Predictions y_pred_ridge =
ridge_model.predict(X_test)

# Evaluation mse_ridge = mean_squared_error(y_test, y_pred_ridge)


r2_ridge = r2_score(y_test, y_pred_ridge) print(f"Ridge Regression -
MSE: {mse_ridge:.4f}, R²: {r2_ridge:.4f}")

from sklearn.model_selection import GridSearchCV

# Setting up the range of alpha values to test for Lasso


lasso_params = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
# Setting up the GridSearchCV object for Lasso lasso_grid
= GridSearchCV(Lasso(), lasso_params, cv=5,
scoring='neg_mean_squared_error') lasso_grid.fit(X_train,
y_train)

# Best alpha value


print("Best alpha for Lasso: ", lasso_grid.best_params_)
import numpy as np

# Function to calculate the mean of each feature def


calculate_mean(data):
n_samples = data.shape[0]
sum_data = np.sum(data, axis=0)
mean = sum_data / n_samples return
mean

# Function to calculate the covariance matrix def


calculate_covariance(data, mean):
y_pred_lasso_best = lasso_grid.best_estimator_.predict(X_test)
mse_lasso_best = mean_squared_error(y_test, y_pred_lasso_best)
r2_lasso_best = r2_score(y_test, y_pred_lasso_best) print(f"Optimized
Lasso Regression - MSE: {mse_lasso_best:.4f}, R²:
{r2_lasso_best:.4f}")

# Setting up the range of alpha values to test for Ridge ridge_params


= {'alpha': [0.1, 1, 10, 100, 1000]}

# Setting up the GridSearchCV object for Ridge ridge_grid


= GridSearchCV(Ridge(), ridge_params, cv=5,
scoring='neg_mean_squared_error') ridge_grid.fit(X_train,
y_train)

# Best alpha value print("Best alpha for Ridge: ",


ridge_grid.best_params_)
# Evaluate the best model found by GridSearchCV y_pred_ridge_best =
ridge_grid.best_estimator_.predict(X_test) mse_ridge_best =
mean_squared_error(y_test, y_pred_ridge_best) r2_ridge_best =
r2_score(y_test, y_pred_ridge_best) print(f"Optimized Ridge
Regression - MSE: {mse_ridge_best:.4f}, R²: {r2_ridge_best:.4f}")

WEEK 8
Python Example: MLE for Bivariate Gaussian Distribution

We’ll simulate a dataset representing two features, which could correspond to the sizes and weights in the
previous example and perform MLE to estimate the parameters of the bivariate Gaussian distribution.

Step-by-step Explanation:

Generate Data: Create synthetic data for two features. Compute Mean: Calculate the sample mean of each
feature. Compute Covariance Matrix: Manually calculate the covariance matrix. MLE Estimation: Use the
computed mean and covariance as the MLE estimates
28

23
n_samples = data.shape[0] deviations = data - mean
covariance_matrix = np.dot(deviations.T, deviations) / n_samples
return covariance_matrix

# Generate synthetic data np.random.seed(0) data =


np.random.multivariate_normal(mean=[0, 0], cov=[[1, 0.5], [0.5, 1]],
size=100)

# Step 1: Compute the mean mean_est


= calculate_mean(data)

# Step 2: Compute the covariance matrix covariance_est


= calculate_covariance(data, mean_est)

# Print the results print("Estimated


Mean:\n", mean_est)
print("Estimated Covariance Matrix:\n", covariance_est)

Creating graphs to visualize the Bivariate Gaussian Distribution


import numpy as np import
matplotlib.pyplot as plt import
seaborn as sns
# Mean (mu) and Covariance (Sigma) of the distribution
mu = np.array([0, 0]) # Example mean for two dimensions (x and y)
Sigma = np.array([[1, 0.5], [0.5, 1]]) # Example covariance matrix
# Generate random data data =
np.random.multivariate_normal(mu, Sigma, size=500)
# Extracting individual components x,
y = data.T

# Setting up the plot with matplotlib plt.figure(figsize=(8,


6))

# Using seaborn to create a scatter plot sns.scatterplot(x=x,


y=y, color='blue')

# Adding titles and labels


plt.title('Bivariate Gaussian Distribution')
plt.xlabel('X') plt.ylabel('Y')

AMITY UNIVERSITY JHARKHAND


# Setting up the plot plt.figure(figsize=(8,
6))

# Creating a density plot with contour lines sns.kdeplot(x=x, y=y,


cmap="Reds", fill=True, thresh=0, levels=100)
# Adding scatter plot to show actual data points
sns.scatterplot(x=x, y=y, color='blue', s=50, edgecolor='w', linewidth=0.5)
# Titles and labels
plt.title('Bivariate Gaussian Distribution with Density Contours')
plt.xlabel('X') plt.ylabel('Y')

# Show the plot


plt.grid(True) plt.show()
30

AMITY UNIVERSITY JHARKHAND


31

Explanation of the Code

mu and Sigma: These are the parameters for the mean and covariance of the distribution. You can modify
these to see how they affect the distribution’s shape and orientation.

np.random.multivariate_normal: Generates random data points based on the specified mean and
covariance.

sns.scatterplot: Plots individual data points on a scatter plot.

sns.kdeplot: Adds a Kernel Density Estimate (KDE) plot that shows the distribution's density with contour
lines.

Week 9
Write a python code to implement Correlation, covariance, Mahalanobis distance, Minkowski distance,
distance metric, Jaccard coefficient, missing values, feature transformations, and Geometrical
interpretation of Euclidean.

Correlation and Covariance


import numpy as np import pandas as pd from sklearn.datasets import load_iris
from scipy.spatial.distance import mahalanobis, minkowski, euclidean, cityblock,
cosine from scipy.spatial.distance import jaccard from sklearn.preprocessing

AMITY UNIVERSITY JHARKHAND


32
import StandardScaler
# Load the Iris dataset iris
= load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target) # Correlation matrix
correlation_matrix = X.corr() print("Correlation
Matrix:\n", correlation_matrix)
# Covariance matrix covariance_matrix = X.cov()
print("Covariance Matrix:\n", covariance_matrix)

Mahalanobis Distance
# Calculate Mahalanobis distance between the first and second sample mean =
np.mean(X, axis=0) cov_matrix = np.cov(X.T) inv_cov_matrix =
np.linalg.inv(cov_matrix) mahal_dist = mahalanobis(X.iloc[0], X.iloc[1],
inv_cov_matrix) print(f"Mahalanobis Distance between the first and second
sample: {mahal_dist}")

Minkowski Distance
# Calculate Minkowski distance (p=3) between the first and second sample
minkowski_dist = minkowski(X.iloc[0], X.iloc[1], p=3) print(f"Minkowski
Distance (p=3) between the first and second sample:{minkowski_dist}")

Distance Metrics (Euclidean, Manhattan)


# Calculate Euclidean, Manhattan, and Cosine distances between the first and
second sample
euclidean_dist = euclidean(X.iloc[0], X.iloc[1]) manhattan_dist
= cityblock(X.iloc[0], X.iloc[1]) print(f"Euclidean Distance
between the first and second sample:
{euclidean_dist}") print(f"Manhattan Distance between the first
and second sample:
{manhattan_dist}")

Jaccard Coefficient
# The Jaccard coefficient is usually used for binary data. We'll create a simple
example.
# Example binary data binary_data1 =
np.array([0, 1, 1, 0, 1]) binary_data2 =
np.array([1, 1, 0, 0, 1])
# Calculate Jaccard coefficient jaccard_coeff = jaccard(binary_data1,
binary_data2) print(f"Jaccard Coefficient between binary data samples:
{jaccard_coeff}")

AMITY UNIVERSITY JHARKHAND


33

Handling Missing Values


# Introduce missing values into the dataset
X_missing = X.copy()
X_missing.iloc[0, 0] = np.nan
# Handling missing values by imputing the mean
X_missing.fillna(X_missing.mean(), inplace=True) print("Data
after handling missing values:\n", X_missing.head())

Feature Transformations
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Normalization (Min-Max scaling) scaler
= MinMaxScaler()
X_normalized = scaler.fit_transform(X)
# Standardization
standardizer = StandardScaler()
X_standardized = standardizer.fit_transform(X) print("First 5 samples
after Min-Max Scaling:\n", X_normalized[:5]) print("First 5 samples
after Standardization:\n", X_standardized[:5])

Geometrical Interpretation of Euclidean Distance


import matplotlib.pyplot as plt
# Select the first two features for 2D visualization
X_2d = X.iloc[:, :2] # Plot the data points
plt.scatter(X_2d.iloc[:, 0], X_2d.iloc[:, 1], c=y, cmap='viridis')
plt.xlabel('Feature 1') plt.ylabel('Feature 2')

AMITY UNIVERSITY JHARKHAND


34

# Plot the Euclidean distance between the first and second sample
point1 = X_2d.iloc[0] point2 = X_2d.iloc[1]

plt.plot([point1[0], point2[0]], [point1[1], point2[1]], 'r-', linewidth=2)


plt.scatter(point1[0], point1[1], c='red', edgecolor='k', s=100)
plt.scatter(point2[0], point2[1], c='blue', edgecolor='k', s=100)
plt.title('Geometrical Interpretation of Euclidean Distance') plt.show()

AMITY UNIVERSITY JHARKHAND


35

Extra Practice Questions


1. The objective of proposed work is to predict the insurance charges of a person and identify those
patients with health insurance policy and medical details weather they have any health issues or
not. The level of treatment in crisis department vary drastically depending the type of health
insurance a person has by this we predict the insurance charges of a person .
(new_insurance_data.csv)

AMITY UNIVERSITY JHARKHAND


36

AMITY UNIVERSITY JHARKHAND


37

AMITY UNIVERSITY JHARKHAND


38

Ridge Regression Overview

Ridge Regression, also known as Tikhonov regularization, is a technique used to analyze multiple
regression data that suffer from multicollinearity. By adding a degree of bias to the regression
estimates, ridge regression reduces the standard errors.

Steps to Implement Ridge Regression


1. Load the Dataset
2. Preprocess the Data
3. Split the Data into Training and Testing Sets
4. Train the Ridge Regression Model
5. Evaluate the Model

Let's go through these steps in detail.

Step 1: Load the Dataset

AMITY UNIVERSITY JHARKHAND


39
First, we'll load the Boston Housing dataset. This dataset is available in the sklearn.datasets
module.

import numpy as np import


pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler from
sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

# Load the California Housing dataset


california = fetch_california_housing()
X = pd.DataFrame(california.data,
columns=california.feature_names) y =
pd.Series(california.target)

print(X.head()) print(y.head())

Step 2: Preprocess the Data


Before training the model, it's important to standardize the features. Standardization can improve
the performance of the model, especially for regularized models like Ridge Regression.

# Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 3: Split the Data into Training and Testing Sets Next,
we split the data into training and testing sets.

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)

Step 4: Train the Ridge Regression Model


Now, we'll train the Ridge Regression model. We'll also tune the regularization parameter α\alphaα
to find the best value.

# Train the Ridge Regression model


ridge = Ridge(alpha=1.0)

# You can change the alpha value to tune the regularization strength
ridge.fit(X_train, y_train)

Step 5: Evaluate the Model


Finally, we'll evaluate the model's performance using Mean Squared Error (MSE) and R-squared
(R²) metrics.

# Make predictions

AMITY UNIVERSITY JHARKHAND


40
y_pred_train = ridge.predict(X_train) y_pred_test
= ridge.predict(X_test)

# Evaluate the model


mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
r2_train = r2_score(y_train, y_pred_train) r2_test =
r2_score(y_test, y_pred_test)

print(f"Training MSE: {mse_train}")


print(f"Testing MSE: {mse_test}") print(f"Training
R²: {r2_train}")
print(f"Testing R²: {r2_test}")

Lasso Regression Overview

Lasso Regression (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that
uses L1 regularization. The L1 regularization adds a penalty equal to the absolute value of the magnitude
of coefficients. This type of regression can shrink some coefficients to zero, effectively performing
variable selection.

Steps to Implement Lasso Regression


1. Load the Dataset
2. Preprocess the Data
3. Split the Data into Training and Testing Sets
4. Train the Lasso Regression Model
5. Evaluate the Model

Let's go through these steps in detail.


Step 1: Load the Dataset
First, we'll load the Diabetes dataset. This dataset is available in the sklearn.datasets module.
import numpy as np import pandas as pd
from sklearn.datasets import load_diabetes from
sklearn.model_selection import train_test_split from
sklearn.preprocessing import StandardScaler from
sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

# Load the Diabetes dataset


diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) y
= pd.Series(diabetes.target)

print(X.head()) print(y.head())

Step 2: Preprocess the Data


Before training the model, it's important to standardize the features. Standardization can improve
the performance of the model, especially for regularized models like Lasso Regression.

AMITY UNIVERSITY JHARKHAND


41
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 3: Split the Data into Training and Testing Sets Next,
we split the data into training and testing sets.

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)

Step 4: Train the Lasso Regression Model


Now, we'll train the Lasso Regression model. We'll also tune the regularization parameter α\alphaα
to find the best value.

# Train the Lasso Regression model


lasso = Lasso(alpha=1.0)
# You can change the alpha value to tune the regularization strength
lasso.fit(X_train, y_train)

Step 5: Evaluate the Model


Finally, we'll evaluate the model's performance using Mean Squared Error (MSE) and R-squared
(R²) metrics.

# Make predictions
y_pred_train = lasso.predict(X_train) y_pred_test
= lasso.predict(X_test)

# Evaluate the model


mse_train = mean_squared_error(y_train, y_pred_train) mse_test
= mean_squared_error(y_test, y_pred_test) r2_train =
r2_score(y_train, y_pred_train) r2_test = r2_score(y_test,
y_pred_test)

print(f"Training MSE: {mse_train}")


print(f"Testing MSE: {mse_test}") print(f"Training
R²: {r2_train}")
print(f"Testing R²: {r2_test}")

Correlation

1. Write a python code to implement Correlation, covariance, Mahalanobis distance, Minkowski


distance, distance metric, Jaccard coefficient, missing values, feature transformations, and
Geometrical interpretation of Euclidean.

1. Correlation and Covariance


import numpy as np
import pandas as pd

AMITY UNIVERSITY JHARKHAND


42
from sklearn.datasets import load_iris
from scipy.spatial.distance import mahalanobis, minkowski,
euclidean, cityblock, cosine
from scipy.spatial.distance import jaccard
from sklearn.preprocessing import
StandardScaler

# Load the Iris dataset


iris = load_iris()
X = pd.DataFrame(iris.data,
columns=iris.feature_names) y =
pd.Series(iris.target)

# Correlation matrix
correlation_matrix = X.corr()
print("Correlation Matrix:\n", correlation_matrix)

# Covariance matrix
covariance_matrix = X.cov()
print("Covariance Matrix:\n", covariance_matrix)

2. Mahalanobis Distance
# Calculate Mahalanobis distance between the first and second
sample
mean = np.mean(X, axis=0)
cov_matrix = np.cov(X.T)
inv_cov_matrix = np.linalg.inv(cov_matrix)

mahal_dist = mahalanobis(X.iloc[0], X.iloc[1], inv_cov_matrix)


print(f"Mahalanobis Distance between the first and second sample:
{mahal_dist}")

3. Minkowski Distance

# Calculate Minkowski distance (p=3) between the first and second


sample minkowski_dist = minkowski(X.iloc[0], X.iloc[1], p=3)
print(f"Minkowski Distance (p=3) between the first and second
sample:
{minkowski_dist}")

4. Distance Metrics (Euclidean, Manhattan, Cosine)

# Calculate Euclidean, Manhattan, and Cosine distances between


the first and second sample
euclidean_dist = euclidean(X.iloc[0], X.iloc[1])
manhattan_dist = cityblock(X.iloc[0], X.iloc[1])
cosine_dist = cosine(X.iloc[0], X.iloc[1])
print(f"Euclidean Distance between the first and second
sample:

AMITY UNIVERSITY JHARKHAND


43
{euclidean_dist}") print(f"Manhattan Distance between the
first and second sample:
{manhattan_dist}") print(f"Cosine Distance between the
first and second sample:
{cosine_dist}")

5. Jaccard Coefficient
The Jaccard coefficient is usually used for binary data. We'll create a simple example.
# Example binary data
binary_data1 = np.array([0, 1, 1, 0, 1])
binary_data2 = np.array([1, 1, 0, 0, 1])

# Calculate Jaccard coefficient


jaccard_coeff = jaccard(binary_data1, binary_data2)
print(f"Jaccard Coefficient between binary data samples:
{jaccard_coeff}")

6. Handling Missing Values


# Introduce missing values into the dataset
X_missing = X.copy()
X_missing.iloc[0, 0] = np.nan

# Handling missing values by imputing the mean


X_missing.fillna(X_missing.mean(), inplace=True)
print("Data after handling missing values:\n", X_missing.head())

7. Feature Transformations from sklearn.preprocessing import MinMaxScaler,


StandardScaler

# Normalization (Min-Max scaling)


scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# Standardization
standardizer = StandardScaler()
X_standardized = standardizer.fit_transform(X)

print("First 5 samples after Min-Max Scaling:\n",


X_normalized[:5]) print("First 5 samples after Standardization:\
n", X_standardized[:5])

8. Geometrical Interpretation of Euclidean Distance


Here, we'll visually interpret Euclidean distance using a 2D plot.
import matplotlib.pyplot as plt

# Select the first two features for 2D visualization


X_2d = X.iloc[:, :2]

# Plot the data points

AMITY UNIVERSITY JHARKHAND


44
plt.scatter(X_2d.iloc[:, 0], X_2d.iloc[:, 1], c=y,
cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Plot the Euclidean distance between the first and second sample
point1 = X_2d.iloc[0]
point2 = X_2d.iloc[1]

plt.plot([point1[0], point2[0]], [point1[1], point2[1]], 'r-',


linewidth=2)
plt.scatter(point1[0], point1[1], c='red', edgecolor='k', s=100)
plt.scatter(point2[0], point2[1], c='blue', edgecolor='k', s=100)
plt.title('Geometrical Interpretation of Euclidean Distance')
plt.show()

LOGISTIC REGRESSION

Problem Statement: You work in XYZ Company. The company officials have collected some data on
health parameters based on diabetes and wish for you to create a model from it.

Dataset: diabetes.csv

Tasks to Be Performed:
• Load the dataset using pandas
• Extract data from outcome column is a variable named Y
• Extract data from every column except outcome column in a variable named X
• Divide the dataset into two parts for training and testing in 80% and 20% proportion
• Create and train Logistic Regression Model on training set
• Make predictions based on the testing set using the trained model
• Check the performance by calculating the confusion matrix and accuracy score of the mode

AMITY UNIVERSITY JHARKHAND


45

AMITY UNIVERSITY JHARKHAND


46

AMITY UNIVERSITY JHARKHAND


47

DECISION TREE OVERVIEW


Problem Statement

We aim to classify whether a tumor is malignant or benign based on features such as mean radius, mean
texture, mean perimeter, mean area, and mean smoothness. We'll use a Decision Tree classifier to model
this relationship and evaluate its performance.

Decision Tree Classifier Overview

A Decision Tree classifier splits the data at each node based on the feature that provides the best split
according to a certain criterion (e.g., Gini impurity or information gain). The process continues
recursively, creating a tree structure where each leaf node represents a class label.

Step 1: Load the Dataset

First, we'll load the Breast Cancer dataset. This dataset is available in the sklearn.datasets module.
import numpy as np import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import
train_test_split from sklearn.preprocessing
import StandardScaler from sklearn.tree import
DecisionTreeClassifier
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn.tree import
plot_tree

# Load the Breast Cancer dataset


breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data,
columns=breast_cancer.feature_names) y =
pd.Series(breast_cancer.target)

print(X.head())
print(y.head()) Step 2:
Preprocess the Data

AMITY UNIVERSITY JHARKHAND


48
We will standardize the features to ensure they all have the same scale.
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 3: Split the Data into Training and Testing Sets Next,
we split the data into training and testing sets. python
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)

Step 4: Train the Decision Tree Classifier Now,


we'll train the Decision Tree classifier.
# Train the Decision Tree Classifier
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)

Step 5: Evaluate the Model


We'll evaluate the model's performance using accuracy, classification report, and confusion matrix.
# Make predictions
y_pred_train = decision_tree.predict(X_train)
y_pred_test = decision_tree.predict(X_test)

# Evaluate the model


train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)
classification_rep = classification_report(y_test,
y_pred_test) conf_matrix = confusion_matrix(y_test,
y_pred_test)

print(f"Training Accuracy: {train_accuracy}")


print(f"Testing Accuracy: {test_accuracy}")
print("Classification Report:\n", classification_rep)
print("Confusion Matrix:\n", conf_matrix)

Step 6: Visualize the Decision Tree


Finally, we'll visualize the Decision Tree to understand the splits and decision rules.
# Visualize the Decision Tree
plt.figure(figsize=(20, 10))
plot_tree(decision_tree, feature_names=breast_cancer.feature_names,
class_names=breast_cancer.target_names, filled=True)
plt.title("Decision Tree for Breast Cancer Dataset")
plt.show()

KNN (K-NEAREST NEIGHBOUR)

Problem Statement

AMITY UNIVERSITY JHARKHAND


49
We aim to classify iris flowers into three species (setosa, versicolor, and virginica) based on four features:
sepal length, sepal width, petal length, and petal width. We'll use a K-Nearest Neighbors classifier to
model this relationship and evaluate its performance.
K-Nearest Neighbors Classifier Overview
The K-Nearest Neighbors algorithm classifies a data point by looking at the 'k' nearest data points in the
training set and assigning the class that is most common among them. It is a type of lazy learning where
the model is built only when a query is made.

Step 1: Load the Dataset


First, we'll load the Iris dataset. This dataset is available in the sklearn.datasets module.
import numpy as np import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import
train_test_split from sklearn.preprocessing
import StandardScaler from sklearn.neighbors
import KNeighborsClassifier
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix

# Load the Iris dataset


iris = load_iris()
X = pd.DataFrame(iris.data,
columns=iris.feature_names) y =
pd.Series(iris.target)

print(X.head())
print(y.head())

Step 2: Preprocess the Data


We will standardize the features to ensure they all have the same scale.
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 3: Split the Data into Training and Testing Sets Next,
we split the data into training and testing sets.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)

Step 4: Train the KNN Classifier


Now, we'll train the KNN classifier. We'll choose k=5 for this example.
# Train the KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

Step 5: Evaluate the Model


We'll evaluate the model's performance using accuracy, classification report, and confusion matrix.
# Make predictions

AMITY UNIVERSITY JHARKHAND


50
y_pred_train =
knn.predict(X_train) y_pred_test
= knn.predict(X_test)
# Evaluate the model
train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)
classification_rep = classification_report(y_test,
y_pred_test) conf_matrix = confusion_matrix(y_test,
y_pred_test)

print(f"Training Accuracy: {train_accuracy}")


print(f"Testing Accuracy: {test_accuracy}")
print("Classification Report:\n", classification_rep)
print("Confusion Matrix:\n", conf_matrix)

SVM (SUPPORT VECTOR MACHINE) CLASSIFIER

Problem Statement
We aim to classify different types of wine into three classes based on 13 chemical attributes such as
alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, and others. We'll use an
SVM classifier to model this relationship and evaluate its performance.
Support Vector Machine Overview
Support Vector Machines find the hyperplane that maximizes the margin between different classes. For
non-linearly separable data, SVM uses kernel tricks to transform the data into a higher-dimensional space
where a linear separator can be found. Common kernels include linear, polynomial, and radial basis
function (RBF).

Step 1: Load the Dataset


First, we'll load the Wine dataset. This dataset is available in the sklearn.datasets module.

import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix

# Load the Wine dataset


wine = load_wine()
X = pd.DataFrame(wine.data,
columns=wine.feature_names) y =
pd.Series(wine.target)

print(X.head())
print(y.head())

AMITY UNIVERSITY JHARKHAND


51
Step 2: Preprocess the Data
We will standardize the features to ensure they all have the same scale.

# Standardize the features


scaler = StandardScaler()
X_scaled =
scaler.fit_transform(X) Step 3:
Split the Data into Training and Testing Sets
Next, we split the data into training and testing
sets.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)

Step 4: Train the SVM Classifier


Now, we'll train the SVM classifier using an RBF kernel.
# Train the SVM Classifier
svm = SVC(kernel='rbf', random_state=42)
svm.fit(X_train, y_train)

Step 5: Evaluate the Model


We'll evaluate the model's performance using accuracy, classification report, and confusion matrix.
# Make predictions
y_pred_train = svm.predict(X_train)
y_pred_test = svm.predict(X_test)

# Evaluate the model


train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)
classification_rep = classification_report(y_test,
y_pred_test) conf_matrix = confusion_matrix(y_test,
y_pred_test)

print(f"Training Accuracy: {train_accuracy}")


print(f"Testing Accuracy: {test_accuracy}")
print("Classification Report:\n", classification_rep)
print("Confusion Matrix:\n", conf_matrix)

KMEANS CLUSTERING
Problem Statement

We aim to cluster iris flowers into three groups based on four features: sepal length, sepal width, petal
length, and petal width. We'll use the K-Means clustering algorithm to achieve this and evaluate the
results.

K-Means Clustering Overview

K-Means clustering aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean. The algorithm iteratively updates the centroids of the clusters and
assigns data points to the nearest cluster until convergence.
AMITY UNIVERSITY JHARKHAND
52
Step 1: Load the Dataset
First, we'll load the Iris dataset. This dataset is available in the sklearn.datasets module.
import numpy as np import pandas as pd from
sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler from
sklearn.cluster import KMeans from sklearn.metrics
import silhouette_score, confusion_matrix import
matplotlib.pyplot as plt import seaborn as sns

# Load the Iris dataset iris = load_iris() X =


pd.DataFrame(iris.data,
columns=iris.feature_names) y =
pd.Series(iris.target)

print(X.head()
)
print(y.head()
)

Step 2: Preprocess the Data


We will standardize the features to ensure they all have the same scale.
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 3: Apply K-Means Clustering


We'll apply the K-Means algorithm with k=3 since we know there are three types of iris flowers.
# Apply K-Means Clustering kmeans =
KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_scaled)
# Get cluster labels
labels = kmeans.labels_

print(labels)

Step 4: Evaluate the Model


We'll evaluate the model using the silhouette score and visualize the clusters.
# Calculate the silhouette score
silhouette_avg = silhouette_score(X_scaled,
labels) print(f"Silhouette Score:
{silhouette_avg}")
# Plot the clusters plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1], hue=labels,
palette='viridis', s=100, alpha=0.6, edgecolor='w') plt.title('K-
Means Clustering of Iris Data') plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

AMITY UNIVERSITY JHARKHAND


53

# Confusion matrix to compare with true labels


conf_matrix = confusion_matrix(y, labels)
print("Confusion Matrix:\n", conf_matrix)

Multi-layer Perceptron and Back propagation


Problem Statement

We aim to classify handwritten digits (0-9) based on pixel values of 8x8 images. We'll use an MLP
classifier to model this relationship and evaluate its performance.

Multi-layer Perceptron Overview

A Multi-layer Perceptron consists of an input layer, one or more hidden layers, and an output layer. Each
neuron in a layer is connected to all neurons in the next layer. During training, the model adjusts the
weights using backpropagation, which involves calculating the gradient of the loss function with respect
to each weight and updating the weights accordingly.

Step 1: Load the Dataset

First, we'll load the digits dataset. This dataset is available in the sklearn.datasets module.
import numpy as np import pandas as pd
from sklearn.datasets import load_digits from
sklearn.model_selection import train_test_split
from sklearn.preprocessing import
StandardScaler from sklearn.neural_network
import MLPClassifier
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix

# Load the Digits dataset


digits = load_digits() X =
pd.DataFrame(digits.data)
y =
pd.Series(digits.target)

print(X.head())
print(y.head())

Step 2: Preprocess the Data

We will standardize the features to ensure they all have the same scale.
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 3: Split the Data into Training and Testing Sets

Next, we split the data into training and testing sets.


# Split the data into training and testing sets

AMITY UNIVERSITY JHARKHAND


54
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)

Step 4: Train the MLP Classifier


Now, we'll train the MLP classifier. We'll use one hidden layer with 100 neurons for this example.
# Train the MLP Classifier
mlp = MLPClassifier(hidden_layer_sizes=(100,),
max_iter=300, random_state=42) mlp.fit(X_train,
y_train)

Step 5: Evaluate the Model

We'll evaluate the model's performance using accuracy, classification report, and confusion matrix.

# Make predictions
y_pred_train = mlp.predict(X_train)
y_pred_test = mlp.predict(X_test)

# Evaluate the model


train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)
classification_rep = classification_report(y_test,
y_pred_test) conf_matrix = confusion_matrix(y_test,
y_pred_test)

print(f"Training Accuracy: {train_accuracy}")


print(f"Testing Accuracy: {test_accuracy}")
print("Classification Report:\n", classification_rep)
print("Confusion Matrix:\n", conf_matrix)

********

Textbooks:

• The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies by Erik
Brynjolfsson and Andrew McAfee. ISBN-10: 0393239357
• Getting started with Internet of Things, by Cuno Pfister, Shroff; First edition (17 May 2011), ISBN-10:
9350234130
• Big Data and The Internet of Things, by Robert Stackowiak, Art licht, Springer Nature; 1st ed. edition
(12 May 2015), ISBN-10: 1484209877

AMITY UNIVERSITY JHARKHAND

You might also like