Amity University: Jharkhand
Amity University: Jharkhand
AMITY UNIVERSITY
JHARKHAND
LAB MANUAL
Note: Above mentioned instructions can be modified based on the context of the lab. Credit
L (Lecture) P/S
T (Practical/Studio) Total Credit Units
3 - 2 4
Unit
Components Lab
Performance Mid Term Viva Attendance Practical Viva
(Drop down) Record
Weightage (%) 20 10 5 10 5 30 20
Machine learning is the science of getting computers to act without being explicitly programmed. This
course provides a broad introduction to machine learning, datamining, and statistical pattern recognition.
It will introduce you to a wide range of machine learning tools in Python. The focus is on the concepts,
methods, and applications of the general predictive modeling and unsupervised learning and how they are
implemented in the Python language environment. The goal is to understand how to use these tools to
solve real world problems. After this course you will be able to carry out your experiments with the
publicly available algorithms or develop your own algorithm.
Pre-requisites:
List of Experiments
WEEK 1
1. Write a Python program to create a line chart, bar chart, and histogram using matplotlib.
WEEK 2
1. Write a Python program to create an n * k matrix to represent a linear function that maps
kdimensional vectors to n-dimensional vectors. Use NumPy to generate a 4x3 matrix with random
integers between 1 and 10.
import numpy as np n = 4 k = 3 matrix =
np.random.randint(1, 11, size=(n, k))
print(matrix)
import numpy as np random_number = np.random.rand()
print(random_number) # Output: A random float in the range [0, 1)
import numpy as np
random_number = np.random.rand() print(random_number) # Output: A
random float in the range [0, 1) import numpy as np
WEEK 3
1. A psychologist is observing eating behaviour in 131 children aged 3 years old from Ranchi. He
presents each child 20 new foods which they have never eaten before. He then records the number
of foods they try. The results are shown in the table below. Previous research with thousands of
children from across the country has shown that we expect 40 % of young children to try 0 to 5 new
foods, 30% to try 6 to 10 new foods, 20% to try 11 to 15 new foods and 10 % to try 16 to 20 new
foods.
Perform a chi square test to see if the children from Ranchi follow the same distribution that the
research on Indian children for significance level 5% 3 degrees of freedom (7.815).
Dataset: customer_churn-1.csv
1. Start off by importing the customer_churn.csv file in the jupyter notebook and store that in churn
DataFrame.
2. From the churn DataFrame, select only 3rd, 7th, 9th, and 20th columns and all the rows and store
that in a new DataFrame named newCols.
3. From the original DataFrame, select only the rows from the 200th index till the 1000th
index(inclusive) column.
4. Now select the rows from 20th index till 200th index(exclusive), and columns from 2nd index till
15th index value.
5. Display the top 100 records from the original DataFrame.
6. Display the last 10 records from the DataFrame.
7. Display the last record from the DataFrame.
8. Now from the churn DataFrame, try to sort the data by the tenure column according to the
descending order.
9. Fetch all the records that are satisfying the following condition: a. Tenure>50 and the gender as
‘Female’ b. Gender as ‘Male’ and SeniorCitizen as 0 c. TechSupport as ‘Yes’ and Churn as ‘No’ d.
Contract type as ‘Month-to-month’ and Churn as ‘Yes’
10. Use a for loop to calculate the number of customers that are getting the tech support and are male
senior citizens.
11. Write a Python program to manipulate and rescale the following data using pandas and
scikitlearn:
import pandas as pd
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data) print(df)
1. Write a Python script using Scrapy to scrape the titles and prices of books from a sample book
store website http://books.toscrape.com.
Instructions:
1. Setup Scrapy Project:
o Install Scrapy if you haven't already: pip install scrapy o Create a
new Scrapy project: scrapy startproject bookscraper o Navigate to
the project directory: cd bookscraper
o Generate a new spider: scrapy genspider books books.toscrape.com
2. Define the Spider:
o Open the books_spider.py file in the spiders directory.
o Modify the spider to scrape book titles and prices.
Sample Code:
# bookscraper/spiders/books_spider.py
import scrapy
class
BooksSpider(scrapy.Spider):
name = "books"
start_urls = ['http://books.toscrape.com']
def parse(self, response): for book in
response.css('article.product_pod'):
yield {
'title': book.css('h3 a::attr(title)').get(),
'price': book.css('div.product_price
p.price_color::text').get(),
}
Additional Questions:
1. Explain the purpose of each part of the spider code.
2. Modify the spider to also scrape the book's availability status.
3. How would you handle potential issues such as missing data or pagination errors?
Step-by-Step Guide:
1. Setup Scrapy Project:
AMITY UNIVERSITY JHARKHAND
o Open your terminal and run the following commands: sh Copy code pip
install scrapy
15
# # Drop features
# df_reduced = df.drop(columns=to_drop)
# return df_reduced, to_drop
# Train and evaluate the model using the original features mse_original,
r2_original = train_and_evaluate(X_train_orig, X_test_orig, y_train,
y_test) print("Original Features - MSE: {:.4f}, R²:
{:.4f}".format(mse_original, r2_original))
# Train and evaluate the model using the reduced features mse_reduced,
r2_reduced = train_and_evaluate(X_train_red, X_test_red, y_train, y_test)
print("Reduced Features - MSE: {:.4f}, R²: {:.4f}".format(mse_reduced,
r2_reduced)) def train_and_evaluate_lasso(X_train, X_test, y_train,
y_test, alpha=0.1):
model = Lasso(alpha=alpha)
model.fit(X_train, y_train) y_pred =
model.predict(X_test) mse =
mean_squared_error(y_test, y_pred) r2 =
r2_score(y_test, y_pred) return mse, r2
# Train and evaluate the Lasso model using the original features
mse_original_lasso, r2_original_lasso = train_and_evaluate_lasso(X_train_orig,
X_test_orig, y_train, y_test) print("Lasso with Original Features - MSE:
19
mse_reduced_lasso, r2_reduced_lasso = train_and_evaluate_lasso(X_train_red,
X_test_red, y_train, y_test) print("Lasso with Reduced Features - MSE:
{:.4f}, R²: {:.4f}".format(mse_reduced_lasso, r2_reduced_lasso))
# Predictions y_pred_lasso =
lasso_model.predict(X_test)
# Predictions y_pred_ridge =
ridge_model.predict(X_test)
WEEK 8
Python Example: MLE for Bivariate Gaussian Distribution
We’ll simulate a dataset representing two features, which could correspond to the sizes and weights in the
previous example and perform MLE to estimate the parameters of the bivariate Gaussian distribution.
Step-by-step Explanation:
Generate Data: Create synthetic data for two features. Compute Mean: Calculate the sample mean of each
feature. Compute Covariance Matrix: Manually calculate the covariance matrix. MLE Estimation: Use the
computed mean and covariance as the MLE estimates
28
23
n_samples = data.shape[0] deviations = data - mean
covariance_matrix = np.dot(deviations.T, deviations) / n_samples
return covariance_matrix
mu and Sigma: These are the parameters for the mean and covariance of the distribution. You can modify
these to see how they affect the distribution’s shape and orientation.
np.random.multivariate_normal: Generates random data points based on the specified mean and
covariance.
sns.kdeplot: Adds a Kernel Density Estimate (KDE) plot that shows the distribution's density with contour
lines.
Week 9
Write a python code to implement Correlation, covariance, Mahalanobis distance, Minkowski distance,
distance metric, Jaccard coefficient, missing values, feature transformations, and Geometrical
interpretation of Euclidean.
Mahalanobis Distance
# Calculate Mahalanobis distance between the first and second sample mean =
np.mean(X, axis=0) cov_matrix = np.cov(X.T) inv_cov_matrix =
np.linalg.inv(cov_matrix) mahal_dist = mahalanobis(X.iloc[0], X.iloc[1],
inv_cov_matrix) print(f"Mahalanobis Distance between the first and second
sample: {mahal_dist}")
Minkowski Distance
# Calculate Minkowski distance (p=3) between the first and second sample
minkowski_dist = minkowski(X.iloc[0], X.iloc[1], p=3) print(f"Minkowski
Distance (p=3) between the first and second sample:{minkowski_dist}")
Jaccard Coefficient
# The Jaccard coefficient is usually used for binary data. We'll create a simple
example.
# Example binary data binary_data1 =
np.array([0, 1, 1, 0, 1]) binary_data2 =
np.array([1, 1, 0, 0, 1])
# Calculate Jaccard coefficient jaccard_coeff = jaccard(binary_data1,
binary_data2) print(f"Jaccard Coefficient between binary data samples:
{jaccard_coeff}")
Feature Transformations
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Normalization (Min-Max scaling) scaler
= MinMaxScaler()
X_normalized = scaler.fit_transform(X)
# Standardization
standardizer = StandardScaler()
X_standardized = standardizer.fit_transform(X) print("First 5 samples
after Min-Max Scaling:\n", X_normalized[:5]) print("First 5 samples
after Standardization:\n", X_standardized[:5])
# Plot the Euclidean distance between the first and second sample
point1 = X_2d.iloc[0] point2 = X_2d.iloc[1]
Ridge Regression, also known as Tikhonov regularization, is a technique used to analyze multiple
regression data that suffer from multicollinearity. By adding a degree of bias to the regression
estimates, ridge regression reduces the standard errors.
print(X.head()) print(y.head())
Step 3: Split the Data into Training and Testing Sets Next,
we split the data into training and testing sets.
# You can change the alpha value to tune the regularization strength
ridge.fit(X_train, y_train)
# Make predictions
Lasso Regression (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that
uses L1 regularization. The L1 regularization adds a penalty equal to the absolute value of the magnitude
of coefficients. This type of regression can shrink some coefficients to zero, effectively performing
variable selection.
print(X.head()) print(y.head())
Step 3: Split the Data into Training and Testing Sets Next,
we split the data into training and testing sets.
# Make predictions
y_pred_train = lasso.predict(X_train) y_pred_test
= lasso.predict(X_test)
Correlation
# Correlation matrix
correlation_matrix = X.corr()
print("Correlation Matrix:\n", correlation_matrix)
# Covariance matrix
covariance_matrix = X.cov()
print("Covariance Matrix:\n", covariance_matrix)
2. Mahalanobis Distance
# Calculate Mahalanobis distance between the first and second
sample
mean = np.mean(X, axis=0)
cov_matrix = np.cov(X.T)
inv_cov_matrix = np.linalg.inv(cov_matrix)
3. Minkowski Distance
5. Jaccard Coefficient
The Jaccard coefficient is usually used for binary data. We'll create a simple example.
# Example binary data
binary_data1 = np.array([0, 1, 1, 0, 1])
binary_data2 = np.array([1, 1, 0, 0, 1])
# Standardization
standardizer = StandardScaler()
X_standardized = standardizer.fit_transform(X)
# Plot the Euclidean distance between the first and second sample
point1 = X_2d.iloc[0]
point2 = X_2d.iloc[1]
LOGISTIC REGRESSION
Problem Statement: You work in XYZ Company. The company officials have collected some data on
health parameters based on diabetes and wish for you to create a model from it.
Dataset: diabetes.csv
Tasks to Be Performed:
• Load the dataset using pandas
• Extract data from outcome column is a variable named Y
• Extract data from every column except outcome column in a variable named X
• Divide the dataset into two parts for training and testing in 80% and 20% proportion
• Create and train Logistic Regression Model on training set
• Make predictions based on the testing set using the trained model
• Check the performance by calculating the confusion matrix and accuracy score of the mode
We aim to classify whether a tumor is malignant or benign based on features such as mean radius, mean
texture, mean perimeter, mean area, and mean smoothness. We'll use a Decision Tree classifier to model
this relationship and evaluate its performance.
A Decision Tree classifier splits the data at each node based on the feature that provides the best split
according to a certain criterion (e.g., Gini impurity or information gain). The process continues
recursively, creating a tree structure where each leaf node represents a class label.
First, we'll load the Breast Cancer dataset. This dataset is available in the sklearn.datasets module.
import numpy as np import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import
train_test_split from sklearn.preprocessing
import StandardScaler from sklearn.tree import
DecisionTreeClassifier
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn.tree import
plot_tree
print(X.head())
print(y.head()) Step 2:
Preprocess the Data
Step 3: Split the Data into Training and Testing Sets Next,
we split the data into training and testing sets. python
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)
Problem Statement
print(X.head())
print(y.head())
Step 3: Split the Data into Training and Testing Sets Next,
we split the data into training and testing sets.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)
Problem Statement
We aim to classify different types of wine into three classes based on 13 chemical attributes such as
alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, and others. We'll use an
SVM classifier to model this relationship and evaluate its performance.
Support Vector Machine Overview
Support Vector Machines find the hyperplane that maximizes the margin between different classes. For
non-linearly separable data, SVM uses kernel tricks to transform the data into a higher-dimensional space
where a linear separator can be found. Common kernels include linear, polynomial, and radial basis
function (RBF).
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix
print(X.head())
print(y.head())
KMEANS CLUSTERING
Problem Statement
We aim to cluster iris flowers into three groups based on four features: sepal length, sepal width, petal
length, and petal width. We'll use the K-Means clustering algorithm to achieve this and evaluate the
results.
K-Means clustering aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean. The algorithm iteratively updates the centroids of the clusters and
assigns data points to the nearest cluster until convergence.
AMITY UNIVERSITY JHARKHAND
52
Step 1: Load the Dataset
First, we'll load the Iris dataset. This dataset is available in the sklearn.datasets module.
import numpy as np import pandas as pd from
sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler from
sklearn.cluster import KMeans from sklearn.metrics
import silhouette_score, confusion_matrix import
matplotlib.pyplot as plt import seaborn as sns
print(X.head()
)
print(y.head()
)
print(labels)
We aim to classify handwritten digits (0-9) based on pixel values of 8x8 images. We'll use an MLP
classifier to model this relationship and evaluate its performance.
A Multi-layer Perceptron consists of an input layer, one or more hidden layers, and an output layer. Each
neuron in a layer is connected to all neurons in the next layer. During training, the model adjusts the
weights using backpropagation, which involves calculating the gradient of the loss function with respect
to each weight and updating the weights accordingly.
First, we'll load the digits dataset. This dataset is available in the sklearn.datasets module.
import numpy as np import pandas as pd
from sklearn.datasets import load_digits from
sklearn.model_selection import train_test_split
from sklearn.preprocessing import
StandardScaler from sklearn.neural_network
import MLPClassifier
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix
print(X.head())
print(y.head())
We will standardize the features to ensure they all have the same scale.
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
We'll evaluate the model's performance using accuracy, classification report, and confusion matrix.
# Make predictions
y_pred_train = mlp.predict(X_train)
y_pred_test = mlp.predict(X_test)
********
Textbooks:
• The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies by Erik
Brynjolfsson and Andrew McAfee. ISBN-10: 0393239357
• Getting started with Internet of Things, by Cuno Pfister, Shroff; First edition (17 May 2011), ISBN-10:
9350234130
• Big Data and The Internet of Things, by Robert Stackowiak, Art licht, Springer Nature; 1st ed. edition
(12 May 2015), ISBN-10: 1484209877