0% found this document useful (0 votes)
7 views27 pages

Cp4252 Machine Learning Lab Manual

The document outlines a series of exercises focused on implementing various machine learning models, including linear regression, binary classification, and KNN classification, using different datasets. It provides detailed programming examples and aims to enhance understanding of model tuning, validation sets, and classification metrics. Additionally, it includes a mini project and emphasizes practical application through data manipulation and analysis.

Uploaded by

8610513455
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views27 pages

Cp4252 Machine Learning Lab Manual

The document outlines a series of exercises focused on implementing various machine learning models, including linear regression, binary classification, and KNN classification, using different datasets. It provides detailed programming examples and aims to enhance understanding of model tuning, validation sets, and classification metrics. Additionally, it includes a mini project and emphasizes practical application through data manipulation and analysis.

Uploaded by

8610513455
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

TABLE OF CONTENTS

S.No Date Title Page Remark


. No
1. Implement a Linear Real dataset.
Experiment with different features in
building a model. Tune the model’s
hyperparameters

2. Implement a binary classification model.


Modify the classification threshold and
determine how the modify influence the
model. Experiment with different
classification metrics to determine model
effectiveness.

3. Classification with nearest neighbors. In this


question, you will use the scikit-learn’s KNN
classifier to classify real vs. fake news
headlines. The aim of this question is for you
to read the scikit-learn API an get
comfortable with training/validation splits.
Use California Housing dataset.

4. Experiment with validation sets an test sets


using the dataset. Split a training set into a
smaller training set and validation set.
Analyse deltas between training set an
validation set results. Test the trained model
is overfitting.

5. Implement the k-means algorithm using


https://archive.ics.uci.eu/ml/atasets/Codon+u
sage dataset

6. Implement the Naïve Bayes Classifier using


https://archive.ics.uci.eu/ml/datasets/Gait+Cl
assification dataset

7. Mini Project

0
Ex. No. 1 Implement a Linear Real dataset. Experiment with
different features in building a model. Tune the model’s hyperparameters
Date:
Aim:

Program:

import numpy as np

import matplotlib.pyplot as plt

def estimate_coef(x, y):

# number of observations/points

n = np.size(x)

# mean of x and y vector

m_x = np.mean(x)

m_y = np.mean(y)

# calculating cross-deviation and deviation about x

SS_xy = np.sum(y*x) - n*m_y*m_x

SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients

b_1 = SS_xy / SS_xx

b_0 = m_y - b_1*m_x

1
return (b_0, b_1)

def plot_regression_line(x, y, b):

# plotting the actual points as scatter plot

plt.scatter(x, y, color = "m",

marker = "o", s = 30)

# predicted response vector

y_pred = b[0] + b[1]*x

# plotting the regression line

plt.plot(x, y_pred, color = "g")

# putting labels

plt.xlabel('x')

plt.ylabel('y')

# function to show plot

plt.show()

def main():

# observations / data

x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

2
# estimating coefficients

b = estimate_coef(x, y)

print("Estimated coefficients:\nb_0 = {} \

\nb_1 = {}".format(b[0], b[1]))

# plotting regression line

plot_regression_line(x, y, b)

if name == " main ":

main()

Result:

3
Ex. No. 2 Implement a binary classification model. Modify the
classification threshold and determine how the modify influence the model.
Experiment with different classification metrics to determine model
effectiveness.
Date:

Aim:

Program:
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_shape=(60,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
return model
...
# evaluate model with standardized dataset
estimator = KerasClassifier(model=create_baseline, epochs=100, batch_size=5,
verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
# Binary Classification with Sonar Dataset: Baseline

4
from pandas import read_csv
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
# load dataset
dataframe = read_csv("sonar.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:60].astype(float)
Y = dataset[:,60]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_shape=(60,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
return model
# evaluate model with standardized dataset
estimator = KerasClassifier(model=create_baseline, epochs=100, batch_size=5,
verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True)

5
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Result:

6
Ex. No. 3 Classification with nearest neighbors. In this question, you will
use the scikit-learn’s KNN classifier to classify real vs. fake news headlines.
The aim of this question is for you to read the scikit-learn API an get
comfortable with training/validation splits. Use California Housing dataset.
Date:

Aim:

Steps:

California Housing - Antonio Furioso


For this project I was asked to predict the median price of houses in California. I will have to
predict their price by comparing the various Machine Learning models (LinearRegression,
KNN, RandomForest, DecisionTree e SVR) and also evaluate which one is the best.

In [1]:
#importing libreries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
#importing dataset
ch = pd.read_csv('../input/california-housing/housing.csv')
ch.head()
Out[2]:

long lati housing_ total_ total_b popu hous median median_h ocean_
itud tud median_a room edroo latio ehol _incom ouse_valu proximi
e e ge s ms n ds e e ty

-
37. 322. 126. NEAR
0 122. 41.0 880.0 129.0 8.3252 452600.0
88 0 0 BAY
23

7
long lati housing_ total_ total_b popu hous median median_h ocean_
itud tud median_a room edroo latio ehol _incom ouse_valu proximi
e e ge s ms n ds e e ty

-
37. 7099. 2401 1138 NEAR
1 122. 21.0 1106.0 8.3014 358500.0
86 0 .0 .0 BAY
22

-
37. 1467. 496. 177. NEAR
2 122. 52.0 190.0 7.2574 352100.0
85 0 0 0 BAY
24

-
37. 1274. 558. 219. NEAR
3 122. 52.0 235.0 5.6431 341300.0
85 0 0 0 BAY
25

-
37. 1627. 565. 259. NEAR
4 122. 52.0 280.0 3.8462 342200.0
85 0 0 0 BAY
25

This dataset was derived from the 1990 U.S. census, using one row per census block group. A
block group is the smallest geographical unit for which the U.S. Census Bureau publishes
sample data (a block group typically has a population of 600 to 3,000 people).
We have:

• Longitude house block longitude


• Latitude house block latitude
• House Age median house age in block
• Tot Rooms total number of rooms
• Tot Bedrms total number of bedrooms
• Population block population
• Households house occupancy
• MedInc median income in block
• Median House Value (target) median value of house block
• Ocean Proximity indicates the position of the block group in relation to the sea
Total number of istances 20640

In [3]:

8
ch.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype

0 longitude 20640 non-null float64


1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

• All columns are numerical except fot 'ocean_proximity'


• 'total_bedrooms' have some null values (we will fix both)

We are going to switch all of null values with the mean for a better prediction.

In [4]:
#fixing data null with mean
ch = ch.fillna(ch.mean())
ch.isna().sum()
Out[4]:
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 0
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64
What you could do is choose whether to delete the lines with the null value or replace the
value. What I did was to replace the null value with an average because I believe that
eliminating too many lines can affect our analysis and forecast.

In [5]:
ch.describe()
Out[5]:

9
housing_ total_b median median_h
longit latitud total_r popul house
median_a edroo _incom ouse_val
ude e ooms ation holds
ge ms e ue

co 20640 20640 20640 20640. 20640 20640


20640.00 20640. 20640.00
un .0000 .0000 .0000 00000 .0000 .0000
0000 000000 0000
t 00 00 00 0 00 00

m - 2635. 1425.
35.63 28.63948 537.87 499.5 3.8706 206855.8
ea 119.5 76308 47674
1861 6 0553 39680 71 16909
n 69704 1 4

2181. 1132.
st 2.003 2.135 12.58555 419.26 382.3 1.8998 115395.6
61525 46212
d 532 952 8 6592 29753 22 15874
2 2

-
m 32.54 2.000 1.0000 3.000 1.000 0.4999 14999.00
124.3 1.000000
in 0000 000 00 000 000 00 0000
50000

- 1447.
25 33.93 18.00000 297.00 787.0 280.0 2.5634 119600.0
121.8 75000
% 0000 0 0000 00000 00000 00 00000
00000 0

- 2127. 1166.
50 34.26 29.00000 438.00 409.0 3.5348 179700.0
118.4 00000 00000
% 0000 0 0000 00000 00 00000
90000 0 0

- 3148. 1725.
75 37.71 37.00000 643.25 605.0 4.7432 264725.0
118.0 00000 00000
% 0000 0 0000 00000 50 00000
10000 0 0

m - 41.95 52.00000 39320 6445.0 35682 6082. 15.000 500001.0

10
housing_ total_b median median_h
longit latitud total_r popul house
median_a edroo _incom ouse_val
ude e ooms ation holds
ge ms e ue

ax 114.3 0000 0 .0000 00000 .0000 00000 100 00000


10000 00 00 0

In [6]:
ch.hist(bins=50, figsize=(10, 10)) #visualizing an histogram
Out[6]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ecdf750>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1eae0350>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1eb179d0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1eac3b90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ea8f710>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ea44d90>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ea054d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1e9bba90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1e9bbad0>]],

dtype=object)

Result:

11
Ex. No. 4 Experiment with validation sets an test sets using the
dataset. Split a training set into a smaller training set and validation set.
Analyse deltas between training set an validation set results. Test the
trained model is overfitting.

Date:

Aim:

Steps:
Training Set
This is the actual dataset from which a model trains .i.e. the model sees and learns from this
data to predict the outcome or to make the right decisions. Most of the training data is
collected from several resources and then preprocessed and organized to provide proper
performance of the model. Type of training data hugely determines the ability of the model
to generalize .i.e. the better the quality and diversity of training data, the better will be the
performance of the model. This data is more than 60% of the total data available for the
project.
Program:
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to


# represent x,y for example
# Making a array for x ranging
# from 0-15 then reshaping it
# to form a matrix of shape 8x2
x = np.arange(16).reshape((8,2))

# y is just a list of 0-7 number


# representing target variable
y = range(8)

12
# Splitting dataset in 80-20 fashion .i.e.
# Testing set is 20% of total data
# Training set is 80% of total data
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=0.8,random_state=42)

# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)

Testing Set
This dataset is independent of the training set but has a somewhat similar type of
probability distribution of classes and is used as a benchmark to evaluate the model, used
only after the training of the model is complete. Testing set is usually a properly organized
dataset having all kinds of data for scenarios that the model would probably be facing when
used in the real world. Often the validation and testing set combined is used as a testing set
which is not considered a good practice. If the accuracy of the model on training data is
greater than that on testing data then the model is said to have overfitting. This data is
approximately 20-25% of the total data available for the project.

Program:
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to represent x,y for example


# Making a array for x ranging from 0-15 then
# reshaping it to form a matrix of shape 8x2
x = np.arange(16).reshape((8, 2))

# y is just a list of 0-7 number representing


# target variable
y = range(8)

13
# Splitting dataset in 80-20 fashion .i.e.
# Training set is 80% of total data
# Testing set is 20% of total data
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8,random_state=42)

# Testing set
print("Testing set x: ", x_test)
print("Testing set y: ", y_test)

Result:

14
Ex. No. 5 Implement the k-means algorithm using
https://archive.ics.uci.eu/ml/atasets/Codon+usage dataset
Date:

Aim:

Steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Program:

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state = 23)

fig = plt.figure(0)

plt.grid(True)

15
plt.scatter(X[:,0],X[:,1])

plt.show()

k=3

clusters = {}

np.random.seed(23)

for idx in range(k):

center = 2*(2*np.random.random((X.shape[1],))-1)

points = []

cluster = {

'center' : center,

'points' : []

clusters[idx] = cluster

clusters

plt.scatter(X[:,0],X[:,1])

plt.grid(True)

for i in clusters:

center = clusters[i]['center']

plt.scatter(center[0],center[1],marker = '*',c = 'red')

plt.show()

16
Output:

Output:

Result:

17
Ex. No. 6 Implement the Naïve Bayes Classifier using
https://archive.ics.uci.eu/ml/datasets/Gait+Classification dataset
Date:

Aim:

Gait Classification Data Set

Abstract: Gait is considered a biometric criterion. Therefore, we tried to classify people


with gait analysis with this gait data set.

Data Set Number of


Multivariate 48 Area: Computer
Characteristics: Instances:

Attribute Number of Date 2020-10-


Real 321
Characteristics: Attributes: Donated 14

Missing Number of
Associated Tasks: Classification N/A 38151
Values? Web Hits:

Data Set Information:

This data set was created by calculating the walking parameters of a total of 16 different
volunteers, 7 female and 9 male. The volunteers of 16 volunteers ranged between 20 and 34
years old, and their weight ranged from 53 to 95. In order to calculate each walking
parameter, people were asked to walk the 30-meter long course for three rounds. The shared
file contains X and Y data. X represents gait data and y represents person information for the
relevant sample.

Implementation of the Naïve Bayes algorithm:

Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we
can easily compare the Naive Bayes model with the other models.

Steps to implement:
o Data Pre-processing step
o Fitting Naive Bayes to the Training set
o Predicting the test result

18
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1) Data Pre-processing step:

In this step, we will pre-process/prepare the data so that we can use it efficiently in our code.
It is similar as we did in data-pre-processing. The code for this is given below:

1. Importing the libraries


2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state =
0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)

In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and
then we have scaled the feature variable.

The output for the dataset is given as:

19
2) Fitting Naive Bayes to the Training Set:

After the pre-processing step, now we will fit the Naive Bayes model to the Training set.
Below is the code for it:

20
1. # Fitting Naive Bayes to the Training set
2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)

In the above code, we have used the GaussianNB classifier to fit it to the training dataset.
We can also use other classifiers as per our requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)


3) Prediction of the test set result:

Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions.

1. # Predicting the Test set results


2. y_pred = classifier.predict(x_test)

21
Output:

The above output shows the result for prediction vector y_pred and real vector y_test. We
can see that some predications are different from the real values, which are the incorrect
predictions.

4) Creating Confusion Matrix:

Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix.
Below is the code for it:

22
1. # Making the Confusion Matrix
2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)

Output:

As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions,
and 65+25=90 correct predictions.

5) Visualizing the training set result:

Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code
for it:

1. # Visualising the Training set results


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() -
1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())

23
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

In the above output we can see that the Naïve Bayes classifier has segregated the data points
with the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our
code.

6) Visualizing the Test set result:


1. # Visualising the Test set results
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() -
1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))

24
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

Result:

25

You might also like