Cp4252 Machine Learning Lab Manual
Cp4252 Machine Learning Lab Manual
7. Mini Project
0
Ex. No. 1 Implement a Linear Real dataset. Experiment with
different features in building a model. Tune the model’s hyperparameters
Date:
Aim:
Program:
import numpy as np
# number of observations/points
n = np.size(x)
m_x = np.mean(x)
m_y = np.mean(y)
1
return (b_0, b_1)
# putting labels
plt.xlabel('x')
plt.ylabel('y')
plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
2
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
plot_regression_line(x, y, b)
main()
Result:
3
Ex. No. 2 Implement a binary classification model. Modify the
classification threshold and determine how the modify influence the model.
Experiment with different classification metrics to determine model
effectiveness.
Date:
Aim:
Program:
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_shape=(60,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
return model
...
# evaluate model with standardized dataset
estimator = KerasClassifier(model=create_baseline, epochs=100, batch_size=5,
verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
# Binary Classification with Sonar Dataset: Baseline
4
from pandas import read_csv
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
# load dataset
dataframe = read_csv("sonar.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:60].astype(float)
Y = dataset[:,60]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_shape=(60,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
return model
# evaluate model with standardized dataset
estimator = KerasClassifier(model=create_baseline, epochs=100, batch_size=5,
verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
5
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Result:
6
Ex. No. 3 Classification with nearest neighbors. In this question, you will
use the scikit-learn’s KNN classifier to classify real vs. fake news headlines.
The aim of this question is for you to read the scikit-learn API an get
comfortable with training/validation splits. Use California Housing dataset.
Date:
Aim:
Steps:
In [1]:
#importing libreries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
#importing dataset
ch = pd.read_csv('../input/california-housing/housing.csv')
ch.head()
Out[2]:
long lati housing_ total_ total_b popu hous median median_h ocean_
itud tud median_a room edroo latio ehol _incom ouse_valu proximi
e e ge s ms n ds e e ty
-
37. 322. 126. NEAR
0 122. 41.0 880.0 129.0 8.3252 452600.0
88 0 0 BAY
23
7
long lati housing_ total_ total_b popu hous median median_h ocean_
itud tud median_a room edroo latio ehol _incom ouse_valu proximi
e e ge s ms n ds e e ty
-
37. 7099. 2401 1138 NEAR
1 122. 21.0 1106.0 8.3014 358500.0
86 0 .0 .0 BAY
22
-
37. 1467. 496. 177. NEAR
2 122. 52.0 190.0 7.2574 352100.0
85 0 0 0 BAY
24
-
37. 1274. 558. 219. NEAR
3 122. 52.0 235.0 5.6431 341300.0
85 0 0 0 BAY
25
-
37. 1627. 565. 259. NEAR
4 122. 52.0 280.0 3.8462 342200.0
85 0 0 0 BAY
25
This dataset was derived from the 1990 U.S. census, using one row per census block group. A
block group is the smallest geographical unit for which the U.S. Census Bureau publishes
sample data (a block group typically has a population of 600 to 3,000 people).
We have:
In [3]:
8
ch.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
We are going to switch all of null values with the mean for a better prediction.
In [4]:
#fixing data null with mean
ch = ch.fillna(ch.mean())
ch.isna().sum()
Out[4]:
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 0
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64
What you could do is choose whether to delete the lines with the null value or replace the
value. What I did was to replace the null value with an average because I believe that
eliminating too many lines can affect our analysis and forecast.
In [5]:
ch.describe()
Out[5]:
9
housing_ total_b median median_h
longit latitud total_r popul house
median_a edroo _incom ouse_val
ude e ooms ation holds
ge ms e ue
m - 2635. 1425.
35.63 28.63948 537.87 499.5 3.8706 206855.8
ea 119.5 76308 47674
1861 6 0553 39680 71 16909
n 69704 1 4
2181. 1132.
st 2.003 2.135 12.58555 419.26 382.3 1.8998 115395.6
61525 46212
d 532 952 8 6592 29753 22 15874
2 2
-
m 32.54 2.000 1.0000 3.000 1.000 0.4999 14999.00
124.3 1.000000
in 0000 000 00 000 000 00 0000
50000
- 1447.
25 33.93 18.00000 297.00 787.0 280.0 2.5634 119600.0
121.8 75000
% 0000 0 0000 00000 00000 00 00000
00000 0
- 2127. 1166.
50 34.26 29.00000 438.00 409.0 3.5348 179700.0
118.4 00000 00000
% 0000 0 0000 00000 00 00000
90000 0 0
- 3148. 1725.
75 37.71 37.00000 643.25 605.0 4.7432 264725.0
118.0 00000 00000
% 0000 0 0000 00000 50 00000
10000 0 0
10
housing_ total_b median median_h
longit latitud total_r popul house
median_a edroo _incom ouse_val
ude e ooms ation holds
ge ms e ue
In [6]:
ch.hist(bins=50, figsize=(10, 10)) #visualizing an histogram
Out[6]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ecdf750>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1eae0350>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1eb179d0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1eac3b90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ea8f710>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ea44d90>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ea054d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1e9bba90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1e9bbad0>]],
dtype=object)
Result:
11
Ex. No. 4 Experiment with validation sets an test sets using the
dataset. Split a training set into a smaller training set and validation set.
Analyse deltas between training set an validation set results. Test the
trained model is overfitting.
Date:
Aim:
Steps:
Training Set
This is the actual dataset from which a model trains .i.e. the model sees and learns from this
data to predict the outcome or to make the right decisions. Most of the training data is
collected from several resources and then preprocessed and organized to provide proper
performance of the model. Type of training data hugely determines the ability of the model
to generalize .i.e. the better the quality and diversity of training data, the better will be the
performance of the model. This data is more than 60% of the total data available for the
project.
Program:
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split
12
# Splitting dataset in 80-20 fashion .i.e.
# Testing set is 20% of total data
# Training set is 80% of total data
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=0.8,random_state=42)
# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)
Testing Set
This dataset is independent of the training set but has a somewhat similar type of
probability distribution of classes and is used as a benchmark to evaluate the model, used
only after the training of the model is complete. Testing set is usually a properly organized
dataset having all kinds of data for scenarios that the model would probably be facing when
used in the real world. Often the validation and testing set combined is used as a testing set
which is not considered a good practice. If the accuracy of the model on training data is
greater than that on testing data then the model is said to have overfitting. This data is
approximately 20-25% of the total data available for the project.
Program:
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split
13
# Splitting dataset in 80-20 fashion .i.e.
# Training set is 80% of total data
# Testing set is 20% of total data
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8,random_state=42)
# Testing set
print("Testing set x: ", x_test)
print("Testing set y: ", y_test)
Result:
14
Ex. No. 5 Implement the k-means algorithm using
https://archive.ics.uci.eu/ml/atasets/Codon+usage dataset
Date:
Aim:
Steps:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Program:
import numpy as np
fig = plt.figure(0)
plt.grid(True)
15
plt.scatter(X[:,0],X[:,1])
plt.show()
k=3
clusters = {}
np.random.seed(23)
center = 2*(2*np.random.random((X.shape[1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
clusters[idx] = cluster
clusters
plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.show()
16
Output:
Output:
Result:
17
Ex. No. 6 Implement the Naïve Bayes Classifier using
https://archive.ics.uci.eu/ml/datasets/Gait+Classification dataset
Date:
Aim:
Missing Number of
Associated Tasks: Classification N/A 38151
Values? Web Hits:
This data set was created by calculating the walking parameters of a total of 16 different
volunteers, 7 female and 9 male. The volunteers of 16 volunteers ranged between 20 and 34
years old, and their weight ranged from 53 to 95. In order to calculate each walking
parameter, people were asked to walk the 30-meter long course for three rounds. The shared
file contains X and Y data. X represents gait data and y represents person information for the
relevant sample.
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we
can easily compare the Naive Bayes model with the other models.
Steps to implement:
o Data Pre-processing step
o Fitting Naive Bayes to the Training set
o Predicting the test result
18
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
In this step, we will pre-process/prepare the data so that we can use it efficiently in our code.
It is similar as we did in data-pre-processing. The code for this is given below:
In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and
then we have scaled the feature variable.
19
2) Fitting Naive Bayes to the Training Set:
After the pre-processing step, now we will fit the Naive Bayes model to the Training set.
Below is the code for it:
20
1. # Fitting Naive Bayes to the Training set
2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)
In the above code, we have used the GaussianNB classifier to fit it to the training dataset.
We can also use other classifiers as per our requirement.
Output:
Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions.
21
Output:
The above output shows the result for prediction vector y_pred and real vector y_test. We
can see that some predications are different from the real values, which are the incorrect
predictions.
Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix.
Below is the code for it:
22
1. # Making the Confusion Matrix
2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)
Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions,
and 65+25=90 correct predictions.
Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code
for it:
23
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
In the above output we can see that the Naïve Bayes classifier has segregated the data points
with the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our
code.
24
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
Result:
25