100% found this document useful (1 vote)
5K views53 pages

Jntuk R20 ML MANUAL

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
5K views53 pages

Jntuk R20 ML MANUAL

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN

(Approved by AICTE New Delhi & Affiliated to JNTUK, Kakinada)


Kommadi, MadhuraWada, Visakhapatnam- 530048

Department of Information Technology

LABORATORY MANUAL

FOR

MACHINE LEARNING LAB

Regulation: 20 Year & Semester: III-I I


Subject Code: R2032427 NBA Course Code: I T-3107
Department of Computer Science and Engineering
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN
(Approved by AICTE New Delhi & Affiliated to JNTUK, Kakinada)

Institute Vision:

To emerge as an acclaimed center of learning that provides value-based technical education for theholistic
development of students

Institute Mission:

• Undertake activities that provide value-based knowledge in Science, Engineering, and technology

• Provide opportunities for learning through industry-institute interaction on the state-of-the-art


technologies.

• Create a collaborative environment for research, innovation, and entrepreneurship.

• Promote activities that bring in a sense of social responsibility:

Department Vision:

To evolve into a Centre of learning that imparts quality education in Computer Science and
Engineering to produce highly competent professionals.

Department Mission:

 Impart computing and technical skills with an emphasis on professional competency and human
values.

 Enrich the learning aptitude to face the dynamic environment of the Computer Industry.
 Enhance the analytical and problem-solving capability through contests and technical
seminars.
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN
(Approved by AICTE New Delhi & Affiliated to JNTUK, Kakinada)

JNTUK R20 SYLLABUS


Experiment-1:
Implement and demonstrate the FIND-S algorithm for finding the most specific hypothesis based on a given set of
training data samples. Read the training data from a .CSV file.
Experiment-2:
For a given set of training data examples stored in a .CSV file, implement and demonstrate the Candidate-
Elimination algorithm to output a description of the set of all hypotheses consistent with the training examples.
Experiment-3:
Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an appropriate data set
for building the decision tree and apply this knowledge to classify a new sample.
Experiment-4:
Exercises to solve the real-world problems using the following machine learning methods: a) Linear Regression b)
Logistic Regression c) Binary Classifier
Experiment-5:
Develop a program for Bias, Variance, Remove duplicates, Cross Validation
Experiment-6:
Write a program to implement Categorical Encoding, One-hot Encoding
Experiment-7:
Build an Artificial Neural Network by implementing the Back propagation algorithm and test the same using
appropriate data sets.
Experiment-8:
Write a program to implement k-Nearest Neighbor algorithm to classify the iris data set. Print both correct and
wrong predictions.
Experiment-9:
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points. Select
appropriate data set for your experiment and draw graphs.
Experiment-10:
Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model to perform this
task. Built-in Java classes/API can be used to write the program. Calculate the accuracy, precision, and recall for
your data set.
Experiment-11:
Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for clustering using k-Means
algorithm. Compare the results of these two algorithms and comment on the quality of clustering. You can add
Java/Python ML library classes/API in the program.
Experiment-12:
Exploratory Data Analysis for Classification using Pandas or Matplotlib.
Experiment-13:
Write a Python program to construct a Bayesian network considering medical data. Use this model to demonstrate
the diagnosis of heart patients using standard Heart Disease Data Set
Experiment-14:
Write a program to Implement Support Vector Machines.
Experiment-15:
Write a program to Implement Principle Component Analysis.
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN
(Approved by AICTE New Delhi & Affiliated to JNTUK, Kakinada)

LAB RUBRICS
Internals Category Points
Attended and Attended and Attended but Not attended
Attendance completed on partially completed in but completed
(1) the same day completed on the extra lab in the extra lab
the same day
Complete Partial Most of the Complete
derstandingof understanding experiment misunderstan
Understanding
the of the misunderstood ding ng of the
ofthe Experiment
experiment experiment experiment
(2)
with learning with learning
objectives objectives
Day to Day
Complete Complete Complete Complete
Performance
implementation implementation implementation implementation
Implementation
with result with result with result with result
with result analysis
analysis and analysis only analysis and nalysis only in
(5) interpretation
interpretation extra lab
in extra lab
Submission of Submission of Submission of Submission of
Observation the observation the observation the observation the observation
submission on time on time almost on time immediately after the extra
(2) after the extra lab
lab
Write all the Write all the Some Some elements
Comprehensiveness elements of the elements of the elementsare are missing and
& Legible experiments experiments missing but poor
(3) which can be with poor presented handwriting
easily readable handwriting clearly
Record
Submission of Submission of Submission of Submission of
the record on the record the record the record after
Timely Submission
time almost on time immediately the extra lab
(2)
after the extra
lab
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN
(Approved by AICTE New Delhi & Affiliated to JNTUK, Kakinada)

Internals Category Points


Complete Complete Partial Misunderstand
Aim of the understanding understanding understanding ing of the
experiment of the learning of the learning of the learning learning
(1) objectives and objectives only objectives objectives
outcomes
Write all the Write all the Some elements Some elements
elements of the elements of the are missing but are missing and
Write up
experiments experiments presented poor
(3)
which can be with poor clearly handwriting
easily readable handwriting
Internals
Complete Complete Partial Partial
Implementation & implementation implementation implementation implementation
result analysis with result with result with result only
(4) analysis and analysis only analysis only
interpretation
Experiment Experiment Partial Partial subject
and subject and subject experiment knowledge with
Viva- Voce
knowledge knowledge knowledge poor oral
(2) with good oral with poor oral with poor oral presentation
presentation presentation presentation
Machine Learning Lab

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING


FOR WOMEN
(Approved by AICTE New Delhi & Affiliated to JNTUK, Kakinada)

INDEX
SNO NAME OF THE EXPERIMENT PAGE NO.
MACHINE LEARNING LAB
1. Implement and demonstrate the FIND-S algorithm for finding the most specific 2
hypothesis based on a given set of training data samples. Read the training data from a
.CSV file.
2. For a given set of training data examples stored in a .CSV file, implement and 5
demonstrate the Candidate-Elimination algorithm to output a description of the
set of all hypotheses consistent with the training examples.
3. Write a program to demonstrate the working of the decision tree based ID3 algorithm. 9
Use an appropriate data set for building the decision tree and apply this knowledge to
classify a new sample.
Exercises to solve the real-world problems using the following machine learning
13
4. methods: a) Linear Regression b) Logistic Regression c) Binary Classifier
5. Develop a program for Bias, Variance, Remove duplicates, Cross Validation . 17
6. Write a program to implement Categorical Encoding, One-hot Encoding. 20
Build an Artificial Neural Network by implementing the Back propagation algorithm and 23
7. test the same using appropriate data sets.
Write a program to implement k-Nearest Neighbor algorithm to classify the iris data set. 25
8. Print both correct and wrong predictions.
9. Implement the non-parametric Locally Weighted Regression algorithm in order to fit da 27
points. Select appropriate data set for your experiment and draw graphs.
Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier 29
10. model to perform this task. Built-in Java classes/API can be used to write the program.
Calculate the accuracy, precision, and recall for your data set.
11. Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for 32
clustering using k-Means algorithm. Compare the results of these two algorithms and
comment on the quality of clustering. You can add Java/Python ML library classes/API
in the program.
12 Exploratory Data Analysis for Classification using Pandas or Matplotlib. 36
13 Write a Python program to construct a Bayesian network considering medical data. Use
this model to demonstrate the diagnosis of heart patients using standard Heart Disease
Data Set
14 Write a program to Implement Support Vector Machines. 39

15 Write a program to Implement Principle Component Analysis. 42

[Type here] [Type here] [Type here]


Machine Learning Lab

Experiment-1:

AIM: Implement and demonstrate the FIND-S algorithm for finding the most specific hypothesis
based on a given set of training data samples. Read the training data from a .CSV file.

Theory:
Find-S Algorithm:
Load Data set
Initialize h to the most specific hypothesis in H
For each positive training instance x
For each attribute constraint ai in h
If the constraint ai in h is satisfied by x then do nothing
Else replace ai in h by the next more general constraint that is satisfied by x
Output hypothesis h

Write a Python program to take input from a CSV file (e.g., EnjoySport.csv), prints all the instances,
then only the positive instances that influence Find-S, and the specific hypothesis after considering
each positive instance, and the final hypotheses.

Example:
Input:
'Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same',True
'Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same',True
'Rainy', 'Cold', 'High', 'Strong', 'Warm', 'Change',False
'Sunny', 'Warm', 'High', 'Strong', 'Cool', 'Change',True

Output: The Maximally specific hypothesis for the training instance is


['Sunny', 'Warm', '?', 'Strong', '?', '?']

Sample Code:
import csv
hypo = ['%','%','%','%','%','%'];

with open('trainingdata.csv') as csv_file:


readcsv = csv.reader(csv_file, delimiter=',')
print(readcsv)

data = []
print("\nThe given training examples are:")
for row in readcsv:
print(row)
if row[len(row)-1].upper() == "YES":
data.append(row)

print("\nThe positive examples are:");


for x in data:
print(x);
print("\n");

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 1


Machine Learning Lab
TotalExamples = len(data);
i=0;

j=0;
k=0;
print("The steps of the Find-s algorithm are :\n",hypo);
list = [];
p=0;
d=len(data[p])-1;
for j in range(d):
list.append(data[i][j]);
hypo=list;
i=1;
for i in range(TotalExamples):
for k in range(d):
if hypo[k]!=data[i][k]:
hypo[k]='?';
k=k+1;
else:
hypo[k];
print(hypo);
i=i+1;

print("\nThe maximally specific Find-s hypothesis for the given training examples
is :");
list=[];
for i in range(d):
list.append(hypo[i]);
print(list);

Output:
The maximally specific Find-s hypothesis for the given training examples is :
[‘Sunny’, ‘Warm’, ‘?’, ‘Strong’, ‘?’, ‘?’]

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 2


Machine Learning Lab
Experiment-2:

Aim: For a given set of training data examples stored in a .CSV file, implement and demonstrate the
Candidate-Elimination algorithm to output a description of the set of all hypotheses consistent with
the training examples.

Theory:
Candidate-Elimination Algorithm:
Initialize G to the set of maximally general hypotheses in H
Initialize S to the set of maximally specific hypotheses in H
For each training example d, do
If d is a positive example
Remove from G any hypothesis inconsistent with d
For each hypothesis s in S that is not consistent with d
Remove s from S
Add to S all minimal generalizations h of s such that
h is consistent with d, and some member of G is more general than h
Remove from S any hypothesis that is more general than another hypothesis in S
If d is a negative example
Remove from S any hypothesis inconsistent with d
For each hypothesis g in G that is not consistent with d
Remove g from G
Add to G all minimal specializations h of g such that
h is consistent with d, and some member of S is more specific than h
Remove from G any hypothesis that is less general than another hypothesis in G
Output the hypotheses S and G.

Write a Python program to take input from a CSV file (e.g., EnjoySport.csv) and apply the candidate-
elimination algorithm to find S and G.
Example:
Input:
'Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same',True
'Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same',True
'Rainy', 'Cold', 'High', 'Strong', 'Warm', 'Change',False
'Sunny', 'Warm', 'High', 'Strong', 'Cool', 'Change',True
Output:
Final Specific h: [['Sunny', 'Warm', '?', 'Strong', '?', '?']]
Final General h: [['Sunny', '?', '?', '?', '?', '?'], ['?', 'Warm', '?', '?', '?', '?']]

Sample Code:

import numpy as np

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 3


Machine Learning Lab

import pandas as pd

# Loading Data from a CSV File


data = pd.DataFrame(data=pd.read_csv('trainingdata.csv'))
print(data)

# Separating concept features from Target


concepts = np.array(data.iloc[:,0:-1])
print(concepts)

# Isolating target into a separate DataFrame


# copying last column to target array
target = np.array(data.iloc[:,-1])
print(target)

def learn(concepts, target):

'''
learn() function implements the learning method of the Candidate elimination
algorithm.
Arguments:
concepts - a data frame with all the features
target - a data frame with corresponding output values
'''

# Initialise S0 with the first instance from concepts


# .copy() makes sure a new list is created instead of just pointing to the
same memory location
specific_h = concepts[0].copy()
print("\nInitialization of specific_h and general_h")
print(specific_h)
#h=["#" for i in range(0,5)]
#print(h)

general_h = [["?" for i in range(len(specific_h))] for i in


range(len(specific_h))]
print(general_h)
# The learning iterations
for i, h in enumerate(concepts):

# Checking if the hypothesis has a positive target


if target[i] == "Yes":
for x in range(len(specific_h)):

# Change values in S & G only if values change

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 4


Machine Learning Lab
if h[x] != specific_h[x]:

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 5


Machine Learning Lab

specific_h[x] = '?'
general_h[x][x] = '?'

# Checking if the hypothesis has a positive target


if target[i] == "No":
for x in range(len(specific_h)):
# For negative hyposthesis change values only in G
if h[x] != specific_h[x]:
general_h[x][x] = specific_h[x]
else:
general_h[x][x] = '?'

print("\nSteps of Candidate Elimination Algorithm",i+1)


print(specific_h)
print(general_h)

# find indices where we have empty rows, meaning those that are unchanged
indices = [i for i, val in enumerate(general_h) if val == ['?', '?', '?', '?',
'?', '?']]
for i in indices:
# remove those rows from general_h
general_h.remove(['?', '?', '?', '?', '?', '?'])
# Return final values
return specific_h, general_h

s_final, g_final = learn(concepts, target)


print("\nFinal Specific_h:", s_final, sep="\n")
print("\nFinal General_h:", g_final, sep="\n")

Output:
Final Specific_h:
[‘Sunny’ ‘Warm’ ‘?’ ‘Strong’ ‘?’ ‘?’]

Final General_h:
[[‘Sunny’, ‘?’, ‘?’, ‘?’, ‘?’, ‘?’], [‘?’, ‘Warm’, ‘?’, ‘?’, ‘?’, ‘?’]]

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 6


Machine Learning Lab
Experiment-3:

AIM: Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an
appropriate data set for building the decision tree and apply this knowledge to classify a new sample.

Theory:
ID3 Algorithm:
Create a Root node for the tree
If all Examples are positive, Return the single-node tree Root, with label = +
If all Examples are negative, Return the single-node tree Root, with label = -
If Attributes is empty, Return the single-node tree Root, with label = most common value of Target_attribute in
Examples
Otherwise Begin
A ← the attribute from Attributes that best* classifies Examples
The decision attribute for Root ← A
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi
Let Examples vi, be the subset of Examples that have value vi for A
If Examples vi , is empty
Then below this new branch add a leaf node with label = most common value of
Target_attribute in Examples
Else below this new branch add the subtree ID3(Examples vi, Targe_tattribute,
Attributes – {A}))
End
Return Root

Write a Python program to import an appropriate dataset, split it into training and test sets, apply ID3
algorithm to build a decision tree fitting the training set. Find out its accuracy on the training set as
well as the test set.

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 7


Machine Learning Lab
Sample Code:

import numpy as np
import math
import csv

def read_data(filename):
with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile, delimiter=',')
headers = next(datareader)
metadata = []
traindata = []
for name in headers:
metadata.append(name)
for row in datareader:
traindata.append(row)

return (metadata, traindata)

class Node:
def init (self, attribute):
self.attribute = attribute
self.children = []
self.answer = ""

def str (self):


return self.attribute

def subtables(data, col, delete):


dict = {}
items = np.unique(data[:, col])
count = np.zeros((items.shape[0], 1), dtype=np.int32)

for x in range(items.shape[0]):
for y in range(data.shape[0]):
if data[y, col] == items[x]:
count[x] += 1

for x in range(items.shape[0]):
dict[items[x]] = np.empty((int(count[x]), data.shape[1]), dtype="|S32")
pos = 0
for y in range(data.shape[0]):
if data[y, col] == items[x]:
dict[items[x]][pos] = data[y]
pos += 1

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 8


Machine Learning Lab
if delete:

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 9


Machine Learning Lab

dict[items[x]] = np.delete(dict[items[x]], col, 1)

return items, dict

def entropy(S):
items = np.unique(S)

if items.size == 1:
return 0

counts = np.zeros((items.shape[0], 1))


sums = 0

for x in range(items.shape[0]):
counts[x] = sum(S == items[x]) / (S.size * 1.0)

for count in counts:


sums += -1 * count * math.log(count, 2)
return sums

def gain_ratio(data, col):


items, dict = subtables(data, col, delete=False)

total_size = data.shape[0]
entropies = np.zeros((items.shape[0], 1))
intrinsic = np.zeros((items.shape[0], 1))

for x in range(items.shape[0]):
ratio = dict[items[x]].shape[0]/(total_size * 1.0)
entropies[x] = ratio * entropy(dict[items[x]][:, -1])
intrinsic[x] = ratio * math.log(ratio, 2)

total_entropy = entropy(data[:, -1])


iv = -1 * sum(intrinsic)

for x in range(entropies.shape[0]):
total_entropy -= entropies[x]

return total_entropy / iv

def create_node(data, metadata):


if (np.unique(data[:, -1])).shape[0] == 1:
node = Node("")
node.answer = np.unique(data[:, -1])[0]
return node

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 10


Machine Learning Lab

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 11


Machine Learning Lab

gains = np.zeros((data.shape[1] - 1, 1))

for col in range(data.shape[1] - 1):


gains[col] = gain_ratio(data, col)

split = np.argmax(gains)

node = Node(metadata[split])
metadata = np.delete(metadata, split, 0)

items, dict = subtables(data, split, delete=True)

for x in range(items.shape[0]):
child = create_node(dict[items[x]], metadata)
node.children.append((items[x], child))

return node

def empty(size):
s = ""
for x in range(size):
s += " "
return s

def print_tree(node, level):


if node.answer != "":
print(empty(level), node.answer)
return
print(empty(level), node.attribute)
for value, n in node.children:
print(empty(level + 1), value)
print_tree(n, level + 2)

metadata, traindata = read_data("tennisdata.csv")


data = np.array(traindata)
node = create_node(data, metadata)
print_tree(node, 0)

Output:
for input [1, 0, 0, 0], we obtain Yes

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 12


Machine Learning Lab
Experiment-4:
AIM: Exercises to solve the real-world problems using the following machine learning methods: a)
Linear Regression b) Logistic Regression c) Binary Classifier

Theory:
While there are many types of regression analysis, at their core they all examine the influence of one
or more independent variables on a dependent variable. Linear regression is usually among the first
few topics which people pick while learning predictive modelling. In this technique, the dependent
variable is continuous, independent variable(s) can be continuous or discrete, and nature of the
regression line is linear. Linear Regression establishes a relationship between dependent variable (Y)
and one or more independent variables (X) using a best fit straight line (also known as regression
line). It is represented by an equation Y = β0 + β1x1 + ……+ βrxr. In other words, linear regression
predicts the target variable as a linear (or weighted) combination of input variables. This equation can
be used to predict the value of the dependent variable based on given the independent variable(s).
Logistic regression, despite its name, is a classification algorithm rather than regression algorithm.
Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no,
true/false). Basically, it measures the relationship between the categorical dependent variable and one
or more independent variables by estimating the probability of occurrence of an event using its
logistics function, i.e., Y = 1/(1+e β0 + β1x1 + ……+ βrxr).
Since logistic regression can be seen as a binary classifier, it enough for students to implement linear
and logistic regression with appropriate data sets for this experiment.

Sample Code:

# Importing libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings( "ignore" )

# to compare our model's accuracy with sklearn model


from sklearn.linear_model import LogisticRegression
# Logistic Regression
class LogitRegression() :
def init ( self, learning_rate, iterations ) :
self.learning_rate = learning_rate
self.iterations = iterations

# Function for model training


def fit( self, X, Y ) :
# no_of_training_examples, no_of_features
self.m, self.n = X.shape
# weight initialization
self.W = np.zeros( self.n )
self.b = 0

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 13


Machine Learning Lab
self.X = X

self.Y = Y

# gradient descent learning

for i in range( self.iterations ) :


self.update_weights()
return self

# Helper function to update weights in gradient descent

def update_weights( self ) :


A = 1 / ( 1 + np.exp( - ( self.X.dot( self.W ) + self.b ) ) )

# calculate gradients
tmp = ( A - self.Y.T )
tmp = np.reshape( tmp, self.m )
dW = np.dot( self.X.T, tmp ) / self.m
db = np.sum( tmp ) / self.m

# update weights
self.W = self.W - self.learning_rate * dW
self.b = self.b - self.learning_rate * db

return self

# Hypothetical function h( x )

def predict( self, X ) :


Z = 1 / ( 1 + np.exp( - ( X.dot( self.W ) + self.b ) ) )
Y = np.where( Z > 0.5, 1, 0 )
return Y

# Driver code

def main() :

# Importing dataset
df = pd.read_csv( "diabetes.csv" )

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 14


Machine Learning Lab
Y = df.iloc[:,-1:].values

# Splitting dataset into train and test set


X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size = 1/3, random_state = 0 )

# Model training
model = LogitRegression( learning_rate = 0.01, iterations = 1000 )

model.fit( X_train, Y_train )


model1 = LogisticRegression()
model1.fit( X_train, Y_train)

# Prediction on test set


Y_pred = model.predict( X_test )
Y_pred1 = model1.predict( X_test )

# measure performance
correctly_classified = 0
correctly_classified1 = 0

# counter
count = 0
for count in range( np.size( Y_pred ) ) :

if Y_test[count] == Y_pred[count] :
correctly_classified = correctly_classified + 1

if Y_test[count] == Y_pred1[count] :
correctly_classified1 = correctly_classified1 + 1

count = count + 1

print( "Accuracy on test set by our model : ", (


correctly_classified / count ) * 100 )
print( "Accuracy on test set by sklearn model : ", (
correctly_classified1 / count ) * 100 )

if name == " main " :


main()

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 15


Machine Learning Lab

Output:
Accuracy on test set by our model : 58.333333333333336
Accuracy on test set by sklearn model : 61.111111111111114

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 16


Machine Learning Lab
Experiment-5:

AIM: Develop a program for Bias, Variance, Remove duplicates, Cross Validation

Theory:
The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause
an algorithm to miss the relevant relations between features and target outputs (underfitting). The
variance is an error from sensitivity to small fluctuations in the training set. High variance may result
from an algorithm modeling the random noise in the training data (overfitting). The bias–variance
tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both
accurately captures the regularities in its training data, but also generalizes well to unseen data.
Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods
may be able to represent their training set well but are at risk of overfitting to noisy or
unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models
that may fail to capture important regularities (i.e. underfit) in the data. The bias–variance
decomposition is a way of analyzing a learning algorithm's expected generalization error with respect
to a particular model. The following diagram illustrates the bias–variance tradeoff.

Preparing a dataset before designing a machine learning model is an important task for the data
scientist. When you gather a dataset for modelling a machine learning model, you may find some
instances repeated several times. It is very important for you to remove duplicates from the dataset to
maintain accuracy and to avoid misleading statistics.
Cross-validation is a technique for evaluating a machine learning model and testing its performance.
CV is commonly used in applied ML tasks. It can be used to estimate the test error associated with a
given statistical learning method in order to evaluate its performance, or to select the appropriate level
of flexibility.
In this experiment, students need to take a learning model and an appropriate data set, remove
duplicates in the data set, fit a model, measure bias and variance components of the error rate, and
fine-tune the parameters using cross validation. They may use built-in APIs if needed.

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 17


Machine Learning Lab
Sample Code:

from mlxtend.evaluate import bias_variance_decomp


from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split

X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)

tree = DecisionTreeClassifier(random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(


tree, X_train, y_train, X_test, y_test,
loss='0-1_loss',
random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)


print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier(random_state=123)
bag = BaggingClassifier(base_estimator=tree,
n_estimators=100,
random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(


bag, X_train, y_train, X_test, y_test,
loss='0-1_loss',
random_seed=123)

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 18


Machine Learning Lab
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

from mlxtend.evaluate import bias_variance_decomp


from sklearn.tree import DecisionTreeRegressor
from mlxtend.data import boston_housing_data
from sklearn.model_selection import train_test_split

X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True)

tree = DecisionTreeRegressor(random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(


tree, X_train, y_train, X_test, y_test,
loss='mse',
random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)


print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

from sklearn.ensemble import BaggingRegressor

tree = DecisionTreeRegressor(random_state=123)
bag = BaggingRegressor(base_estimator=tree,
n_estimators=100,
random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(


bag, X_train, y_train, X_test, y_test,
loss='mse',
random_seed=123)

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 19


Machine Learning Lab
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Output:
Average expected loss: 18.620
Average bias: 15.461
Average variance: 3.159

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 20


Machine Learning Lab
Experiment-6:

AIM: Write a program to implement Categorical Encoding, One-hot Encoding

Theory:
In the field of data science, before going for the modelling, data preparation is a mandatory task.
There are various tasks we require to perform in the data preparation. Encoding categorical data is one
of such tasks which is considered crucial. As we know, most of the data in real life come with
categorical string values and most of the machine learning models work with numerical values only.
All models basically perform mathematical operations which can be performed using different tools
and techniques. But the harsh truth is that mathematics is totally dependent on numbers. So in short
we can say most of the models require numbers as the data, not strings or not anything else and these
numbers can be float or integer. Encoding categorical data is a process of converting categorical data
into integer format so that the data with converted categorical values can be provided to the models to
give and improve the predictions.

In this experiment, students need to implement popular categorical encoding techniques like:
 One-Hot Encoding
 Label Encoding (or Ordinal Encoding)
 Binary Encoding
 Base-N Encoding
 Hash Encoding
 Target Encoding

Sample Code:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')
# passing bridge-types-cat column (label encoded values of
bridge_types)
enc_df =
pd.DataFrame(enc.fit_transform(bridge_df[['Bridge_Types_Cat']]).toar
ray())
# merge with main df bridge_df on key values
bridge_df = bridge_df.join(enc_df)
bridge_df

import numpy as np
### Categorical data to be converted to numeric data
colors = ["red", "green", "yellow", "red", "blue"]

### Universal list of colors

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 21


Machine Learning Lab
total_colors = ["red", "green", "blue", "black", "yellow"]

### map each color to an integer


mapping = {}
for x in range(len(total_colors)):
mapping[total_colors[x]] = x

one_hot_encode = []

for c in colors:
arr = list(np.zeros(len(total_colors), dtype = int))
arr[mapping[c]] = 1
one_hot_encode.append(arr)

print(one_hot_encode)

from sklearn.preprocessing import LabelEncoder


from sklearn.preprocessing import OneHotEncoder

### Categorical data to be converted to numeric data


colors = (["red", "green", "yellow", "red", "blue"])

### integer mapping using LabelEncoder


label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(colors)
print(integer_encoded)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)

### One hot encoding


onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
Output:
[2 1 3 2 0]
[[0. 0. 1. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 22


Machine Learning Lab
[0. 0. 1. 0.]
[1. 0. 0. 0.]]

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 23


Machine Learning Lab
Experiment-7:
AIM: Build an Artificial Neural Network by implementing the Back propagation algorithm and test
the same using appropriate data sets.

Theory:
The Back propagation algorithm is a supervised learning method for multilayer feedforward networks
from the field of Artificial Neural Networks.
Feed-forward neural networks are inspired by the information processing of one or more neural cells,
called a neuron. A neuron accepts input signals via its dendrites, which pass the electrical signal down
to the cell body. The axon carries the signal out to synapses, which are the connections of a cell’s
axon to other cell’s dendrites.
The principle of the back propagation approach is to model a given function by modifying internal
weightings of input signals to produce an expected output signal. The system is trained using a
supervised learning method, where the error between the system’s output and a known expected
output is presented to the system and used to modify its internal state.
Technically, the back propagation algorithm is a method for training the weights in a multilayer feed-
forward neural network. As such, it requires a network structure to be defined of one or more layers
where one layer is fully connected to the next layer. Back propagation can be used for both
classification and regression problems.

Back propagation Algorithm:


1. Load data set
2. Assign all network inputs and output
3. Initialize all weights with small random numbers, typically between -1 and 1
4. repeat
for every pattern in the training set
Present the pattern to the network
Propagate the input forward through the network computing the output
Propagate the errors backward through the network, updating the weights
Calculate the Error Function
while ((maximum number of iterations < than specified) AND (Error Function is > than specified))

Students need to build an neural network by implementing the back propagation algorithm and test
the same using an appropriate data set.

Sample Code:

import numpy as np

X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float) # X = (hours sleeping,


hours studying)
y = np.array(([92], [86], [89]), dtype=float) # y = score on test

# scale units
X = X/np.amax(X, axis=0) # maximum of X array
y = y/100 # max test score is 100

class Neural_Network(object):
def init (self):

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 24


Machine Learning Lab

# Parameters
self.inputSize = 2
self.outputSize = 1
self.hiddenSize = 3
# Weights
self.W1 = np.random.randn(self.inputSize, self.hiddenSize) # (3x2)
weight matrix from input to hidden layer
self.W2 = np.random.randn(self.hiddenSize, self.outputSize) # (3x1)
weight matrix from hidden to output layer

def forward(self, X):


#forward propagation through our network
self.z = np.dot(X, self.W1) # dot product of X (input) and
first set of 3x2 weights
self.z2 = self.sigmoid(self.z) # activation function
self.z3 = np.dot(self.z2, self.W2) # dot product of hidden layer
(z2) and second set of 3x1 weights
o = self.sigmoid(self.z3) # final activation function
return o

def sigmoid(self, s):


return 1/(1+np.exp(-s)) # activation function

def sigmoidPrime(self, s):


return s * (1 - s) # derivative of sigmoid

def backward(self, X, y, o):


# backward propgate through the network
self.o_error = y - o # error in output
self.o_delta = self.o_error*self.sigmoidPrime(o) # applying derivative of
sigmoid to
self.z2_error = self.o_delta.dot(self.W2.T) # z2 error: how much our
hidden layer weights contributed to output error
self.z2_delta = self.z2_error*self.sigmoidPrime(self.z2) # applying
derivative of sigmoid to z2 error
self.W1 += X.T.dot(self.z2_delta) # adjusting first set (input -->
hidden) weights
self.W2 += self.z2.T.dot(self.o_delta) # adjusting second set (hidden -->
output) weights

def train (self, X, y):


o = self.forward(X)
self.backward(X, y, o)

NN = Neural_Network()

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 25


Machine Learning Lab

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 26


Machine Learning Lab

print ("\nInput: \n" + str(X))


print ("\nActual Output: \n" + str(y))
print ("\nPredicted Output: \n" + str(NN.forward(X)))
print ("\nLoss: \n" + str(np.mean(np.square(y - NN.forward(X))))) # mean sum
squared loss)
NN.train(X, y)

Output:

Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]

Actual Output:
[[0.92]
[0.86]
[0.89]]

Predicted Output:
[[0.90907296]
[0.85841616]
[0.90140598]]

Loss:
8.400178305772788e-05

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 27


Machine Learning Lab
Experiment-8:

AIM: Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set. Print
both correct and wrong predictions.

Theory:
The k-nearest neighbours algorithm, also known as k-NN, is a non-parametric, supervised learning
classifier, which uses proximity to make classifications or predictions about an individual data point.
While it can be used for either regression or classification problems, it is typically used as a
classification algorithm, working off the assumption that similar points can be found near one another.
For classification problems, a class label is assigned on the basis of a majority vote—i.e. the label that
is most frequently represented around a given data point is used. Regression problems use a similar
concept as classification problem, but in this case, the average of the k nearest neighbors is taken to
make a prediction.
It's also worth noting that the k-NN algorithm is also part of a family of “lazy learning” models,
meaning that it only stores a training dataset versus undergoing a training stage. This also means that
all the computation occurs when a classification or prediction is being made. Since it heavily relies on
memory to store all its training data, it is also referred to as an instance-based or memory-based
learning method.

K-NN algorithm
Let m be the number of training data samples. Let p be an unknown point.
1. Store the training samples in an array of data points arr[].
2. for i=0 to m: Calculate Euclidean distance d(arr[i], p).
3. Sort the training data points in ascending order based on distance values.
4. Get top k rows from the sorted array.
5. Get the most frequent class of these rows.
6. Return the predicted class.

Students need to implement the k-NN algorithm and test the same to classify the iris data set. Print
both correct and wrong predictions.

Sample Code:

from sklearn.datasets import load_iris


from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import numpy as np
dataset=load_iris()
#print(dataset)
X_train,X_test,y_train,y_test=train_test_split(dataset["data"],dataset["target"],r
andom_state=0)
kn=KNeighborsClassifier(n_neighbors=1)
kn.fit(X_train,y_train)

for i in range(len(X_test)):
x=X_test[i]

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 28


Machine Learning Lab

x_new=np.array([x])
prediction=kn.predict(x_new)

print("TARGET=",y_test[i],dataset["target_names"][y_test[i]],"PREDICTED=",predicti
on,dataset["target_names"][prediction])
print(kn.score(X_test,y_test))

Output:
TARGET= 2 virginica PREDICTED= [2] ['virginica']
TARGET= 1 versicolor PREDICTED= [1] ['versicolor']
TARGET= 0 setosa PREDICTED= [0] ['setosa']
TARGET= 1 versicolor PREDICTED= [2] ['virginica']
0.9736842105263158

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 29


Machine Learning Lab
Experiment-9:
AIM: Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.

Theory:
Model-based methods, such as neural networks and the mixture of Gaussians, use the data to build a
parameterized model. After training, the model is used for predictions and the data are generally
discarded. In contrast, "memory-based" methods are non-parametric approaches that explicitly retain
the training data, and use it each time a prediction needs to be made.
The disadvantage of global methods (like regular regression) is that sometimes no parameter values
can provide a sufficiently good approximation. An alternative to global function approximation is
locally weighted regression (LWR). The basic idea behind LWR is that instead of building a global
model for the whole function space, for each point of interest a local model is created based on
neighbouring data of the query point. For this purpose, each data point becomes a weighting factor
which expresses the influence of the data point for the prediction. In general, data points which are in
the close neighbourhood to the current query point are receiving a higher weight than data points
which are far away. LWR is also called lazy learning because the processing of the training data is
shifted until a query point needs to be answered. This approach makes LWR a very accurate function
approximation method where it is easy to add new training points.

Students need to implement the non-parametric LWR algorithm on an appropriate data set and draw
graphs.

Sample Code:

from math import ceil


import numpy as np
from scipy import linalg

def lowess(x, y, f, iterations):


n = len(x)
r = int(ceil(f * n))
h = [np.sort(np.abs(x - x[i]))[r] for i in range(n)]
w = np.clip(np.abs((x[:, None] - x[None, :]) / h), 0.0, 1.0)
w = (1 - w ** 3) ** 3
yest = np.zeros(n)
delta = np.ones(n)
for iteration in range(iterations):
for i in range(n):
weights = delta * w[:, i]
b = np.array([np.sum(weights * y), np.sum(weights * y * x)])
A = np.array([[np.sum(weights), np.sum(weights * x)],[np.sum(weights *
x), np.sum(weights * x * x)]])
beta = linalg.solve(A, b)
yest[i] = beta[0] + beta[1] * x[i]

residuals = y - yest
s = np.median(np.abs(residuals))

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 30


Machine Learning Lab

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 31


Machine Learning Lab

delta = np.clip(residuals / (6.0 * s), -1, 1)


delta = (1 - delta ** 2) ** 2

return yest

import math
n = 100
x = np.linspace(0, 2 * math.pi, n)
y = np.sin(x) + 0.3 * np.random.randn(n)
f =0.25
iterations=3
yest = lowess(x, y, f, iterations)

import matplotlib.pyplot as plt


plt.plot(x,y,"r.")
plt.plot(x,yest,"b-")

Output:

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 32


Machine Learning Lab
Experiment-10:

AIM: Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier
model to perform this task. Built-in Java classes/API can be used to write the program. Calculate the
accuracy, precision, and recall for your data set.

Theory:

Naive Bayes is a simple and powerful algorithm for predictive modeling. The model comprises two
types of probabilities that can be calculated directly from the training data: (i) the probability of each
class and (ii) the conditional probability for each class given each x value. Once calculated, the
probability model can be used to make predictions for new data using Bayes theorem. Naive Bayes is
called naive because it assumes that each input feature/variable is independent. This is a strong
assumption and unrealistic for real data; however, the technique has been proved very effective on a
large range of complex problems. The thought behind naive Bayes classification is to try to classify
the data by maximizing P(O | Ci)P(Ci) using Bayes theorem of posterior probability (where O is the
Object or tuple in a dataset and “i” is an index of the class).

LEARN_NAIVE_BAYES_TEXT (Examples, V)
Examples is a set of text documents along with their target values. V is the set of all possible target
values. This function learns the probability terms P(wk |vj), describing the probability that a randomly
drawn word from a document in class vj will be the English word wk. It also learns the class prior
probabilities P(vj).
1. collect all words, punctuation, and other tokens that occur in Examples
• Vocabulary ← c the set of all distinct words and other tokens occurring in any text document
from Examples
2. calculate the required P(vj) and P(wk|vj) probability terms
• For each target value vj in V do
• docsj ← the subset of documents from Examples for which the target value is vj
• P(vj) ← | docsj | / |Examples|
• Textj ← a single document created by concatenating all members of docsj
• n ← total number of distinct word positions in Textj
• for each word wk in Vocabulary
• nk ← number of times word wk occurs in Textj
• P(wk|vj) ← ( nk + 1) / (n + | Vocabulary| )

Students need to implement naïve Bayesian Classification algorithm and apply it to a dataset of text
documents to classify them into various categories. Also need to calculate the accuracy, precision, and
recall for your data set.

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 33


Machine Learning Lab
Sample Code:
import pandas as pd
msg = pd.read_csv('document.csv', names=['message', 'label'])
print("Total Instances of Dataset: ", msg.shape[0])
msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})

X = msg.message
y = msg.labelnum
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
from sklearn.feature_extraction.text import CountVectorizer

count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)

df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())
print(df[0:5])

from sklearn.naive_bayes import MultinomialNB


clf = MultinomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)

for doc, p in zip(Xtrain, pred):


p = 'pos' if p == 1 else 'neg'
print("%s -> %s" % (doc, p))

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score,


recall_score
print('Accuracy Metrics: \n')
print('Accuracy: ', accuracy_score(ytest, pred))
print('Recall: ', recall_score(ytest, pred))
print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))

Output:
Accuracy Metrics:
Accuracy: 0.6
Recall: 0.5
Precision: 1.0
Confusion Matrix:
[[1 0]
[2 2]]

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 34


Machine Learning Lab
Experiment-11:
AIM: Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for clustering
using k-Means algorithm. Compare the results of these two algorithms and comment on the quality of
clustering. You can add Java/Python ML library classes/API in the program.

Theory:
Clustering is an unsupervised learning technique that separates data of similar nature. It aims to find a
structure (intrinsic grouping) in a collection of unlabelled data. A cluster is therefore a collection of
objects which are ‘similar’ between each other and are ‘dissimilar’ to the objects belonging to other
clusters. Two representatives of the clustering algorithms are the K-means algorithm and the
expectation maximization (EM) algorithm. The K-means algorithm uses Euclidean distance while EM
uses statistical methods.

K-means clustering:
Input: The number of k and a database containing n objects.
Output: A set of k-clusters that minimize the squared-error criterion.
1. arbitrarily choose k objects as the initial cluster centres;
2. repeat;
a. (re)assign each object to the cluster to which the object is the most similar based on the
mean value of the objects in the cluster;
b. update the cluster mean, i.e. calculate the mean value of the object for each cluster;
until no change.

EM clustering:
Input: Cluster number k, a database, stopping tolerance.
Output: A set of k-clusters with weight that maximize log-likelihood function.
1. Expectation step: For each database record x, compute the membership probability of x in
each cluster h = 1,…, k.
2. Maximization step: Update mixture model parameter (probability weight).
3. Stopping criteria: If stopping criteria are satisfied stop, else set j = j +1 and go to (1).

Students need to implement the k-Means clustering algorithm and the EM algorithm to cluster a Heart
Disease Data Set, compare the results of these two algorithms and comment on the quality of
clustering.

Sample Code:

import matplotlib.pyplot as plt


from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm
import pandas as pd
import numpy as np

iris = datasets.load_iris()

X = pd.DataFrame(iris.data)
X. columns = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 35


Machine Learning Lab

y = pd.DataFrame(iris.target)
y.columns = ['Targets']

model = KMeans(n_clusters=3)
model.fit(X)

plt.figure(figsize=(14,7))

colormap = np.array(['red', 'lime', 'black'])

# Plot the Original Classifications


plt.subplot(1, 2, 1)
plt.scatter(X.Petal_Length, X.Petal_Width, c=colormap[y.Targets], s=40)
plt.title('Real Classification')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')

# Plot the Models Classifications


plt.subplot(1, 2, 2)
plt.scatter(X.Petal_Length, X.Petal_Width, c=colormap[model.labels_], s=40)
plt.title('K Mean Classification')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
print('The accuracy score of K-Mean: ',sm.accuracy_score(y, model.labels_))
print('The Confusion matrixof K-Mean: ',sm.confusion_matrix(y, model.labels_))

from sklearn import preprocessing


scaler = preprocessing.StandardScaler()
scaler.fit(X)
xsa = scaler.transform(X)
xs = pd.DataFrame(xsa, columns = X.columns)
#xs.sample(5)

from sklearn.mixture import GaussianMixture


gmm = GaussianMixture(n_components=3)
gmm.fit(xs)

y_gmm = gmm.predict(xs)
#y_cluster_gmm

plt.subplot(2, 2, 3)
plt.scatter(X.Petal_Length, X.Petal_Width, c=colormap[y_gmm], s=40)

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 36


Machine Learning Lab

plt.title('GMM Classification')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')

print('The accuracy score of EM: ',sm.accuracy_score(y, y_gmm))


print('The Confusion matrix of EM: ',sm.confusion_matrix(y, y_gmm))

Output:
The accuracy score of K-Mean: 0.24
The Confusion matrixof K-Mean: [[ 0 50 0]
[48 0 2]
[14 0 36]]
The accuracy score of EM: 0.0
The Confusion matrix of EM: [[ 0 50 0]
[ 5 0 45]
[50 0 0]]

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 37


Machine Learning Lab
Experiment-12:
AIM: Exploratory Data Analysis for Classification using Pandas or Matplotlib.

Theory:
Exploratory Data Analysis (EDA) is one of the first steps in any machine learning project. It is a
technique to analyze data using some visual Techniques. With this technique, we can get detailed
information about the statistical summary of the data. We will also be able to deal with the duplicates
values, outliers, and also see some trends or patterns present in the dataset. Some of the well-known
aspects of EDA are:
1. Getting a quick statistical summary of the dataset
2. Checking Missing Values
3. Checking Duplicates
4. Handling Correlations
5. Data Visualization
6. Handling Outliers
7. Handling Nans
8. Feature selection

In this experiment, students need to explore and demonstrate various built-in methods for EDA
available in Pandas and Matplotlib.

Sample Code:

import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")

df.describe()

import seaborn as sns


import matplotlib.pyplot as plt

sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',
hue='Species', data=df, )

# Placing Legend outside the Figure


plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 38


Machine Learning Lab

Output:

From the above plot, we can infer that –


 Species Setosa has smaller sepal lengths but larger sepal widths.
 Versicolor Species lies in the middle of the other two species in terms of sepal length and width
 Species Virginica has larger sepal lengths but smaller sepal widths.

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 39


Machine Learning Lab
Experiment-13:

Aim: Write a Python program to construct a Bayesian network considering medical data. Use this
model to demonstrate the diagnosis of heart patients using standard Heart Disease Data Set

Theory:
A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional
dependency, and each node corresponds to a unique random variable.
Bayesian network consists of two major parts: a directed acyclic graph and a set of conditional
probability distributions
 The directed acyclic graph is a set of random variables represented by nodes.
 The conditional probability distribution of a node (random variable) is defined for every
possible outcome of the preceding causal node(s).
For illustration, consider the following example. Suppose we attempt to turn on our computer, but the
computer does not start (observation/evidence). We would like to know which of the possible causes
of computer failure is more likely. In this simplified illustration, we assume only two possible causes
of this misfortune: electricity failure and computer malfunction. The corresponding directed acyclic
graph is depicted in below figure.

The goal is to calculate the posterior conditional probability distribution of each of the possible
unobserved causes given the observed evidence, i.e. P [Cause | Evidence].

Students need to write a Python program to construct a Bayesian network for Cleveland's Heart
Disease Data Set using pgmpy API and demonstrate the diagnosis of heart patients using the
constructed Bayesian network.
Sample Code:

import numpy as np
from urllib.request import urlopen
import urllib
import matplotlib.pyplot as plt # Visuals
import seaborn as sns
import sklearn as skl
import pandas as pd
Cleveland_data_URL = 'http://archive.ics.uci.edu/ml/machine-
learning-databases/heart-disease/processed.hungarian.data'

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 40


Machine Learning Lab
np.set_printoptions(threshold=np.nan) #see a whole array when we
output it
names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg',
'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal',
'heartdisease']
heartDisease = pd.read_csv(urlopen(Cleveland_data_URL), names =
names) #gets Cleveland data
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator,
BayesianEstimator

model = BayesianModel([('age', 'trestbps'), ('age', 'fbs'), ('sex',


'trestbps'),
('sex','trestbps'),('exang','trestbps'),('trestbps','heartdisease'),
('fbs','heartdisease'),('heartdisease','restecg'),('heartdisease','t
halach'),('heartdisease','chol')])

# Learing CPDs using Maximum Likelihood Estimators


model.fit(heartDisease, estimator=MaximumLikelihoodEstimator)

# Doing exact inference using Variable Elimination


from pgmpy.inference import VariableElimination
HeartDisease_infer = VariableElimination(model)

# Computing the probability of bronc given smoke.


q = HeartDisease_infer.query(variables=['heartdisease'],
evidence={'age': 28})
print(q['heartdisease'])

Output:

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 41


Machine Learning Lab
Experiment-14:
Aim: Write a program to Implement Support Vector Machines.

Theory:
Support vector machine is a simple algorithm that every machine learning expert should have in
his/her arsenal. Support vector machine is highly preferred by many as it produces significant
accuracy with less computation power. Support Vector Machine, abbreviated as SVM can be used for
both regression and classification tasks. But, it is widely used in classification objectives.
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional
space (N = the number of features) that distinctly classifies the data points. To separate the two
classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to
find a plane that has the maximum margin, i.e the maximum distance between data points of both
classes. Maximizing the margin distance provides some reinforcement so that future data points can
be classified with more confidence.
Support vectors are data points that are closer to the hyperplane and influence the position and
orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier.
Deleting the support vectors will change the position of the hyperplane. These are the points that help
us build our SVM.

Students need to implement support vector machine in Python for breast cancer dataset, and
demonstrate the use of SVM in predicting if the cancer diagnosis is benign or malignant based on
several observations/features.

Sample Code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

from sklearn.datasets import load_breast_cancer


cancer = load_breast_cancer()
col_names = list(cancer.feature_names)
col_names.append('target')
df = pd.DataFrame(np.c_[cancer.data, cancer.target], columns=col_names)

from sklearn.model_selection import cross_val_score, train_test_split


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

X = df.drop('target', axis=1)
y = df.target

print(f"'X' shape: {X.shape}")

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 42


Machine Learning Lab
print(f"'y' shape: {y.shape}")

pipeline = Pipeline([
('min_max_scaler', MinMaxScaler()),
('std_scaler', StandardScaler())
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
'X' shape: (569, 30)
'y' shape: (569,)

from sklearn.metrics import accuracy_score, confusion_matrix,


classification_report

def print_score(clf, X_train, y_train, X_test, y_test, train=True):


if train:
pred = clf.predict(X_train)
clf_report = pd.DataFrame(classification_report(y_train, pred,
output_dict=True))
print("Train Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
print(" ")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print(" ")
print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
elif train==False:
pred = clf.predict(X_test)
clf_report = pd.DataFrame(classification_report(y_test, pred,
output_dict=True))
print("Test Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
print(" ")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print(" ")
print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

from sklearn.svm import LinearSVC

model = LinearSVC(loss='hinge', dual=True)


model.fit(X_train, y_train)
print_score(model, X_train, y_train, X_test, y_test, train=True)
print_score(model, X_train, y_train, X_test, y_test, train=False)

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 43


Machine Learning Lab
Sample Output:
Train Result:
================================================
Accuracy Score: 91.96%

CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.983471 0.891697 0.919598 0.937584 0.926054
recall 0.798658 0.991968 0.919598 0.895313 0.919598
f1-score 0.881481 0.939163 0.919598 0.910322 0.917569
support 149.000000 249.000000 0.919598 398.000000 398.000000

Confusion Matrix:
[[119 30]
[ 2 247]]

Test Result:
================================================
Accuracy Score: 95.32%

CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 1.000000 0.931034 0.953216 0.965517 0.956443
recall 0.873016 1.000000 0.953216 0.936508 0.953216
f1-score 0.932203 0.964286 0.953216 0.948245 0.952466
support 63.000000 108.000000 0.953216 171.000000 171.000000

Confusion Matrix:
[[ 55 8]
[ 0 108]]

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 44


Machine Learning Lab
Experiment-15:
Aim: Write a program to Implement Principle Component Analysis.

Theory:
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation
to convert a set of observations of possibly correlated variables into a set of values of linearly
uncorrelated variables called principal components. The number of principal components is less than
or equal to the smaller the number of original variables or the number of observations. This
transformation is defined in such a way that the first principal component has the largest possible
variance (that is, accounts for as much of the variability in the data as possible), and each succeeding
component in turn has the highest variance possible under the constraint that it is orthogonal to
preceding components.
PCA algorithm:
Step 1: Calculate Mean
Step 2: Calculation of the covariance matrix.
Step 3: Eigenvalues of the covariance matrix
Step 4: Computation of the eigenvectors
Step 5: Computation of first principal components
Step 6: Geometrical meaning of first principal components

Students need to implement PAC algorithm in Python for breast_cancer dataset and fit SVM on the
transformed dataset, and print confusion matrix, precision and recall metrics.

Sample Code:

scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.decomposition import PCA

pca = PCA(n_components=3)
scaler = StandardScaler()

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 45


Machine Learning Lab
param_grid = {'C': [0.01, 0.1, 0.5, 1, 10, 100],
'gamma': [1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'poly', 'linear']}

grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=1, cv=5,


iid=True)
grid.fit(X_train, y_train)
best_params = grid.best_params_
print(f"Best params: {best_params}")

svm_clf = SVC(**best_params)
svm_clf.fit(X_train, y_train)

print_score(svm_clf, X_train, y_train, X_test, y_test, train=True)


print_score(svm_clf, X_train, y_train, X_test, y_test, train=False)

Sample Output:
Fitting 5 folds for each of 126 candidates, totalling 630 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent
workers.
Best params: {'C': 0.5, 'gamma': 0.1, 'kernel': 'rbf'}
Train Result:
================================================
Accuracy Score: 95.98%

CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.992593 0.942966 0.959799 0.967779 0.961545
recall 0.899329 0.995984 0.959799 0.947656 0.959799
f1-score 0.943662 0.968750 0.959799 0.956206 0.959358
support 149.000000 249.000000 0.959799 398.000000 398.000000

Confusion Matrix:
[[134 15]
[ 1 248]]

Test Result:
================================================
Accuracy Score: 93.57%

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 46


Machine Learning Lab
CLASSIFICATION REPORT:
0.0 1.0 accuracy macro avg weighted avg
precision 0.948276 0.929204 0.935673 0.938740 0.936230
recall 0.873016 0.972222 0.935673 0.922619 0.935673
f1-score 0.909091 0.950226 0.935673 0.929659 0.935071
support 63.000000 108.000000 0.935673 171.000000 171.000000

Confusion Matrix:
[[ 55 8]
[ 3 105]]

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN 47

You might also like