Shri Madhwa Vadiraja Institute of Technology & Management: Vishwothama Nagar, Bantakal - 574 115, Udupi Dist
Shri Madhwa Vadiraja Institute of Technology & Management: Vishwothama Nagar, Bantakal - 574 115, Udupi Dist
LABORATORY MANUAL
for
MACHINE LEARNING LABORATORY
[As per Choice Based Credit System (CBCS) scheme]
(Effective from the academic year 2017 -2018)
Course: B. E.
Semester: VII
Subject code: 17CSL76
1 Introduction 1-4
2 Syllabus 5
13 Appendix I 49-50
14 References 51
Machine Learning Laboratory 17CSL76
Introduction
Machine learning is a subset of artificial intelligence in the field of computer science that
often uses statistical techniques to give computers the ability to "learn" (i.e.,
progressively improve performance on a specific task) with data, without being explicitly
programmed. In the past decade, machine learning has given us self-driving cars,
practical speech recognition, effective web search, and a vastly improved understanding
of the human genome.
In classification, inputs are divided into two or more classes, and the learner must
produce amodel that assigns unseen inputs to one or more (multi-label classification) of
these classes. This is typically tackled in a supervised manner. Spam filtering is an
example of classification, where the inputs are email (or other) messages, and the
classes are "spam" and "not spam".
In regression, also a supervised problem, the outputs are continuous rather than
discrete. In clustering, a set of inputs is to be divided into groups. Unlike in classification,
the groups are not known beforehand, making this typically an unsupervised task.
Density estimation finds the distribution of inputs in some space. Dimensionality
reduction simplifies inputs by mapping them into a lower dimensional space. Topic
modeling is a related problem, where a program is given a list of human language
documents and is tasked with finding out which documents cover similar topics.
4. Deep learning
Falling hardware prices and the development of GPUs for personal use in the last few years
have contributed to the development of the concept of deep learning which consists of
multiple hidden layers in an artificial neural network. This approach tries to model the way
the human brain processes light and sound into vision and hearing. Some successful
applications of deep learning are computer vision and speech Recognition.
7. Clustering
Cluster analysis is the assignment of a set of observations into subsets (called clusters) so
that observations within the same cluster are similar according to some predesignated
criterion or criteria, while observations drawn from different clusters are dissimilar.
Different clustering techniques make different assumptions on the structure of the data,
often defined by some similarity metric and evaluated for example by internal
compactness (similarity between members of the same cluster) and separation between
different clusters. Other methods are based on estimated density and graph connectivity.
Clustering is a method of unsupervised learning, and a common technique for statistical
data analysis.
8. Bayesian networks
A Bayesian network, belief network or directed acyclic graphical model is a probabilistic
graphical model that represents a set of random variables and their conditional
independencies via a directed acyclic graph (DAG). For example, a Bayesian network could
represent the probabilistic relationships between diseases and symptoms. Given
symptoms, the network can be used to compute the probabilities of the presence of
various diseases. Efficient algorithms exist that perform inference and learning.
9. Reinforcement learning
Reinforcement learning is concerned with how an agent ought to take actions in an
environment so as to maximize some notion of long-term reward. Reinforcement learning
algorithms attempt to find a policy that maps states of the world to the actions the agent
ought to take in those states. Reinforcement learning differs from the supervised learning
problem in that correct input/output pairs are never presented, nor sub-optimal actions
explicitly corrected.
1. Implement and demonstrate the FIND-S algorithm for finding the most specific
Hypothesis is based on a given set of training data samples. Read the training data from
a CSV file.
2. For a given set of training data examples stored in a .CSV file, implement and
Demonstrate the Candidate-Elimination algorithm to output a description of the set
of all hypotheses consistent with the training examples.
3. Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to
classify a new sample.
4. Build an Artificial Neural Network by implementing the Back-propagation algorithm
and test the same using appropriate data sets.
5. Write a program to implement the naïve Bayesian classifier for a sample training data
set stored as a .CSV file. Compute the accuracy of the classifier, considering few test data
sets.
6. Assuming a set of documents that need to be classified, use the naïve Bayesian
Classifier model to perform this task. Built-in Java classes/API can be used to write the
program. Calculate the accuracy, precision, and recall for your data set.
7. Write a program to construct a Bayesian network considering medical data. Use this
model to demonstrate the diagnosis of heart patients using standard Heart Disease
Data Set. You can use Java/Python ML library classes/API.
8. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data set
for clustering using k-Means algorithm. Compare the results of these two algorithms
and comment on the quality of clustering. You can add Java/Python ML library
classes/API in the program.
9. Write a program to implement k-Nearest Neighbor algorithm to classify the iris data
set. Print both correct and wrong predictions. Java/Python ML library classes can be
used for this problem.
10. Implement the non-parametric Locally Weighted Regression algorithm to fit data
points. Select appropriate data set for your experiment and draw graphs.
1. Implement and demonstrate the FIND-S algorithm for finding the most specific
hypothesis based on a given set of training data samples. Read the training
data from a .CSV file.
import csv
['Warm', 'Cold'],
['Normal', 'High'],
['Strong', 'Weak'],
['Warm', 'Cool'],
['Same', 'Change']]
total_attributes = len(attributes)
a=[]
a.append(row)
print(row)
print(hypothesis)
if a[i][total_attributes] == 'Yes':
hypothesis[j] = a[i][j]
else:
hypothesis[j] = '?'
print(hypothesis)
Output:
2. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Candidate-Elimination algorithm to output a description of the
set of all hypotheses consistent with the training examples.
import csv
def getdomains(examples):
for r in examples:
for c, v in enumerate(r):
d[c].add(v)
def g0(n):
return ('?',)*n
def s0(n):
return ('0',)*n
mgparts = []
mgparts.append(mg)
return all(mgparts)
s_new = list(s)
for i in range(len(s)):
if not consistent(s[i:i+1],e[i:i+1]):
if s[i] != '0':
s_new[i] = '?'
else:
s_new[i] = e[i]
return [tuple(s_new)]
S_prev = list(S)
for s in S_prev:
if s not in S:
continue
if not consistent(s,e):
S.remove(s)
Splus = min_generalizations(s, e)
S.difference_update( [ h for h in S
return S
results = []
for i in range(len(h)):
if h[i] == '?':
if e[i] != val:
results.append(h_new)
results.append(h_new)
return results
G_prev = list(G)
for g in G_prev:
if g not in G:
continue
if consistent(g, e):
G.remove(g)
G.difference_update( [ h for h in G
return G
def candidate_elimination(examples):
domains = getdomains(examples)[:-1]
G = set( [ g0(len(domains) ) ] )
S = set( [ s0(len(domains) ) ] )
i=0
print("\nInitially")
print("G[{0}]:".format(i), G)
print("S[{0}]:".format(i), S)
for r in examples:
i = i+1
S = generalize_S(e, G, S)
G = specialize_G(e, domains, G, S)
print("G[{0}]:".format(i), G)
print("S[{0}]:".format(i), S)
return
candidate_elimination(examples)
---------------------------------------------------------------------------------------------------------------------
Output:
Initially
G[0]: {('?', '?', '?', '?', '?', '?')}
S[0]: {('0', '0', '0', '0', '0', '0')}
3. Write a program to demonstrate the working of the decision tree based ID3
algorithm. Use an appropriate data set for building the decision tree and apply this
knowledge to classify a new sample.
#code
import pandas as pd
import numpy as np
dataset=
pd.read_csv('PlayTennis.csv',names=['outlook','temperature','humidity','wind','playtennis'])
def entropy(target_col):
entropy = np.sum([(-counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts))
for i in range(len(elements))])
return entropy
def InfoGain(data,split_attribute_name,target_name="playtennis"):
total_entropy = entropy(data[target_name])
vals,counts= np.unique(data[split_attribute_name],return_counts=True)
Weighted_Entropy =
np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[split_attribute_name]==vals[i
]).dropna()[target_name]) for i in range(len(vals))])
return Information_Gain
def ID3(data,originaldata,features,target_attribute_name="playtennis",parent_node_class =
None):
if len(np.unique(data[target_attribute_name])) <= 1:
return np.unique(data[target_attribute_name])[0]
elif len(data)==0:
return
np.unique(originaldata[target_attribute_name])[np.argmax(np.unique(originaldata[target_a
ttribute_name],return_counts=True)[1])]
return parent_node_class
else:
parent_node_class =
np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name]
,return_counts=True)[1])]
best_feature_index = np.argmax(item_values)
best_feature = features[best_feature_index]
tree = {best_feature:{}}
value = value
subtree = ID3(sub_data,dataset,features,target_attribute_name,parent_node_class)
tree[best_feature][value] = subtree
return(tree)
tree = ID3(dataset,dataset,dataset.columns[:-1])
Output:
#code
import numpy as np
y = y/100
#Sigmoid Function
def derivatives_sigmoid(x):
return x * (1 - x)
#Variable initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))
#Forward Propagation
for i in range(epoch):
hinp1=np.dot(X,wh)
hinp=hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
output = sigmoid(outinp)
#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output)
EH = d_output.dot(wout.T)
hiddengrad = derivatives_sigmoid(hlayer_act)
d_hiddenlayer = EH * hiddengrad
wh += X.T.dot(d_hiddenlayer) *lr
Input:
[[0.66666667 1.00000000]
[0.33333333 0.55555556]
[1.000 00000 0.66666667]]
Actual Output: Predicted Output:
[[0.92] [[0.89592234]
[0.86] [0.88076582]
[0.89]] [0.89211838]]
5. Write a program to implement the naïve Bayesian classifier for a sample training
data set stored as a .CSV file. Compute the accuracy of the classifier, considering
few test data sets.
Bayes’ Theorem
• Let D be a training set of tuples and their associated class labels. Each tuple is represented by an n-dimensional
attribute vector, X=(x1,x2,x3……xn) depicting the n measurements made on the tuple from n attributes,
respectively, A1, A2, A3… , An.
• Suppose that there are m classes, C1, C2, C3…., Cm. Given a tuple, X, the classifier will predict that X belongs to the
class having the highest posterior probability, conditioned on X. That is, the naive Bayesian classifier predicts that
tuple X belongs to the class Ci if and only if
Thus, we maximize P(Ci/X). The class Ci for which P(Ci/X) is maximized is called the maximum posteriori hypothesis. By
Bayes theorem we have
i. As P(X) is constant for all classes, only P(X/Ci) P(Ci) needs to be maximized.
ii. Given data sets with many attributes, it would be extremely computationally expensive to compute
P(X/Ci). To reduce computation in evaluating P(X/Ci), the naive assumption of class-conditional
independence is made.
This presumes that the attributes’ values are conditionally independent of one another, given the class label of the
tuple.Thus,
For instance, to compute P(X/Ci), we consider If Ak is continuous-valued then it is typically assumed to have a
Gaussian distribution with a mean and standard deviation , defined by
import csv
import random
import math
random.seed(0)
# 1.Data Handling:
# 1.1 Loading the Data from csv file of Pima indians diabetes dataset.
def loadcsv(filename):
dataset = [ ]
inlist = [ ]
for i in range(len(row)):
inlist.append(float(row[i]))
dataset.append(inlist)
return dataset
#The naive bayes model is comprised of a summary of the data in the training dataset.
#involves the mean and the standard deviation for each attribute, by class value
trainSet = []
copy = list(dataset)
trainSet.append(copy.pop(index))
#Function to categorize the dataset in terms of classes.#The function assumes that the
last attribute (-1) is the class value. #The function returns a map of class values to lists of
data instances.
def separateByClass(dataset):
separated = {}
vector = dataset[i]
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated
#The mean is the central middle or central tendency of the data, and we will use it as the
def mean(numbers):
return sum(numbers)/float(len(numbers))
#The standard deviation describes the variation of spread of the data, and we will use it
to characterize #the expected spread of each attribute in our Gaussian distribution when
calculating probabilities.
def stdev(numbers):
avg = mean(numbers)
return math.sqrt(variance)
#Summarize Dataset
def summarize(dataset):
del summaries[-1]
return summaries
def summarizeByClass(dataset):
separated = separateByClass(dataset)
summaries = {}
summaries[classValue] = summarize(instances)
return summaries
#Make Prediction
probabilities = {}
probabilities[classValue] = 1
for i in range(len(classSummaries)):
x = inputVector[i]
return probabilities
#Prediction : look for the largest probability and return the associated class
bestProb = probability
bestLabel = classValue
return bestLabel
#Make Predictions
predictions = []
for i in range(len(testSet)):
predictions.append(result)
return predictions
#Computing Accuracy
correct = 0
for i in range(len(testSet)):
if testSet[i][-1] == predictions[i]:
correct += 1
#Main Function
def main():
filename = 'T:\\ML\\datasheet\\PI_Diabetes.csv'
splitRatio = 0.67
dataset = loadcsv(filename)
print("\n The Data Set Splitting into Training and Testing \n")
# prepare model
summaries = summarizeByClass(trainingSet)
# test model
print("\nPredictions:\n",predictions)
main()
Output:
Model Summaries:
{1.0: [(4.701754385964913, 3.749344627974186), (142.9298245614035, 31.1849471507099
03), (68.81871345029239, 23.193226713717014), (22.239766081871345, 17.934713233516
998), (110.09356725146199, 146.07110482316023), (35.18128654970761, 8.026522255094
289), (0.5614912280701757, 0.3747628641345956), (36.801169590643276, 11.3472566692
62784)], 0.0: [(3.3556851311953353, 3.006137199069943), (110.64139941690962, 26.28181
125254248), (68.9067055393586, 18.44741337335469), (20.06122448979592, 14.98632069
0982343), (69.11953352769679, 97.04270626661162), (30.255976676384837, 8.076274770
888682), (0.43274635568513126, 0.3115760396097594), (31.600583090379008, 11.842321
751005294)]}
Predictions:
[0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0,
0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0,
0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0,
1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0,
0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0,
0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0,
1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0,
1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0,
1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0]
Accuracy: 74.40944881889764%
6. Assuming a set of documents that need to be classified, use the naïve Bayesian
Classifier model to perform this task. Built-in Java classes/API can be used to write
the program. Calculate the accuracy, precision, and recall for your data set.
import pandas as pd
import pandas as pd
print(txt)
X = txt.text
Y = txt.labelnum
print(xtrain)
print(xtest)
count_vect = CountVectorizer()
xtest_dtm = count_vect.transform(xtest)
df = pd.DataFrame(xtrain_dtm.toarray(),columns=count_vect.get_feature_names())
print(df.columns)
print(df)
predicted = clf.predict(xtest_dtm)
print(metrics.confusion_matrix(ytest, predicted))
Output:
Index(['about', 'am', 'an', 'and', 'awesome', 'bad', 'beers', 'best', 'boss', 'can', 'dance', 'deal', 'd
o', 'enemy', 'feel', 'fun', 'good', 'have', 'horrible', 'house', 'is', 'juice', 'like', 'locality', 'love', 'my',
'not', 'of', 'place', 'restaurant', 'sandwich', 'sick', 'stay', 'taste', 'that', 'the', 'these', 'this', 'tired',
'to', 'today', 'tomorrow', 'very', 'view', 'we', 'went', 'what', 'will', 'with', 'work'], dtype='object')
[[2 0]
[1 2]]
Install the pgmpy(Probabilistic Graph Models) python package using following command
in anaconda prompt
#code
import numpy as np
import pandas as pd
attributes = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak',
print(heartDisease.head())
print(heartDisease.dtypes)
model = BayesianModel( [ ('age', 'trestbps'), ('age', 'fbs'), ('sex', 'trestbps'), ('sex', 'trestbps'),
Estimators...');
HeartDisease_infer = VariableElimination(model)
print( q['heartdisease'])
print( q['heartdisease'])
Output:
8. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same
data set for clustering using k-Means algorithm. Compare the results of these two
algorithms and comment on the quality of clustering. You can add Java/Python ML
library classes/API in the program.
K-Means Algorithm
distance function.
6. Repeat steps 3, 4 and 5 until the same points are assigned to each cluster in consecutive
rounds.
EM algorithm
These are the two basic steps of the EM algorithm, namely E Step or Expectation Step or
Estimation step:
• Then for those given parameter values, estimate the value of the latent variables (i.e γk)
Maximization Step:
• Update the value of the parameters( i.e. µk, ∑k and πk) calculated using ML method.
2. Pick k seeds as centroids of the k clusters. The seeds may be randomly picked.
the centroids.
4. Allocate each object to the cluster it is nearest to based on the distances computed
5. Compute the centroids of the clusters by computing the means of the attribute
6. Check if the stopping criterion has been met. (ex: cluster membership is
7. [optional] one may decide to stop at this stage or to split a cluster or combine two
import sklearn.metrics as sm
import pandas as pd
import numpy as np
#%matplotlib inline
iris_dataset = pd.read_csv('Iris.csv')
virginica':2})
Y = iris_dataset[['Targets']]
model.fit(X)
scaler = preprocessing.StandardScaler()
scaler.fit(X)
xs = scaler.transform(X)
gmm.fit(xs)
Y_gmm = gmm.predict(xs)
# Create a colormap
plt.subplot(2, 2, 1)
plt.title('Real Classification')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.subplot(2, 2, 2)
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.subplot(2, 2, 3)
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
Output:
K-Nearest-Neighbour Algorithm:
1. Load the data
2. Initialize the value of k
3. For getting the predicted class, iterate from 1 to total number of training data points
1. Calculate the distance between test data and each row of training data.Here we
will use Euclidean distance as our distance metric since it’s the most popular
method. The other metricsthat can be used are Chebyshev, cosine, etc.
2. Sort the calculated distances in ascending order based on distance values
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows i.e Get the labels of the selected K
entries
5. Return the predicted class
If regression, return the mean of the K labels
If classification, return the mode of the K labels
Confusion matrix:
Note,
Class 1: Positive
Class 2: Negative
• Positive (P) : Observation is positive (for example: is an apple).
#code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import pandas as pd
dataset=pd.read_csv("iris.csv")
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.25)
classifier=KNeighborsClassifier(n_neighbors=8,p=3,metric='euclidean')
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
cm=confusion_matrix(y_test,y_pred)
print('Accuracy Metrics')
print(classification_report(y_test,y_pred))
Output:
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]
Accuracy Metrics
• Regression is a technique from statistics that is used to predict values of a desired target
quantity when the target quantity is continuous.
• In regression, we seek to identify (or estimate) a continuous variable y associated with a
given input vector x.
• y is called the dependent variable.
• x is called the independent variable.
#code
import pandas as pd
import numpy as np
m, n = np.shape(xmat)
for j in range(m):
weights[j, j] = np.exp(diff*diff.T/(-2.0*k**2))
return weights
wei = kernel(point,xmat,k)
W = (X.T*(wei*X)).I*(X.T*(wei*ymat.T))
return W
m, n = np.shape(xmat)
ypred = np.zeros(m)
for i in range(m):
return ypred
xsort = X[sortindex][:,0]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(bill,tip, color='green')
plt.xlabel('Total bill')
plt.ylabel('Tip')
plt.show()
data = pd.read_csv(‘Tips.csv')
tip = np.array(data.tip)
mtip = np.mat(tip)
m= np.shape(mbill)[1]
one = np.mat(np.ones(m))
ypred = localWeightRegression(X,mtip,3)
graphPlot(X, ypred)
graphPlot(X, ypred)
Output: