Updated ML LAB Manual-2020-21
Updated ML LAB Manual-2020-21
Address: 181/1, 182/1, Sonnenahalli, Hoodi, K.R.Puram, Whitefield, Bangalore, Karnataka - 560 048
Phone No: (080) - 42229748 Email: [email protected] Website: www.gopalancolleges.com/gcem
Girish
Ddddb
M sdc Dr.J.Somasekar Dr. N. Sengottaiyan
Asst.Professor Head of the Department Principal
Dept. of CSE Dept. of CSE GCEM
GCEM GCEM
TABLE OF CONTENTS
Page
S No Title Of Contents
From To
1 SYLLABUS 3 4
2 STUDY EXPERIMENT 4 4
3 COURSE OBJECTIVE AND COURSE OUTCOME 5 5
4 COMPUTER LAB- DO’S AND DON’TS 5 5
5 LIST OF EXPERIMENTS 6 6
Prerequisites
• Programming experience in Python
• Knowledge of basic Machine Learning Algorithms
• Knowledge of common statistical methods and data analysis best practices.
Software Requirements
1. Python version 3.5 and above
2. Machine Learning packages
Scikit-Learn
Numpy - matrices and linear algebra
Scipy - many numerical routines
Matplotlib- creating plots of data
Pandas –facilitates structured/tabular data manipulation and visualisations
Pomegranate –for fast and flexible probabilistic models
3. An Integrated Development Environment (IDE) for Python Programming
Anaconda
Together with a list of Python packages, tools like editors, Python distributions include the
Python interpreter. Anaconda is one of several Python distributions. Anaconda is a new
distribution of the Python and R data science package. It was formerly known as Continuum
Analytics. Anaconda has more than 100 new packages. Anaconda is used for scientific
computing, data science, statistical analysis, and machine learning
Operating System
Windows/Linux
Anaconda Python distribution is compatible with Linux or windows.
COURSE OBJECTIVES
This course will enable students to, Make use of Data sets in implementing the machine
learning algorithms Implement the machine learning concepts and algorithms in any suitable
language of
choice.
COURSE OUTCOMES
After studying this course, the students will be able to Understand the implementation
procedures for the machine learning algorithms Design Java/Python programs for various
Learning algorithms. Apply appropriate data sets to the Machine Learning algorithms Identify
and apply Machine Learning algorithms to solve real world problems
DO’S
1. Know the location of the fire extinguisher and the first aid box and how to use them in
case of emergency.
2. Read and understand how to carry out an activity thoroughly before coming to the
laboratory.
3. Report fires or accidents to your lecturer/ laboratory technician immediately.
4. Report any broken plugs or exposed electrical wires to your lecturer/ laboratory
technician immediately.
DON’Ts
LIST OF EXPERIMENTS
Program 1
Implement and demonstrate the FIND- S Algorithm for finding the most specific hypothesis
based on a given set of training data samples. Read the training data from a .CSV file.
Objective To find most specific hypothesis in set of hypothesis which is consistent
with positive training example.
Dataset Tennis data set: This data set contains the set of example days on which
playing of tennis is possible or not, based on attributes Sky, AirTemp,
Humidity, Wind, Water and Forecast.
ML Supervised Learning-Find –S algorithm
algorithm
Description The FIND-S algorithm is probably one of the most simple machine
learning algorithms.
FIND-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training
instance x For each attribute
constraint ai in h
If the constraint ai is satisfied
by x Then do nothing
Else replace ai in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
Program
import csv
a = []
print(a)
print("\n The total number of training instances are : ",len(a))
num_attribute = len(a[0])-1
print("\n The initial hypothesis is : ")
hypothesis = ['0']*num_attribute
print(hypothesis)
print("\n The Maximally specific hypothesis for the training instance is ")
print(hypothesis)
Data Set:
sunny warm normal strong warm same yes
sunny warm high strong warm same yes
rainy cold high strong warm change no
sunny warm high strong cool change yes
Output:
Conclusion:
Thus to Implement and demonstrate the FIND- S Algorithm for finding the most specific
hypothesis based on a given set of training data samples was executed successfully.
Program 2
For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Candidate-Elimination algorithm to output a description of the set of all
hypotheses consistent with the training examples.
Objective
Dataset Tennis data set: This data set contain the set of example days on which
playing of tennis is possible or not. Based on attribute Sky, AirTemp,
Humidity, Wind, Water and Forecast. The dataset has 14 instances.
ML
algorithm Supervised Learning-Candidate elimination algorithm
Description
• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
• Remove from G any hypothesis that is less general than another hypothesis in
G
Training Examples:
Program
import numpy as np
import pandas as pd
data = pd.read_csv('enjoysport.csv')
concepts = np.array(data.iloc[:,0:-1])
print(concepts)
target = np.array(data.iloc[:,-1])
print(target)
def learn(concepts, target):
specific_h = concepts[0].copy()
print("initialization of specific_h and general_h")
print(specific_h)
general_h = [["?" for i in range(len(specific_h))] for i in range(len(specific_h))]
print(general_h)
for i, h in enumerate(concepts):
print("For Loop Starts")
if target[i] == "yes":
print("If instance is Positive ")
for x in range(len(specific_h)):
if h[x]!= specific_h[x]:
specific_h[x] ='?'
general_h[x][x] ='?'
if target[i] == "no":
print("If instance is Negative ")
for x in range(len(specific_h)):
if h[x]!= specific_h[x]:
general_h[x][x] = specific_h[x]
else:
general_h[x][x] = '?'
print(specific_h)
print(general_h)
print("\n")
print("\n")
indices = [i for i, val in enumerate(general_h) if val == ['?', '?', '?', '?', '?', '?']]
for i in indices:
general_h.remove(['?', '?', '?', '?', '?', '?'])
return specific_h, general_h
Data Set:
sunny warm normal strong warm same yes
sunny warm high strong warm same yes
rainy cold high strong warm change no
sunny warm high strong cool change yes
output:
[['sunny' 'warm' 'normal' 'strong' 'warm' 'same']
['sunny' 'warm' 'high' 'strong' 'warm' 'same']
['rainy' 'cold' 'high' 'strong' 'warm' 'change']
['sunny' 'warm' 'high' 'strong' 'cool' 'change']]
['yes' 'yes' 'no' 'yes']
initialization of specific_h and general_h
['sunny' 'warm' 'normal' 'strong' 'warm' 'same']
[['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?',
'?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?']]
For Loop Starts
If instance is Positive
steps of Candidate Elimination Algorithm 1
['sunny' 'warm' 'normal' 'strong' 'warm' 'same']
[['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?',
'?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?']]
Final Specific_h:
['sunny' 'warm' '?' 'strong' '?' '?']
Final General_h:
[['sunny', '?', '?', '?', '?', '?'], ['?', 'warm', '?', '?', '?', '?']]
Conclusion:
Program 3
Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use
an appropriate data set for building the decision tree and apply this knowledge to classify a
new sample.
Objective To demonstrate the working of the decision tree based ID3 algorithm.
Dataset Tennis data set: This data set contain the set of example days on which
playing of tennis is possible or not. Based on attribute Sky, AirTemp,
Humidity, Wind, Water and Forecast.
ML Supervised Learning-Decision Tree algorithm
algorithm
Description Decision tree builds classification or regression models in the form of a
tree structure. It breaks down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is incrementally
developed. The final result is a tree with decision nodes and leaf nodes.
ID3 Algorithm
ID3(Examples, Target_attribute, Attributes)
Examples are the training examples. Target_attribute is the attribute whose value is to be
predicted by the tree. Attributes is a list of other attributes that may be tested by the learned
decision tree. Returns a decision tree that correctly classifies the given Examples.
• Create a Root node for the tree
• If all Examples are positive, Return the single-node tree Root, with label = +
• If all Examples are negative, Return the single-node tree Root, with label = -
• If Attributes is empty, Return the single-node tree Root, with label = most common
value of Target_attribute in Examples
• Otherwise Begin
• A ← the attribute from Attributes that best* classifies Examples
• The decision attribute for Root ← A
• For each possible value, vi, of A,
• Add a new tree branch below Root, corresponding to the test A = vi
• Let Examples vi, be the subset of Examples that have value vi for A
• If Examples vi , is empty
• Then below this new branch add a leaf node with label = most
common value of Target_attribute in Examples
• Else below this new branch add the subtree
ID3(Examples vi, Targe_tattribute, Attributes –
{A}))
• End
• Return Root
Program:
import math
import csv
def load_csv(filename):
lines=csv.reader(open(filename,"r"));
dataset = list(lines)
headers = dataset.pop(0)
return dataset,headers
class Node:
def __init__(self,attribute):
self.attribute=attribute
self.children=[]
self.answer=""
def subtables(data,col,delete):
dic={}
coldata=[row[col] for row in data]
attr=list(set(coldata))
counts=[0]*len(attr)
r=len(data)
c=len(data[0])
for x in range(len(attr)):
for y in range(r):
if data[y][col]==attr[x]:
counts[x]+=1
for x in range(len(attr)):
def entropy(S):
attr=list(set(S))
if len(attr)==1:
return 0
counts=[0,0]
for i in range(2):
counts[i]=sum([1 for x in S if attr[i]==x])/(len(S)*1.0)
sums=0
for cnt in counts:
sums+=-1*cnt*math.log(cnt,2)
return sums
def compute_gain(data,col):
attr,dic = subtables(data,col,delete=False)
total_size=len(data)
entropies=[0]*len(attr)
ratio=[0]*len(attr)
def build_tree(data,features):
lastcol=[row[-1] for row in data]
if(len(set(lastcol)))==1:
node=Node("")
node.answer=lastcol[0]
return node
n=len(data[0])-1
gains=[0]*n
for col in range(n):
gains[col]=compute_gain(data,col)
split=gains.index(max(gains))
node=Node(features[split])
fea = features[:split]+features[split+1:]
attr,dic=subtables(data,split,delete=True)
for x in range(len(attr)):
child=build_tree(dic[attr[x]],fea)
node.children.append((attr[x],child))
return node
def print_tree(node,level):
if node.answer!="":
print(" "*level,node.answer)
return
print(" "*level,node.attribute)
for value,n in node.children:
print(" "*(level+1),value)
print_tree(n,level+2)
def classify(node,x_test,features):
if node.answer!="":
print(node.answer)
return
pos=features.index(node.attribute)
for value, n in node.children:
if x_test[pos]==value:
classify(n,x_test,features)
'''Main program'''
dataset,features=load_csv("id3.csv")
node1=build_tree(dataset,features)
print("The decision tree for the dataset using ID3 algorithm is")
print_tree(node1,0)
testdata,features=load_csv("id3_test.csv")
Training Dataset:
Test Dataset:
INFORMATION GAIN:
OUTPUT
The decision tree for the dataset using ID3 algorithm is
Outlook
rain
Wind
strong
no
weak
yes
sunny
Humidity
high
no
normal
yes
overcast
yes
The test instance: ['rain', 'cool', 'normal', 'strong']
The label for test instance: no
The test instance: ['sunny', 'mild', 'normal', 'strong']
The label for test instance: yes
Conclusion:
Thus the working of the decision tree based ID3 algorithm was demonstrated successfully.
Program 4
Build an Artificial Neural Network by implementing the Backpropagation algorithm and
test the same using appropriate data sets.
Objective To build an artificial neural network using the backpropagation
algorithm.
Dataset Data stored as a list having two features- number of hours slept,
number of hours studied with the test score being the class label
ML Supervised Learning –Backpropagation algorithm
algorithm
Description The neural network using back propagation will model a single hidden
layer with three inputs and one output. The network will be predicting
the score of an exam based on the inputs of number of hours studied
and the number of hours slept the day before. The test score is the
output.
BACKPROPAGATION Algorithm
• Create a feed-forward network with ni inputs, nhidden hidden units, and nout output
units.
• Initialize all network weights to small random numbers
• Until the termination condition is met, Do For each (⃗𝑥,𝑡 ), in training
examples, Do
Propagate the input forward through the network:
1. Input the instance ⃗𝑥, to the network and compute the output ou of every
unit u in the network.
Initialize all weights with small random numbers, typically between -1 and 1
repeat
end
end
end
Training Examples:
Expected % in
Example Sleep Study
Exams
1 2 9 92
2 1 5 86
3 3 6 89
Program:
import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)
X = X/np.amax(X,axis=0) # maximum of X array longitudinally
y = y/100
#Sigmoid Function
def sigmoid (x):
return 1/(1 + np.exp(-x))
#Derivative of Sigmoid Function
def derivatives_sigmoid(x):
return x * (1 - x)
#Variable initialization
epoch=7000 #Setting training iterations
lr=0.1 #Setting learning rate
inputlayer_neurons = 2 #number of features in data set
hiddenlayer_neurons = 3 #number of hidden layers neurons
output_neurons = 1 #number of neurons at output layer
#weight and bias initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))
#draws a random range of numbers uniformly of dim x*y
for i in range(epoch):
#Forward Propogation
hinp1=np.dot(X,wh)
hinp=hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp= outinp1+ bout
output = sigmoid(outinp)
#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output)
d_output = EO* outgrad
EH = d_output.dot(wout.T)
hiddengrad = derivatives_sigmoid(hlayer_act)#how much hidden layer wts contributed to
error
d_hiddenlayer = EH * hiddengrad
wout += hlayer_act.T.dot(d_output) *lr# dotproduct of nextlayererror and currentlayerop
# bout += np.sum(d_output, axis=0,keepdims=True) *lr
wh += X.T.dot(d_hiddenlayer) *lr
#bh += np.sum(d_hiddenlayer, axis=0,keepdims=True) *lr
print("Input: \n" + str(X))
print("Actual Output: \n" + str(y))
print("Predicted Output: \n" ,output)
OUTPUT:
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.89533147]
[0.88083584]
[0.89416396]]
Conclusion:
Thus to implement the Backpropagation algorithm and to test the same using appropriate data
sets was executed successfully.
Program 5
Write a program to implement the naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data sets.
Where,
P(h|D) is the probability of hypothesis h given the data D. This is called the posterior
probability.
P(D|h) is the probability of data d given that the hypothesis h was true.
P(h) is the probability of hypothesis h being true. This is called the prior
probability of h. P(D) is the probability of the data. This is called the prior
probability of D
After calculating the posterior probability for a number of different hypotheses h, and is
interested in finding the most probable hypothesis h ∈ H given the observed data D.
Any such maximally probable hypothesis is called a maximum a posteriori (MAP)
hypothesis.
Bayes theorem to calculate the posterior probability of each candidate hypothesis is
Program
import csv
import random
import math
def loadCsv(filename):
lines = csv.reader(open(filename, "r"));
dataset = list(lines)
for i in range(len(dataset)):
#converting strings into numbers for processing
dataset[i] = [float(x) for x in dataset[i]]
return dataset
trainSet = []
copy = list(dataset);
while len(trainSet) < trainSize:
#generate indices for the dataset list randomly to pick ele for
training data
index = random.randrange(len(copy));
trainSet.append(copy.pop(index))
return [trainSet, copy]
def separateByClass(dataset):
separated = {}
#creates a dictionary of classes 1 and 0 where the values are the
instacnes belonging to each class
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated
def mean(numbers):
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in
numbers])/float(len(numbers)-1)
return math.sqrt(variance)
def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in
zip(*dataset)];
del summaries[-1]
return summaries
def summarizeByClass(dataset):
separated = separateByClass(dataset);
summaries = {}
for classValue, instances in separated.items():
#summaries is a dic of tuples(mean,std) for each class value
summaries[classValue] = summarize(instances)
return summaries
return probabilities
def main():
filename = 'naivedata.csv'
splitRatio = 0.67
dataset = loadCsv(filename);
main()
OUTPUT:
Conclusion:
Thus to implement the naïve Bayesian classifier to compute the accuracy of the classifier was
executed successfully.
Program 6
Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier
model to perform this task. Built-in Java classes/API can be used to write the program.
Calculate the accuracy, precision, and recall for your data set.
Objective To implement a binary classification model that classifies a set of
documents and calculates the accuracy, precision and recall for the
dataset
Dataset Contains text as sentences labelled positive and negative. The dataset
contains a total of 10 instances.
ML Supervised Learning -Naïve Bayes Algorithm
algorithm
Packages Scikit learn(sklearn),pandas
Description The Naïve Bayes classifier is a probabilistic classifier that is based on
Bayes Theorem. The algorithm builds a model assuming that the
attributes in the dataset are independent of each other.
LEARN_NAIVE_BAYES_TEXT (Examples, V)
Examples is a set of text documents along with their target values. V is the set of all
possible target values. This function learns the probability terms P(wk |vj,),
describing the probability that a randomly drawn word from a document in class vj
will be the English word wk. It also learns the class prior probabilities P(vj).
1. collect all words, punctuation, and other tokens that occur in Examples
• Vocabulary ← c the set of all distinct words and other tokens occurring in any
text document from Examples
CLASSIFY_NAIVE_BAYES_TEXT (Doc)
Return the estimated target value for the document Doc. ai denotes the word found in the
ith position within Doc.
• positions ← all word positions in Doc that contain tokens found in Vocabulary
• Return VNB, where
Dataset/Examples:
Program
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
msg=pd.read_csv('naivetext6.csv',names=['message','label'])
msg['labelnum']=msg.label.map({'pos':1,'neg':0})
X=msg.message
y=msg.labelnum
OUTPUT:
['about', 'an', 'awesome', 'bad', 'beers', 'boss', 'dance', 'do', 'enemy', 'feel', 'fun', 'good', 'great',
'have', 'he', 'holiday', 'horrible', 'house', 'is', 'juice', 'like', 'locality', 'love', 'my', 'not', 'of', 'place',
'restaurant', 'sandwich', 'stay', 'sworn', 'taste', 'that', 'the', 'these', 'this', 'to', 'today', 'tomorrow',
'very', 'view', 'we', 'went', 'what', 'will']
Confusion matrix
[[1 2]
[1 1]]
Conclusion:
Thus to use the naïve Bayesian Classifier model Calculate the accuracy, precision, and recall
for your data set was executed successfully.
Program 7
Write a program to construct a Bayesian network considering medical data. Use this model
to demonstrate the diagnosis of heart patients using standard Heart Disease Data Set. You
can use Java/Python ML library classes/API.
import bayespy as bp
import numpy as np
import csv
from colorama import init
from colorama import Fore, Back, Style
init()
ageEnum = {'SuperSeniorCitizen':0, 'SeniorCitizen':1, 'MiddleAged':2, 'Youth':3, 'Teen':4}
genderEnum = {'Male':0, 'Female':1}
familyHistoryEnum = {'Yes':0, 'No':1}
dietEnum = {'High':0, 'Medium':1, 'Low':2}
lifeStyleEnum = {'Athlete':0, 'Active':1, 'Moderate':2, 'Sedetary':3}
cholesterolEnum = {'High':0, 'BorderLine':1, 'Normal':2}
heartDiseaseEnum = {'Yes':0, 'No':1}
with open('heart_disease_data.csv') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
data = []
for x in dataset:
data.append([ageEnum[x[0]],genderEnum[x[1]],familyHistoryEnum[x[2]],dietEnum[x[3]],lif
eStyleEnum[x[4]],cholesterolEnum[x[5]],heartDiseaseEnum[x[6]]])
data = np.array(data)
N = len(data)
p_age = bp.nodes.Dirichlet(1.0*np.ones(5))
age = bp.nodes.Categorical(p_age, plates=(N,))
age.observe(data[:,0])
p_gender = bp.nodes.Dirichlet(1.0*np.ones(2))
gender = bp.nodes.Categorical(p_gender, plates=(N,))
gender.observe(data[:,1])
p_familyhistory = bp.nodes.Dirichlet(1.0*np.ones(2))
familyhistory = bp.nodes.Categorical(p_familyhistory, plates=(N,))
familyhistory.observe(data[:,2])
p_diet = bp.nodes.Dirichlet(1.0*np.ones(3))
diet = bp.nodes.Categorical(p_diet, plates=(N,))
diet.observe(data[:,3])
p_lifestyle = bp.nodes.Dirichlet(1.0*np.ones(4))
lifestyle = bp.nodes.Categorical(p_lifestyle, plates=(N,))
lifestyle.observe(data[:,4])
p_cholesterol = bp.nodes.Dirichlet(1.0*np.ones(3))
cholesterol = bp.nodes.Categorical(p_cholesterol, plates=(N,))
cholesterol.observe(data[:,5])
p_heartdisease = bp.nodes.Dirichlet(np.ones(2), plates=(5, 2, 2, 3, 4, 3))
heartdisease = bp.nodes.MultiMixture([age, gender, familyhistory, diet, lifestyle, cholesterol],
bp.nodes.Categorical, p_heartdisease)
heartdisease.observe(data[:,6])
p_heartdisease.update()
m=0
while m == 0:
print("\n")
res = bp.nodes.MultiMixture([int(input('Enter Age: ' + str(ageEnum))), int(input('Enter
Gender: ' + str(genderEnum))), int(input('Enter FamilyHistory: ' + str(familyHistoryEnum))),
int(input('Enter dietEnum: ' + str(dietEnum))), int(input('Enter LifeStyle: ' +
str(lifeStyleEnum))), int(input('Enter Cholesterol: ' + str(cholesterolEnum)))],
bp.nodes.Categorical, p_heartdisease).get_moments()[0][heartDiseaseEnum['Yes']]
print("Probability(HeartDisease) = " + str(res))
m = int(input("Enter for Continue:0, Exit :1 "))
INPUT:
SuperSeniorCitizen,Male,Yes,Medium,Sedetary,High,Yes
SuperSeniorCitizen,Female,Yes,Medium,Sedetary,High,Yes
SeniorCitizen,Male,No,High,Moderate,BorderLine,Yes
Teen,Male,Yes,Medium,Sedetary,Normal,No
Youth,Female,Yes,High,Athlete,Normal,No
MiddleAged,Male,Yes,Medium,Active,High,Yes
Teen,Male,Yes,High,Moderate,High,Yes
SuperSeniorCitizen,Male,Yes,Medium,Sedetary,High,Yes
Youth,Female,Yes,High,Athlete,Normal,No
SeniorCitizen,Female,No,High,Athlete,Normal,Yes
Teen,Female,No,Medium,Moderate,High,Yes
Teen,Male,Yes,Medium,Sedetary,Normal,No
MiddleAged,Female,No,High,Athlete,High,No
MiddleAged,Male,Yes,Medium,Active,High,Yes
Youth,Female,Yes,High,Athlete,BorderLine,No
SuperSeniorCitizen,Male,Yes,High,Athlete,Normal,Yes
SeniorCitizen,Female,No,Medium,Moderate,BorderLine,Yes
Youth,Female,Yes,Medium,Athlete,BorderLine,No
Teen,Male,Yes,Medium,Sedetary,Normal,No
OUTPUT:
Conclusion:
Program 8
Apply EM algorithm to cluster a set of data stored in a .csv file. Use the same data set for
clustering k-means algorithm. Compare the results of these two algorithms and comment on
the quality of clustering. You can add Java/Python ML library classes/API in the program.
Objective To group a set of unlabelled data into similar classes/clusters and label
them and to compare the quality of algorithm.
Dataset Delivery fleet driver dataset Data set in .csv file with features
“Driver_ID”, “distance_feature”,”speeding_feature” having more than
20 instances
ML EM algorithm, K means algorithm – Unsupervised clustering
algorithm
Packages Scikit learn(sklearn),pandas
Description EM algorithm – soft clustering - can be used for variable whose value is
never directly observed, provided the general probability distribution
governing these varaiable is known. EM algorithm can be used to train
Bayesian belief networks as well as radial basis function network.
K-Means – Hard Clustering - to find groups in the data, with the
number of groups represented by the variable K. The algorithm
works iteratively to assign each data point to one of K groups based
on the features that are provided. Data points are clustered based
on feature similarity.
Algorithm
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to the
Euclidean distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each
cluster in consecutive rounds.
Program
from sklearn.cluster import KMeans
X=np.matrix(list(zip(f1,f2)))
plt.plot()
plt.xlim([0, 100])
plt.ylim([0, 50])
plt.title('Dataset')
plt.ylabel('speeding_feature')
plt.xlabel('Distance_Feature')
plt.scatter(f1,f2)
plt.show()
# KMeans algorithm
#K = 3
kmeans_model = KMeans(n_clusters=3).fit(X)
plt.plot()
for i, l in enumerate(kmeans_model.labels_):
plt.plot(f1[i], f2[i], color=colors[l], marker=markers[l],ls='None')
plt.xlim([0, 100])
plt.ylim([0, 50])
plt.show()
INPUT/DATASET
OUTPUT
Conclusion:
Thus to compare EM algorithm to cluster a set of data stored in a .csv file and using the same
data set for clustering k-means algorithm was executed successfully.
Program 9
Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set.
Print both correct and wrong predictions. Java/Python ML library classes can be used for this
problem.
Objective To implement a classification model that classifies a set of documents
and calculates the accuracy, precision and recall for the dataset
Dataset IRIS data set with features “petal_length”, “petal_width”,
“sepal_length”, “sepal_width” having more than 150 instances
ML Supervised Learning – Lazy learning algorithm
algorithm
Packages Scikit learn(sklearn),pandas
Description When we get training set/instances, machine won’t learn or a model
can’t be built. Instead instances/examples will be just stored in
memory.Test instance is given, attempt to find the closest instance/most
neighboring instances in the instance space….
Training algorithm:
• For each training example (x, f (x)), add the example to the list training
examples Classification algorithm:
• Given a query instance xq to be classified,
• Let x1 . . .xk denote the k instances from training examples that are nearest to xq
• Return
• Where, f(xi) function to calculate the mean value of the k nearest training
examples.
Steps
1. Load the data
2. Initialize the value of k
3. For getting the predicted class, iterate from 1 to total number of training
data points
1. Calculate the distance between test data and each row of training
data. Here we will use Euclidean distance as our distance metric since
it’s the most popular method. The other metrics that can be used are
Chebyshev, cosine, etc.
2. Sort the calculated distances in ascending order based on distance
values
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows i.e Get the labels of the
selected K entries
5. Return the predicted class
If regression, return the mean of the K labels
If classification, return the mode of the K labels
Confusion matrix:
Note,
• Class 1 : Positive
• Class 2 : Negative
Data Set:
Iris Plants Dataset: Dataset contains 150 instances (50 in each of three
classes) Number of Attributes: 4 numeric, predictive attributes and the Class
Program
iris=datasets.load_iris()
x = iris.data
y = iris.target
print('Confusion Matrix')
print(confusion_matrix(y_test,y_pred))
print('Accuracy Metrics')
print(classification_report(y_test,y_pred))
OUTPUT
sepal-length sepal-width petal-length petal-width
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.1 1.5 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
Confusion Matrix
[[15 0 0]
[ 0 14 0]
[ 0 0 16]]
Accuracy Metrics
precision recall f1-score support
Conclusion:
Thus to implement k-Nearest Neighbour algorithm to classify the iris data set was executed
successfully.
Program 10
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs
Objective To implement Regression algorithm to fit the given data
Dataset The dataset contains billing information based on the attributes
total_bill,tip,sex,smoker,day,time,size
ML Locally Weighted Regression Algorithm – Instance Based learning
algorithm
Description Regression means approximating a real valued target function.Given a
new query instance Xq, the general approach is to construct an
approximation function F that fits the training example in the
neighbourhood surrounding Xq.This approximation is then used to
estimate the target value F(Xq)
Algorithm : Regression:
Algorithm
1. Read the Given data Sample to X and the curve (linear or non linear) to Y
2. Set the value for Smoothening parameter or free parameter say τ
3. Set the bias /Point of interest set X0 which is a subset of X
4. Determine the weight matrix using:
6. Prediction = x0*β
Program:
def localWeight(point,xmat,ymat,k):
wei = kernel(point,xmat,k)
W=(X.T*(wei*X)).I*(X.T*(wei*ymat.T))
return W
def localWeightRegression(xmat,ymat,k):
m,n = np1.shape(xmat)
ypred = np1.zeros(m)
for i in range(m):
ypred[i] = xmat[i]*localWeight(xmat[i],xmat,ymat,k)
return ypred
#set k here
ypred = localWeightRegression(X,mtip,2)
SortIndex = X[:,1].argsort(0)
xsort = X[SortIndex][:,0]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(bill,tip, color='green')
ax.plot(xsort[:,1],ypred[SortIndex], color = 'red', linewidth=5)
plt.xlabel('Total bill')
plt.ylabel('Tip')
plt.show();
INPUT DATASET :
OUTPUT
Conclusion:
Thus to Implement the non-parametric Locally Weighted Regression algorithm in order to fit
data points was executed successfully.
VIVA Questions