0% found this document useful (0 votes)
159 views50 pages

Shri Madhwa Vadiraja Institute of Technology & Management: Vishwothama Nagar, Bantakal - 574 115, Udupi Dist

This document is a laboratory manual for a Machine Learning course. It provides an introduction to machine learning, describing common tasks like supervised learning, unsupervised learning, and reinforcement learning. It also outlines several machine learning approaches, including decision trees, artificial neural networks, deep learning, support vector machines, clustering, Bayesian networks, and genetic algorithms. The manual contains 10 laboratory programs for students to complete hands-on exercises applying these machine learning techniques.

Uploaded by

Siju V Soman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views50 pages

Shri Madhwa Vadiraja Institute of Technology & Management: Vishwothama Nagar, Bantakal - 574 115, Udupi Dist

This document is a laboratory manual for a Machine Learning course. It provides an introduction to machine learning, describing common tasks like supervised learning, unsupervised learning, and reinforcement learning. It also outlines several machine learning approaches, including decision trees, artificial neural networks, deep learning, support vector machines, clustering, Bayesian networks, and genetic algorithms. The manual contains 10 laboratory programs for students to complete hands-on exercises applying these machine learning techniques.

Uploaded by

Siju V Soman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

SHRI MADHWA VADIRAJA INSTITUTE OF

TECHNOLOGY & MANAGEMENT


Vishwothama Nagar, Bantakal – 574 115, Udupi Dist.
(A unit of Shri Sode Vadiraja Matt Education Trust)

LABORATORY MANUAL
for
MACHINE LEARNING LABORATORY
[As per Choice Based Credit System (CBCS) scheme]
(Effective from the academic year 2017 -2018)

Course: B. E.
Semester: VII
Subject code: 17CSL76

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


TABLE OF CONTENT

Sl. No Topic Page No.

1 Introduction 1-4

2 Syllabus 5

3 Laboratory Program 1 6-8

4 Laboratory Program 2 7-13

5 Laboratory Program 3 8-16

6 Laboratory Program 4 17-19

7 Laboratory Program 5 20-26

8 Laboratory Program 6 27-32

9 Laboratory Program 7 33-35

10 Laboratory Program 8 36-40

11 Laboratory Program 9 41-44

12 Laboratory Program 10 45-48

13 Appendix I 49-50

14 References 51
Machine Learning Laboratory 17CSL76

Introduction

Machine learning is a subset of artificial intelligence in the field of computer science that
often uses statistical techniques to give computers the ability to "learn" (i.e.,
progressively improve performance on a specific task) with data, without being explicitly
programmed. In the past decade, machine learning has given us self-driving cars,
practical speech recognition, effective web search, and a vastly improved understanding
of the human genome.

Machine learning tasks


Machine learning tasks are typically classified into two broad categories, depending on
whether there is a learning "signal" or "feedback" available to a learning system:
1. Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to
outputs. As special cases, the input signal can be only partially available, or restricted to
special feedback:
2. Semi-supervised learning: the computer is given only an incomplete training signal: a
training set with some (often many) of the target outputs missing.
3. Active learning: the computer can only obtain training labels for a limited set of instances
(based on a budget), and also has to optimize its choice of objects to acquire labels for.
When used interactively, these can be presented to the user for labeling.
4. Reinforcement learning: training data (in form of rewards and punishments) is given only
as feedback to the program's actions in a dynamic environment, such as driving a
vehicle or playing a game against an opponent.
5. Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own
to find structure in its input. Unsupervised learning can be a goal in itself (discovering
hidden patterns in data) or a means towards an end (feature learning).

Machine Learning Applications

Department of CSE, SMVITM, Bantakal 1


Machine Learning Laboratory 17CSL76

In classification, inputs are divided into two or more classes, and the learner must
produce amodel that assigns unseen inputs to one or more (multi-label classification) of
these classes. This is typically tackled in a supervised manner. Spam filtering is an
example of classification, where the inputs are email (or other) messages, and the
classes are "spam" and "not spam".
In regression, also a supervised problem, the outputs are continuous rather than
discrete. In clustering, a set of inputs is to be divided into groups. Unlike in classification,
the groups are not known beforehand, making this typically an unsupervised task.
Density estimation finds the distribution of inputs in some space. Dimensionality
reduction simplifies inputs by mapping them into a lower dimensional space. Topic
modeling is a related problem, where a program is given a list of human language
documents and is tasked with finding out which documents cover similar topics.

Machine learning Approaches


1. Decision tree learning
Decision tree learning uses a decision tree as a predictive model, which maps observations
about an item to conclusions about the item's target value.

2. Association rule learning


Association rule learning is a method for discovering interesting relations between
variables in large databases.

3. Artificial neural networks


An artificial neural network (ANN) learning algorithm, usually called "neural network" (NN),
is a learning algorithm that is vaguely inspired by biological neural networks. Computations
are structured in terms of an interconnected group of artificial neurons, processing
information using a connectionist approach to computation. Modern neural networks are
non-linear statistical data modeling tools. They are usually used to model complex
relationships between inputs and outputs, to find patterns in data, or to capture the
statistical structure in an unknown joint probability distribution between observed
variables.

4. Deep learning
Falling hardware prices and the development of GPUs for personal use in the last few years
have contributed to the development of the concept of deep learning which consists of
multiple hidden layers in an artificial neural network. This approach tries to model the way

Department of CSE, SMVITM, Bantakal 2


Machine Learning Laboratory 17CSL76

the human brain processes light and sound into vision and hearing. Some successful
applications of deep learning are computer vision and speech Recognition.

5. Inductive logic programming


Inductive logic programming (ILP) is an approach to rule learning using logic Programming
as a uniform representation for input examples, background knowledge, and hypotheses.
Given an encoding of the known background knowledge and a set of examples
represented as a logical database of facts, an ILP system will derive a hypothesized logic
program that entails all positive and no negative examples. Inductive programming is a
related field that considers any kind of programming languages for representing
hypotheses (and not only logic programming), such as functional programs.

6. Support vector machines


Support vector machines (SVMs) are a set of related supervised learning methods used for
classification and regression. Given a set of training examples, each marked as belonging to
one of two categories, an SVM training algorithm builds a model that predicts whether a
new example falls into one category or the other.

7. Clustering
Cluster analysis is the assignment of a set of observations into subsets (called clusters) so
that observations within the same cluster are similar according to some predesignated
criterion or criteria, while observations drawn from different clusters are dissimilar.
Different clustering techniques make different assumptions on the structure of the data,
often defined by some similarity metric and evaluated for example by internal
compactness (similarity between members of the same cluster) and separation between
different clusters. Other methods are based on estimated density and graph connectivity.
Clustering is a method of unsupervised learning, and a common technique for statistical
data analysis.

8. Bayesian networks
A Bayesian network, belief network or directed acyclic graphical model is a probabilistic
graphical model that represents a set of random variables and their conditional
independencies via a directed acyclic graph (DAG). For example, a Bayesian network could
represent the probabilistic relationships between diseases and symptoms. Given
symptoms, the network can be used to compute the probabilities of the presence of
various diseases. Efficient algorithms exist that perform inference and learning.

Department of CSE, SMVITM, Bantakal 3


Machine Learning Laboratory 17CSL76

9. Reinforcement learning
Reinforcement learning is concerned with how an agent ought to take actions in an
environment so as to maximize some notion of long-term reward. Reinforcement learning
algorithms attempt to find a policy that maps states of the world to the actions the agent
ought to take in those states. Reinforcement learning differs from the supervised learning
problem in that correct input/output pairs are never presented, nor sub-optimal actions
explicitly corrected.

10. Similarity and metric learning


In this problem, the learning machine is given pairs of examples that are considered similar
and pairs of less similar objects. It then needs to learn a similarity function (or a distance
metric function) that can predict if new objects are similar. It is sometimes used in
Recommendation systems.

11. Genetic algorithms


A genetic algorithm (GA) is a search heuristic that mimics the process of natural selection
and uses methods such as mutation and crossover to generate new genotype in the hope
of finding good solutions to a given problem. In machine learning, genetic algorithms found
some uses in the 1980s and 1990s. Conversely, machine learning techniques have been
used to improve the performance of genetic and evolutionary algorithms.

12. Rule-based machine learning


Rule-based machine learning is a general term for any machine learning method that
identifies, learns, or evolves "rules" to store, manipulate or apply, knowledge. The defining
characteristic of a rule- based machine learner is the identification and utilization of a set of
relational rules that collectively represent the knowledge captured by the system. This is in
contrast to other machine learners that commonly identify a singular model that can be
universally applied to any instance in order to make a prediction. Rule-based machine
learning approaches include learning classifier systems, association rule learning, and
artificial immune systems.

13. Feature selection approach


Feature selection is the process of selecting an optimal subset of relevant features for use
in model construction. It is assumed the data contains some features that are either
redundant or irrelevant and can thus be removed to reduce calculation cost without
incurring much loss of information. Common optimality criteria include accuracy, similarity,
and information measures.

Department of CSE, SMVITM, Bantakal 4


Machine Learning Laboratory 17CSL76

MACHINE LEARNING LABORATORY


Academic year 2020-21
SEMESTER: VII Exam Hours: 03
Subject Code: 17CSL76 IA Marks: 40 ,
Exam Marks: 100 Credit: 2

1. Implement and demonstrate the FIND-S algorithm for finding the most specific
Hypothesis is based on a given set of training data samples. Read the training data from
a CSV file.
2. For a given set of training data examples stored in a .CSV file, implement and
Demonstrate the Candidate-Elimination algorithm to output a description of the set
of all hypotheses consistent with the training examples.
3. Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to
classify a new sample.
4. Build an Artificial Neural Network by implementing the Back-propagation algorithm
and test the same using appropriate data sets.
5. Write a program to implement the naïve Bayesian classifier for a sample training data
set stored as a .CSV file. Compute the accuracy of the classifier, considering few test data
sets.
6. Assuming a set of documents that need to be classified, use the naïve Bayesian
Classifier model to perform this task. Built-in Java classes/API can be used to write the
program. Calculate the accuracy, precision, and recall for your data set.
7. Write a program to construct a Bayesian network considering medical data. Use this
model to demonstrate the diagnosis of heart patients using standard Heart Disease
Data Set. You can use Java/Python ML library classes/API.
8. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data set
for clustering using k-Means algorithm. Compare the results of these two algorithms
and comment on the quality of clustering. You can add Java/Python ML library
classes/API in the program.
9. Write a program to implement k-Nearest Neighbor algorithm to classify the iris data
set. Print both correct and wrong predictions. Java/Python ML library classes can be
used for this problem.
10. Implement the non-parametric Locally Weighted Regression algorithm to fit data
points. Select appropriate data set for your experiment and draw graphs.

Department of CSE, SMVITM, Bantakal 5


Machine Learning Laboratory 17CSL76

1. Implement and demonstrate the FIND-S algorithm for finding the most specific
hypothesis based on a given set of training data samples. Read the training
data from a .CSV file.

import csv

attributes = [ ['Sunny', 'Cloudy', 'Rainy'],

['Warm', 'Cold'],

['Normal', 'High'],

['Strong', 'Weak'],

['Warm', 'Cool'],

['Same', 'Change']]

total_attributes = len(attributes)

Department of CSE, SMVITM, Bantakal 6


Machine Learning Laboratory 17CSL76

print("\nTotal number of attributes is: ", total_attributes)

print ("The most specific hypothesis: ['0','0','0','0','0','0']")

print ("The most general hypothesis: ['?','?','?','?','?','?']")

a=[]

print("\nThe Given Training Data Set is:")

with open(‘EnjoySport.csv', 'r') as cfile:

for row in csv.reader(cfile):

a.append(row)

print(row)

print("\nTotal number of records is: ", len(a))

print("\nThe initial hypothesis is: ")

hypothesis = ['0'] * total_attributes

print(hypothesis)

#Comparing with Training Examples of Given Data Set

for i in range(0, len(a)):

if a[i][total_attributes] == 'Yes':

for j in range(0, total_attributes):

if hypothesis[j] == '0' or hypothesis[j] == a[i][j]:

hypothesis[j] = a[i][j]

else:

hypothesis[j] = '?'

print("\nHypothesis for Training Example No {} is: \n".format(i+1), hypothesis)

print("\nThe Maximally Specific Hypothesis for a given Training Examples :")

print(hypothesis)

Department of CSE, SMVITM, Bantakal 7


Machine Learning Laboratory 17CSL76

Input File: EnjoySport.CSV

Sunny Warm Normal Strong Warm Same Yes


Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes

Output:

Total number of attributes is: 6


The most specific hypothesis: ['0','0','0','0','0','0']
The most general hypothesis: ['?','?','?','?','?','?']

The Given Training Data Set is:


['Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same', 'Yes']
['Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same', 'Yes']
['Rainy', 'Cold', 'High', 'Strong', 'Warm', 'Change', 'No']
['Sunny', 'Warm', 'High', 'Strong', 'Cool', 'Change', 'Yes']

Total number of records is: 4

The initial hypothesis is:


['0', '0', '0', '0', '0', '0']

Hypothesis for Training Example No 1 is:


['Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same']

Hypothesis for Training Example No 2 is:


['Sunny', 'Warm', '?', 'Strong', 'Warm', 'Same']

Hypothesis for Training Example No 3 is:


['Sunny', 'Warm', '?', 'Strong', 'Warm', 'Same']

Hypothesis for Training Example No 4 is:


['Sunny', 'Warm', '?', 'Strong', '?', '?']

The Maximally Specific Hypothesis for a given Training Examples:


['Sunny', 'Warm', '?', 'Strong', '?', '?']

Department of CSE, SMVITM, Bantakal 8


Machine Learning Laboratory 17CSL76

2. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Candidate-Elimination algorithm to output a description of the
set of all hypotheses consistent with the training examples.

import csv

with open('T:\\ML\\datasheet\\EnjoySport.csv') as cfile:

examples = [ tuple(row) for row in csv.reader(cfile) ]

# To obtain the domain of attribute values defined in the instances X

def getdomains(examples):

# set function returns the unordered collection of items with no duplicates

d = [ set() for i in examples[0] ]

for r in examples:

for c, v in enumerate(r):

d[c].add(v)

Department of CSE, SMVITM, Bantakal 9


Machine Learning Laboratory 17CSL76

return [ list(sorted(x)) for x in d ]

def g0(n):

return ('?',)*n

def s0(n):

return ('0',)*n

def more_general(h, e):

mgparts = []

for x, y in zip(h, e):

mg = x == '?' or (x != '0' and (x == y or y == "0"))

mgparts.append(mg)

return all(mgparts)

def consistent(hypothesis, example):

return more_general(hypothesis, example)

def min_generalizations(s, e):

s_new = list(s)

for i in range(len(s)):

if not consistent(s[i:i+1],e[i:i+1]):

if s[i] != '0':

s_new[i] = '?'

else:

s_new[i] = e[i]

return [tuple(s_new)]

def generalize_S(e, G, S):

Department of CSE, SMVITM, Bantakal 10


Machine Learning Laboratory 17CSL76

S_prev = list(S)

for s in S_prev:

if s not in S:

continue

if not consistent(s,e):

S.remove(s)

Splus = min_generalizations(s, e)

S.update( [ h for h in Splus if any( [ more_general(g, h) for g in G ] ) ] )

S.difference_update( [ h for h in S

if any( [ more_general(h, h1) for h1 in S if h != h1 ] ) ] )

return S

def min_specializations(h, domains, e):

results = []

for i in range(len(h)):

if h[i] == '?':

for val in domains[i]:

if e[i] != val:

h_new = h[:i] + (val, ) + h[i+1:]

results.append(h_new)

elif h[i] != '0':

h_new = h[: i] + ('0', ) + h[i+1:]

results.append(h_new)

return results

def specialize_G(e, domains, G, S):

G_prev = list(G)

for g in G_prev:

if g not in G:

Department of CSE, SMVITM, Bantakal 11


Machine Learning Laboratory 17CSL76

continue

if consistent(g, e):

G.remove(g)

Gminus = min_specializations(g, domains, e)

G.update( [ h for h in Gminus if any( [ more_general(h, s) for s in S] ) ] )

G.difference_update( [ h for h in G

if any( [ more_general(g1, h) for g1 in G if h != g1 ] ) ] )

return G

def candidate_elimination(examples):

domains = getdomains(examples)[:-1]

G = set( [ g0(len(domains) ) ] )

S = set( [ s0(len(domains) ) ] )

i=0

print("\nInitially")

print("G[{0}]:".format(i), G)

print("S[{0}]:".format(i), S)

for r in examples:

i = i+1

e, t = r[:-1], r[-1] # Splitting data into attributes and decisions

if t == 'Yes': # positive example

G = {g for g in G if consistent(g, e)}

S = generalize_S(e, G, S)

else: # negative example

S = {s for s in S if not consistent(s, e)}

G = specialize_G(e, domains, G, S)

print("For Training example {0}".format(i))

print("G[{0}]:".format(i), G)

print("S[{0}]:".format(i), S)

Department of CSE, SMVITM, Bantakal 12


Machine Learning Laboratory 17CSL76

return

candidate_elimination(examples)

---------------------------------------------------------------------------------------------------------------------

Input File: EnjoySport.CSV

Sunny Warm Normal Strong Warm Same Yes


Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes

Output:

Initially
G[0]: {('?', '?', '?', '?', '?', '?')}
S[0]: {('0', '0', '0', '0', '0', '0')}

For Training example 1


G[1]: {('?', '?', '?', '?', '?', '?')}
S[1]: {('Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same')}

For Training example 2


G[2]: {('?', '?', '?', '?', '?', '?')}
S[2]: {('Sunny', 'Warm', '?', 'Strong', 'Warm', 'Same')}

For Training example 3


G[3]: {('?', 'Warm', '?', '?', '?', '?'), ('?', '?', '?', '?', '?', 'Same'), ('Sunny', '?', '?', '?', '?', '?')}
S[3]: {('Sunny', 'Warm', '?', 'Strong', 'Warm', 'Same')}

For Training example 4


G[4]: {('?', 'Warm', '?', '?', '?', '?'), ('Sunny', '?', '?', '?', '?', '?')}
S[4]: {('Sunny', 'Warm', '?', 'Strong', '?', '?')}

Department of CSE, SMVITM, Bantakal 13


Machine Learning Laboratory 17CSL76

3. Write a program to demonstrate the working of the decision tree based ID3
algorithm. Use an appropriate data set for building the decision tree and apply this
knowledge to classify a new sample.

Department of CSE, SMVITM, Bantakal 14


Machine Learning Laboratory 17CSL76

#code

import pandas as pd

import numpy as np

dataset=
pd.read_csv('PlayTennis.csv',names=['outlook','temperature','humidity','wind','playtennis'])

def entropy(target_col):

elements,counts = np.unique(target_col,return_counts = True)

entropy = np.sum([(-counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts))

for i in range(len(elements))])

return entropy

def InfoGain(data,split_attribute_name,target_name="playtennis"):

total_entropy = entropy(data[target_name])

vals,counts= np.unique(data[split_attribute_name],return_counts=True)

Weighted_Entropy =
np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[split_attribute_name]==vals[i
]).dropna()[target_name]) for i in range(len(vals))])

Information_Gain = total_entropy - Weighted_Entropy

return Information_Gain

def ID3(data,originaldata,features,target_attribute_name="playtennis",parent_node_class =
None):

if len(np.unique(data[target_attribute_name])) <= 1:

return np.unique(data[target_attribute_name])[0]

elif len(data)==0:

return
np.unique(originaldata[target_attribute_name])[np.argmax(np.unique(originaldata[target_a
ttribute_name],return_counts=True)[1])]

Department of CSE, SMVITM, Bantakal 15


Machine Learning Laboratory 17CSL76

elif len(features) ==0:

return parent_node_class

else:

parent_node_class =
np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name]
,return_counts=True)[1])]

item_values = [InfoGain(data,feature,target_attribute_name) for feature in features]


#Return the information gain values for the features in the dataset

best_feature_index = np.argmax(item_values)

best_feature = features[best_feature_index]

tree = {best_feature:{}}

features = [i for i in features if i != best_feature]

for value in np.unique(data[best_feature]):

value = value

sub_data = data.where(data[best_feature] == value).dropna()

subtree = ID3(sub_data,dataset,features,target_attribute_name,parent_node_class)

tree[best_feature][value] = subtree

return(tree)

tree = ID3(dataset,dataset,dataset.columns[:-1])

print(' \nDisplay Tree\n',tree)

Output:

The Resultant Decision Tree is:

{'Outlook': {'Overcast': 'Yes',


'Rain': {'Wind': {'Strong': 'No', 'Weak': 'Yes'}},
'Sunny': {'Humidity': {'High': 'No', 'Normal': 'Yes'}}}}

Department of CSE, SMVITM, Bantakal 16


Machine Learning Laboratory 17CSL76

4. Build an Artificial Neural Network by implementing the Backpropagation


algorithm and test the same using appropriate data sets.

#code

import numpy as np

X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)

y = np.array(([92], [86], [89]), dtype=float)

X = X/np.amax(X,axis=0) # maximum of X array longitudinally

y = y/100

Department of CSE, SMVITM, Bantakal 17


Machine Learning Laboratory 17CSL76

#Sigmoid Function

def sigmoid (x):

return (1/(1 + np.exp(-x)))

#Derivative of Sigmoid Function

def derivatives_sigmoid(x):

return x * (1 - x)

#Variable initialization

epoch=7000 #Setting training iterations

lr=0.1 #Setting learning rate

inputlayer_neurons = 2 #number of features in data set

hiddenlayer_neurons = 3 #number of hidden layers neurons

output_neurons = 1 #number of neurons at output layer

#weight and bias initialization

wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))

bh=np.random.uniform(size=(1,hiddenlayer_neurons))

wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))

bout=np.random.uniform(size=(1,output_neurons))

# draws a random range of numbers uniformly of dim x*y

#Forward Propagation

for i in range(epoch):

hinp1=np.dot(X,wh)

hinp=hinp1 + bh

hlayer_act = sigmoid(hinp)

outinp1=np.dot(hlayer_act,wout)

Department of CSE, SMVITM, Bantakal 18


Machine Learning Laboratory 17CSL76

outinp= outinp1+ bout

output = sigmoid(outinp)

#Backpropagation

EO = y-output

outgrad = derivatives_sigmoid(output)

d_output = EO* outgrad

EH = d_output.dot(wout.T)

hiddengrad = derivatives_sigmoid(hlayer_act)

#how much hidden layer wts contributed to error

d_hiddenlayer = EH * hiddengrad

wout += hlayer_act.T.dot(d_output) *lr

# dotproduct of nextlayererror and currentlayerop

bout += np.sum(d_output, axis=0,keepdims=True) *lr

wh += X.T.dot(d_hiddenlayer) *lr

#bh += np.sum(d_hiddenlayer, axis=0,keepdims=True) *lr

print("Input: \n" + str(X))

print("Actual Output: \n" + str(y))

print("Predicted Output: \n" ,output)

Input:
[[0.66666667 1.00000000]
[0.33333333 0.55555556]
[1.000 00000 0.66666667]]
Actual Output: Predicted Output:
[[0.92] [[0.89592234]
[0.86] [0.88076582]
[0.89]] [0.89211838]]

Department of CSE, SMVITM, Bantakal 19


Machine Learning Laboratory 17CSL76

5. Write a program to implement the naïve Bayesian classifier for a sample training
data set stored as a .CSV file. Compute the accuracy of the classifier, considering
few test data sets.

Bayes’ Theorem

Naïve Bayes Classification

The naive Bayes classifier, or simple Bayesian classifier, works as follows:

• Let D be a training set of tuples and their associated class labels. Each tuple is represented by an n-dimensional
attribute vector, X=(x1,x2,x3……xn) depicting the n measurements made on the tuple from n attributes,
respectively, A1, A2, A3… , An.

• Suppose that there are m classes, C1, C2, C3…., Cm. Given a tuple, X, the classifier will predict that X belongs to the
class having the highest posterior probability, conditioned on X. That is, the naive Bayesian classifier predicts that
tuple X belongs to the class Ci if and only if

Thus, we maximize P(Ci/X). The class Ci for which P(Ci/X) is maximized is called the maximum posteriori hypothesis. By
Bayes theorem we have

i. As P(X) is constant for all classes, only P(X/Ci) P(Ci) needs to be maximized.
ii. Given data sets with many attributes, it would be extremely computationally expensive to compute
P(X/Ci). To reduce computation in evaluating P(X/Ci), the naive assumption of class-conditional
independence is made.
This presumes that the attributes’ values are conditionally independent of one another, given the class label of the
tuple.Thus,

For instance, to compute P(X/Ci), we consider If Ak is continuous-valued then it is typically assumed to have a
Gaussian distribution with a mean  and standard deviation , defined by

Department of CSE, SMVITM, Bantakal 20


Machine Learning Laboratory 17CSL76

#Read and Handle Data

import csv

import random

import math

random.seed(0)

# 1.Data Handling:

# 1.1 Loading the Data from csv file of Pima indians diabetes dataset.

def loadcsv(filename):

reader = csv.reader(open(filename, "r"))

dataset = [ ]

for row in reader:

inlist = [ ]

for i in range(len(row)):

# converting the attributes from string to floating point numbers

inlist.append(float(row[i]))

dataset.append(inlist)

return dataset

#Splitting the Data set into Training Set

#The naive bayes model is comprised of a summary of the data in the training dataset.

This summary is #then used when making predictions.

#involves the mean and the standard deviation for each attribute, by class value

def splitDataset(dataset, splitRatio):

trainSize = int( len(dataset) * splitRatio)

trainSet = []

copy = list(dataset)

while len(trainSet) < trainSize:

Department of CSE, SMVITM, Bantakal 21


Machine Learning Laboratory 17CSL76

index = random.randrange(len(copy)) # random index

trainSet.append(copy.pop(index))

return [trainSet, copy]

#Summarize Data #Separate Data by Class

#Function to categorize the dataset in terms of classes.#The function assumes that the

last attribute (-1) is the class value. #The function returns a map of class values to lists of

data instances.

def separateByClass(dataset):

separated = {}

for i in range( len(dataset) ):

vector = dataset[i]

if (vector[-1] not in separated):

separated[vector[-1]] = []

separated[vector[-1]].append(vector)

return separated

#Calculate the Mean

#The mean is the central middle or central tendency of the data, and we will use it as the

middle of our #gaussian distribution when calculating probabilities

#2.2 : Calculate Mean

def mean(numbers):

return sum(numbers)/float(len(numbers))

#Calculate the Standard Deviation

#The standard deviation describes the variation of spread of the data, and we will use it

to characterize #the expected spread of each attribute in our Gaussian distribution when

calculating probabilities.

Department of CSE, SMVITM, Bantakal 22


Machine Learning Laboratory 17CSL76

#2.3 : Calculate Standard Deviation

def stdev(numbers):

avg = mean(numbers)

variance = sum( [ pow(x-avg, 2) for x in numbers]) / float(len(numbers)-1)

return math.sqrt(variance)

#Summarize Dataset

def summarize(dataset):

summaries = [ (mean(attribute), stdev(attribute)) for attribute in zip(*dataset) ]

del summaries[-1]

return summaries

#Summarize Attributes By Class

def summarizeByClass(dataset):

separated = separateByClass(dataset)

summaries = {}

for classValue, instances in separated.items():

summaries[classValue] = summarize(instances)

return summaries

#Make Prediction

#Calculate Probaility Density Function

def calculateProbability(x, mean, stdev):

exponent = math.exp( -(math.pow(x-mean,2) / (2*math.pow(stdev,2) ) ) )

return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent

#Calculate Class Probabilities

def calculateClassProbabilities(summaries, inputVector):

Department of CSE, SMVITM, Bantakal 23


Machine Learning Laboratory 17CSL76

probabilities = {}

for classValue, classSummaries in summaries.items():

probabilities[classValue] = 1

for i in range(len(classSummaries)):

mean, stdev = classSummaries[i]

x = inputVector[i]

probabilities[classValue] *= calculateProbability(x, mean, stdev)

return probabilities

#Prediction : look for the largest probability and return the associated class

def predict(summaries, inputVector):

probabilities = calculateClassProbabilities(summaries, inputVector)

bestLabel, bestProb = None, -1

for classValue, probability in probabilities.items():

if bestLabel is None or probability > bestProb:

bestProb = probability

bestLabel = classValue

return bestLabel

#Make Predictions

def getPredictions(summaries, testSet):

predictions = []

for i in range(len(testSet)):

result = predict(summaries, testSet[i])

predictions.append(result)

return predictions

#Computing Accuracy

def getAccuracy(testSet, predictions):

Department of CSE, SMVITM, Bantakal 24


Machine Learning Laboratory 17CSL76

correct = 0

for i in range(len(testSet)):

if testSet[i][-1] == predictions[i]:

correct += 1

return (correct / float(len(testSet))) * 100.0

#Main Function

def main():

filename = 'T:\\ML\\datasheet\\PI_Diabetes.csv'

splitRatio = 0.67

dataset = loadcsv(filename)

print("\n The length of the Data Set : ", len(dataset))

print("\n The Data Set Splitting into Training and Testing \n")

trainingSet, testSet = splitDataset(dataset, splitRatio)

print('\n Number of Rows in Training Set:{0} rows '.format(len(trainingSet)))

print('\n Number of Rows in Testing Set:{0} rows '.format(len(testSet)))

# prepare model

summaries = summarizeByClass(trainingSet)

print("\n Model Summaries:\n",summaries)

# test model

predictions = getPredictions(summaries, testSet)

print("\nPredictions:\n",predictions)

accuracy = getAccuracy(testSet, predictions)

print('\n Accuracy: {0}%'.format(accuracy))

main()

Output:

The length of the Data Set : 768

Department of CSE, SMVITM, Bantakal 25


Machine Learning Laboratory 17CSL76

The Data Set Splitting into Training and Testing

Number of Rows in Training Set:514 rows

Number of Rows in Testing Set:254 rows

Model Summaries:
{1.0: [(4.701754385964913, 3.749344627974186), (142.9298245614035, 31.1849471507099
03), (68.81871345029239, 23.193226713717014), (22.239766081871345, 17.934713233516
998), (110.09356725146199, 146.07110482316023), (35.18128654970761, 8.026522255094
289), (0.5614912280701757, 0.3747628641345956), (36.801169590643276, 11.3472566692
62784)], 0.0: [(3.3556851311953353, 3.006137199069943), (110.64139941690962, 26.28181
125254248), (68.9067055393586, 18.44741337335469), (20.06122448979592, 14.98632069
0982343), (69.11953352769679, 97.04270626661162), (30.255976676384837, 8.076274770
888682), (0.43274635568513126, 0.3115760396097594), (31.600583090379008, 11.842321
751005294)]}

Predictions:
[0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0,
0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0,
0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0,
1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0,
0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0,
0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0,
1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0,
1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0,
1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0]

Accuracy: 74.40944881889764%

Department of CSE, SMVITM, Bantakal 26


Machine Learning Laboratory 17CSL76

6. Assuming a set of documents that need to be classified, use the naïve Bayesian
Classifier model to perform this task. Built-in Java classes/API can be used to write
the program. Calculate the accuracy, precision, and recall for your data set.

Department of CSE, SMVITM, Bantakal 27


Machine Learning Laboratory 17CSL76

#Document Classification using Naive Bayes Classifier

import pandas as pd

import pandas as pd

txt=pd.read_csv(Text.csv',names=['text','label']) #Tabular form data

print(txt)

print('\nTotal instances in the dataset is: ',txt.shape[0])

txt['labelnum'] = txt.label.map( {'pos':1, 'neg':0} )

X = txt.text

Y = txt.labelnum

# Splitting the dataset into train and test data

from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(X, Y, random_state=0)

print('\nDataset is split into Training and Testing samples')

print('Total training instances :', xtrain.shape[0])

print(xtrain)

print('Total testing instances :', xtest.shape[0])

print(xtest)

# Output of count vectoriser is a sparse matrix

# CountVectorizer - stands for 'feature extraction'

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

xtrain_dtm = count_vect.fit_transform(xtrain) #Sparse matrix

xtest_dtm = count_vect.transform(xtest)

print('\nTotal features extracted using CountVectorizer:', xtrain_dtm.shape[1])

print('\nFeatures for training instances are:')

Department of CSE, SMVITM, Bantakal 28


Machine Learning Laboratory 17CSL76

df = pd.DataFrame(xtrain_dtm.toarray(),columns=count_vect.get_feature_names())

print(df.columns)

print('\nDocument term matrix is:\n ')

print(df)

# Training Naive Bayes (NB) classifier on training data.

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(xtrain_dtm, ytrain)

predicted = clf.predict(xtest_dtm)

print('\n---Classstification results of testing samples are given below----')

for doc, p in zip(xtest, predicted):

pred = 'pos' if p==1 else 'neg'

print('%s -> %s ' % (doc, pred))

#printing accuracy metrics

from sklearn import metrics

print('\nAccuracy of the classifer is: ', metrics.accuracy_score(ytest, predicted))

print('Recall of the classifer is: ', metrics.recall_score(ytest, predicted))

print('Precison of the classifer is: ', metrics.precision_score(ytest, predicted))

print('Confusion matrix is: ')

print(metrics.confusion_matrix(ytest, predicted))

Output:

Department of CSE, SMVITM, Bantakal 29


Machine Learning Laboratory 17CSL76

Dataset is split into Training and Testing samples

Total training instances: 13

4 What an awesome view


2 I feel very good about these beers
16 We will have good fun tomorrow
17 I went to my enemy's house today
9 My boss is horrible
7 I can't deal with this
13 I am sick and tired of this place
11 I do not like the taste of this juice
3 This is my best work work work
0 I love this sandwich
5 I do not like this restaurant
15 That is a bad locality to stay
12 I love to dance
Name: text, dtype: object

Total testing instances : 5


1 This is an amazing place

Department of CSE, SMVITM, Bantakal 30


Machine Learning Laboratory 17CSL76

6 I am tired of this stuff


8 He is my sworn enemy
10 This is an awesome place
14 What a great holiday
Name: text, dtype: object

Total features extracted using CountVectorizer: 50

Features for training instances are:

Index(['about', 'am', 'an', 'and', 'awesome', 'bad', 'beers', 'best', 'boss', 'can', 'dance', 'deal', 'd

o', 'enemy', 'feel', 'fun', 'good', 'have', 'horrible', 'house', 'is', 'juice', 'like', 'locality', 'love', 'my',

'not', 'of', 'place', 'restaurant', 'sandwich', 'sick', 'stay', 'taste', 'that', 'the', 'these', 'this', 'tired',

'to', 'today', 'tomorrow', 'very', 'view', 'we', 'went', 'what', 'will', 'with', 'work'], dtype='object')

Document term matrix is:

Department of CSE, SMVITM, Bantakal 31


Machine Learning Laboratory 17CSL76

-----Classsification results of testing samples are given below-----

This is an amazing place -> neg

I am tired of this stuff -> neg

He is my sworn enemy -> neg

This is an awesome place -> pos

What a great holiday -> pos

Accuracy of the classifer is: 0.8

Recall of the classifer is: 0.666666666667

Precison of the classifer is: 1.0

Confusion matrix is:

[[2 0]

[1 2]]

Department of CSE, SMVITM, Bantakal 32


Machine Learning Laboratory 17CSL76

7. Write a program to construct a Bayesian network considering medical data.


Use this model to demonstrate the diagnosis of heart patients using standard
Heart Disease Data Set. You can use Java/Python ML library classes/API.

Install the pgmpy(Probabilistic Graph Models) python package using following command
in anaconda prompt

pip install pgmpy

#code
import numpy as np

import pandas as pd

from pgmpy.estimators import MaximumLikelihoodEstimator

from pgmpy.models import BayesianModel

from pgmpy.inference import VariableElimination

#Read the cleveland heart disease dataset

attributes = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak',

'slope', 'ca', 'thal', 'heartdisease']

#Read Cleveland Heart dicease data

heartDisease = pd.read_csv('HeartDisease.csv', names = attributes)

heartDisease = heartDisease.replace('?', np.nan)

# Display the data

print('Few examples from the dataset are given below')

print(heartDisease.head())

print('\nAttributes and datatypes')

print(heartDisease.dtypes)

#Model the Bayesian Network

model = BayesianModel( [ ('age', 'trestbps'), ('age', 'fbs'), ('sex', 'trestbps'), ('sex', 'trestbps'),

('exang', 'trestbps'), ('trestbps', 'heartdisease'), ('fbs', 'heartdisease'), ('heartdisease',

'restecg'), ('heartdisease', 'thalach'), ('heartdisease', 'chol') ] )

# Learning CPDs using Maximum Likelihood Estimators

Department of CSE, SMVITM, Bantakal 33


Machine Learning Laboratory 17CSL76

print('\nLearning Conditional Probability Distributions using Maximum Likelihood

Estimators...');

model.fit(heartDisease, estimator = MaximumLikelihoodEstimator)

#Inferencing with Bayesian Network

print('\nInferencing with Bayesian Network:')

HeartDisease_infer = VariableElimination(model)

print('\nComputing the probability of Heart disease given age = 28')

q = HeartDisease_infer.query(variables = ['heartdisease'], evidence = {'age': 28})

print( q['heartdisease'])

print('\nComputing the probability of Heart disease given chol = 100')

q = HeartDisease_infer.query(variables = ['heartdisease'], evidence = {'chol': 100})

print( q['heartdisease'])

Output:

Learning Conditional Probability Distributions using Maximum Likelihood Estimators...

Department of CSE, SMVITM, Bantakal 34


Machine Learning Laboratory 17CSL76

Department of CSE, SMVITM, Bantakal 35


Machine Learning Laboratory 17CSL76

8. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same
data set for clustering using k-Means algorithm. Compare the results of these two
algorithms and comment on the quality of clustering. You can add Java/Python ML
library classes/API in the program.

K-Means Algorithm

1. Load data set

2. Clusters the data into k groups where k is predefined.

3. Select k points at random as cluster centers.

4. Assign objects to their closest cluster center according to the Euclidean

distance function.

5. Calculate the centroid or mean of all objects in each cluster.

6. Repeat steps 3, 4 and 5 until the same points are assigned to each cluster in consecutive

rounds.

EM algorithm

These are the two basic steps of the EM algorithm, namely E Step or Expectation Step or

Estimation Step and M Step or Maximization Step.

Estimation step:

• Initializ µk, ∑k and πk by some random values, or by K means clustering results or by

hierarchical clustering results.

• Then for those given parameter values, estimate the value of the latent variables (i.e γk)

Maximization Step:

• Update the value of the parameters( i.e. µk, ∑k and πk) calculated using ML method.

1. Select the number of seeds. Let this number be k

2. Pick k seeds as centroids of the k clusters. The seeds may be randomly picked.

3. Compute Euclidean/Manhattan distance of each object in the dataset from each of

the centroids.

4. Allocate each object to the cluster it is nearest to based on the distances computed

in the previous step.

Department of CSE, SMVITM, Bantakal 36


Machine Learning Laboratory 17CSL76

5. Compute the centroids of the clusters by computing the means of the attribute

values of the objects in each cluster.

6. Check if the stopping criterion has been met. (ex: cluster membership is

unchanged). If so go to step 7. if not got to step 3

7. [optional] one may decide to stop at this stage or to split a cluster or combine two

clusters heauristically until a stopping criterion is met.

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn import preprocessing

from sklearn.mixture import GaussianMixture

import sklearn.metrics as sm

import pandas as pd

import numpy as np

#%matplotlib inline

# Load the iris dataset

iris_dataset = pd.read_csv('Iris.csv')

iris_dataset['Targets'] = iris_dataset.Class.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-

virginica':2})

X = iris_dataset[['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width' ] ]

Y = iris_dataset[['Targets']]

#Build K-Means Model

model = KMeans(n_clusters = 3, random_state = 0)

model.fit(X)

print('Model Labels:\n', model.labels_ )

#Build GMM for EM algorithm

scaler = preprocessing.StandardScaler()

scaler.fit(X)

Department of CSE, SMVITM, Bantakal 37


Machine Learning Laboratory 17CSL76

xs = scaler.transform(X)

gmm = GaussianMixture(n_components = 3, random_state = 0)

gmm.fit(xs)

Y_gmm = gmm.predict(xs)

print('GMM Labels:\n', Y_gmm)

#Visualize the clustering results

# To View the results set the size of the plot

plt.figure(figsize = (14, 14))

# Create a colormap

colormap = np.array(['red', 'lime', 'black'])

# Plot Orginal using Petal features

plt.subplot(2, 2, 1)

plt.scatter(X.Petal_Length,X.Petal_Width, c = colormap[Y.Targets], s=40)

plt.title('Real Classification')

plt.xlabel('Petal Length')

plt.ylabel('Petal Width')

# Plot the K-Means model classifications

plt.subplot(2, 2, 2)

plt.scatter(X.Petal_Length,X.Petal_Width, c = colormap[model.labels_], s=40)

plt.title('K Means Clustering')

plt.xlabel('Petal Length')

plt.ylabel('Petal Width')

#Plot the GMM Model classification

plt.subplot(2, 2, 3)

plt.scatter(X.Petal_Length, X.Petal_Width, c = colormap[Y_gmm], s=40)

plt.title('GMM Based Clustering')

plt.xlabel('Petal Length')

plt.ylabel('Petal Width')

#Calculate performance metrics for K-Means and GMM

Department of CSE, SMVITM, Bantakal 38


Machine Learning Laboratory 17CSL76

print('Evaluation of K-Means with ground truth classification of Iris Dataset')

print('Rand Index:%f ' % sm.adjusted_rand_score(Y.Targets, model.labels_ ))

print('Homogenity Score:%f ' % sm.homogeneity_score(Y.Targets, model.labels_ ))

print('Completeness Score:%f ' % sm.completeness_score(Y.Targets, model.labels_ ))

print('V-Measure:%f ' % sm.v_measure_score(Y.Targets, model.labels_ ))

print('Evaluation of GMM with ground truth classification of Iris Dataset')

print('Rand Index:%f ' % sm.adjusted_rand_score(Y.Targets, Y_gmm))

print('Homogenity Score:%f ' % sm.homogeneity_score(Y.Targets, Y_gmm))

print('Completeness Score:%f ' % sm.completeness_score(Y.Targets, Y_gmm))

print('V-Measure:%f ' % sm.v_measure_score(Y.Targets, Y_gmm))

Output:

[150 rows x 6 columns]


Model Labels:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1111111111111002000000000000000000000
0002000000000000000000000020222202222
2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2 2 0]
GMM Labels:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000111111111111111111212121
1112111112111111111111111122222222222
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

Evaluation of K-Means with ground truth classification of Iris Dataset


Rand Index: 0.730238
Homogenity Score: 0.751485
Completeness Score: 0.764986
V-Measure: 0.758176
Evaluation of GMM with ground truth classification of Iris Dataset
Rand Index: 0.903874
Homogenity Score: 0.898326
Completeness Score: 0.901065
V-Measure: 0.899694

Department of CSE, SMVITM, Bantakal 39


Machine Learning Laboratory 17CSL76

Department of CSE, SMVITM, Bantakal 40


Machine Learning Laboratory 17CSL76

9. Write a program to implement k-Nearest Neighbour algorithm to classify the iris


data set. Print both correct and wrong predictions. Java/Python ML library classes
can be used for this problem.

K-Nearest-Neighbour Algorithm:
1. Load the data
2. Initialize the value of k
3. For getting the predicted class, iterate from 1 to total number of training data points
1. Calculate the distance between test data and each row of training data.Here we
will use Euclidean distance as our distance metric since it’s the most popular
method. The other metricsthat can be used are Chebyshev, cosine, etc.
2. Sort the calculated distances in ascending order based on distance values
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows i.e Get the labels of the selected K
entries
5. Return the predicted class
If regression, return the mean of the K labels
If classification, return the mode of the K labels

Confusion matrix:
Note,
Class 1: Positive
Class 2: Negative
• Positive (P) : Observation is positive (for example: is an apple).

Department of CSE, SMVITM, Bantakal 41


Machine Learning Laboratory 17CSL76

• Negative (N) : Observation is not positive (for example: is not an apple).


• True Positive (TP) : Observation is positive, and is predicted to be
• False Negative (FN) : Observation is positive, but is predicted negative. (Also known as a
"Type II
• True Negative (TN) Observation is negative, and is predicted to be negative.
• False Positive (FP) : Observation is negative, but is predicted positive. (Also known as a
"Type I

Accuracy: Overall, how often is the classifier correct?


(TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate: Overall, how often is it wrong?

Department of CSE, SMVITM, Bantakal 42


Machine Learning Laboratory 17CSL76

(FP+FN)/total = (10+5)/165 = 0.09


equivalent to 1 minus Accuracy also known as "Error Rate“
True Positive Rate: When it's actually yes, how oft en does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"
False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
True Negative Rate: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate
also known as "Specificity“
Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
Prevalence: How often does the yes condition actual ly occur in our sample?
actual yes/total = 105/165 = 0.64

#code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import pandas as pd
dataset=pd.read_csv("iris.csv")
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.25)
classifier=KNeighborsClassifier(n_neighbors=8,p=3,metric='euclidean')

classifier.fit(X_train,y_train)

#predict the test resuts

y_pred=classifier.predict(X_test)

cm=confusion_matrix(y_test,y_pred)

print('Confusion matrix is as follows\n',cm)

print('Accuracy Metrics')

print(classification_report(y_test,y_pred))

print(" correct predicition",accuracy_score(y_test,y_pred))

print(" worng predicition",(1-accuracy_score(y_test,y_pred)))

Output:

Department of CSE, SMVITM, Bantakal 43


Machine Learning Laboratory 17CSL76

Confusion matrix is as follows

[[13 0 0]

[ 0 15 1]

[ 0 0 9]]

Accuracy Metrics

precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 13

Iris-versicolor 1.00 0.94 0.97 16

Iris-virginica 0.90 1.00 0.95 9

avg / total 0.98 0.97 0.97 38

correct predicition 0.9736842105263158

worng predicition 0.02631578947368418

Department of CSE, SMVITM, Bantakal 44


Machine Learning Laboratory 17CSL76

10. Implement the non-parametric Locally Weighted Regression algorithm in order to


fit data points. Select appropriate data set for your experiment and draw graphs.

• Regression is a technique from statistics that is used to predict values of a desired target
quantity when the target quantity is continuous.
• In regression, we seek to identify (or estimate) a continuous variable y associated with a
given input vector x.
• y is called the dependent variable.
• x is called the independent variable.

Loess/Lowess Regression: Loess regression is a nonparametric technique that uses local

weighted regression to fit a smooth curve through points in a scatter plot.


Loess/Lowess Regression: Loess regression is a nonparametric technique that uses local
weighted regression to fit a smooth curve through points in a scatter plot.
Lowess Algorithm: Locally weighted regression is a very powerful non-parametric model used
in statistical learning .Given a dataset X, y, we attempt to find a model parameter β(x) that
minimizes residual sum of weighted squared errors. The weights are given by a kernel
function(k or w) which can be chosen arbitrarily

Locally Weighted Regression Algorithm:


1. Read the Given data Sample to X and the curve (linear or non linear) to Y
2. Set the value for Smoothening parameter or free parameter say τ
3. Set the bias /Point of interest set X0 which is a subset of X
4. Determine the weight matrix using:

5. Determine the value of model term parameter β using :


6. Prediction = x0*β

Department of CSE, SMVITM, Bantakal 45


Machine Learning Laboratory 17CSL76

#code

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

def kernel(point, xmat, k):

m, n = np.shape(xmat)

weights = np.mat(np.eye((m))) # eye - identity matrix

for j in range(m):

diff = point - X[j]

weights[j, j] = np.exp(diff*diff.T/(-2.0*k**2))

return weights

def localWeight(point, xmat, ymat, k):

wei = kernel(point,xmat,k)

W = (X.T*(wei*X)).I*(X.T*(wei*ymat.T))

return W

def localWeightRegression(xmat, ymat, k):

m, n = np.shape(xmat)

ypred = np.zeros(m)

for i in range(m):

ypred[i] = xmat[i]*localWeight(xmat[i], xmat, ymat, k)

return ypred

def graphPlot(X, ypred):

sortindex = X[:,1].argsort(0) #argsort - index of the smallest

xsort = X[sortindex][:,0]

fig = plt.figure()

Department of CSE, SMVITM, Bantakal 46


Machine Learning Laboratory 17CSL76

ax = fig.add_subplot(1,1,1)

ax.scatter(bill,tip, color='green')

ax.plot(xsort[:,1],ypred[sortindex], color = 'red', linewidth=5)

plt.xlabel('Total bill')

plt.ylabel('Tip')

plt.show()

# load data points

data = pd.read_csv(‘Tips.csv')

bill = np.array(data.total_bill) # We use only Bill amount and Tips data

tip = np.array(data.tip)

mbill = np.mat(bill) # .mat will convert nd array is converted in 2D array

mtip = np.mat(tip)

m= np.shape(mbill)[1]

one = np.mat(np.ones(m))

X = np.hstack((one.T, mbill.T)) # 244 rows, 2 cols

# Prediction with k=3

print('\nypred for k=3')

ypred = localWeightRegression(X,mtip,3)

graphPlot(X, ypred)

# Prediction with k=9

print('\nypred for k=9')

ypred = localWeightRegression(X, mtip, 9)

graphPlot(X, ypred)

Output:

Department of CSE, SMVITM, Bantakal 47


Machine Learning Laboratory 17CSL76

Department of CSE, SMVITM, Bantakal 48

You might also like