My ML Lab Manual

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 21

SJC Institute of Technology

Department of Computer Science & Engineering

MACHINE LEARNING LAB MANUAL

Designed and Compiled by: Prof. Harshavardhana Doddamani


MACHINE LEARNING LABORATORY
[As per Choice Based Credit System (CBCS) scheme]
(Effective from the academic year 2016 -2017)
SEMESTER – VII
Subject Code 15CSL76 IA Marks 20
Number of Lecture Hours/Week 01I + 02P Exam Marks 80
Total Number of Lecture Hours 40 Exam Hours 03
CREDITS – 02
Course objectives: This course will enable students to
1. Make use of Data sets in implementing the machine learning algorithms
2. Implement the machine learning concepts and algorithms in any suitable language of choice.
Description (If any):
1. The programs can be implemented in either JAVA or Python.
2. For Problems 1 to 6 and 10, programs are to be developed without using the built-in
classes or APIs of Java/Python.
3. Data sets can be taken from standard repositories (https://archive.ics.uci.edu/ml/datasets.html)
or constructed by the students.
Lab Experiments:
1. Implement and demonstrate the FIND-S algorithm for finding the most specific
hypothesis based on a given set of training data samples. Read the training data from a
.CSV file.
2. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Candidate-Elimination algorithm to output a description of the set of
all hypotheses consistent with the training examples.
3. Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to
classify a new sample.
4. Build an Artificial Neural Network by implementing the Backpropagation algorithm
and test the same using appropriate data sets.
5. Write a program to implement the naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data
sets.
6. Assuming a set of documents that need to be classified, use the naïve Bayesian
Classifier model to perform this task. Built-in Java classes/API can be used to write the
program. Calculate the accuracy, precision, and recall for your data set.
7. Write a program to construct a Bayesian network considering medical data. Use this
model to demonstrate the diagnosis of heart patients using standard Heart Disease Data
Set. You can use Java/Python ML library classes/API.
8. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data set
for clustering using k-Means algorithm. Compare the results of these two algorithms and
comment on the quality of clustering. You can add Java/Python ML library classes/API in
the program.
9. Write a program to implement k-Nearest Neighbour algorithm to classify the iris data
set. Print both correct and wrong predictions. Java/Python ML library classes can be used
for this problem.
10. Implement the non-parametric Locally Weighted Regression algorithm in order to fit
data points. Select appropriate data set for your experiment and draw graphs.
Study Experiment / Project:
NIL
Course outcomes: The students should be able to:
1. Understand the implementation procedures for the machine learning algorithms.
2. Design Java/Python programs for various Learning algorithms.
3. Apply appropriate data sets to the Machine Learning algorithms.
4. Identify and apply Machine Learning algorithms to solve real world problems.
Conduction of Practical Examination:
All laboratory experiments are to be included for practical examination.
Students are allowed to pick one experiment from the lot.
Strictly follow the instructions as printed on the cover page of answer script
Marks distribution: Procedure + Conduction + Viva:20 + 50 +10 (80)
Change of experiment is allowed only once and marks allotted to the procedure part to be
made zero.

Problem1: Implement and demonstrate the FIND-S algorithm for finding the
most specific hypothesis based on a given set of training data samples. Read the
training data from a .CSV file.

Algorithm:
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
• For each attribute constraint ai in h
If the constraint ai in h is satisfied by x then do nothing
else replace ai in h by the next more general constraint that is satisfied by x
3. Output hypothesis h

Illustration:
Step1: Find S

Step2: Find S

Step2: Find S
Iteration 4 and Step 3: Find S

Source Code of the Program:


import random
import csv
attributes = [['Sunny','Rainy'],
['Warm','Cold'],
['Normal','High'],
['Strong','Weak'],
['Warm','Cool'],
['Same','Change']]

num_attributes = len (attributes)

print (" \n The most general hypothesis : ['?', '?', '?', '?', '?', '?']\n")
print ("\n The most specific hypothesis : ['0', '0', '0', '0', '0', '0']\n")

a=[]
print("\n The Given Training Data Set \n")

with open('C:\\Users\\hd\\Desktop\\Data\\tennis.csv', 'r') as csvFile:


reader = csv.reader (csvFile)
for row in reader:
a.append (row)
print (row)

print ("\n The initial value of hypothesis: ")


hypothesis = ['0'] * num_attributes
print(hypothesis)

# Comparing with First Training Example

for j in range(0, num_attributes):


hypothesis [j] = a[0] [j];

# Comparing with Remaining Training Examples of Given Data Set

print("\n Find S : Finding a Maximally Specific Hypothesis\n")

for i in range(0, len(a)) :


if a[i] [num_attributes] == 'Yes':

for j in range(0, num_attributes):


if a [i] [j] != hypothesis[j]:
hypothesis [j] = '?'
else :
hypothesis [j] = a [i] [j]
print(" For Training Example No : {0} the hypothesis is ".format (i) , hypothesis)
print("\n The Maximally Specific Hypothesis for a given Training Examples :\n")
print(hypothesis)
Output:

The most general hypothesis : ['?','?','?','?','?','?']


The most specific hypothesis : ['0','0','0','0','0','0']
The Given Training Data Set
['sunny', 'warm', 'normal', 'strong', 'warm', 'same', 'Yes']
['sunny', 'warm', 'high', 'strong', 'warm', 'same', 'Yes']
['rainy', 'cold', 'high', 'strong', 'warm', 'change', 'No']
['sunny', 'warm', 'high', 'strong', 'cool', 'change', 'Yes']
The initial value of hypothesis:
['0', '0', '0', '0', '0', '0']
Find S: Finding a Maximally Specific Hypothesis
For Training Example No :0 the hypothesis is ['sunny', 'warm', 'normal', 'strong', 'warm', 'same']
For Training Example No :1 the hypothesis is ['sunny', 'warm', '?', 'strong', 'warm', 'same']
For Training Example No :2 the hypothesis is ['sunny', 'warm', '?', 'strong', 'warm', 'same']
For Training Example No :3 the hypothesis is ['sunny', 'warm', '?', 'strong', '?', '?']
The Maximally Specific Hypothesis for a given Training Examples :
['sunny', 'warm', '?', 'strong', '?', '?']

Problem-2: For a given set of training data examples stored in a .CSV file,
implement and demonstrate the Candidate-Elimination algorithm to output a
description of the set of all hypotheses consistent with the training examples.
Trace – 1:
Trace – 2:

Trace – 3:
Final Version Space:

Source Code:
OUTPUT:
Problem – 3: Write a program to demonstrate the working of the decision tree
based ID3 algorithm. Use an appropriate data set for building the decision tree
and apply this knowledge to classify a new sample.

Algorithm:

Illustration: To illustrate the operation of ID3, let’s consider the learning task represented by the
below examples Compute the Gain and identify which attribute is the best as illustrated below
Day Outlook Temperature. Humidity Wind Play Tennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Weak Yes

D8 Sunny Mild High Weak No

D9 Sunny Cold Normal Weak Yes

D10 Rain Mild Normal Strong Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Which attribute to test at the root?

After first step:


Second step:

Second and third steps:

Source Code:
import pandas as pd
from pandas import DataFrame
df_tennis =
DataFrame.from_csv('C:\\Users\\HD\\Desktop\\Data\\PlayTennis.csv')
df_tennis

def entropy(probs): # Calulate the Entropy of given probability


import math
return sum( [-prob*math.log(prob, 2) for prob in probs] )
def entropy_of_list(a_list): # Entropy calculation of list of discrete
values (YES/NO)
from collections import Counter
cnt = Counter(x for x in a_list)
print("No and Yes Classes:",a_list.name,cnt)
num_instances = len(a_list)*1.0
probs = [x / num_instances for x in cnt.values()]
return entropy(probs) # Call Entropy:
# The initial entropy of the YES/NO attribute for our dataset.
#print(df_tennis['PlayTennis'])
total_entropy = entropy_of_list(df_tennis['PlayTennis'])
print("Entropy of given PlayTennis Data Set:",total_entropy)

Output :
No and Yes Classes : PlayTennis Counter({'Yes': 9, 'No': 5})
Entropy of given PlayTennis Data Set : 0.9402859586706309

Information Gain of Attributes


def information_gain(df, split_attribute_name, target_attribute_name,
trace=0):
print("Information Gain Calculation of ",split_attribute_name)
'''
Takes a DataFrame of attributes,and quantifies the entropy of a target
attribute after performing a split along the values of another attribute.
'''
# Split Data by Possible Vals of Attribute:
df_split = df.groupby(split_attribute_name)
#print(df_split.groups)
for name,group in df_split:
print(name)
print(group)
# Calculate Entropy for Target Attribute, as well as
# Proportion of Obs in Each Data-Split
nobs = len(df.index) * 1.0
#print("NOBS",nobs)
df_agg_ent = df_split.agg({target_attribute_name : [entropy_of_list,
lambda x: len(x)/nobs] })[target_attribute_name]
#print("DFAGGENT",df_agg_ent)
df_agg_ent.columns = ['Entropy', 'PropObservations']
#if trace: # helps understand what fxn is doing:
# print(df_agg_ent)
# Calculate Information Gain:
new_entropy = sum( df_agg_ent['Entropy'] *
df_agg_ent['PropObservations'] )
old_entropy = entropy_of_list(df[target_attribute_name])

return old_entropy - new_entropy


print('Info-gain for Outlook is :'+str( information_gain(df_tennis,
'Outlook', 'PlayTennis')),"\n")
print('\n Info-gain for Humidity is: ' + str( information_gain(df_tennis,
'Humidity', 'PlayTennis')),"\n")
print('\n Info-gain for Wind is:' + str( information_gain(df_tennis,
'Wind', 'PlayTennis')),"\n")
print('\n Info-gain for Temperature is:' +
str( information_gain(df_tennis, 'Temperature','PlayTennis')),"\n")

ID3 Algorithm
def id3(df, target_attribute_name, attribute_names, default_class=None):
## Tally target attribute:
from collections import Counter
cnt = Counter(x for x in df[target_attribute_name])# class of YES /NO
## First check: Is this split of the dataset homogeneous?
if len(cnt) == 1:
return next(iter(cnt))
## Second check: Is this split of the dataset empty?
# if yes, return a default value
elif df.empty or (not attribute_names):

return default_class
## Otherwise: This dataset is ready to be divvied up!
else:
# Get Default Value for next recursive call of this function:
default_class = max(cnt.keys()) #[index_of_max] # most common value of
target attribute in dataset
# Choose Best Attribute to split on:
gainz = [information_gain(df, attr, target_attribute_name) for attr in
attribute_names]
index_of_max = gainz.index(max(gainz))
best_attr = attribute_names[index_of_max]
# Create an empty tree, to be populated in a moment
tree = {best_attr:{}}
remaining_attribute_names = [i for i in attribute_names if i != best_attr]
# Split dataset
# On each split, recursively call this algorithm.
# populate the empty tree with subtrees, which
# are the result of the recursive call
for attr_val, data_subset in df.groupby(best_attr):
subtree = id3(data_subset,
target_attribute_name,
remaining_attribute_names,
default_class)
tree[best_attr][attr_val] = subtree
return tree

Predicting Attributes:
# Get Predictor Names (all but 'class')
attribute_names = list(df_tennis.columns)
print("List of Attributes:", attribute_names)
attribute_names.remove('PlayTennis') #Remove the class attribute
print("Predicting Attributes:", attribute_names)

Tree Construction:
# Run Algorithm:
from pprint import pprint
tree = id3(df_tennis,'PlayTennis',attribute_names)
print("\n\nThe Resultant Decision Tree is :\n")
pprint(tree)

Classification Accuracy:
def classify(instance, tree, default=None):
attribute = next(iter(tree))#tree.keys()[0]
if instance[attribute] in tree[attribute].keys():
result = tree[attribute][instance[attribute]]
if isinstance(result, dict): # this is a tree, delve deeper
return classify(instance, result)
else:
return result # this is a label
else:
return default

df_tennis['predicted'] = df_tennis.apply(classify, axis=1,


args=(tree,'No') )
# classify func allows for a default arg: when tree doesn't have answer
for a particular
# combitation of attribute-values, we can use 'no' as the default guess
print('Accuracy is:' +
str( sum(df_tennis['PlayTennis']==df_tennis['predicted'] ) /
(1.0*len(df_tennis.index)) ))
df_tennis[['PlayTennis', 'predicted']]

You might also like