0% found this document useful (0 votes)
41 views13 pages

Assignment-1 of Machine Learning On Decision Tree: Submitted To: Submitted by

The document describes a machine learning assignment involving decision tree classification. It discusses: 1. Using the ID3 algorithm to build a decision tree from a database of 62 responses to questions about job requirements. 2. Calculating entropy and information gain to split the data and determine the attribute at each node. 3. Comparing the performance of the decision tree to a baseline of using prior probabilities from the training data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views13 pages

Assignment-1 of Machine Learning On Decision Tree: Submitted To: Submitted by

The document describes a machine learning assignment involving decision tree classification. It discusses: 1. Using the ID3 algorithm to build a decision tree from a database of 62 responses to questions about job requirements. 2. Calculating entropy and information gain to split the data and determine the attribute at each node. 3. Comparing the performance of the decision tree to a baseline of using prior probabilities from the training data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Assignment-1

Of
Machine Learning
On
Decision Tree

Submitted To: Submitted By:


Dr. Kuldeep Kumar Mohit Kumar Goel
Assistant Professor Roll no.-19804003
PhD.- ECE
Decision Tree: Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a test
on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node)
holds a class label.

Algorithm:

Create a root node for the tree

If all examples are positive, Return the single-node tree Root, with label = +.

If all examples are negative, Return the single-node tree Root, with label = -.

If number of predicting attributes is empty, then Return the single node tree Root, with label = most
common value of the target attribute in the examples.

Else

–A = The Attribute that best classifies examples.

–Decision Tree attribute for Root = A.

–For each possible value, vi, of A,

•Add a new tree branch below Root, corresponding to the test A = vi.

•Let Examples(vi), be the subset of examples that have the value vi for A

•If Examples(vi) is empty

–Then below this new branch add a leaf node with label = most common target
value in the examples

•Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute,
Attributes – {A})

End

Return Root

ID3 algorithm
I have taken database of responses of 62 people (attached with the mail) and questionnaire consist of
seven questions regarding the facilities provided by employer before accepting any job offer.

Questions are:

1. Employer must give good salary package?


2. Employer must give enough leaves in a year?
3. Food facility should be good in college campus?
4. Employer give enough chances to get promoted?
5. Salary must be credited on time in employee’s account?
6. Employer must provide medical insurance to its employees?
7. Employer must provide residence facility to employee?
In the program we divide the database into two parts training data and test data. Training data is
selected randomly from data base and then tree is created based on ID3 algorithm. In this we use
entropy and information gain to split data in tree.

Entropy(S) = ∑ – p(I) . log2p(I)

Gain(S, A) = Entropy(S) – ∑ [ p(S|A) . Entropy(S|A) ]

The attribute which have highest information gain will become root node of a sub tree.

In the program, decision tree is created by ID3 algorithm and performance is compared with the
performance if tree is created by prior probability of true and false ( i.e. tree is created by counting
no. of true and false decision)

The overall program consist of 4 parts (function)

Part:1 Main Function : decisiontree: In this we provide path of dataset, size


of the training set and the number of trail.

function[] = decisiontree(mohit_path, size_of_traingset, numberOfTrials)


% DECISIONTREE Create a decision tree by following the ID3 algorithm
% args:
% mohit_path - the fully specified path to input file
% size_of_traingset - integer specifying number of examples from
input
% used to train the dataset
% numberOfTrials - integer specifying how many times decision tree
% will be built from a randomly selected subset
% of the training examples

fid = fopen(mohit_path, 'rt');


dataInput = textscan(fid, '%s');
fclose(fid);
i = 1;
% First store the attributes into a cell array
while (~strcmp(dataInput{1}{i}, 'CLASS'));
i = i + 1;
end
attributes = cell(1,i);
for j=1:i;
attributes{j} = dataInput{1}{j};
end
numAttributes = i;
numInstances = (length(dataInput{1}) - numAttributes) / numAttributes;
% Then store the data into matrix
data = zeros(numInstances, numAttributes);
i = i + 1;
for j=1:numInstances
for k=1:numAttributes
data(j, k) = strcmp(dataInput{1}{i}, 'yes');
i = i + 1;
end
end
% Here is where the trials start
for i=1:numberOfTrials;
fprintf('TRIAL NUMBER: %d\n\n', i);
rows = sort(randsample(numInstances, size_of_traingset));
% Initialize two new matrices, training set and test set
trainingSet = zeros(size_of_traingset, numAttributes);
testingSetSize = (numInstances - size_of_traingset);
testingSet = zeros(testingSetSize, numAttributes);
% Loop through data matrix, copying relevant rows to each matrix
training_index = 1;
testing_index = 1;
for data_index=1:numInstances;
if (rows(training_index) == data_index);
trainingSet(training_index, :) = data(data_index, :);
if (training_index < size_of_traingset);
training_index = training_index + 1;
end
else
testingSet(testing_index, :) = data(data_index, :);
if (testing_index < testingSetSize);
testing_index = testing_index + 1;
end
end
end

for ii=1:numAttributes;
fprintf('%s\t', attributes{ii});
end
fprintf('\n');
for ii=1:size_of_traingset;
for jj=1:numAttributes;
if (trainingSet(ii, jj));
fprintf('%s\t', 'yes');
else
fprintf('%s\t', 'no');
end
end
fprintf('\n');
end
% Estimate the expected prior probability of yes and no based on
% training set
if (sum(trainingSet(:, numAttributes)) >= size_of_traingset);
expectedPrior = 'yes';
else
expectedPrior = 'no';
end

% Construct a decision tree on the training set using the ID3 algorithm
activeAttributes = ones(1, length(attributes) - 1);
new_attributes = attributes(1:length(attributes)-1);
tree = ID3(trainingSet, attributes, activeAttributes);

% Print out the tree


fprintf('DECISION TREE STRUCTURE:\n');
PrintTree(tree, 'root');
ID3_Classifications = zeros(testingSetSize,2);
ExpectedPrior_Classifications = zeros(testingSetSize,2);
ID3_numCorrect = 0; ExpectedPrior_numCorrect = 0;
for k=1:testingSetSize; %over the testing set
% Call a recursive function to follow the tree nodes and classify
ID3_Classifications(k,:) = ...
ClassifyByTree(tree, new_attributes, testingSet(k,:));

ExpectedPrior_Classifications(k, 2) = testingSet(k,numAttributes);
if (expectedPrior);
ExpectedPrior_Classifications(k, 1) = 1;
else
ExpectedPrior_Classifications(k, 0) = 0;
end

if (ID3_Classifications(k,1) == ID3_Classifications(k, 2));


%correct
ID3_numCorrect = ID3_numCorrect + 1;
end
if (ExpectedPrior_Classifications(k,1) ==
ExpectedPrior_Classifications(k,2));
ExpectedPrior_numCorrect = ExpectedPrior_numCorrect + 1;
end
end
% Calculate the proportions correct and print out
if (testingSetSize);
ID3_Percentage = round(100 * ID3_numCorrect / testingSetSize);
ExpectedPrior_Percentage = round(100 * ExpectedPrior_numCorrect /
testingSetSize);
else
ID3_Percentage = 0;
ExpectedPrior_Percentage = 0;
end
ID3_Percentages(i) = ID3_Percentage;
ExpectedPrior_Percentages(i) = ExpectedPrior_Percentage;

fprintf('\tPercent of test cases correctly classified by an ID3


decision tree = %d\n' ...
, ID3_Percentage);
fprintf('\tPercent of test cases correctly classified by using prior
probabilities from the training set = %d\n\n' ...
, ExpectedPrior_Percentage);
end

meanID3 = round(mean(ID3_Percentages));
meanPrior = round(mean(ExpectedPrior_Percentages));

% Print out remaining details


fprintf('example file used = %s\n', mohit_path);
fprintf('number of trials = %d\n', numberOfTrials);
fprintf('training set size for each trial = %d\n', size_of_traingset);
fprintf('testing set size for each trial = %d\n', testingSetSize);
fprintf('mean performance (percentage correct) of decision tree over all
trials = %d\n', meanID3);
fprintf('mean performance (percentage correct) of prior probability from
training set = %d\n\n', meanPrior);
end

Part 2: ID3 : This function will calculate entropy and information gain.
function [tree] = ID3(examples, attributes, activeAttributes)
if (isempty(examples));
error('Must provide examples');
end

% Constants
numberAttributes = length(activeAttributes);
numberExamples = length(examples(:,1));
% Create the tree node
tree = struct('value', 'null', 'left', 'null', 'right', 'null');

% If last value of all rows in examples is 1, return tree labeled 'yes'


lastColumnSum = sum(examples(:, numberAttributes + 1));
if (lastColumnSum == numberExamples);
tree.value = 'yes';
return
end
% If last value of all rows in examples is 0, return tree labeled 'no'
if (lastColumnSum == 0);
tree.value = 'no';
return
end
% If activeAttributes is empty, then return tree with label as most common
% value
if (sum(activeAttributes) == 0);
if (lastColumnSum >= numberExamples / 2);
tree.value = 'yes';
else
tree.value = 'no';
end
return
end

% Find the current entropy


p1 = lastColumnSum / numberExamples;
if (p1 == 0);
p1_eq = 0;
else
p1_eq = -1*p1*log2(p1);
end
p0 = (numberExamples - lastColumnSum) / numberExamples;
if (p0 == 0);
p0_eq = 0;
else
p0_eq = -1*p0*log2(p0);
end
currentEntropy = p1_eq + p0_eq;
% Find the attribute that maximizes information gain
gains = -1*ones(1,numberAttributes); %-1 if inactive, gains for all else
% Loop through attributes updating gains, making sure they are still active
for i=1:numberAttributes;
if (activeAttributes(i)) % this one is still active, update its gain
s0 = 0; s0_and_yes = 0;
s1 = 0; s1_and_yes = 0;
for j=1:numberExamples;
if (examples(j,i)); % this instance has splitting attr. yes
s1 = s1 + 1;
if (examples(j, numberAttributes + 1)); %target attr is yes
s1_and_yes = s1_and_yes + 1;
end
else
s0 = s0 + 1;
if (examples(j, numberAttributes + 1)); %target attr is yes
s0_and_yes = s0_and_yes + 1;
end
end
end
% Entropy for S(v=1)
if (~s1);
p1 = 0;
else
p1 = (s1_and_yes / s1);
end
if (p1 == 0);
p1_eq = 0;
else
p1_eq = -1*(p1)*log2(p1);
end
if (~s1);
p0 = 0;
else
p0 = ((s1 - s1_and_yes) / s1);
end
if (p0 == 0);
p0_eq = 0;
else
p0_eq = -1*(p0)*log2(p0);
end
entropy_s1 = p1_eq + p0_eq;

% Entropy for S(v=0)


if (~s0);
p1 = 0;
else
p1 = (s0_and_yes / s0);
end
if (p1 == 0);
p1_eq = 0;
else
p1_eq = -1*(p1)*log2(p1);
end
if (~s0);
p0 = 0;
else
p0 = ((s0 - s0_and_yes) / s0);
end
if (p0 == 0);
p0_eq = 0;
else
p0_eq = -1*(p0)*log2(p0);
end
entropy_s0 = p1_eq + p0_eq;

gains(i) = currentEntropy - ((s1/numberExamples)*entropy_s1) -


((s0/numberExamples)*entropy_s0);
end
end

% Pick the attribute that maximizes gains


[~, bestAttribute] = max(gains);
% Set tree.value to bestAttribute's relevant string
tree.value = attributes{bestAttribute};
% Remove splitting attribute from activeAttributes
activeAttributes(bestAttribute) = 0;

% Initialize and create the new example matrices


examples_0 = []; examples_0_index = 1;
examples_1 = []; examples_1_index = 1;
for i=1:numberExamples;
if (examples(i, bestAttribute)); % this instance has it as 1/yes
examples_1(examples_1_index, :) = examples(i, :); % copy over
examples_1_index = examples_1_index + 1;
else
examples_0(examples_0_index, :) = examples(i, :);
examples_0_index = examples_0_index + 1;
end
end

% For both values of the splitting attribute


% For value = no or 0, corresponds to left branch
% If examples_0 is empty, add leaf node to the left with relevant label
if (isempty(examples_0));
leaf = struct('value', 'null', 'left', 'null', 'right', 'null');
if (lastColumnSum >= numberExamples / 2); % for matrix examples
leaf.value = 'yes';
else
leaf.value = 'no';
end
tree.left = leaf;
else
% Here is were we can recur
tree.left = ID3(examples_0, attributes, activeAttributes);
end
% For value = yes or 1, corresponds to right branch
% If examples_1 is empty, add leaf node to the right with relevant label
if (isempty(examples_1));
leaf = struct('value', 'null', 'left', 'null', 'right', 'null');
if (lastColumnSum >= numberExamples / 2); % for matrix examples
leaf.value = 'yes';
else
leaf.value = 'no';
end
tree.right = leaf;
else
% Here is were we can recur
tree.right = ID3(examples_1, attributes, activeAttributes);
end

% Now we can return tree


return
end

Part 3: ClassifyByTree : After getting value of entropy and information gain, it will create the tree
depending upon their value.

function [classifications] = ClassifyByTree(tree, attributes, instance)


% Store the actual classification
actual = instance(1, length(instance));

% Recursion with 3 cases

% Case 1: Current node is labeled 'yes'


% So trivially return the classification as 1
if (strcmp(tree.value, 'yes'));
classifications = [1, actual];
return
end

% Case 2: Current node is labeled 'no'


% So trivially return the classification as 0
if (strcmp(tree.value, 'no'));
classifications = [0, actual];
return
end

% Case 3: Current node is labeled an attribute


% Follow correct branch by looking up index in attributes, and recur
index = find(ismember(attributes,tree.value)==1);
if (instance(1, index)); % attribute is yes for this instance
% Recur down the right side
classifications = ClassifyByTree(tree.right, attributes, instance);
else
% Recur down the left side
classifications = ClassifyByTree(tree.left, attributes, instance);
end

return
end

Part 4: PrintTree: This function will print the tree on command window.
function [] = PrintTree(tree, parent)
% Print current node
if (strcmp(tree.value, 'yes'));
fprintf('parent: %s\tyes\n', parent);
return
elseif (strcmp(tree.value, 'no'));
fprintf('parent: %s\tno\n', parent);
return
else
% Current node an attribute splitter
fprintf('parent: %s\tattribute: %s\tnoChild:%s\tyesChild:%s\n', ...
parent, tree.value, tree.left.value, tree.right.value);
end

% Recur the left subtree


PrintTree(tree.left, tree.value);

% Recur the right subtree


PrintTree(tree.right, tree.value);

End
Dataset:

S. Goods enoughl GoodFoodf Timelyprom Ontimes Medicalinsu ProvideH CLA


No. alary eave acility otion alary rance ome SS
1 yes yes yes yes yes yes yes yes
2 yes yes yes yes yes yes no yes
3 yes yes yes yes yes no yes yes
4 yes yes yes yes yes no no yes
5 yes yes yes yes no yes yes yes
6 yes yes yes yes no yes no yes
7 yes yes yes yes no no yes yes
8 yes yes yes yes no no no yes
9 no no yes no yes no no no
10 no no no yes no no yes no
11 no no no no yes no no no
12 no no no no no no no no
13 no no no yes no no no no
14 no no no no yes yes yes no
15 no no no yes yes yes yes no
16 no no yes no no no yes no
17 no no yes yes yes no yes no
18 no no yes yes yes yes yes no
19 no no yes yes yes yes yes no
20 no yes yes yes yes yes yes yes
21 no yes yes yes yes no yes yes
22 no yes yes yes no yes yes yes
23 no yes yes yes yes yes yes yes
24 yes yes yes no no yes yes no
25 yes yes yes no no yes no no
26 yes yes yes no no no yes no
27 yes yes yes no no no no no
28 yes yes yes no no no no no
29 yes yes no yes yes yes yes yes
30 yes yes no yes no no yes yes
31 yes yes no yes yes yes yes yes
32 yes yes no yes no yes yes yes
33 yes no no yes yes no yes yes
34 yes no no yes no yes yes yes
35 yes no no yes yes yes yes yes
36 yes no no yes no yes no no
37 yes no no yes yes no no no
38 yes no no yes yes yes no no
39 yes no no yes no no no no
40 yes yes no yes yes yes no yes
41 yes yes no yes yes yes no yes
42 yes yes no yes no yes no yes
43 yes yes no yes no yes yes yes
44 no yes no no yes yes yes no
45 no yes no no yes yes no no
46 no yes no no yes no yes no
47 no yes no no yes no no no
48 no yes no no no yes yes no
49 no yes no no no yes no no
50 no yes no no no no yes no
51 no yes no yes yes yes yes no
52 no yes yes yes yes no yes yes
53 no yes yes yes no yes yes yes
54 no yes yes yes yes yes yes yes
55 no yes yes yes no no yes yes
56 no yes yes no yes yes yes yes
57 no yes yes no yes no yes yes
58 no yes yes no yes yes yes yes
59 no yes yes no yes no yes yes
60 no yes yes no no yes yes no
61 no yes yes no no no yes no
62 no yes yes no no yes yes no
Result of tree if we take size of training set as 20 and no. of trial as 1.

From Program we analyse that


(i) As we are selecting training data randomly so performance may be different in every trial.
(ii) Performance of ID3 algorithm is better than tree based on prior probability of true and
false.
(iii) Sometimes we get better result by taking lesser training samples (like 25 instead of 30)
depending upon the training set selected.

You might also like