0% found this document useful (0 votes)
31 views6 pages

Ie4903 Data Mining Lecture Homework 4

The document contains code and explanations for implementing decision tree algorithms in data mining. It includes: 1. Functions to calculate impurity measures like Gini index and entropy, and to split nodes based on minimum impurity. 2. Code to construct a decision tree on a training set by recursively splitting nodes using these functions. 3. The output is a matrix representing the decision tree, with columns for node number, indices of samples at each node, purity, splitting attribute, etc.

Uploaded by

Candaş Demir
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views6 pages

Ie4903 Data Mining Lecture Homework 4

The document contains code and explanations for implementing decision tree algorithms in data mining. It includes: 1. Functions to calculate impurity measures like Gini index and entropy, and to split nodes based on minimum impurity. 2. Code to construct a decision tree on a training set by recursively splitting nodes using these functions. 3. The output is a matrix representing the decision tree, with columns for node number, indices of samples at each node, purity, splitting attribute, etc.

Uploaded by

Candaş Demir
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

IE4903 DATA MINING LECTURE HOMEWORK 4

Fato LB-1535459

Q1)

Since Gini index, Entropy and Misclassification error measures impurity, their value

increases as impurity increase. Thus, they take their maximum value when records are equally distributed among all classes implying least information, i.e. if there exists n classes, and T records at node t, p (i \t)= (T/n) / T= 1/n. Maximum Gini Index= 1-i (p(i\t))2 = 1-n* (1/n)2 = 1-(1/n) Maximum Entropy= -i p(i\t)*log2 p(i\t) = -n* (1/n)*log2 (1/n) = -log2 (1)+ log2 n =0 + log2 n = log2 n Maximum Misclassification Error= 1-maxi{ p(i\t)} = 1 (1/n)

Q3) Codes for the question:


matris = xlsread ('Associations.xls'); v=randperm(2000); tmatris=matris; vmatris=matris; for i=1:1600 tmatris(i,:)= matris(v(i),:); end tmatris(1601:2000,:)=[]; for i=1601:2000 vmatris(i-1600,:)= matris(v(i),:); end vmatris(401:2000,:)=[]; j=0; k=0; birtmatris=tmatris; sifirtmatris=tmatris; for i=1:1600 if tmatris(i,11)==1 j=j+1; birtmatris(j,:)=tmatris(i,:); else k=k+1; sifirtmatris(k,:)=tmatris(i,:); end end birtmatris(j+1:1600,:)=[]; sifirtmatris(k+1:1600,:)=[]; pbir=j/1600; psifir=k/1600;

v(11:2000)=[]; birgbir=v; sifirgbir=v; for t=1:10 h=0; for i=1:j if birtmatris(i,t)==1 h=h+1; end birgbir(t)=h/j; sifirgbir(t)=1-birgbir(t); end end birgsifir=v; sifirgsifir=v; for t=1:10 h=0; for i=1:k if sifirtmatris(i,t)==1 h=h+1; end birgsifir(t)=h/k; sifirgsifir(t)=1-birgsifir(t); end end Bir=pbir; Birprime=psifir; Label=randperm(400); for i=1:400 for t=1:10 if vmatris(i,t)==1 Bir=Bir*birgbir(t); Birprime=Birprime*birgsifir(t); else Bir=Bir*sifirgbir(t); Birprime=Birprime*sifirgsifir(t); end if Bir>Birprime Label(i)=1; else Label(i)=0; end end end eV=[0 0; 0 0]; for i=1:400 if Label(i)==vmatris(i,11)+1 eV(1,2)=eV(1,2)+1; elseif Label(i)==vmatris(i,11)-1 eV(2,1)=eV(2,1)+1; elseif Label(i)==0 eV(1,1)=eV(1,1)+1; elseif Label(i)==1 eV(2,2)=eV(2,2)+1; end end ErrorV =(eV(1,2)+eV(2,1))/400; Bir=pbir; Birprime=psifir; Label=randperm(1600); for i=1:1600

for t=1:10 if tmatris(i,t)==1 Bir=Bir*birgbir(t); Birprime=Birprime*birgsifir(t); else Bir=Bir*sifirgbir(t); Birprime=Birprime*sifirgsifir(t); end if Bir>Birprime Label(i)=1; else Label(i)=0; end end end eT=[0 0; 0 0]; for i=1:1600 if Label(i)==tmatris(i,11)+1 eT(1,2)=eT(1,2)+1; elseif Label(i)==tmatris(i,11)-1 eT(2,1)=eT(2,1)+1; elseif Label(i)==0 eT(1,1)=eT(1,1)+1; elseif Label(i)==1 eT(2,2)=eT(2,2)+1; end end ErrorT =(eT(1,2)+eT(2,1))/1600;

Since training and validation sets are split randomly at each run, error matrices changes at each run. One example output can be seen below: eT matrix: 1397 28 1 174 ErrorT= 0,13 eV matrix: 341 17 1 41 ErrorV= 0,15

Q4) Functions used for this question: 1. Function for splitting a node with minimum split Gini
function [purity,splitattr,splitindex,splitamount,newmatrix]= splitnode(datamatrix,startindex,endindex) splitattr=0; splitindex=0; splitamount=0; newmatrix=datamatrix; datasayisi=endindex-startindex+1; birler=0; sifirlar=0; for i=1:datasayisi if datamatrix(i,7)==1 birler=birler+1; else sifirlar=sifirlar+1; end end purity=0; if birler==datasayisi purity=1; elseif sifirlar==datasayisi purity=1; end

if purity==0 bestsplit(1:103,1:6)=1; splitplace(1:103,1:6)=0; for i=1:6 datamatrix(startindex:endindex,:)= sortrows(datamatrix(startindex:endindex,:),i); for j=startindex:endindex-1 if datamatrix(j,7) == datamatrix(j+1,7) bestsplit(j,i)= 1; else bestsplit(j,i)=ginihesapla(datamatrix(:,7),startindex,endindex,j); splitplace(j,i)=datamatrix(j,i); end end end minginis(1:2,1:6)=0; for i=1:6 [minginis(1,i),minginis(2,i)]=min(bestsplit(:,i)); end [~,splitattr]=min(minginis(1,:)); splitindex=minginis(2,splitattr); splitamount=splitplace(splitindex,splitattr); newmatrix(startindex:endindex,:)=sortrows(datamatrix(startindex:endindex,:) ,splitattr); end end

2. Function calculating the gini which is used in previous function:


function [gini]=ginihesapla(vektor,startpt,endpt,ayrimindex) toplambiryukari = sum(vektor(startpt:ayrimindex)); toplam1 = ayrimindex-startpt+1; toplamsifiryukari=toplam1-toplambiryukari; toplambirasagi = sum(vektor(ayrimindex+1:endpt)); toplam2 = endpt-ayrimindex; toplamsifirasagi=toplam2-toplambirasagi; toplam=toplam1+toplam2; gini1 = 1-((toplambiryukari/toplam1)^2) - ((toplamsifiryukari/toplam1)^2); gini2 = 1-((toplambirasagi/toplam2)^2) - ((toplamsifirasagi/toplam2)^2); gini=(toplam1/toplam)*gini1+(toplam2/toplam)*gini2; end

3. Main function to construct the decision tree for training set:


for i=105:130 vmatrix(i-104,:)= matrix(v(i),:); end vmatrix(27:130,:)=[]; treematrix(:,7)=0; startindx=1; endindx=104; [purity1,splitattr1,splitindex1,splitamount1,newtmatrix1]= splitnode(tmatrix,startindx,endindx); treematrix(1,:)=[1 startindx endindx purity1 splitattr1 splitindex1 splitamount1] ; [purity2,splitattr2,splitindex2,splitamount2,newtmatrix2]= splitnode(newtmatrix1,startindx,splitindex1); treematrix(2,:)=[2 startindx splitindex1 purity2 splitattr2 splitindex2 splitamount2] ; [purity3,splitattr3,splitindex3,splitamount3,newtmatrix3]= splitnode(newtmatrix1,(splitindex1+1),endindx); treematrix(3,:)=[3 (splitindex1+1) endindx purity3 splitattr3 splitindex3 splitamount3] ; if purity2==0 [purity4,splitattr4,splitindex4,splitamount4,newtmatrix4]= splitnode(newtmatrix2,startindx,splitindex2); treematrix(4,:)=[4 startindx splitindex2 purity4 splitattr4 splitindex4 splitamount4] ; [purity5,splitattr5,splitindex5,splitamount5,newtmatrix5]= splitnode(newtmatrix2,(splitindex2+1),splitindex1); treematrix(5,:)=[5 (splitindex2+1) splitindex1 purity5 splitattr5 splitindex5 splitamount5] ; end

if purity3==0 [purity6,splitattr6,splitindex6,splitamount6,newtmatrix6]= splitnode(newtmatrix3,(splitindex1+1),splitindex3); treematrix(6,:)=[6 (splitindex1+1) splitindex6 purity6 splitattr6 splitindex6 splitamount6] ; [purity7,splitattr7,splitindex7,splitamount7,newtmatrix7]= splitnode(newtmatrix3,(splitindex3+1),endindx); treematrix(7,:)=[7 (splitindex3+1) endindx purity7 splitattr7 splitindex7 splitamount7] ; end

The output is a matrix having columns for node number, startindex, endindex, purity, attribute type, attribute index and attribute value used to split the node respectively: 1 2 3 4 5 6 7 1 1 51 1 51 51 56 104 50 104 50 53 0 104 0 1 0 1 1 1 1 1 0 5 0 0 0 0 50 0 55 0 0 0 0 12,77 0 87 0 0 0 0

You might also like