Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
March 6, 2014
)hat is classification* )hat is prediction* +ssues regarding classification and prediction Classification !, decision tree induction -a,esian Classification Classification !, !ac"propagation Classification !ased on concepts fro' association rule 'ining .ther Classification Methods (rediction Classification accurac, Su''ar,
Data Mining: Concepts and Techniques 2
March 6, 2014
Classification: predicts categorical class la!els classifies data 0constructs a 'odel1 !ased on the training set and the 2alues 0class la!els1 in a classif,ing attri!ute and uses it in classif,ing ne$ data (rediction: 'odels continuous32alued functions, i%e%, predicts un"no$n or 'issing 2alues T,pical 4pplications credit appro2al target 'ar"eting 'edical diagnosis treat'ent effecti2eness anal,sis
Data Mining: Concepts and Techniques /
March 6, 2014
.ur e a'ple
)e used clustering !ased on te'poral patterns of 2isits and spending to co'e up $ith follo$ing la!els 5o,al3!igSpender 5o,al3'oderateSpender Se'i5o,al3!igSpender Se'i5o,al3'oderateSpender .ther
Data Mining: Concepts and Techniques 4
March 6, 2014
.ur classification
)e $ill tr, to predict those la!les fro' $hat the, !ought 7se spending in /6 i'portant categories to predict8classif, the la!el )e use the $ord classif, instead of predict, !ecause prediction t,picall, is for continuous attri!utes 0in our !oo", an,$a,1 Classification is prediction of categories or la!els
Data Mining: Concepts and Techniques 6
March 6, 2014
Model construction: descri!ing a set of predeter'ined classes 9ach tuple8sa'ple is assu'ed to !elong to a predefined class, as deter'ined !, the class la!el attri!ute The set of tuples used for 'odel construction: training set The 'odel is represented as classification rules, decision trees, or 'athe'atical for'ulae Model usage: for classif,ing future or un"no$n o!&ects 9sti'ate accurac, of the 'odel The "no$n la!el of test sa'ple is co'pared $ith the classified result fro' the 'odel 4ccurac, rate is the percentage of test set sa'ples that are correctl, classified !, the 'odel Test set is independent of training set, other$ise o2er3fitting $ill occur
Data Mining: Concepts and Techniques 6
March 6, 2014
RANK YEARS TENURED 4ssistant (rof / no 4ssistant (rof # ,es (rofessor 2 ,es 4ssociate (rof # ,es 4ssistant (rof 6 no 4ssociate (rof / no
Classifier %&o$el'
March 6, 2014
RANK YEARS TENURED 4ssistant (rof 2 no 4ssociate (rof # no (rofessor 6 ,es 4ssistant (rof # ,es
Data Mining: Concepts and Techniques
Ten#re$-
Super2ision: The training data 0o!ser2ations, 'easure'ents, etc%1 are acco'panied !, la!els indicating the class of the o!ser2ations >e$ data is classified !ased on the training set The class la!els of training data is un"no$n <i2en a set of 'easure'ents, o!ser2ations, etc% $ith the ai' of esta!lishing the e istence of classes or clusters in the data
Data Mining: Concepts and Techniques =
March 6, 2014
Data cleaning
(reprocess data in order to reduce noise and handle 'issing 2alues ?e'o2e the irrele2ant or redundant attri!utes <enerali@e and8or nor'ali@e data
Data transfor'ation
March 6, 2014
10
(redicti2e accurac, Speed and scala!ilit, ti'e to construct the 'odel ti'e to use the 'odel ?o!ustness handling noise and 'issing 2alues Scala!ilit, efficienc, in dis"3resident data!ases +nterpreta!ilit,: understanding and insight pro2ded !, the 'odel <oodness of rules decision tree si@e co'pactness of classification rules
Data Mining: Concepts and Techniques 11
March 6, 2014
Decision tree 4 flo$3chart3li"e tree structure +nternal node denotes a test on an attri!ute -ranch represents an outco'e of the test 5eaf nodes represent class la!els or class distri!ution Decision tree generation consists of t$o phases Tree construction 4t start, all the training e a'ples are at the root (artition e a'ples recursi2el, !ased on selected attri!utes Tree pruning +dentif, and re'o2e !ranches that reflect noise or outliers 7se of decision tree: Classif,ing an un"no$n sa'ple Test the attri!ute 2alues of the sa'ple against the decision tree
Data Mining: Concepts and Techniques 12
March 6, 2014
Training Dataset
age BC/0 BC/0 /1D40 E40 E40 E40 /1D40 BC/0 BC/0 E40 BC/0 /1D40 /1D40
inco'e student creditArating high no fair high no e cellent high no fair 'ediu' no fair lo$ ,es fair lo$ ,es e cellent lo$ ,es e cellent 'ediu' no fair lo$ ,es fair 'ediu' ,es fair 'ediu' ,es e cellent 'ediu' no e cellent high ,es fair
!u,sAco'puter no no ,es ,es ,es no ,es no ,es ,es ,es ,es ,es
March 6, 2014
1/
)e ha2e four attri!utes used to predict8classif, $hether the custo'er !ought a co'puter or not
March 6, 2014
14
age<=30 st#$entno no
March 6, 2014
30..40 o.ercast
-asic algorith' 0a greed, algorith'1 Tree is constructed in a top3do$n recursi2e di2ide3and3conquer 'anner 4t start, all the training e a'ples are at the root 4ttri!utes are categorical 0if continuous32alued, the, are discreti@ed in ad2ance1 9 a'ples are partitioned recursi2el, !ased on selected attri!utes Test attri!utes are selected on the !asis of a heuristic or statistical 'easure 0e%g%, infor'ation gain1 Conditions for stopping partitioning 4ll sa'ples for a gi2en node !elong to the sa'e class There are no re'aining attri!utes for further partitioning F 'a&orit, 2oting is e'plo,ed for classif,ing the leaf There are no sa'ples left
Data Mining: Concepts and Techniques 16
March 6, 2014
+nfor'ation gain 0+D/8C4%61 4ll attri!utes are assu'ed to !e categorical Can !e 'odified for continuous32alued attri!utes <ini inde 0+-M +ntelligentMiner1 4ll attri!utes are assu'ed continuous32alued 4ssu'e there e ist se2eral possi!le split 2alues for each attri!ute Ma, need other tools, such as clustering, to get the possi!le split 2alues Can !e 'odified for categorical attri!utes
March 6, 2014
1#
Select the attri!ute $ith the highest infor'ation gain 4ssu'e there are t$o classes, P and N
5et the set of e a'ples S contain p ele'ents of class P and n ele'ents of class N The a'ount of infor'ation, needed to decide if an ar!itrar, e a'ple in S !elongs to P or N is defined as p p n n I % p* n' = log 0 log 0 p+n p+n p+n p+n
March 6, 2014
1;
4ssu'e that using attri!ute 4 a set S $ill !e partitioned into sets GS1, S2 , D, SvH
+f Si contains pi e a'ples of P and ni e a'ples of N, the entrop,, or the e pected infor'ation needed to classif, o!&ects in all su!trees Si is p +n E % A' = i i I % pi * ni ' i =1 p + n
The encoding infor'ation that $ould !e gained !, !ranching on A Gain% A' = I % p* n' E % A'
Data Mining: Concepts and Techniques 1=
March 6, 2014
Class (: !u,sAco'puter C I,esJ Class >: !u,sAco'puter C InoJ +0p, n1 C +0=, 61 C0%=40 Co'pute the entrop, for age:
E % age' =
Kence
pi 2 4 /
Gain%income' = 43402 Gain% student ' = 43161 Gain%credit 8 rating ' = 434,7
20
)e repeat the anal,sis for each 2alue of age 4cti2it,: Lind out gain for the re'aining attri!utes for age BC/0 and age E40 There is no need for such anal,sis for age in 0/0,401
Data Mining: Concepts and Techniques 21
March 6, 2014
+f a data set T contains e a'ples fro' n classes, gini inde , n gini0T1 is defined as 0 $here pj is the relati2e frequenc, of class j in T. +f a data set T is split into t$o su!sets T1 and T2 $ith si@es N1 and N2 respecti2el,, the gini inde of the split data contains e a'ples fro' n classes, the gini inde gini0T1 is defined as
gini%T ' =1 p j j= 1
The attri!ute pro2ides the s'allest ginisplit0T1 is chosen to split the node 0need to enumerate all possible splitting points for each attribute1%
Data Mining: Concepts and Techniques 22
March 6, 2014
?epresent the "no$ledge in the for' of +L3TK9> rules .ne rule is created for each path fro' the root to a leaf 9ach attri!ute32alue pair along a path for's a con&unction The leaf node holds the class prediction ?ules are easier for hu'ans to understand 9 a'ple
+L age C IBC/0J 4>D student C InoJ TK9> buys computer C InoJ +L age C IBC/0J 4>D student C IyesJ TK9> buys computer C IyesJ +L age C I/1D40J TK9> buys computer C IyesJ +L age C IE40J 4>D credit rating C Ie!cellentJ TK9> buys computer C IyesJ +L age C IE40J 4>D credit rating C IfairJ TK9> buys computer C InoJ
March 6, 2014
2/