DM Final
DM Final
atamining engine J
atobase oh Warehoule
Batabase the
anehuee Rejraitory
Vitgled ge base- Thi knooledge includes COncoft hiehahehies wed to
nkhutes Ok attilbute yales nto dienert level o abttnacto
Qhinng ongine- Thatieuental- to data mining ystan and oonsIts
afefe uncfion medulas for chanactendgattrassclation and
onlakon, olausfoaim, prediet on clutor anayais, otian anahyais
and evelution onalyid .
Ptehn evaluatin mddule-Thk combomend typlaly emplays moaswe and
interacts th the data mring modules' to ocus the keon ch to0ads
ntenasting pattekns.
Uen intorhace - TRiA module Commnoatos betuoeen usetu and the
data wmening ystem alezaing thethe uset to interaet with the systam
ok task based Dn
intakmediate
data mining task.
Th additin, the Combonert allozos the usan te br0se database
and alato darehoie 8chemas o data stHuetwes ) evauate
ynined batens, and vlsualEze tne patteknsin digorent foms.
* A that ean penokm data ox enonmaton Hebycival tneluding
vawes,, do duattve query answoing ni a
deductve.
lataise can be said al a database gtem ingonmattn're Bvay
AHdeductfve database
invelves an
n)
Dasehouse - Satposa ALtenfs £s aSuctrshul
0on
'intaraatrnal
&et of datA.
comnfany
Eaah banch has its .Now
olth bia nehes al oyetthe otld.
Penident AL Eleets ones had askod yu to frovide an Analyea of the eombany's
les þe tken type pur brnch thisd quantek .
a diktut task because elevant data ate tþreod
Revehal databases tocated at numorouS ites. .
had data anekvuse. Qos easio.
house
sel. of Wore
Onshutted by Baia cleanng intzgratten, trana<oumairn, loding jeiete
medelàngo modeled by a muetttdmonand
Adata arehouse l ucually mestetio
Anta struttwe ealed data cube .in shlch each dimendion côrresbonde to
ottributes in the lchema, Each cell (tOre!
the values somme measwre sueh ad aount 9N sum(tales _amor
requont ttenseti,
On tranoctonal data can do minng
a ite that ahe fethct.
Otheektnd o< data
Nony ethehk knd 8 data having vehsatle fohm and sthuetu bde
Bequenee Vdata (Hstoleal eetmds
Ex- Time Helated oH lequence
Atoek exchonge data) ata ktieams ( video Qiveillonce, Lenh
data), ihatäl data, (ma nginaoning deagn data (bildg
dedign, integtod eaieutt) hypenteyt 'mttmeda datal
text, deo audlo)
ane used o DM 2
8aa ninin, hei
has ineontanated many
teshvkajues Jyom sthere dema'.
tatishes
TDatabae vituaigatin
System
Warehouset
Dota Ming
HPC
loTmation Abpleattow
Retrieval
uriabeled ovare
daoisisn. bouranr,
are
torgatkt 22
As a higyabpleeation dhven de ne, dafa mining has seen
Suce in me
N) Tnvsib le D
Chabtan-3 at frehusee4e`ng
mikiing, inconsltent data
Real old data aHe
due to their
ttr) ean
quality
data wtD D nad to knoo,
hedaþebreesed in orden ts halßs tnbaoe the qalty 4 the deta
dota mning vesubks.
and conieguenty y the
Thehe ae dovestol þre T9cing techniue.nelke and coweet
ineonal&tenclos
remsi9e
en beatbtedt to &-t0he
data
multtþle doutees into a cshehent data
od aLa ntegration merges som
Juch as data Qehouse tino redundant
Sota reduetion can hoduee data sige b ggottg
Lghernedata ae
be abbed
(4) Sealed to <all Oithen a. smales wange kd o.o+0 1.0. Tñis can
and distanee
the acaey
meaduHes.
thuy may
These techniguet are mutually exeuive,
Aota ieprocelog : An ovevtes
deta
sny toctans eomprising data
moty
umhetenal,"consteny tiblness , bobisasy
qualty
noudes aeeuyaey
ond inteybctnb.(y. data Qte
Thhoe element
and consisteny
ljon tuk oy data prejnoces§ng -
3L2 þrepr0CeLiing ane data leanâng, data
Jlajok takA trvolvtng data
reduotion and dota translormation.
inthotfon dota
oleaning Houtnes work todean the dota b ting the
Aata
Nalues,
9mecthina
Enconsitencies.
data, identbn
mueh mollo fn
in vlsme,yet produces lame analytieal result.
Aao heductton Atrategy indudes dimenitnabty roduuctlon and
nmotoety heduction,
Nohualga tion, data dtieneigatin and eoncept he
goneraton ane 7orns af dota transkematsn
value
manual
o0B Danohe the tujle ,All in the niling
iea gislbal esntant to ll.in the miling value
mootktngbybinbrundane neare
Middleone is
Bin : 4, 4,1S
Bin a: al, 2, 24
Bin 3 : 2S, 25, 34.
dome by egresaton
can als be clone by
technigue.
Rogaessiom-
that esnon& data alt a Junatton. Lnear reuston Enyslves
the best t tioo ottbutas. &0 hot
(ne to J4 töo t attibutes. A0 hat one,
sther.
attribute
Can be used to þredict the
eomhuted a 2
iu the bleved
6X20+5X\044X14+ 3X5+S
SO--43.9 -(4xlo.80)
We Can that Itoek ie
YOndorn vauáeles on
attrbutabod0,.one nurnbor
TRes methsd ineude Wovelet transohnd ,,and
pineibal eompononts
atb
())
Resutant reduced veetr
4 - , , 0, 0, ,-, 0
0 Inhut data oe nrrmaed.
onputes kohgyonat veltor.
14
Covaanoe matbi
4 Covl,33) Cov( XiX)
4
N
LCov(a,) (%)
Nel
14.
Enual Witth.
25
20.
|0 15 20 25. 30
7a 10 12 14 15 18 o21.25 2%30
ttsw many produets ahe theNe
v' mna
-
+
) and maelmun
Thie method le uselul ehen the actual mirimuBtler& coh dominate
Aotttbute s hkign,hShece there ae
the mlh max notna 3att on.
moan and standasd oeviation o the vaes loh
B Subpeke that
that the l6,600
S4,60D and
ibo atribute income are in co7ne
otth 2- So He normazativn
73600 - S4,000 |.225
|b,g00
above_ the nean .
t meases the s.d blow osL doVatin.
om -3 standand dariation to 3 stan dand
onges the Leatuwre to have a aean
lenles the vaue
d*Te teohuioue deviation oß 1.
0 and stdland eatwne value and
ohy moan ea dubstucted rom ovey
That is
divded sd.
mean
A vorianee
(SA
(meanabtolute daviation A mean aboute
SA
SA A moye obust to outies than s. d.
|2-7thedeviakion Grom mean ia not
henee
( -3), -0,9&6.
b that -986 wlu be noknalyed to
97
401-soan
19 databotnte thee
in i :5, J0,I ,13
Bend: |535, SD, 55.
Ain: 12,92,204, 215.
Width
min + ot bins.
mln +20
may-min
min + m . Nun ol bns.
0,0.3, 0.1,0"6, .D
.6 )D-28 9.980.e44
3"02 49-7 0-33 -042 0.138b
3-82 48.4 0.47 -" 72-0&0.84
o.D044|-44
342 54.9 0.677 4:08| 0. 28S
3"59 54:9 0-24 |4.78 1·1412
- O» 48-6-44 3.0816 0.2304 4
2.87 43 1
47.9 !k 32-2.42 9344 J0. 1024 8B59
2.03 0"I4.42 -0.s412
3.46 45-2
0-0t4:80.04 28
3.6 54:4 "0-050-8-0.o14 |0.0025 0.01
9.3 504
23B.35 5-091 0.8!18 43
30.12
6-09|
0.8454
2 0. 4694
s. Pordoam the ehe tet Oh eornelation or the
qutyy shee 256 jesple shaned tie month
e thein expeeted dributlon o manth &ahe nty distibuted.
Jan 29
Chi-squae <owmula
24
Mateh 22
2 Exhected 256
|2
21.333,
June 18
dtabrtmted)
20
23
18
20
Nov
Dee 23
(extBook)
t. uiola.data Kooatony.
letadata
neCheatod
axe the aatn,that doline data Qanohguie sbjects. leta data
7oH the data namei and do<anattons the given Wnehsude
AddeWonal
data .
metndata ae Cheatzd kon nctany aXt7acted
henatonal metadota
b-l Prequent þaitakns are hattons that atpen raq uenty in latasot
Ttem Cqunt
Au itams A, B,c,D are .rsquent
4 because ts thein aubpord -Count
D 4
Pem Ceunt
A, Ad 2- itemKets one raquant x eapt AD.
Ac
A, D
AC3) ’ AG a), Ac(3), AD (2)
innodiate
A( Count) is not &reater -th an ts
4
PUp ohset o, Mis not olosod
A' immedi ate Aupeset,temget ane b her
ediate
AiA not maxna
Qlssed and mlma requant ttomsets
as4oeation
Abniok algonithm war hegunt itemsels to genorate
uuled. Tt Ye baled 6n the that aubset
4afeguent
the comcest that
tomset mutt aLs0 be a
Jraquent îtomset
TID tems
Ttemset
235eY
T4
4
I3,S
Subbot
( Ttemáet
7 125]
(2,3} 2
2
3 2,3,1
fruning
Hemaet NO.
NO
Ltem set
12,2;5
7evGUs one, X
(fo and retn to
requant ittemset aee
Butpose mimum congtden ee vale is 60/.
SKRch is qgong to gehanate all
itomset!
foh bublet Qe
seleett
40%
Rule 4.
mean
iutont (13,s)
kjectag rutjoret (3) 4
Ruleb_)
Bample-2
Obfeettve is to use the tronsactlon data to fend afnstiar between
procucts, toklch þrodutts kel togethan.
ne dubbOHt level oie be set at 39Y the Conidenee Lovel
Lovel at
sOy. Atocoton
Tiondacont
Bread butter.
2 MiK bute, fag otokub
3 Bread Buttatt Ketehub
4 Mik Aread butten
5 Brea Bcttes Coeies
Oread Guttet Cooles
7 Mek CeklesThsvd
Gread Buttar
buctes read
12 Nuk
Gutte
Bread Cootles Kttcle
Rule X’Y
Total 12 4ian
4
1-temdets fegyene
Craues hyeshot
Lyo33133/.
9Teater than 4
Bruad
Buttas Bread to
Butel0
3
Cothies
2 tem lets
Froqueno
Mk Buto
MLK Coekies a requent
bvead butte Frequont 2itomset
butty Cockies Mluk Breal fraquenag
Bread Cogkies 4
Bread Bctter
o, tem ae Mak, Bread Butten 4
Bread ,butte, Cokles
3item lets
frequan
Mk ,Bread. , Coekies lo frequant temigt Froquenti
Broad. buttos,Co&es Mek,byead Butos
Mik Bcttah. , Coekley
Valid.
uleS:
Ruleb
TID
T200
I2, I4 Ttem
troguone
I,42,T4
T400
T500
T4
T600
I, Iy
I,I2,T3,I5
En
the ole ants
L Ham et,aceovden
uld Ordad
No, lok each trasacton,
ount
to the equani (1262)
T100
T 300
T400 T900
TSOO
Bulld the tree 2
transachion
Dhaw therst
24
NeXttrandacon
OsrneetSame nodes
wth drts
Tfems Csnditional potlen Base londional
fattornet
To reaeh garinlarIs 0e needaeto Take the set al plements
klnd al baths. Tcoo Is nodes ohieh i, Common in ae paths
"hoe sth ceunt i at I & condtisnal patten base
5
Classification—A Two-Step
Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
Note: If the test set is used to select models, it is called validation (test) set
6
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
8
Chapter 8. Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
9
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
>40 low yes excellent no
Resulting tree:
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes yes
10
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
m=2
12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
classify D: v | D |
InfoA ( D) Info( D j )
j
j 1 | D |
Information gained by branching on attribute A
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 14
Computing Information-Gain for
Continuous-Valued Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
15
Gain Ratio for Attribute Selection
(C4.5)
Information gain measure is biased towards attributes with a
large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) log 2 ( )
j 1 |D| |D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
noise or outliers
Poor accuracy for unseen samples
23
Scalability Framework for
RainForest
24
Rainforest: Training Set and Its AVC
Sets
26
Presentation of Classification Results
medium income
32
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
33
Classification Is to Derive the Maximum
Posteriori
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) i i
i P(X)
Since P(X) is constant for all classes, only
P(C | X) P(X | C )P(C )
i i i
needs to be maximized
34
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k 1
This greatly reduces the computation cost: Only counts the
class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x )2
1
g ( x, , ) e 2 2
and P(xk|Ci) is 2
P ( X | C i ) g ( xk , Ci , Ci )
35
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
36
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
“uncorrected” counterparts
38
Naïve Bayes Classifier: Comments
Advantages
Easy to implement
Disadvantages
Assumption: class conditional independence, therefore loss
of accuracy
Practically, dependencies exist among variables
Bayes Classifier
How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
39
Chapter 8. Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
40
Using IF-THEN Rules for Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
Rule antecedent/precondition vs. rule consequent
One rule is created for each path from the <=30 31..40 >40
root to a leaf
student? credit rating?
yes
Each attribute-value pair along a path forms a
no yes excellent fair
conjunction: the leaf holds the class
no yes yes
prediction
Rules are mutually exclusive and exhaustive
Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
42
Rule Induction: Sequential Covering
Method
Sequential covering algorithm: Extracts rules directly from training
data
Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
Steps:
Rules are learned one at a time
Each time a rule is learned, the tuples covered by the rules are
removed
Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
Comp. w. decision-tree induction: learning a set of rules
simultaneously
43
Sequential Covering Algorithm
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
44
Rule Generation
To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
45
How to Learn-One-Rule?
Start with the most general rule possible: condition = empty
Adding new attributes by adopting a greedy depth-first strategy
Picks the one that most improves the rule quality
50
Precision and Recall, and F-
measures
Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
51
Classifier Evaluation Metrics: Example
52
Holdout & Cross-Validation
Methods
Holdout method
Given data is randomly partitioned into two independent sets