0% found this document useful (0 votes)
85 views137 pages

DM Final

Uploaded by

Debadutta Swain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views137 pages

DM Final

Uploaded by

Debadutta Swain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 137

Bata ning tu a concebt or unaovering tntaneting daBa pattns

Rdden. in (ange datasets?


The imbementatiom mothods dieousses abrut the
jeaLas e. and elßieant data mining toea.

kinds e dota on bek nining on(/ be þenormed.


ti) tyes o attahi eon ba fuand and how to tlp oktdh þatton
ata mining þrimittves rom Oich data
an be

Suuss egarding hoo tooanehintegrate a data


ose.
a databse oVdata
kesearch. usues
data. nning tos.
Jo bialding data
1loivation
atamining has atthacked a denl t attantsn in nlounationindutty
theotdeavailabilty huge anount t data andl the
need s twning kuch data" into usa<ul injokinainl tnouß ledg e.,
TRA ean be
eustome etention ete.
onalya, lyaud stactim,
Evouteon in Dalotmatton bothasog
ata. cglection ’ atabase.cheation., ’ata management ned.t
data atohage and netrieyal J’ Adyanaed data data aanehaig
data min)
Pveutisn o database dystem technatgage 2) awe.
Bata Aalscttouy mutle hetvregan esut data
undek a wleed &bhema, at a Uote.
,t data eleanng,data Lntzgnation hnd on-ine analytieal
brecessing (oLAR) that i4 analyci technhus sith hunchiondlHes
Such as ummar otlon , ons8Ldatign and gnagatB£n.
OLAP tuols ubponts multtdimen&lonal anayeis anddelion making,
in dath analyis kuch as data oladsleattn dludt~ng., hanatogati
e data change oue time.
Elechve analyis a data in Wordd wide Deb, yideo sevel lance
tole emuTLncltion i d t talk becuuse hehe data leus in and aut
Peiabytes
Examble-Teradota Atort, Cud based-toote Gongle caAzure sBlatabase,
mzom lodshit
Snoolake7
l·2 hat is Data nng ?
hata mislng nelons ts extrackng
QMBunteVdata.

Qotabases ata wanehsue


Relaetton Evaluaton
and Tranlohmaton and prasentat on.
> Bata Cleaning to remaoe nbise and Lneensla tont datoa|
Bata integahonatie data burce maybe combined]
ata Selecion. Sata elevant ts the
analyail talk ane
etrieved pom the dataskse]
9Bata trononation hata tran<orm into atrofiate ohn
^dargailth erotfn]
An edsential hro ceLs shre
aie tntelligent
appüed yon exthatng dta pate]met
methody

I> fatern evalution fdentiying pattens baied on some


measuhe|
) kngaladyeþresenaten L Viuayeten ond tneoladge eebveentein
technegues
tebs I4 ae difenent onm a data fre-frocescing.
Ontonested þaterns ane presented. to the wsers and
gHered a a knca ledgein khaledge base.
lsetr inteyac
faltorn Evadaton

atamining engine J

atobase oh Warehoule

gatacleanùng inkgyaton, kelcfin

Batabase the
anehuee Rejraitory
Vitgled ge base- Thi knooledge includes COncoft hiehahehies wed to
nkhutes Ok attilbute yales nto dienert level o abttnacto
Qhinng ongine- Thatieuental- to data mining ystan and oonsIts
afefe uncfion medulas for chanactendgattrassclation and
onlakon, olausfoaim, prediet on clutor anayais, otian anahyais
and evelution onalyid .
Ptehn evaluatin mddule-Thk combomend typlaly emplays moaswe and
interacts th the data mring modules' to ocus the keon ch to0ads
ntenasting pattekns.
Uen intorhace - TRiA module Commnoatos betuoeen usetu and the
data wmening ystem alezaing thethe uset to interaet with the systam
ok task based Dn
intakmediate
data mining task.
Th additin, the Combonert allozos the usan te br0se database
and alato darehoie 8chemas o data stHuetwes ) evauate
ynined batens, and vlsualEze tne patteknsin digorent foms.
* A that ean penokm data ox enonmaton Hebycival tneluding
vawes,, do duattve query answoing ni a
deductve.
lataise can be said al a database gtem ingonmattn're Bvay
AHdeductfve database

invelves an

i) datawarehouse technofogy ii> Statfstics iv) ML


) database
Patern Kecogniion vi Neural Neka
ogomance amputirg Y)
i Data visual2aton x) Dyormati on
Qfatiad tantoral
data
anayi
þrocayl khould be soalaie runtine khould fetni
ata niing anly in troportim to the e the data gven
the avalado Such as maín memot
Manketing, Fraud doteCior: Ruk Nanagemt
Banks.
Donie in Comha to onhance its Csteme experience. StutéI
Mc Jonald's bg data nihÕng
evdehing patithibait times endehs a Qustomuns
make a move oh aseries become byp ula
Netux inds oit hro t
ustome) ing Minsigkts . behavor threugt eveit
the cusfomen
Amazon -ulek +o ey álate imt, intet tate aid
Custeme Soqmentation Custome þrefiing, Haiket gmont
Data 0an be Wined
-b. What kind
’ As ageneHal, data nrng can bebe afplked
afbsad toto any ktd o dota
bongas they
they ane

Host Baste okm o data mining


¯nêng atlcatons ate
i) databas e data
) Bata anehouse data
itfTransaehienal data
ata mining can aso be apbled to data stneams, ohdoxed
data , yra Bh. netoSHk data, Shatial data
oh /aquence
yrok
data , wWw.
,Teyt data,Nulimedla
managem ant
a otabase Data- Adatabas &ystem
oteeiam 8 btahclated data , and a cet o 2ottoase progamé
marage and access the data
to mange
The sogaaxe brognan jrovide machaniam Bor dalhing database lttuttues
à dt Hohoge'ensunel eetteney ane detuty
ard

Analattmal database il a coueCion & tables eaek kiea it


name. each table consts
AL (toreS aLange a set o! attribute
number tubled. Each tuble tn a helatlonal
a cet o7 attrébute vaues and des elbed
Agonante dato model , sueh as
Entty- Kelationshib data Hodel
(ER deogram) tai eonctructed to rebresenda databas
F- ATelattonal database \o
ALLEtectr5nies. Rolational tables ane
bustomen ,em.Emblsyee, branck.
Relatiomal data canbe aeeossed by datab ase quores
alational. written in
Agien quy ts tranomed into a set nelatonal
0tth mining relafonal databases
Can be done.

n)
Dasehouse - Satposa ALtenfs £s aSuctrshul
0on
'intaraatrnal
&et of datA.
comnfany
Eaah banch has its .Now
olth bia nehes al oyetthe otld.
Penident AL Eleets ones had askod yu to frovide an Analyea of the eombany's
les þe tken type pur brnch thisd quantek .
a diktut task because elevant data ate tþreod
Revehal databases tocated at numorouS ites. .
had data anekvuse. Qos easio.
house
sel. of Wore
Onshutted by Baia cleanng intzgratten, trana<oumairn, loding jeiete
medelàngo modeled by a muetttdmonand
Adata arehouse l ucually mestetio
Anta struttwe ealed data cube .in shlch each dimendion côrresbonde to
ottributes in the lchema, Each cell (tOre!
the values somme measwre sueh ad aount 9N sum(tales _amor

Adatacube brevides a mulfdimenslonal voo st data and alowS the


egmbutaton and fast acecs dsta.
sumnatgd
xe
pre
Dmoge ELata Cube fo AUEectrenics. Thee dementis ave
A--3 -3D
addses, me,tem.
Danehsuse lyttems þroide inhorent Qutbort
Cugpokt fo oLAP.Q
dok 6LAP.Dt aloos the
Sata.
preantation o data at diklerent lovela o abstraètion .
alirs nielt to vBeo he data ot dont dege

(Oromble þom bok )


Transactsonal Data Each heeBhd in a tranioetisnal databse capthes
action buch a euttomek' &þuuehade , a
a tyans should contai a trarsOctiond
cwhpage . A tYoNÁoctton
oleek n the transa thn
helatisns.
Eaample e nested masket b aket data

o9utd enab le to bund le gnsubs a itens togcthet au


analyl fales.

requont ttenseti,
On tranoctonal data can do minng
a ite that ahe fethct.
Otheektnd o< data
Nony ethehk knd 8 data having vehsatle fohm and sthuetu bde
Bequenee Vdata (Hstoleal eetmds
Ex- Time Helated oH lequence
Atoek exchonge data) ata ktieams ( video Qiveillonce, Lenh
data), ihatäl data, (ma nginaoning deagn data (bildg
dedign, integtod eaieutt) hypenteyt 'mttmeda datal
text, deo audlo)

|.4.J9hat ktnd o þatekn can be menod


From dat ebaulroraos mine varieties o
Thehe ae a numbex of data mng uneionalliesdata, ok thot soU hncia
The~e
and, di kcranlnatton, the mening inoudes
nt hattesa,arono:
acsoeiations
and elaticn , claicationU
gyacsten , chtaning analyis
These funetionalhes ane used to
kind ol þattns
tn the data mlng tasks.
7und
elaueilied inte tos categoas - decohe fitve predteive.
Task Can be
Buebtive mning tasks Chahaetanye prafottel f the. daka in a
datacotJ
targct dataset
fraeke mining task heons induction on ewrent data in
thder to make þyedteholß.
Dnteesting and usejul fattans epnsent
Au éleetronne examble fovn botk 6liermin
aa ean be assoelated coith elasses ot
enats)
uthut o dsta charaot zatien can ba presented in vanous
Er- Pie chont , Bah Chat ,Multi dimenslon al data
Cubes
kohns.
mutt dtmentien al toble ineluding cteÁA tabse.
L4.2 ning reguent þattkns, Assoeatfons l CoHMelatbns
ucguent þata kns aHe batrn that ocen quenty in data.
Mang kind yb equent data þatle n ane thohe incbhd requent
tleetk, fhequent saquenans (tequartad patens) and Jhuquant
lubdtruttwe s
E Stnale dimentional askociatton tide ult- dimentional asto eltisn ule

L.A laeaioatton and Regression fon Predicttve Analysts,.


loasication is the þr0cess o inding a model/ or hunation) that desenibos
and a8stnguiske s data ciasies | bncepts .
uodels ane oeryed based on the data
tdata ebeets cohose clasa levels ane knoìn)
bredtct thé claas level af objecb jor onoh
The model is nou wsed to breoict
clas tovel is nknown.
T darived model maybe nebesnted in yanious ohns kuch
alasaleation ules (TE-THEN RULE, Beesion thaes. Mathenatfcal
korumulae or neunal netiorko.
EXanmble fvom textbook.
1.4.4: Custe Aalysts .
elass- aBoled data
Wike loadReation and þredletisn, Shlch analyge.
Withut

a bnown clasthbe based on the þrineiple o


The cbjecBa ane clusted ox qnoupad the inte velast
tnttacass Emclahty and miniming
movtming the be o
snlant!Means obrects o) same ohustese wiu

Adadaset may Contan data that donot


bohavi ot oH model
14.4. Evaliotion Analyis
evaluati analysis deserbes and modols noqulatties ek
Sta
tronds yor ejeoks usho behovtbw Ohonges ovu im
4.4 Ane Al patters htnstngA ?
An ectve measwee for ass0eiation ule kohmdX> is a
ue
hule tusokt
eutjort , vehveaentsthat
the þaneentage o tronsations 7rom
/agm
the givedl de gatses.
tyans oetion database
P(IUY) hre XUY ind'eates that a transaction contan
beth X and Y , uion o itemiet X2Y

Adata misine, dytem has the þetonttal to genenate. thsUSande ok


evon mitlion aohns ok Hulos.
Queshions Can aise leke

N) Shat makes abattexn intenestng 2


Can a data
mning sytem genenate al af the
patenns ) imenesting
interestng ?’ No- rty a dmall frn ctirn af þatenn
potentialy g2nunted one tnteesing to useh.
Apatehn is intanesting 9eadty bunderttovd by hunan
b) valid on neuw oh tet data, ith ovne
(
a) þotentialy usalul d) novel.
AbatrniA intenesttng 2 it vatidates a hypetheaid.
An interes tin9 pattern tefnsents knonoledge:
levekal tbjec medwe exic exist fon þatta
stuctwe of ditcovehed þattern. tntenetngnast bucd m
An sbjeet measwe lot ass0ofatfon ule o the tokm XyYu
a nole sutsort,epresentinge, Tkis
the. foncantage o tranaçtirns
Lm
fom atraMactin databas e. uis taben to be
to
the brtb abby P(xUY) , shene XUY indieates tthat a
YamactrOn dontain both Xand Y.
Anothek eb jeet mealw fon
shich acceus the
ge cextuirty o<
assotation rude is ogn?denee.
faken to be Condtional the detected asacaton.
nad phebibty
prebady that etrando then x, ao
sufyat(x r) - t(xu)
congidence (x) - r(Y)
gknoral ,thrneshotd ik als0 a mea swe , uke rules that donot
my acon donce thredotd o so cor. be considered anintensted

dNanining kien ponohate all e the utonetirg þett?


’ Relens empetents o datn ndnisy algonthm. usn froridod onstraintd
and tntenetigncs
dngness meadune ahout be t cused .
tk) Only the intenesting battern 2
aination
done.þreblam
tn data. ndsing by klenig But the ntoreting
LeS ca be

ane used o DM 2
8aa ninin, hei
has ineontanated many
teshvkajues Jyom sthere dema'.
tatishes

TDatabae vituaigatin
System

Warehouset
Dota Ming
HPC
loTmation Abpleattow
Retrieval

Gtotctes Atudies the coleetin


tutstic
And presentatien data. , anatya, ttonfheten exhanatien
Atatic-taad model it a det mathem atieal ncticng that desenebe
the behavicoh the bbj4 eb
èn
asAoclated
the
taget ciayin tomá erandom
vanfable, and theih d'gtibution,.
) We on ue Atatitic to mgdel nado and omsdung data values,
atitiet ik uld fon mineng pattahns
ia
un) Atn a clai cation Oh odlotien , the model
statiehcal hypstheats test i
|-S.2 Machine leanring
Mackine Lo0-nång Lnvestigates haw crmhutas Lean on onbrove thei
polotmante bakd hich
loahn 4rom a At ef enbmple data.
ML can be
be &rentzytko eh these catogoies -

) Qupenieed leanning tsatynonm to clauteatten.The


of loannng comes"rm th tabeled eiamþles in the tubervision
ta'ning dataet.
x- Atybioal ML þoblem isiA to Yeeognije hand
ari tten boltal oahegn
oakaon
madl alken kanning a at Exam ples.
4 fok exomþle in ths pstal cede recogntion þyebem,
handwrten þotal images and thlin eotvesbon.ding maCAne
MOadable. tbamlattors qne' ulod ad a training Axample uA? ch
tutonise the tahang the alasaii oat:on modet.

lhce teanaina prçecar


the. bnhutexomples ahe wet elas tabeled. tde m¯y ude
ustering to deine clas on data.
Pok examhe an unsubowlsed leaing method an take a
hand artten d'git e may
but data has ne class
dalne te kemanfe meaing.
Qume lubensad. lonning is a cla
t
(obaled' and undabele data ohle leannML. thata model,
make use both
Laboeled exam7les are wed to leahn the lass
models.
(enlolbled exarnes e used to yeRne the
boundantes
beteen clas,
Fok a to0 clasr brohien
Qubbocetoe classes ane, bo cltive examþles and negatire Oxombles.
Qutyose
ioe donet cnsiden the nlabled exambles, then the
ashad tint is a delsien, boundahy ' that best þanfitims the
elasses,
using unlabelet , examples we ca reone the deetsisn bund
fo the (eid Ene.
de ear also defeet cutt'en though
iabeled.

uriabeled ovare

daoisisn. bouranr,

Mang iloviti@s ae thane bt@een ML and.DH, for claasi7eaton


anb oustenng tasks , ML Tekeaeh eten foouses on the aceuyaty
.del. h ad dPAAsn,, DM LBeAneh. þaces trong emþhaste
and methodi.

Sotabose lystems and ata Danahaudes


kzok.
Satabase ystams ane eten wee,knoan for theêr Cn

procecang verflage datasels . Recent database ystema has b


'dyjtenqe dat analycis aabal on database data uling deta
wanehsuse and data Dmining factes. t þrombtes mutt menkmal

S.4- nohmaton Rethiaval


mullimedia daL
in douments. TRi doeumeict can be toxt, o
ede n the cwQb,
orence botoeen Tk nd databases are that
DThe data undon leaneh and wstuthued and
The quenick ae fohmed mainly by skich do ngt ha
keyuJOkds eohidh have
Atyuchues.
The typtea! abochetn lRiR adoht þrsbahtistic mpdel.
fx- A text docum ent oan be rgndod as a
Hhas a mu tfisot of orda abboang in the dsoumert,
The, document'A tanguage model is tht. bsbabcty, density
unciten thatgneaths he bag oondu tn the lobumon
9imlart botweon t0o modol can be moalwed the
itmaatty. beh:oen thetn, o0hheaponding. langunge modols.
Rsbablty tbutienovex the, yocabulody i caled a toþie madef.
Atot dseument may invelve.
dseumant may invotve. ne on
TR DH, Qe can ind the, mazo tehies in a
egdacton e< dotuments.
as Digital librahies
dgital goveNments ana healtheare ingohnation lyste.
Ti atve. keanch and anayst, have Taised mang
chalanges.

are
torgatkt 22
As a higyabpleeation dhven de ne, dafa mining has seen
Suce in me

intoltgende. and : Zeareh engines.


6.Lustness Inteloence t is erctcal an busineu to acquir
a baten bndarstand'ng cmeh eial cntext o ther ani sation
,mahet,Qutolyy. oanbetitozs .
ach ad Custome ,manket,

AL frdes histohical, Cuent and þiedictuve vicas on bustnoss

Alitheut data mining it wad


imbessible to do elaehive mank et
aLAD Anta in BI Telles on data Warehse and
mltidlentional iN.
) Clasikicaton and predietha techaiquei ne the Core predietive
onaiyia in gl.
in Cuto
Managemet buel
chara etevigation. mning techneque. , De can bettu understand
pach Customon
katus

-6.2 Web Leaneh Enginu


&envo tat larchs or
angine
n the web.

The eaeh quohy ane oten netwned as a Bctf tometfmes het). 4t


and othor tye g ges
Vanrus daton mnng techgibi ane used in al asbeets of oeb 4earck

cYau ltng freuÁnsy


deiding okich boge wtu be koRe
inden - sclecttng bages toto be
crasled and
indexed and
deoids to ohich extent e index hutebe constreted
- deeding how bages Chould b Tanked, oheh advertisemeit
olndd be added ,ane how the &eahch engine
ubsn eontext qwase"

Fiysty have to handle hug am ount data, 8 elud


eoch engined
94 neededthere ae thausands and
amTurt data
thousands
amfutan ane theHe cohich

mine the hge


engines o7ten dea otth onne data ,A leasch engine mayte
dataset. TÕ do thes t
oble to aerd eottueting line Aeanth quthy fredafned
may contuct a Ciàu:eh that

olten dede as ot th. antext aDane uey qe Camendatt on,


oeb Lanch
kea
ansl.
Uazo Tcsues in Data
aith gneat ktnegih
ish ato
No doubt data minthgSutline expanin
the u
mazoheiskues
s in data rnning
them in to
neseanch pafition'
methodbig (6) Mining knazledpe in
() Mning vatigus and neo kind S} Knsoleage.
mony untendis:hinay
malkdhantional spaee. (o) Qomblninghotoorked let,
enythrnment
the þoroen elof diicoyohy in a data () Pattera
noite
uncentainty, bh incompienest af
Handng
(@)
Handig
evat aglon and onrtstyaïntguidad minin
nmay condan, hovo to interact coA DH Ytem, how to ineosponate
utuh backround knoledge in mhing and how to vesualiye ard
Comþrehand data mining tesutts
J Thteradkive miing, l5 heorporatin o backg vound, tnoto ladge
Prosentation and wsualgatton et data mining vsults.
and Scalabilety
and coabi tit data mintng ayonithnms.
Ponael shibuted, and inehemental mining abyonithm s..
L) Batabase tyd
,netooked and
tata rebostories.
DM

N) Tnvsib le D
Chabtan-3 at frehusee4e`ng
mikiing, inconsltent data
Real old data aHe
due to their
ttr) ean
quality
data wtD D nad to knoo,
hedaþebreesed in orden ts halßs tnbaoe the qalty 4 the deta
dota mning vesubks.
and conieguenty y the
Thehe ae dovestol þre T9cing techniue.nelke and coweet
ineonal&tenclos
remsi9e
en beatbtedt to &-t0he
data
multtþle doutees into a cshehent data
od aLa ntegration merges som
Juch as data Qehouse tino redundant
Sota reduetion can hoduee data sige b ggottg
Lghernedata ae
be abbed
(4) Sealed to <all Oithen a. smales wange kd o.o+0 1.0. Tñis can
and distanee
the acaey
meaduHes.
thuy may
These techniguet are mutually exeuive,
Aota ieprocelog : An ovevtes
deta
sny toctans eomprising data
moty
umhetenal,"consteny tiblness , bobisasy
qualty
noudes aeeuyaey
ond inteybctnb.(y. data Qte
Thhoe element
and consisteny
ljon tuk oy data prejnoces§ng -
3L2 þrepr0CeLiing ane data leanâng, data
Jlajok takA trvolvtng data
reduotion and dota translormation.
inthotfon dota
oleaning Houtnes work todean the dota b ting the
Aata
Nalues,
9mecthina
Enconsitencies.
data, identbn
mueh mollo fn
in vlsme,yet produces lame analytieal result.
Aao heductton Atrategy indudes dimenitnabty roduuctlon and
nmotoety heduction,
Nohualga tion, data dtieneigatin and eoncept he
goneraton ane 7orns af dota transkematsn

value
manual
o0B Danohe the tujle ,All in the niling
iea gislbal esntant to ll.in the miling value

to Rl in the. mietng value

. M2 the attri bute mean oke medan fo all. lambles bel


to Ehe Same ela the gven tufe.
as the
value.
Uae the mt þrlbable value to u the

3.2.2 Wbiy Jata


Ngide ia a tondom eoQh ON Vanoe in a measuned vaiable.
We haoe seen bauie ttatistB eal deerbtlon techgue (boxhet
and seate bot) 9nd methad s data vikualigotlon ear
be used to identtly outlBans ohcoh. eredent nese
CSe one gien aa numeric attibute uke brice, s boD
we
canttmocth the dota by wing Aining tehntgue.
Binning- Binng methadi anosth a Aolcted dta values by
Londitting
distebubhd into buekcts
The dorted values ae
or 6ens . becauie
method eondutt the neghbguchrod o vaueg.biatng

aHe c9hted and barttttoned into


bins of sige n. (nl;2,3. -)
equal.
(9 nootheng by bn means pach yalue in a bin ?
thmn vae a the bin. bin t4 reblaced
bin medland meang 2ach ben
hed'an vauey
and maimam values in
Tn emoothing by bin boundanles,4he menum
Eoch ben value is
a g&ven be ale täontied. s
ate bin baudaies.
the oleest vaue

h genal, tangeh the width, aTeatex the


dizonetgatkon. ochniye
Sohted data þhfee:4.8, 1S,21,21, 24,95,28 ,34.
oin meons
Putitign into(2jua-qyeng bis
Bini: 4,8,IS Bin1: 9, 9. 9
Bina: 21,21,24
i n : 2s,28,34 :4,24, Q4

mootktngbybinbrundane neare
Middleone is
Bin : 4, 4,1S
Bin a: al, 2, 24
Bin 3 : 2S, 25, 34.
dome by egresaton
can als be clone by
technigue.
Rogaessiom-
that esnon& data alt a Junatton. Lnear reuston Enyslves
the best t tioo ottbutas. &0 hot
(ne to J4 töo t attibutes. A0 hat one,
sther.
attribute
Can be used to þredict the

1.9hene' mgre thanto attytb utes arMe involvo


to multtdmentfon al Qunfaee.

Quttex AnalyfiA Mgbe datetod by


(a kom data
Nany data dmgothing methods ane also uJed oh data ditcretatior an
oYmte
doto veductlon.
iing taahnigue heduce the umber dietinct values
ià Jon o dato veduction lor logtc based datiþamnino
attrabute
Ynethods buch as danion trree inducisn
conideT{Auesincortaniy the in tta3.8
t03.1 Iteration
Doeu to CowmaNCal The
Qvalable arothe
data sA Oyno,
attiebe may An b31-3.2 too
matohng
ba cbjeetond No fo lieh. Aiexehancied
Rodiundoney ETL Buh.
dl ast
matched can Entby
hedundanee Redindaney tntagratten
atroace (Extathion
ota
yeDveh Bata HO Bata
otribute. euávolent ionaidak as dat a
is
us ldentikeatisn. t0* iles derubbirg taos
migrtin ean in
anotheh This ? neblace th¯d ert ry data
ean oh and be
h ond. ral dusng /Trondonnatien/ eau
be rahens helb data ean th at elining
an 2t Qohld esütingdata the toell toÝlk
be 6fCoymglotton
nedundant
mhohetart dat Rroblam
a helattonsie. ald have a
dotectod to reduee olsning
attibutes entetyentttes kthing,aUoas ue prðceut
integation. in ma
and - Loodig)
Analys imtle the ofinal
anali,
cstrelstm by . itt ldenttfe rom avbid potud's UkAie mbe d'lehcban dtrenobaney
proees ie
eo ne
tt. - mutiþle (ohena Gendoe
toel damasnioowledge
bata. ntton nedundanciet shel trandjorumation elde.
dotved olsuw by cy.
integroda. prsblem,inte_ratton
data detaetion
sod human
Bron Jsue anol detection.
tor hmaie attibute, We uie. ootsal,49sa clpielent and eovaanea
beth. Oheh access how one attHehtp s value. vary rom thae o
anetha

YCorrelatfonlt oY teminal Bata -


Ahas eo8tìnet vaee a,,a2,

ata Gie tube deUdbeo by Aand b oan be dhDOn as


table oth'e vaeiyaluei oA maktng ub the colnnd
maliguup the
the yos.
A6i) denttes the joint event that otth~bute A takes on value
a and b takes the value b; , shese- A=a2,
Each ond vehy postble joint/ AB;)Dvent has ts o0n eell ohla lot)

eomhuted a 2

iu the bleved

ia the epeeted fvequeny A;6; shia can be consuted


caunte(A- a;)x unt (B- b;)
Skore, nee the numbete o< dato tubled, eBunt( A- a:) 2is the mbor
tubles haing value o; fot A CSunt B:bi) /the mei
values bi koy B
colls
Junmatrn value is eombuted aver kxe
Xstatlete tolts the that A and B ane tndafondant.
baed
that ie thoe m8 coNHOation betWeen them. The test i8
Love afth (h-)*(C-) dagreo eedorn
A and B ahe
Btat itleay dovelatad
TUt dita indehendente ?
Ta6 le Valee > Caleulatd vae. than coreiec.

Tabte yauo < Caleulatod value,' huà 9thea


in eorloft.
Cinto .Ou tomb utted value id ab otPe
tharn tahle. lo

COKkelaion Caolcent potk Nurmeie ota


Folt ruumee atthibutes , we can evoute the
eohxel ation beueet
athibutes A.B by eonbutng the eahelatton Anpi£elet/ Pearions
product nemant oileont,. 0
A Z(a-)(-)
t 2(4;-nA)
Ahene, nis the nunbet
Ju8s ef Aand
vaues
oin tule tuple A, a; and be ane the helheetrve
Values ot Aand b. and B ahe the Helsecive men
o,on ane vethective tandond de viatf on Aand b.
aibi) the Aum of the A erss- þroduet.
HA,G 0 then A

valuy .A ineaasel athe values &< eans that


corelation Binereasel
The the vauethe ktong the
A, B axe
=0
<0, Aand Batee
indebendent,thoe it no côJHolafiem
. attrcbute negoctvely oôralated, the ya
Covniance Numerio ata oE etho deeheAses. 8ne

Th trolbobtky theny and


Ptatitiel, kHalaton and cova:
ho muuch t8
Pmideh tos athribute Aand
attnibutek thangt
b let af n
bi0vo8
Fhe me an values Aand 8. (anta
Vaduy T Aond B,
and

Tho eovianee betoen Aand 4ie


delened
bo(A o)- E((A-A) (e-5)
2(1-)(b-)
HA. Ushehe ta and oa ae Btandard deviotions A
ondb.

fk too athribute Aond Bthat tand to chanAe togetont A Lag er than


thon bi koly to be ionges than b, thonfore covalance opA md
ne el the attributes ten ds to be abpue tha exþeeted vaue , and sthey
btribite à bolsuo ts yþected value , the cwvarlance of Aand Ais
negative.
A Aand &ane indejenaant. E(A.6)
ElA-B)>- E(4).E(4)
Thanefre the covçance .Cov(A B) - £(A-b)-AB
is E(A). E(&)
AB
Timepoit 0
41
t
14
44
2

Tabie bresend, atuyd


Bnsulyl exemble af ctock þrice, ohthved at've
at ee time
points foh AU AetBrics and iyk]ech , Ehe 90ke one
game 7the
6+5t4 +3+) 20
$4.
S4
S

6X20+5X\044X14+ 3X5+S
SO--43.9 -(4xlo.80)
We Can that Itoek ie

he idential 3he'al CAde y Covariance , usnore he t00 atlrbutea

elating hedunciad betear attributol, dub'ephign. (hsuld dso


be dstedfed at the-tuje iavel. The ue o!' dehnalsed tabes
isiß angthey Jhce data hedtunaney
Taeonitencles aten aise batieen van'u duicates
doats, due to
ingoturate data entr Oh uydating ime but not aL data OCtuterne
3:8.4 ata yatue Coniat oetection and
Pessstion
Hoata intagherhorn aluo invs lves the detecton and vest utlrn ef
dato. vabe nicts.
84ota Reductton
Combilex dato. onayeis and meng on huge amount afBdataoan thke a
me mafes ueh
ijastble
gta neduction tzehgueun be dlpled to setÑn arducad vetyae.
ehe datajet thatiuch kinalor in vatune. Minng On Heduced
datnlet Jisuld be mhe ytoarnt yet þrsduce the lame
Ruta Roduaton Satugies
Bat neduchen rategica nolude imandsnatty redu
Tedu etion,
etiny Tumolity
Neductien ond data dombesion.

YOndorn vauáeles on
attrbutabod0,.one nurnbor
TRes methsd ineude Wovelet transohnd ,,and
pineibal eompononts

Atrebute s ubset selecion i& a meth9d

dmenions' are defe etedd and vemoved.


dimensisnalty, taduct
nshleh }srelevant, QLaky mdavant, or Aedundant
Numeneaty eductfonM
edueons 4 atternative
nelace ohiginal data
data vobme
Lmallo hn af olata representatih.These technlqus may be
paramette.
fon bametrie etD method, a madal s10d te eAlmata -the data
intteod
9 that e daata parametoha naed to be stored,
altua ata

Rereaten and log- tineamodel!


Ddelaeahe Rxambier ef 'non- paramelric method
or torng reditbed roresanmtatisnd the deta inobde hiltogrand ,
oltterin! and data Cube Q9re9

Gata comreeisn transkomatigná ae L9 t9 btain a raducedL

torþrssed nejreseitatino migia data. IL onginal duta


data eanbe
Yeeonstruted frormthe eompreseAata
ol OithBct inom atisn less
the data educion t& Ca led les4ley.
tnstead @e can eoTstYuet data
veduotion i
ahbro xinati
then the data
TRone ae Aevekal taaleas algondthnd fos string compreuiom
B.4.a avelet Tranokmi
Bilenete oaelat tangonn(Dwr) ua teas tgnal poceasing taskazue
vitok X, tansoms
that ashen absled to a datan veeton trnokns it to untntoally
diyenant vectok x' wavdlet eoalents.
Tioo veetohs ate o same length, ushen
we conyidor eaek
tife n
Wota vectoh, that is
measuwe made on the
n. datasase ttibute. tufle fim
Tet Bsok (Hon Tionakgrm
Rusutisn Avagas ets, eaooients
L2.2,0,2,3,5,44
4 [2, t,4, 4] ttLo, -1, -1, 07 2

atb
())
Resutant reduced veetr
4 - , , 0, 0, ,-, 0
0 Inhut data oe nrrmaed.
onputes kohgyonat veltor.

Prinsthal tonmþment anayais ote KL mohsd Aeanche (o k n-dkmenti


Onthogonal vectoks thatoan be used to vabyedet the data oheNe
reþyesent
eonvet ento /D,

feotue Examtle! Exanyle2 Erample3 Examplet


4 3

14

Covaanoe matbi
4 Covl,33) Cov( XiX)
4
N
LCov(a,) (%)
Nel
14.

tebs to golve thia find mean


)nd Find eovariance mati
(3) iganValued s covaiance mabis (4) find the adg en veckoys
Conquiatim o% fhst hitaaal ononent t all the exomes .
dotpoigt
men
B.44. Athibute Subiet Solation
hoohevant on hedundant attributes
Reduces the data siye by removing iselevant
o botke loaand zelition , (8) lpuote Ktebwike buckdaxd
baoküand eléninati on
) Conbinatisn o onDOnd 1clectten and backo ard elimination
eeilion tree inducion

34.S: @Regueion and


nd Log
Log- Linean odel : Paramchea Jata Reduction
Regneson and to atbrorlm ate the geven
ed to
Cneat medols ean be usded
dåta. Tn Simble (nea TegresAlon, data ne modeed to t a
ne.
Ex- AYandom vatoble y( resbsnse variable) ean be msoleled as a Cheas
funetion o7anothes vattlom varlable a l medictor variabie)
teheHe the vaance ia asdumed to be csntant .
Land are numeie database attrtbte&
wand b are graasion coioints dh esos the klsape af the line
and the y intah-byt. by.the method
od o least
These. eoeelents can be &elved
the emOY bat2an actal ine
Poh
4nutle inea m lthe ine
the data and the
Hegretlon,
.
y, to be madeled as a eneas
Shlh allous a gesBonse varlable Yobotb, ,+b,hyt
kuncti on af fao oke moe þredietor vabhe arn to-th
CParanetle methodg hon lnea bnctions ean be troardkovmed
a 7omala
Log near models multdEmensional þrsbo.biRu.
diforete
(inean madal abhroximatei can be wsed to stimate the þrsbabtdy
Aebution. Log inean modols
dimensional (pace for
2ach þoùnt ata mutti Qtribute,
and ). 1ên en models Can both be used on <hase dota,
Regueaisn
bth can
handle desed da
Asume the data td Sume mocdel , es4timate model þanametos, gt0ke
data
the banametesand dis cand the

Nn- þaramctle methocs


tune model1.
o t asune
. Majon fantliek- thtogvems,clutuang, kamftng.
Hitograrts twe bining to abbroxmate data dtibutont and ane a
A RHOgvam lose an
atttbute A, arhtions the data dt lbutio
eA Pnto dlafoirt Aubsetd, releNhed to a
to hed the eental tenden ey Ko each ok the
D oach bucket vobeSerts otly bukets. --value/yoquney
Aingle attrlbute
pafH ,the buckets are cald pengleton
nebresents erntiaubus anges fo'eAvenbuckoks. gomatmel it
atilbute.
Ar-83. Subpose A elecéroies have prte ltst of tomunony sold
toms.
I,,S,5,5 5,5 ,8 80,10, l0 ,10, 12. 14,14, 14,1S, IS.
1S,15,1S, 15,1&,18,18 ,8, )% j8,8, 1%. 20
,20, 20, 20
Q0, 20 , 9020,21,
21, 21,21, 21,2S, 2s,
30,30, 30,
25,25.25, 28 , , , 28

Eaual Widtk-fn an enual- oideh litogvam the width s eaek


bucket vonge is nllorm.
Equat reguency- In equal jnequeney ARltogvam, the buckets ae
eas bucketu eonstant( each buckt.
Cheated 4o that the requniy of eack
the qame nSmber ay eomtgu data
Htatoghand ae
yative at attroinating both bande and
ghy heed and aon dita.
uttidmendional hiltgrams Can catne dependanelet betw oen-attrih
buelets ae
Singeton
9-4.1_ (lutu§ng -
fatitiom data set into cutoru based
based on
on limilarity. and lt
cluaton nobneAenta tin( entraid- and diametar).
lan have a hinan clical clustig
the cuktzk amete the
maybe nebrblented itk diclutta
af
mAxemn olktanle betoea tog ekfets in a
Aovthod dittanee iA an altekñative meas We câten quateu
cludts ol jest onm
and delined u the awerage destance eacl
the oistet centve d.
3.4.8
can be uied al data redueton because it allos
techaigue Landom
tonge UAata set t rabnesented by aa Amuchmalles
data " amhle.

Enual Witth.

25
20.

|0 15 20 25. 30
7a 10 12 14 15 18 o21.25 2%30
ttsw many produets ahe theNe

betwean the ee Tange o


0-5, S -I0,15- 20,20 - 25
2S 30.
B48, Sompn
Qubsose that a langedatadet D, contains Ntusles.
l s4SwDR) sge s.
andom 4amÇte atthsut velacem entSRsNR)
() Semjle rondsn Janþle aith neßlacenont (

Gou) Sbrated sámple.


the þrsbablthy oto danig
be
tute in Dis /n Qll the tujes

(9RSOR)- Each time atuple t Cbaon om D, t id recohded ond


That t4 atable s daan ,t &þloced baek.
then reblaced .
t be dhan
in D eo that

Custan lorm}le -u es in Daue tndo N muctuatly dioint "olutaa'


s Cutths can be
then SRS Sbtalned here SK4,
Eoch paging can be eon&edored d a elusto:
Staatikod tombie: 21 D% davlded into dilioint Jauts calLed ghiata
Qn SRS at
eaeh tyaturm
An. overnge Jomþlng ok data Jamng ton data Meduchim gig tha
he cost
co obhônba ante i þisponkmal to the itge e the kanble
Rublino to the-ae of data,.
842 ata Cube A9gngalon
Sbhse data oniti AU Edronck Aales þek quatos,;o tha

Bata eanbe oggnegaked 0 that Heuhing data umnanla the


The, Hona dtta set i dmallen in Vatume , wthut os a nfomai
Uor the analyate taik,
?

Sota uube4 ualMdtmientlona agregated in orumation. Each eell ksldt


DAAMegoke daka value, Cahonashondng to the data boint in a
muttidimnd'nal þace
Conee hiahanchseg may exitt loN each atttibute in the mulltlimede
ube, hilnin, tha anadis data at multßle ab gtractisn lavel.
or branehalisro brandhes to
into hegesn baiod n threk addnese,
Soda cbersbredes last ossers to fuecomnbutad , tummaed data,.
þretde att
benjtrg olat srocee. in data
Ihe tube eheatedot the owest abstraction tevol is alered as the
babe oubsid.£x: Rales oe cstone, Aeube at highest lavel e abitsactir:
ucalled abex oubeid , Ex- toBal aales o 3 yea..
ata cubes aneated tot Level o
L abtractisn ane
lodol Juthrkeduces the
date ye Onen
Pyg to the
Avllabie Qubold Melevànt to the
Teduleing
mining mequelt the snallag
task i ued.
Qata Thaneorunatin and Qota Biuexekigaten
ota ae, thanilomed ok esn 2d atod &o that the necfin g
minng
procel may b' mone eoiant ard the þatten ound may be basY t
understan
3:s.1. Bata Tianakormaflen ltrategles
Bata tans}Okmation Atategtas inoludel.
) 8:/eMetatton(Raw Values of ruumente atthlbut es (age)
L ntival lalkels ( 0-10, I)- 2o) ate oe ane replaced
concobtual abala( bth
adult, denlox) Thelasali ane reuvey agonged tnio comceYt
i) Conccht haraschy eneratton ok NmÉn al data
KD aHributej ich okad ret can be generaligud to
0untyy on 2ity.
ike higkon tevel omat,
ata dilekettgatlon technlque ean be categokged basbase 6n
(a) mearts ishethen tt wted elasr intomatia
oh (anoh arectton t proceeds.to- doØn , buttom- ub)
T} ailenothaton has a deeton - subonyis
ubonyis«d dtoet
ed dseretiyatin etkrte
,otire
uoutorised
TTthe þrecaS dtants by ast Jindhy ha thon hebeats that
Entodal, t ealle
te-dsn disenetlotfi
(menging)
the eontinuous
Values as prtattal ¿plt points,then emobs dme bu
neighbenhsod values to fotum irtorvale and tien
esutin g inteyal.
moging.
thsbrced to the
Bata Traneoematfon by Nokemalgatn,
Uonlement unit
Mealutement sed eo aect the data anayis. Chan'%
measutement wit rom metis t ineh
to þsunda oh Oigkt may lead to dlesent
An atthbute in &maleh tunlt ou lead to a
nge ot that athibute. To av ndondy on wnt
he dabd nould be nokmalged on tandladie
Tht uie the tanomatin o data into I maio Tan ae guch

NOmallgation the dta attembts to give all ctthributes

Neaket ngbrn clunlation ond clustening


foh atance based method,
nokmalaotisn helbs sth
tbutol otth fnfoly Larger kang os tike incrme þrevent
tnetally smaller nangee the bny otthibute
Thoee ate many methodi oh data nonalg atlon ke mh-max
nounalEgatfo, z- SCone nghmalizatton ,and norunaliz ation
by demal teoltng
Lts esmsidoh, A be a umeie atthlbute wh n sbdenveol
Vn
NinMax NSAmaltzatton- oyotuni alinan tronsohmator
on the oigënol dota. Rutjoe mng and max Qe the
minimum andmamm valuet o an attabute t, Nin-Max
tn the

v' mna
-
+

4* Nin-Nax nôknalfzatfn buegenvet the. helatfonahGe anzne


the ordgtnal data vaues."
oase loh nokmalation als suttlde the
that minun and maemm value Lo attibute cndome
ae ¢ 12, o0D and 98o00(e. 0ould ke to moap income
to the
Tranoun73,bo0 nto LO0. o ngè by min-may nokmalgation:
T3,600-12,000
O8.oUD- 12,000 -(10-6.0)+D 2

Z- kcoke nOkmalla atlon


az0h0- meon Stmaigtion)the vales koe an attibute A are
mokmazed based n the men (avorage ) and tond and dovia lisn
A
Avalue , N; Gi AtA nakmaled to v; by eambut
VË- A

tShehe A nd A ae mean and &d

) and maelmun
Thie method le uselul ehen the actual mirimuBtler& coh dominate
Aotttbute s hkign,hShece there ae
the mlh max notna 3att on.
moan and standasd oeviation o the vaes loh
B Subpeke that
that the l6,600
S4,60D and
ibo atribute income are in co7ne
otth 2- So He normazativn
73600 - S4,000 |.225
|b,g00
above_ the nean .
t meases the s.d blow osL doVatin.
om -3 standand dariation to 3 stan dand
onges the Leatuwre to have a aean
lenles the vaue
d*Te teohuioue deviation oß 1.
0 and stdland eatwne value and
ohy moan ea dubstucted rom ovey
That is
divded sd.
mean
A vorianee

(SA
(meanabtolute daviation A mean aboute
SA
SA A moye obust to outies than s. d.
|2-7thedeviakion Grom mean ia not
henee

Nteralgaton by deeßnal tealing


Noknatfger by mosg the deema potnt ol valued sl atthibute A.
The humbe o dlecshat þsints moved debende m the maxinum
abboute Valie
Avaue V; s nornaliad
Ohere the s mallest integer
Auoh that max(|v)<
EX utpone Aranges
ao ram-986
.fo to 917. Maxinun absoluka
97. Maxlnun absolute
ol ABs
Value o7

( -3), -0,9&6.
b that -986 wlu be noknalyed to
97

But dataehanes 'oa bit fo Z-teore and dee'nal ealinp


Ttic necesas to Save fhe þaameters 0.that ute
Aata aan bel nomaLsad i aa unom m mannes
B5:3, senettsaton by

bns. Atcbute values ean be htoetzed b abbytng aqual iotdthbn omoan


equal yroquen Sinvnêng and hen neflach eac
tan be
00 bin vabues by bâ mea
Oh medlan, These teehnkues to the
qotttons to gonhateconabt kinanokes. uniuhrv'sed
Bintng dse¦nSt uáe olasl inlohmatteion and ia thoalone
Apnekypten.
954, itonlgakirn bË HRebyram Analyui8, it does,
Hiktogkam anaycte an ncaapoided dieenetzel taehniqe beauseatlibute
eclads ioumati on Histogkam banttlns the value o an
nR blns.
to diujotnt anges ealled butket ok
Nodrus þarttioning wules can be used to dellne Kitogans .
to each
TheRetogramanayl Can be

baEle tn ohded to abtomatiealy gennte aamultlavel recbt hierarchy


bin he þröcedune teaninat'ng onae Vaprecpeo<led
umber
has been Yeached.

B.6.6 loneebt Rekanchy yunenatin o Neninad data


Nemina otthtbutl Navea kite numbet of dittnet voleed s sith ng
lhcaffon
odring ating the values. Exambles nolude geng rah hc
8 high lavel]
atthebute3 pte at the
Aehema level by ulens oteattrebutes.
treet < aty< Pronce -or statec
That is fo usD amon North.
Cowty
TyonoPoabasday

401-soan

hioarohy genoaton baded on the. uibon af distin et


Concabt
values perotbrtEue
Azigunont 2
has bien gorte
12 qales phlae ecordi

5,10,It, 13, 15 , 35,50, 55, 72 , 92, 204 5


Patin them into three bis by each o the
methocs

19 databotnte thee
in i :5, J0,I ,13
Bend: |535, SD, 55.
Ain: 12,92,204, 215.

Width

min + ot bins.
mln +20
may-min
min + m . Nun ol bns.

Hoe i5-5- . 6,10, 1\, |3, 15 35,SosS,72


G,
lin2; ubto [S+70
Bin2: ubto [s+(2x70) 92
aD4, 2S.

(Kmeans lussertng) Mean


Con'do 3 trsidi t. Takethe nigan
O5 K,5,i0, 1, 13,15) e 10.

60 Ky (2S, sO ,55, 12, 9 80.&


K3(204 ,21)
354,0.3S4,
.T7 0.707, -|"06,
0. -
4 282
00
nofmalation - AYe 62-
200o)+0
O5,| u25, O|2S, 200 ve- v-
bt
setting nomaliyation max mén
max| 0, 2min
6o,60 o 450, 00, D00,
data the
goubo nohumae. betbw
to metho
da e
7S,1S,1Si75, 14%,45,
45,5 92, 22, 22, 22,
19,2-(9,19, bin2
21.2
6
30.67 bsundaies Ymaothing
20.16
(4Median- 383 Mean lass! n' done
13,75. 72,45,45, 1
2423, 229 21 209) lb, 15,l3,I5, 13, I,
2020,
,boundanes medan meon, bin
k-meon ans
fnal elustey,
o, Same
)1s) (104
sIT2,
92) , K2
35
15, S,10,|,
13,
(35-4-93)-20.
17 32.45 2S-39)
= (G7
92) ,72
St, s0, k2
267. C,'
1463 15,3) 13, 1h, j0, (S,
the
bute deviation
SA
240.
notnaheddataae
The
833, 0.46 ,
- (·2S, 0. 0.416
debmal
20%
Nax
4o0 600
290 300

0,0.3, 0.1,0"6, .D

Comhute þasáon. cDMelaton @ the data

.6 )D-28 9.980.e44
3"02 49-7 0-33 -042 0.138b
3-82 48.4 0.47 -" 72-0&0.84
o.D044|-44
342 54.9 0.677 4:08| 0. 28S
3"59 54:9 0-24 |4.78 1·1412
- O» 48-6-44 3.0816 0.2304 4
2.87 43 1
47.9 !k 32-2.42 9344 J0. 1024 8B59
2.03 0"I4.42 -0.s412
3.46 45-2
0-0t4:80.04 28
3.6 54:4 "0-050-8-0.o14 |0.0025 0.01
9.3 504
23B.35 5-091 0.8!18 43
30.12

Poun eonelaton co-aoient 5-041

6-09|
0.8454
2 0. 4694
s. Pordoam the ehe tet Oh eornelation or the
qutyy shee 256 jesple shaned tie month
e thein expeeted dributlon o manth &ahe nty distibuted.

Jan 29
Chi-squae <owmula
24
Mateh 22

2 Exhected 256
|2
21.333,

June 18
dtabrtmted)
20
23
18
20
Nov
Dee 23

Ngnth. blerve 0})


Jan 29 333 58-182889
24 21 233, 2. bb1
22 2333 O 44R<9
21. 333 - 2.333 5.44 2889
19 21332 -0.333
2| - 9.333
Ju 2:333
21 333
- 2333 S.44 28eg
- | . 333
2 333
|b67 2 7788q
2 2l: 333 3-333
20. 333 I.776&84
Nov 20
23 2. 33? 667 2.7284
Chbtek4 aka Waehonde
otahouse. and Onitne An aytizol figc
Qota Wanokuses ganehalige and cons6%Aste. data Pn multldmonlonal
Lpace.. Corttructhen of data wanohauio &
invelves data cleanin
ita interato,data traniomatim and oan beviewed a an

Sate Qoreheuse. provides onlhe analytteal srsecdang (OLAR) trols kox


storaetive.onayis muldlnendebhal dota ef vtuced granulanitie'
Okich <asilateflollecse data generatton ane data
Alany pthek data mining hunctisns Ruth
Yed'otin and elstekin! can be luch as asoclati on
intograted oith OLAP,cladsiicaton
of nattong.
TGINOlOhe data aneh suseng ond OLAP tounan eontt al ttah n
FTuakoke
dee-4-.Sel aeeebteol defination of data aareh ouse .data cube
See- 4-2. OLAP eboratloni",19-ub, dlu do
See- 4-3. Jata wareheuse
cosmekuge deliym ánd
and udzge. Mulitadtb sionas n
lee-4.4. ata cube gmputatibn, 40LAP huta indexing, DAP
See-4-5- 8 ata geneyalzatign yattrebuto orented nduehon Th
method. wses toncebt hierarehieA to uwaije data. to
muttfple lavel o abrtraef ons .
4.1 ata Waehese, baaie Con cebt
4-41 at i4 data Wanehsuse 2
noou to a data nobou?toluy molriakeA
that ls malntal ed
Adata aMehouse
oyanidatim 's ebwratima
a cod

Ace. to QUiam H. Ihmon


Adatn oandlsute ie a dcbject Ohiantec ntegrate, ma-Vaniont and
on-yelatle celoett on
maz0k subjects uch as enstema , kublor Vbr9duc, salet.
Rothek conentyafing on doy to day ahenu and transacton
obtnatisne and trangactn
proelng.a ano f ,a datatneh suse Jocuset o
orn decn makes.
Time Valant. Non-velatHle.

412- Hexenee betadeen Spirattenal atabue lystend


Aata wanehouies.
ÔLAP Vs 0LTP.

A:t.3 hy Have a Ceparate ata Wanehouse ?


Oberational databaie
to emsoltdte ere Taw data ,kuch as trang ation
bn1. ofm shch nees
balone analy Auabort requirek egmkokii
(oggregaton ,Junnanigaton) data. fron hetenoganerud
) lean. data,

4-1-4 ota Wsehouclrg :: AAMutttiened Arehteetne


oata coaheheuses adsßt a three- ton' ahehitectune

Analydis Tep tiar :


Front-andto.
OLAP SLHVet
thut..
edale ter
|Adminithoatlon
ata waehouse stom
7 ata Mahts ata anek
Metalala
&eveL
Extract cloan
Transjom Load.
ata.
E
eperathona
databases Fxternal l0uHees.
4ol5:
Rata Wanehguse Wodels Eatsble Wanth omte, ata llant
and Vretual oehoue.
FromtheaHOAltune boint o vlew , theke ne hhee data wonehouse
modelb :
the enterrle DSOHehoule data mart vittual warehouse.

(extBook)
t. uiola.data Kooatony.
letadata
neCheatod
axe the aatn,that doline data Qanohguie sbjects. leta data
7oH the data namei and do<anattons the given Wnehsude
AddeWonal
data .
metndata ae Cheatzd kon nctany aXt7acted

henatonal metadota

the ebenathonal anvironnent 0the data lanehouse.


Yeloted to
AbulneS metadata

1 Rata Wakeheule Modeling :ata


: ata bube
cabe and OLAP
bo4ed on mullt dimend lonol data
Onto Waehsule and OLAP toy ate
Modol.The model views data in the Johm of data eube.
Vabies n- demeneonmal models re &tor sohema , dneoflake
emaact eomite Uation
data to loled and vieuded cn nulttßle
be m deled
Adataoube allow 8
demengions Modli
Saekous
fb calledtu
ehaßtek-b
eneng Freguent Pattns, Aueoaigns , and Coalations :baute
Cmmeeots andlethode.

4).2. Freguent Ttnget Nning lethads (b


2l,b.2.2,b2.3
6. ,62.4)
6.3. to iRoh battons ore Ntevestsng- Patawn evaluation Mettode
(4.31,63.1 ,6.3.).

b-l Prequent þaitakns are hattons that atpen raq uenty in latasot

6-lL llalet Basset Analysis


Laad toto the
frequert temset méng leade the trankaetfonal atbHelationa
omd corelattons amohg toms in Lange
dathset.
TRSA broce analy zes eutomeh buylng hablts by nd'ng oikolaionk
betoan doctt ttems th
tems that cuertb takes baskets.

thege assoclafons can helb netaflns devolop markefng


ganing nalgkts into ohich ttems ane
purcha d
cyy itemie an items et Shoro
Aeoueney Ttanset - A requeney shocglied mnmun Butbort.
Kubpont greatek than &OTne o fuhhot.
Closed requent tomset- An temiet i k lesed none o ts inned
Lubhort at that f the temse ,
kuberiets has the kame
Ullavimal 'Poquent Ttomset - An ienset is maxinal reent
nene o its inmadiate sutonset is aquent
Aoanaand doewse froßordy hzgyant þatovns
A2 Aubset o any yeguant toms et. mutt alIo ba
be raquert
D wik, Bredl, Buter id a hequent itemset, then itemsets ae
aauent,
,bready
Mk, bre al, Buttoy Mk,read) , Mik, Butte),/buto
length Df the reguent itomset i'k', L o its 2k-|
Oubiets . are also regent.:
Cuorel
6requent
TID Lis oTHerms
A ,C,D rtamsakmt
T2 A b,C,D Count =3. Fraquent tamse
T3 A B,C fnd raquont , Cloned
T4 Mayimal Ttemaet
TS

Ttem Cqunt
Au itams A, B,c,D are .rsquent
4 because ts thein aubpord -Count

D 4

Pem Ceunt
A, Ad 2- itemKets one raquant x eapt AD.
Ac
A, D
AC3) ’ AG a), Ac(3), AD (2)
innodiate
A( Count) is not &reater -th an ts
4
PUp ohset o, Mis not olosod
A' immedi ate Aupeset,temget ane b her
ediate
AiA not maxna
Qlssed and mlma requant ttomsets
as4oeation
Abniok algonithm war hegunt itemsels to genorate
uuled. Tt Ye baled 6n the that aubset
4afeguent
the comcest that
tomset mutt aLs0 be a
Jraquent îtomset
TID tems
Ttemset
235eY
T4
4
I3,S

Subbot
( Ttemáet

7 125]
(2,3} 2
2
3 2,3,1
fruning

Hemaet NO.

NO

Ltem set

12,2;5
7evGUs one, X
(fo and retn to
requant ittemset aee
Butpose mimum congtden ee vale is 60/.
SKRch is qgong to gehanate all
itomset!

foh bublet Qe

4’(1-s) (S reci mands I-s)


ubfoxt (2)
D)z, nh- eon7% va

Confidonce qubprft (i3.s)


Rula2. n)-(3,6)-is)
Rule means
seleet. Supjovt(i 3,5)
uhþort(1T)

seleett
40%
Rule 4.

mean
iutont (13,s)
kjectag rutjoret (3) 4

Ruleb_)
Bample-2
Obfeettve is to use the tronsactlon data to fend afnstiar between
procucts, toklch þrodutts kel togethan.
ne dubbOHt level oie be set at 39Y the Conidenee Lovel
Lovel at

sOy. Atocoton

Tiondacont
Bread butter.
2 MiK bute, fag otokub
3 Bread Buttatt Ketehub
4 Mik Aread butten
5 Brea Bcttes Coeies
Oread Guttet Cooles
7 Mek CeklesThsvd
Gread Buttar
buctes read
12 Nuk
Gutte
Bread Cootles Kttcle

Rule X’Y
Total 12 4ian
4
1-temdets fegyene
Craues hyeshot
Lyo33133/.
9Teater than 4

Bruad
Buttas Bread to
Butel0
3

Cothies
2 tem lets
Froqueno
Mk Buto
MLK Coekies a requent
bvead butte Frequont 2itomset
butty Cockies Mluk Breal fraquenag
Bread Cogkies 4
Bread Bctter
o, tem ae Mak, Bread Butten 4
Bread ,butte, Cokles
3item lets
frequan
Mk ,Bread. , Coekies lo frequant temigt Froquenti
Broad. buttos,Co&es Mek,byead Butos
Mik Bcttah. , Coekley

3tas oe tnene , 20 we can t buH 4 temset .


Qub let eoation

Non embty ubkets ane HY,(bredy,/Butor.Ma Brand,


Bubket of I , the alocatlon ule i&

ted Qubtout (1)/tuloxt (s) >= min. tonldbnee

Valid.
uleS:
Ruleb

.Rb Tmbuoving the ElRlonsy


Hash Buled Technique -A hash baed techrigue can be used to redee
the lege o the candldade k-itent ets, C onK)I4
neduce 4he numbak o candldate 'K- items ots
xamlned
ToMacin Reduction-A tiansaction that doeinat
csntoin any
K-itemiQ t oannt ontoin any hequsnt (kti) contan any requent
trontactibn can be morked hanoved itemskt. Therafoe euen
romurthen Csneidoafsn
Portbioring- Pottning techiayue Hogulree tao databale Acans to
nme thd hequent tehaset.
Phade i, the olgoüm dirldes the transactisns into n non
pasitirnd . find the rauent itaniets locad to each þasttn..
'Carn benes all Local reguent temslts to kOrm
orm eandldate
aandRdate itemiet
Phase
Find gbal reauent temsett amng. eandidata:
Samng- Batie tdea o lambltng abproack.
a tato Blek a vandgm Aomjie
Pbthegiven data ,thh deanek o reauant temset n
9 insteal e D.
Ae we ane leanching rom S, 0e eanmiss Aome the gisbal requant
om S,
tamsets.T9 kedute the bostbiity
theshod than mnimun aubbo ti to gind the requent itodets
Local to S.( den cted Bo Ls)
ynanee tmtet ounttng databe
in ohlah
- Aynamiis potttloned
tbnet enuntthg tachnigue. war
into' blocke maxked by
Rkart boints .Thie leade to ewee databales Aeans than wth A$riohd

Ga4. Aþattn Grooth Aproack ok hininy Frequant T+emdets


Abh9hd canda ste genarate ondteacls
tet tYnethod Rgntllcantly nedues
the size eandiddtte kets, Rd potorun ane. car
montval costs.
nban candiclate set.
Tt may need to reseatedly sean
gean the, shele database and
ef cohdldates by patosn naten'ng
I we can deugn a. method that nines the comphte set of preguat
ttomeeta oth oàt luch a Osity candidate geneYtonprecas,
Lhich acgßts a diviode and Csng un &trategy
Fört, t Orm þresses the database ebegon tung reg uont ter
nto a Yequent battarntee lk Htee which Hetalns the ss
temet aceoclatlom inOmat:
Then devlles the (ombredie d databale înto a tot condional
dotaases(a cheotallind þgjected detabde), each asouata
One resuet ttem o fot
with one þatten'
pattera agmont
ragment and mines
each database deparately. each Uhattorn yogmot
its astoiated datasets nee ds-to be examined.

TID

T200
I2, I4 Ttem
troguone
I,42,T4
T400
T500
T4
T600
I, Iy
I,I2,T3,I5

En
the ole ants

L Ham et,aceovden
uld Ordad
No, lok each trasacton,
ount
to the equani (1262)
T100

T 300
T400 T900
TSOO
Bulld the tree 2

transachion
Dhaw therst
24

NeXttrandacon

TRath isalready dranfom


90,Lncrement toVL to
And ára I4 . nd count mark 4
TtemD
Supprt
Count
Ngde

E path i aLready there, Tneremant


rauî3.Count i
9 t03 and d
Next trans acor Dy P D4 T3
|5:/
4
Da ount 3 to 2
2
T, C&unt | t0 2
.
DrQLo Da node. cowt

OsrneetSame nodes
wth drts
Tfems Csnditional potlen Base londional
fattornet
To reaeh garinlarIs 0e needaeto Take the set al plements
klnd al baths. Tcoo Is nodes ohieh i, Common in ae paths
"hoe sth ceunt i at I & condtisnal patten base

to Ty thore is m8 prelix mode pesent


FoL I3 drouo pp tree
uphort Node C&nk
Ttem lD

Pregyont patfenerated Wrte Cndhnal


$ j t tree qn
at (as
0s we kound the
path tonedoune
Last ltoh ie Rule e can ond ossoeiaton rules on hone.
The FP grath. melhod tranalorens the aob lam af qindng tomg raqunoy bato
Shorten me innulh nallehAEal databases
and soalable
ond ahohet Jeuent þattnk Jathn than
Apsi A'gotithn
Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
3
Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
4
Prediction Problems: Classification vs.
Numeric Prediction
 Classification
 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the training


set and the values (class labels) in a classifying attribute
and uses it in classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical applications
 Credit/loan approval:

 Medical diagnosis: if a tumor is cancerous or benign

 Fraud detection: if a transaction is fraudulent

 Web page categorization: which category it is

5
Classification—A Two-Step
Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute


 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees, or

mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

 The known label of test sample is compared with the classified

result from the model


 Accuracy rate is the percentage of test set samples that are

correctly classified by the model


 Test set is independent of training set (otherwise overfitting)

 If the accuracy is acceptable, use the model to classify new data

 Note: If the test set is used to select models, it is called validation (test) set
6
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’
7
Process (2): Using the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
8
Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
9
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
>40 low yes excellent no
 Resulting tree:
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yes
10
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-

conquer manner
 At start, all the training examples are at the root

 Attributes are categorical (if continuous-valued, they are

discretized in advance)
 Examples are partitioned recursively based on selected

attributes
 Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)


 Conditions for stopping partitioning
 All samples for a given node belong to the same class

 There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf


 There are no samples left
11
Brief Review of Entropy

m=2

12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
classify D: v | D |
InfoA ( D)    Info( D j )
j

j 1 | D |
 Information gained by branching on attribute A

Gain(A)  Info(D)  InfoA(D)


13
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D )  I ( 2,3)  I ( 4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3)means “age <=30” has 5 out of
<=30 2 3 0.971 14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age)  Info( D)  Infoage ( D)  0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes

Gain(income)  0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student )  0.151
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40 medium no excellent no 14
Computing Information-Gain for
Continuous-Valued Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
15
Gain Ratio for Attribute Selection
(C4.5)
 Information gain measure is biased towards attributes with a
large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.

 gain_ratio(income) = 0.029/1.557 = 0.019


 The attribute with the maximum gain ratio is selected as the
splitting attribute
16
Gini Index (CART, IBM
IntelligentMiner)
 If a data set D contains examples from n classes, gini index,
gini(D) is defined as n 2
gini( D)  1  p j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A (D)  gini(D1)  2 gini(D2)
|D| |D|
 Reduction in Impurity:
gini( A)  gini(D)  giniA (D)
 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
17
Computation of Gini Index
 Ex. D has 9 tuples in buys_computer = “yes”
2
and
2
5 in “no”
9 5
gini ( D)  1        0.459
 14   14 
 Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 giniincome{low,medium} ( D)   10 Gini( D1 )   4 Gini( D2 )
 14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the


{low,medium} (and {high}) since it has the lowest Gini index
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split
values
 Can be modified for categorical attributes 18
Comparing Attribute Selection Measures

 The three measures, in general, return good results but


 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one partition is
much smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions
and purity in both partitions
19
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistic: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
 The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
20
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to

noise or outliers
 Poor accuracy for unseen samples

 Two approaches to avoid overfitting


 Prepruning: Halt tree construction early ̵ do not split a node

if this would result in the goodness measure falling below a


threshold
 Difficult to choose an appropriate threshold

 Postpruning: Remove branches from a “fully grown” tree—

get a sequence of progressively pruned trees


 Use a set of data different from the training data to

decide which is the “best pruned tree”


21
Enhancements to Basic Decision Tree
Induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
22
Classification in Large Databases
 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
 Why is decision tree induction popular?
 relatively faster learning speed (than other classification
methods)
 convertible to simple and easy to understand classification
rules
 can use SQL queries for accessing databases

 comparable classification accuracy with other methods

 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)


 Builds an AVC-list (attribute, value, class label)

23
Scalability Framework for
RainForest

 Separates the scalability aspects from the criteria that


determine the quality of the tree
 Builds an AVC-list: AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X )
 Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
 AVC-group (of a node n )
 Set of AVC-sets of all predictor attributes at the node n

24
Rainforest: Training Set and Its AVC
Sets

Training Examples AVC-set on Age AVC-set on income


age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer

<=30 high no fair no yes no


<=30 high no excellent no yes no
high 2 2
31…40 high no fair yes <=30 2 3
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
>40 low yes excellent no
31…40 low yes excellent yes
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer
>40 medium yes fair yes Credit
Buy_Computer

<=30 medium yes excellent yes yes no rating yes no


31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
>40 medium no excellent no
25
BOAT (Bootstrapped Optimistic
Algorithm for Tree Construction)
 Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
 Each subset is used to create a tree, resulting in several
trees
 These trees are examined and used to construct a new
tree T’
 It turns out that T’ is very close to the tree that would
be generated using the whole data set together
 Adv: requires only two scans of DB, an incremental alg.

26
Presentation of Classification Results

September 24, 2023 Data Mining: Concepts and Techniques 27


SGI/MineSet 3.0

September 24, 2023 Data Mining: Concepts and Techniques 28


Interactive Visual Mining by
Perception-Based Classification (PBC)

Data Mining: Concepts and Techniques 29


Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
30
Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
31
Bayes’ Theorem: Basics
M
 Total probability Theorem: P(B)   P(B | A )P( A )
i i
i 1

 Bayes’ Theorem: P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)


P(X)
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …

 P(X): probability that sample data is observed


 P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40,

medium income
32
Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem

P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)


P(X)
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost

33
Classification Is to Derive the Maximum
Posteriori
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized

34
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k 1
 This greatly reduces the computation cost: Only counts the
class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x )2
1 
g ( x,  ,  )  e 2 2
and P(xk|Ci) is 2 

P ( X | C i )  g ( xk ,  Ci ,  Ci )
35
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
36
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes

 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40


>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no

P(buys_computer = “no”) = 5/14= 0.357 31…40


<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes
 Compute P(X|Ci) for each class >40
<=30
medium yes fair
medium yes excellent
yes
yes
31…40 medium no excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40
>40
high
medium
yes fair
no excellent
yes
no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6


P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”) 37
Avoiding the Zero-Probability
Problem
 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i )   P( x k | C i )
k 1
 Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case

Prob(income = low) = 1/1003


Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their

“uncorrected” counterparts
38
Naïve Bayes Classifier: Comments
 Advantages
 Easy to implement

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence, therefore loss
of accuracy
 Practically, dependencies exist among variables

 E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer,


diabetes, etc.
 Dependencies among these cannot be modeled by Naïve

Bayes Classifier
 How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
39
Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
40
Using IF-THEN Rules for Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent

 Assessment of a rule: coverage and accuracy


 ncovers = # of tuples covered by R

 ncorrect = # of tuples correctly classified by R

coverage(R) = ncovers /|D| /* D: training data set */


accuracy(R) = ncorrect / ncovers
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has

the “toughest” requirement (i.e., with the most attribute tests)


 Class-based ordering: decreasing order of prevalence or misclassification

cost per class


 Rule-based ordering (decision list): rules are organized into one long

priority list, according to some measure of rule quality or by experts


41
Rule Extraction from a Decision Tree
 Rules are easier to understand than large
trees age?

 One rule is created for each path from the <=30 31..40 >40
root to a leaf
student? credit rating?
yes
 Each attribute-value pair along a path forms a
no yes excellent fair
conjunction: the leaf holds the class
no yes yes
prediction
 Rules are mutually exclusive and exhaustive
 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
42
Rule Induction: Sequential Covering
Method
 Sequential covering algorithm: Extracts rules directly from training
data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time

 Each time a rule is learned, the tuples covered by the rules are
removed
 Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules
simultaneously
43
Sequential Covering Algorithm

while (enough target tuples left)


generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

44
Rule Generation
 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break

A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1

Positive Negative
examples examples

45
How to Learn-One-Rule?
 Start with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy
 Picks the one that most improves the rule quality

 Rule-Quality measures: consider both coverage and accuracy


 Foil-gain (in FOIL & RIPPER): assesses info_gain by extending

condition pos' pos


FOIL _ Gain  pos'(log 2  log 2 )
pos'neg ' pos  neg
 favors rules that have high accuracy and cover many positive tuples
 Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune( R) 
pos  neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
46
Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
47
Model Evaluation and Selection
 Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
 Use validation test set of class-labeled tuples instead of
training set when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves
48
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
49
Accuracy, Error Rate, Sensitivity and
Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of


recognition rate: percentage of the positive class
test set tuples that are correctly  Sensitivity: True Positive
classified recognition rate
Accuracy = (TP + TN)/All  Sensitivity = TP/P

 Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate


 Specificity = TN/N

50
Precision and Recall, and F-
measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

 Recall: completeness – what % of positive tuples did the


classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision and
recall,

 Fß: weighted measure of precision and recall


 assigns ß times as much weight to recall as to precision

51
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

52
Holdout & Cross-Validation
Methods
 Holdout method
 Given data is randomly partitioned into two independent sets

 Training set (e.g., 2/3) for model construction

 Test set (e.g., 1/3) for accuracy estimation

 Random sampling: a variation of holdout

 Repeat holdout k times, accuracy = avg. of the accuracies


obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
 At i-th iteration, use Di as test set and others as training set

 Leave-one-out: k folds where k = # of tuples, for small sized


data
 *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
53

You might also like