Data Mining: Practical Machine Learning Tools and Techniques
Data Mining: Practical Machine Learning Tools and Techniques
.ules' classification and association Decision trees eather! contact lens! CP/ perfor-ance! labor negotiation data! soybean classification .an$ing 0eb pages! loan applications! screening i-ages! load forecasting! -achine fault diagnosis! -ar$et bas$et analysis
Datasets
#ielded applications
Potentially ,aluable resource .a0 data is useless' need techniques to auto-atically e5tract infor-ation fro- it
Infor-ation is crucial
1i,en' e-bryos described by 78 features Proble-' selection of e-bryos that 0ill sur,i,e Data' historical records of e-bryos and outco-e 1i,en' co0s described by 988 features Proble-' selection of co0s that should be culled Data' historical records and far-ers* decisions
Data -ining
"5tracting
infor-ation fro- data ;eeded' progra-s that detect patterns and regularities in the data Strong patterns good predictions
Proble- 1' -ost patterns are not interesting Proble- &' patterns -ay be ine5act (or spurious) Proble- 3' data -ay be garbled or -issing
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) :
Algorithms for acquiring structural descriptions from examples Structural descriptions represent patterns e5plicitly
Can be used to predict outco-e in ne0 situation Can be used to understand and e5plain ho0 prediction is deri,ed (may be even more important)
Structural descriptions
No No No Yes
@perational definition'
Things learn when they change their behavior in a way that makes them perform better in the future.
If If If If If
outlook = sunny and humidity = high then play = no outlook = rainy and windy = true then play = no outlook = overcast then play = yes humidity = normal then play = yes none of the above then play = yes
.oss Buinlan
Machine learning researcher fro- 1A98*s /ni,ersity of Sydney! %ustralia 1A=7 >Induction of decision trees? ML Journal 1AA3 C4. ! "rograms for machine learning. Morgan Cauf-ann 1AA+ Started
18
Classification rule'
predicts ,alue of a gi,en attribute (the classification of an e5a-ple)
If outlook = sunny and humidity = high then play = no
%ssociation rule'
predicts ,alue of arbitrary attribute (or co-bination)
If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high
11
If If If If If
outlook = sunny and humidity > 83 then play = no outlook = rainy and windy = true then play = no outlook = overcast then play = yes humidity < 8 then play = yes none of the above then play = yes
1&
16
1:
If petal length < "#$ then Iris setosa If sepal width < "#%& then Iris versicolor ###
17
19
1=
1A
Soybean classification
Attribute &nvironment Time of occurrence Precipitation $eed 2ondition Mold gro1t %ruit 2ondition of fruit pods Fruit spots Leaf 2ondition =eaf spot siBe $tem 2ondition Stem lodging #oot 2ondition Diagnosis Number of "alues ' * / / 0 ( / * / / * -+ Sample "alue Culy Abo"e normal Normal Absent Normal : Abnormal : Abnormal Yes Normal <iaport e stem can$er
&8
Dut in this do-ain! >leaf condition is nor-al? i-plies >leaf -alfor-ation is absent?E
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &1
#ielded applications
The result of learningFor the learning -ethod itselfFis deployed in practical applications
Processing loan applications Screening i-ages for oil slic$s "lectricity supply forecasting Diagnosis of -achine faults Mar$eting and sales Separating crude oil and natural gas .educing banding in rotogra,ure printing #inding appropriate technicians for telephone faults Scientific applications' biology! astrono-y! che-istry %uto-atic selection of TG progra-s Monitoring intensi,e care patients
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &&
(%-erican
1i,en' questionnaire 0ith financial and personal infor-ation Buestion' should -oney be lent+ Si-ple statistical -ethod co,ers A8H of cases Dorderline cases referred to loan officers Dut' :8H of accepted borderline cases defaultedE Solution' reIect all borderline cases+
&3
age years 0ith current e-ployer years at current address years 0ith the ban$ other credit cards possessed!4 hu-an e5perts only :8H
&6
Screening i-ages
1i,en' radar satellite i-ages of coastal 0aters Proble-' detect oil slic$s in those i-ages @il slic$s appear as dar$ regions 0ith changing si2e and shape ;ot easy' loo$ali$e dar$ regions can be caused by 0eather conditions (e.g. high 0ind) "5pensi,e process requiring highly trained personnel
&:
si2e of region shape! area intensity sharpness and Iaggedness of boundaries pro5i-ity of other regions info about bac$ground #e0 training e5a-plesFoil slic$s are rareE /nbalanced data' -ost dar$ regions aren*t slic$s .egions fro- sa-e i-age for- a batch .equire-ent' adIustable false<alar- rate
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &7
Constraints'
Load forecasting
"lectricity supply co-panies need forecast of future de-and for po0er #orecasts of -inJ-a5 load for each hour significant sa,ings 1i,en' -anually constructed load -odel that assu-es >nor-al? cli-atic conditions Proble-' adIust for 0eather conditions Static -odel consist of'
base load for the year load periodicity o,er the year effect of holidays
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &9
te-perature hu-idity 0ind speed cloud co,er readings plus difference bet0een actual load and predicted load
%,erage difference a-ong three >-ost si-ilar? days added to static -odel Linear regression coefficients for- attribute 0eights in si-ilarity function
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &=
Diagnosis' classical do-ain of e5pert syste-s 1i,en' #ourier analysis of ,ibrations -easured at ,arious points of a de,ice*s -ounting Buestion' 0hich fault is present+ Pre,entati,e -aintenance of electro-echanical -otors and generators Infor-ation ,ery noisy So far' diagnosis by e5pertJhand<crafted rules
&A
%,ailable' 788 faults 0ith e5pert*s diagnosis K388 unsatisfactory! rest used for training %ttributes aug-ented by inter-ediate concepts that e-bodied causal do-ain $no0ledge "5pert not satisfied 0ith initial rules because they did not relate to his do-ain $no0ledge #urther bac$ground $no0ledge resulted in -ore co-ple5 rules that 0ere satisfactory Learned rules outperfor-ed hand<crafted ones
38
Co-panies precisely record -assi,e a-ounts of -ar$eting and sales data %pplications'
Custo-er loyalty' identifying custo-ers that are li$ely to defect by detecting changes in their beha,ior (e.g. ban$sJphone co-panies) Special offers' identifying profitable custo-ers (e.g. reliable o0ners of credit cards that need e5tra -oney during the holiday season)
31
%ssociation techniques find groups of ite-s that tend to occur together in a transaction (used to analy2e chec$out data)
#ocusing pro-otional -ailouts (targeted ca-paigns are cheaper than -ass<-ar$eted ones)
3&
Statistics' testing hypotheses Machine learning' finding the right hypothesis Decision trees (C6.: and C%.T) ;earest<neighbor -ethods Most ML algorith-s e-ploy statistical techniques
33
Statisticians
Sir .onald %yl-er #isher Dorn' 19 #eb 1=A8 London! "ngland Died' &A Luly 1A7& %delaide! %ustralia
'umerous distinguished contributions to developing the theory and application of statistics for ma(ing quantitative a vast field of biology
Leo Drei-an De,eloped decision trees )*+4 Classification and #egression ,rees. ads0orth.
36
1enerali2ation as search
Inducti,e learning' find a concept description that fits the data "5a-ple' rule sets as description language
"nor-ous! but finite! search space enu-erate the concept space eli-inate descriptions that do not fit e5a-ples sur,i,ing descriptions contain target concept
Si-ple solution'
3:
6 5 6 5 3 5 3 5 & M &== possible co-binations ith 16 rules &.951836 possible rule sets More than one description -ay sur,i,e ;o description -ay sur,i,e
%nother ,ie0 of generali2ation as search' hill<cli-bing in description space according to pre<specified -atching criterion
Most practical algorith-s use heuristic search that cannot guarantee to find the opti-u- solution
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) 37
Dias
Concept description language @rder in 0hich the space is searched ay that o,erfitting to the particular training data is a,oided Language bias Search bias @,erfitting<a,oidance bias
39
Language bias
I-portant question'
/ni,ersal language can e5press arbitrary subsets of e5a-ples If language includes logical or (>disIunction?)! it is uni,ersal "5a-ple' rule sets Do-ain $no0ledge can be used to e5clude so-e concept descriptions a priori fro- the search
3=
Search bias
Search heuristic
>1reedy? search' perfor-ing the best single step >Dea- search?' $eeping se,eral alternati,es 4 -eneral.to.specific
Direction of search
".g. speciali2ing a rule by adding conditions ".g. generali2ing an indi,idual instance into a rule
$pecific.to.general
3A
@,erfitting<a,oidance bias
".g. balancing si-plicity and nu-ber of errors ".g. pruning (si-plifying a description)
Pre<pruning' stops at a si-ple description before search proceeds to an o,erly co-ple5 one Post<pruning' generates a co-ple5 description first and si-plifies it after0ards
68
"thical issues arise in practical applications %nony-i2ing data is difficult =:H of %-ericans can be identified fro- Iust 2ip code! birth date and se5 Data -ining often used to discri-inate
".g. loan applications' using so-e infor-ation (e.g. se5! religion! race) is unethical ".g. sa-e infor-ation o$ in -edical application ".g. area code -ay correlate 0ith race
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) 61
I-portant questions'
ho is per-itted access to the data+ #or 0hat purpose 0as the data collected+ hat $ind of conclusions can be legiti-ately dra0n fro- it+
Ca,eats -ust be attached to results Purely statistical argu-ents are ne,er sufficientE %re resources put to good use+
6&