0% found this document useful (0 votes)
79 views42 pages

Data Mining: Practical Machine Learning Tools and Techniques

1) Data mining involves extracting implicit, previously unknown, and potentially useful information from data. It uses machine learning techniques to detect patterns in large datasets. 2) Machine learning algorithms are used to acquire structural descriptions like decision trees and rules from examples in order to predict outcomes for new data or understand how predictions are made. 3) Data mining applications include weather prediction, contact lens recommendation, credit scoring, and soybean disease diagnosis using large datasets with many attributes.

Uploaded by

German Toledo
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views42 pages

Data Mining: Practical Machine Learning Tools and Techniques

1) Data mining involves extracting implicit, previously unknown, and potentially useful information from data. It uses machine learning techniques to detect patterns in large datasets. 2) Machine learning algorithms are used to acquire structural descriptions like decision trees and rules from examples in order to predict outcomes for new data or understand how predictions are made. 3) Data mining applications include weather prediction, contact lens recommendation, credit scoring, and soybean disease diagnosis using large datasets with many attributes.

Uploaded by

German Toledo
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Data Mining

Practical Machine Learning Tools and Techniques


Slides for Chapter 1 of Data Mining by I. H. M. %. Hall itten! ". #ran$ and

hat*s it all about+


Data ,s infor-ation Data -ining and -achine learning Structural descriptions


.ules' classification and association Decision trees eather! contact lens! CP/ perfor-ance! labor negotiation data! soybean classification .an$ing 0eb pages! loan applications! screening i-ages! load forecasting! -achine fault diagnosis! -ar$et bas$et analysis

Datasets

#ielded applications

1enerali2ation as search Data -ining and ethics


Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &

Data ,s. infor-ation

Society produces huge a-ounts of data

Sources' business! science! -edicine! econo-ics! geography! en,iron-ent! sports! 4

Potentially ,aluable resource .a0 data is useless' need techniques to auto-atically e5tract infor-ation fro- it

Data' recorded facts Infor-ation' patterns underlying the data

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

Infor-ation is crucial

"5a-ple 1' in vitro fertili2ation


1i,en' e-bryos described by 78 features Proble-' selection of e-bryos that 0ill sur,i,e Data' historical records of e-bryos and outco-e 1i,en' co0s described by 988 features Proble-' selection of co0s that should be culled Data' historical records and far-ers* decisions

"5a-ple &' co0 culling


Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

Data -ining

"5tracting

i-plicit! pre,iously un$no0n! potentially useful

infor-ation fro- data ;eeded' progra-s that detect patterns and regularities in the data Strong patterns good predictions

Proble- 1' -ost patterns are not interesting Proble- &' patterns -ay be ine5act (or spurious) Proble- 3' data -ay be garbled or -issing
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) :

Machine learning techniques

Algorithms for acquiring structural descriptions from examples Structural descriptions represent patterns e5plicitly

Can be used to predict outco-e in ne0 situation Can be used to understand and e5plain ho0 prediction is deri,ed (may be even more important)

Methods originate fro- artificial intelligence! statistics! and research on databases

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

Structural descriptions

"5a-ple' if<then rules


If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft
Age Spectacle prescription Myope Hypermetrope Hypermetrope Myope Astigmatism Tear production rate Reduced Normal Reduced Normal Recommended lenses None Soft None Hard
9

Young Young Pre-presbyopic Presbyopic

No No No Yes

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

Can -achines really learn+

Definitions of >learning? fro- dictionary'


To get knowledge of by study, experience, or being taught To become aware by information or from observation To commit to memory To be informed of, ascertain; to receive instruction

Difficult to -easure Tri,ial for co-puters

@perational definition'
Things learn when they change their behavior in a way that makes them perform better in the future.

Does a slipper learn+

Does learning i-ply intention+


Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) =

The 0eather proble

Conditions for playing a certain ga-e


!utloo$ Sunny Sunny !"ercast Rainy Temperature Hot Hot Hot Mild Humidity Hig Hig Hig Normal #indy False True False False Play No No Yes Yes

If If If If If

outlook = sunny and humidity = high then play = no outlook = rainy and windy = true then play = no outlook = overcast then play = yes humidity = normal then play = yes none of the above then play = yes

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

.oss Buinlan
Machine learning researcher fro- 1A98*s /ni,ersity of Sydney! %ustralia 1A=7 >Induction of decision trees? ML Journal 1AA3 C4. ! "rograms for machine learning. Morgan Cauf-ann 1AA+ Started

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

18

Classification ,s. association rules

Classification rule'
predicts ,alue of a gi,en attribute (the classification of an e5a-ple)
If outlook = sunny and humidity = high then play = no

%ssociation rule'
predicts ,alue of arbitrary attribute (or co-bination)
If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

11

eather data 0ith -i5ed attributes

So-e attributes ha,e nu-eric ,alues


!utloo$ Sunny Sunny !"ercast Rainy Temperature %( %& %* '( Humidity %( +& %) %& #indy False True False False Play No No Yes Yes

If If If If If

outlook = sunny and humidity > 83 then play = no outlook = rainy and windy = true then play = no outlook = overcast then play = yes humidity < 8 then play = yes none of the above then play = yes

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

1&

The contact lenses data


Age Young Young Young Young Young Young Young Young Pre-presbyopic Pre-presbyopic Pre-presbyopic Pre-presbyopic Pre-presbyopic Pre-presbyopic Pre-presbyopic Pre-presbyopic Presbyopic Presbyopic Presbyopic Presbyopic Presbyopic Presbyopic Presbyopic Presbyopic Spectacle prescription Myope Myope Myope Myope Hypermetrope Hypermetrope Hypermetrope Hypermetrope Myope Myope Myope Myope Hypermetrope Hypermetrope Hypermetrope Hypermetrope Myope Myope Myope Myope Hypermetrope Hypermetrope Hypermetrope Hypermetrope Astigmatism No No Yes Yes No No Yes Yes No No Yes Yes No No Yes Yes No No Yes Yes No No Yes Yes Tear production rate Reduced Normal Reduced Normal Reduced Normal Reduced Normal Reduced Normal Reduced Normal Reduced Normal Reduced Normal Reduced Normal Reduced Normal Reduced Normal Reduced Normal Recommended lenses None Soft None Hard None Soft None ard None Soft None Hard None Soft None None None None None Hard None Soft None None 13

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

% co-plete and correct rule set


If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre!presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre!presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

16

% decision tree for this proble-

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

1:

Classifying iris flo0ers


Sepal lengt / ((/ -&-&/ ).* (.% *.* /.' ).& (./.( -.+ ,ris "irginica ,ris "irginica '.& ).0 *./ *./ 0.' 0.( -.0 -.( ,ris "ersicolor ,ris "ersicolor (.0.+ Sepal 1idt *.( *.& Petal lengt -.0 -.0 Petal 1idt &./ &./ Type ,ris setosa ,ris setosa

If petal length < "#$ then Iris setosa If sepal width < "#%& then Iris versicolor ###

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

17

Predicting CP/ perfor-ance

"5a-ple' &8A different co-puter configurations


2ycle time 4ns6 MY2T / /&% /&+ 0%& 0%& (-/ -&&& %&&& 0&&& */ & & & & & )' 0( -/( /+ Main memory 45b6 MM,N /() %&&& MMA3 )&&& */&&& 2ac e 45b6 2A2H /() */ -) % 2 annels 2HM,N 2HMA3 -/% */ Performance PRP -+% /)+

Linear regression function


'(' = ! #) * &#&$8) +,-. * &#&% 3 ++I/ * &#&& 0 ++12 * &#0$%& -1-3 ! &#"4&& -3+I/ * %#$8& -3+12

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

19

Data fro- labor negotiations


Attribute <uration #age increase first year #age increase second year #age increase t ird year 2ost of li"ing ad@ustment #or$ing ours per 1ee$ Pension Standby pay S ift-1or$ supplement ?ducation allo1ance Statutory olidays >acation =ong-term disability assistance <ental plan contribution ;erea"ement assistance Healt plan contribution Acceptability of contract Type 4Number of years6 Percentage Percentage Percentage 7none8tcf8tc9 4Number of ours6 7none8ret-all18 empl-cntr9 Percentage Percentage 7yes8no9 4Number of days6 7belo1-a"g8a"g8gen9 7yes8no9 7none8 alf8full9 7yes8no9 7none8 alf8full9 7good8bad9 /A : : none /% none : : yes -a"g no none no none bad / / 0A (A : tcf *( : -*A (A : -( gen : : : : good * * 0.*A 0.0A : : *% : : 0A : -/ gen : full : full good 0& / 0.( 0.& : none 0& : : 0 : -/ a"g yes full yes alf good

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

1=

Decision trees for the labor data

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

1A

Soybean classification
Attribute &nvironment Time of occurrence Precipitation $eed 2ondition Mold gro1t %ruit 2ondition of fruit pods Fruit spots Leaf 2ondition =eaf spot siBe $tem 2ondition Stem lodging #oot 2ondition Diagnosis Number of "alues ' * / / 0 ( / * / / * -+ Sample "alue Culy Abo"e normal Normal Absent Normal : Abnormal : Abnormal Yes Normal <iaport e stem can$er
&8

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

The role of do-ain $no0ledge


If leaf condition is normal and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown then diagnosis is rhi5octonia root rot If leaf malformation is absent and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown then diagnosis is rhi5octonia root rot

Dut in this do-ain! >leaf condition is nor-al? i-plies >leaf -alfor-ation is absent?E
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &1

#ielded applications

The result of learningFor the learning -ethod itselfFis deployed in practical applications

Processing loan applications Screening i-ages for oil slic$s "lectricity supply forecasting Diagnosis of -achine faults Mar$eting and sales Separating crude oil and natural gas .educing banding in rotogra,ure printing #inding appropriate technicians for telephone faults Scientific applications' biology! astrono-y! che-istry %uto-atic selection of TG progra-s Monitoring intensi,e care patients
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &&

Processing loan applications


"5press)

(%-erican

1i,en' questionnaire 0ith financial and personal infor-ation Buestion' should -oney be lent+ Si-ple statistical -ethod co,ers A8H of cases Dorderline cases referred to loan officers Dut' :8H of accepted borderline cases defaultedE Solution' reIect all borderline cases+

;oE Dorderline cases are -ost acti,e custo-ers

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

&3

"nter -achine learning


1888 training e5a-ples of borderline cases &8 attributes'


age years 0ith current e-ployer years at current address years 0ith the ban$ other credit cards possessed!4 hu-an e5perts only :8H

Learned rules' correct on 98H of cases

.ules could be used to e5plain decisions to custo-ers

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

&6

Screening i-ages

1i,en' radar satellite i-ages of coastal 0aters Proble-' detect oil slic$s in those i-ages @il slic$s appear as dar$ regions 0ith changing si2e and shape ;ot easy' loo$ali$e dar$ regions can be caused by 0eather conditions (e.g. high 0ind) "5pensi,e process requiring highly trained personnel

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

&:

"nter -achine learning


"5tract dar$ regions fro- nor-ali2ed i-age %ttributes'


si2e of region shape! area intensity sharpness and Iaggedness of boundaries pro5i-ity of other regions info about bac$ground #e0 training e5a-plesFoil slic$s are rareE /nbalanced data' -ost dar$ regions aren*t slic$s .egions fro- sa-e i-age for- a batch .equire-ent' adIustable false<alar- rate
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &7

Constraints'

Load forecasting

"lectricity supply co-panies need forecast of future de-and for po0er #orecasts of -inJ-a5 load for each hour significant sa,ings 1i,en' -anually constructed load -odel that assu-es >nor-al? cli-atic conditions Proble-' adIust for 0eather conditions Static -odel consist of'

base load for the year load periodicity o,er the year effect of holidays
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &9

"nter -achine learning


Prediction corrected using >-ost si-ilar? days %ttributes'


te-perature hu-idity 0ind speed cloud co,er readings plus difference bet0een actual load and predicted load

%,erage difference a-ong three >-ost si-ilar? days added to static -odel Linear regression coefficients for- attribute 0eights in si-ilarity function
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) &=

Diagnosis of -achine faults

Diagnosis' classical do-ain of e5pert syste-s 1i,en' #ourier analysis of ,ibrations -easured at ,arious points of a de,ice*s -ounting Buestion' 0hich fault is present+ Pre,entati,e -aintenance of electro-echanical -otors and generators Infor-ation ,ery noisy So far' diagnosis by e5pertJhand<crafted rules

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

&A

"nter -achine learning


%,ailable' 788 faults 0ith e5pert*s diagnosis K388 unsatisfactory! rest used for training %ttributes aug-ented by inter-ediate concepts that e-bodied causal do-ain $no0ledge "5pert not satisfied 0ith initial rules because they did not relate to his do-ain $no0ledge #urther bac$ground $no0ledge resulted in -ore co-ple5 rules that 0ere satisfactory Learned rules outperfor-ed hand<crafted ones

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

38

Mar$eting and sales I

Co-panies precisely record -assi,e a-ounts of -ar$eting and sales data %pplications'

Custo-er loyalty' identifying custo-ers that are li$ely to defect by detecting changes in their beha,ior (e.g. ban$sJphone co-panies) Special offers' identifying profitable custo-ers (e.g. reliable o0ners of credit cards that need e5tra -oney during the holiday season)

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

31

Mar$eting and sales II

Mar$et bas$et analysis

%ssociation techniques find groups of ite-s that tend to occur together in a transaction (used to analy2e chec$out data)

Historical analysis of purchasing patterns Identifying prospecti,e custo-ers

#ocusing pro-otional -ailouts (targeted ca-paigns are cheaper than -ass<-ar$eted ones)

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

3&

Machine learning and statistics

Historical difference (grossly o,ersi-plified)'


Statistics' testing hypotheses Machine learning' finding the right hypothesis Decision trees (C6.: and C%.T) ;earest<neighbor -ethods Most ML algorith-s e-ploy statistical techniques

Dut' huge o,erlap


Today' perspecti,es ha,e con,erged

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

33

Statisticians

Sir .onald %yl-er #isher Dorn' 19 #eb 1=A8 London! "ngland Died' &A Luly 1A7& %delaide! %ustralia
'umerous distinguished contributions to developing the theory and application of statistics for ma(ing quantitative a vast field of biology

Leo Drei-an De,eloped decision trees )*+4 Classification and #egression ,rees. ads0orth.
36

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

1enerali2ation as search

Inducti,e learning' find a concept description that fits the data "5a-ple' rule sets as description language

"nor-ous! but finite! search space enu-erate the concept space eli-inate descriptions that do not fit e5a-ples sur,i,ing descriptions contain target concept

Si-ple solution'

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

3:

"nu-erating the concept space

Search space for 0eather proble

6 5 6 5 3 5 3 5 & M &== possible co-binations ith 16 rules &.951836 possible rule sets More than one description -ay sur,i,e ;o description -ay sur,i,e

@ther practical proble-s'


Language is unable to describe target concept or data contains noise

%nother ,ie0 of generali2ation as search' hill<cli-bing in description space according to pre<specified -atching criterion

Most practical algorith-s use heuristic search that cannot guarantee to find the opti-u- solution
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) 37

Dias

I-portant decisions in learning syste-s'


Concept description language @rder in 0hich the space is searched ay that o,erfitting to the particular training data is a,oided Language bias Search bias @,erfitting<a,oidance bias

These for- the >bias? of the search'


Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

39

Language bias

I-portant question'

is language uni,ersal or does it restrict 0hat can be learned+

/ni,ersal language can e5press arbitrary subsets of e5a-ples If language includes logical or (>disIunction?)! it is uni,ersal "5a-ple' rule sets Do-ain $no0ledge can be used to e5clude so-e concept descriptions a priori fro- the search

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

3=

Search bias

Search heuristic

>1reedy? search' perfor-ing the best single step >Dea- search?' $eeping se,eral alternati,es 4 -eneral.to.specific

Direction of search

".g. speciali2ing a rule by adding conditions ".g. generali2ing an indi,idual instance into a rule

$pecific.to.general

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

3A

@,erfitting<a,oidance bias

Can be seen as a for- of search bias Modified e,aluation criterion

".g. balancing si-plicity and nu-ber of errors ".g. pruning (si-plifying a description)

Modified search strategy

Pre<pruning' stops at a si-ple description before search proceeds to an o,erly co-ple5 one Post<pruning' generates a co-ple5 description first and si-plifies it after0ards

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

68

Data -ining and ethics I


"thical issues arise in practical applications %nony-i2ing data is difficult =:H of %-ericans can be identified fro- Iust 2ip code! birth date and se5 Data -ining often used to discri-inate

".g. loan applications' using so-e infor-ation (e.g. se5! religion! race) is unethical ".g. sa-e infor-ation o$ in -edical application ".g. area code -ay correlate 0ith race
Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1) 61

"thical situation depends on application

%ttributes -ay contain proble-atic infor-ation

Data -ining and ethics II

I-portant questions'

ho is per-itted access to the data+ #or 0hat purpose 0as the data collected+ hat $ind of conclusions can be legiti-ately dra0n fro- it+

Ca,eats -ust be attached to results Purely statistical argu-ents are ne,er sufficientE %re resources put to good use+

Data Mining' Practical Machine Learning Tools and Techniques (Chapter 1)

6&

You might also like