0% found this document useful (0 votes)

32 views58 pages

Data Mining Tutorial

This is a tutorial on Data Mining for beginners.

Uploaded by

Peter Okoh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views58 pages

Data Mining Tutorial

This is a tutorial on Data Mining for beginners.

Uploaded by

Peter Okoh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 58

Data Mining Tutorial

What is it?
Large datasets Fast methods Not significance testing Topics
Trees (recursive splitting) Nearest Neighbor Neural Networks Clustering Association Analysis

Trees
A divisive method (splits) Start with root node all in one group Get splitting rules Response often binary Result is a tree Example: Loan Defaults Example: Framingham Heart Study

Recursive Splitting
Pr{default} =0.007
Pr{default} =0.006 X1=Debt To Income Ratio Pr{default} =0.003 Pr{default} =0.012

Pr{default} =0.0001

No default Default

X2 = Age

Some Actual Data

Framingham Heart Study First Stage Coronary Heart Disease
P{CHD} = Function of:
Age - no drug yet! Cholesterol Systolic BP
Import

Example of a tree
All 1615 patients
Split # 1: Age

Systolic BP

terminal node

How to make splits?

Which variable to use? Where to split?
Cholesterol > ____ Systolic BP > _____

Goal: Pure leaves or terminal nodes Ideal split: Everyone with BP>x has problems, nobody with BP<x has problems

Where to Split?
First review Chi-square tests Contingency tables
Heart Disease No Yes Low BP Heart Disease No Yes

100 100

75 75

25 25

High BP

DEPENDENT

INDEPENDENT

c2 Test Statistic
Expect 100(150/200)=75 in upper left if independent (etc. e.g. 100(50/200)=25)
Heart Disease No Yes
2 ( observed exp ected ) c 2 allcells exp ected

Low BP

High BP

95 (75) 55 (75)
150

5 (25) 45 (25)
50

100
100

2(400/75)+ 2(400/25) = 42.67

Compare to Tables Significant!

200

WHERE IS HIGH BP CUTOFF???

Measuring Worth of a Split

P-value is probability of Chi-square as great as that observed if independence is true. (Pr {c2>42.67} is 6.4E-11) P-values all too small. Logworth = -log10(p-value) = 10.19 Best Chi-square max logworth.

Logworth for Age Splits

Age 47 maximizes logworth

How to make splits?

Which variable to use? Where to split?
Cholesterol > ____ Systolic BP > _____

Idea Pick BP cutoff to minimize p-value for c2 What does signifiance mean now?

Multiple testing
50 different BPs in data, 49 ways to split Sunday football highlights always look good! If he shoots enough baskets, even 95% free throw shooter will miss. Jury trial analogy Tried 49 splits, each has 5% chance of declaring significance even if theres no relationship.

Multiple testing
a= Pr{ falsely reject hypothesis 2}

a= Pr{ falsely reject hypothesis 1} Pr{ falsely reject one or the other} < 2a Desired: 0.05 probabilty or less Solution: use a = 0.05/2 Or compare 2(p-value) to 0.05

Multiple testing
50 different BPs in data, m=49 ways to split Multiply p-value by 49 Bonferroni original idea Kass apply to data mining (trees) Stop splitting if minimum p-value is large. For m splits, logworth becomes -log10(m*p-value)

Other Split Evaluations

Gini Diversity Index
{ A A A A B A B B C B} Pick 2, Pr{different} = 1-Pr{AA}-Pr{BB}-Pr{CC}
1-[10+6+0]/45=29/45=0.64

{AABCBAABCC}
1-[6+3+3]/45 = 33/45 = 0.73 MORE DIVERSE, LESS PURE

Shannon Entropy
Larger more diverse (less pure) -Si pi log2(pi)
{0.5, 0.4, 0.1} 1.36 {0.4, 0.2, 0.3} 1.51 (more diverse)

Goals
Split if diversity in parent node > summed diversities in child nodes Observations should be
Homogeneous (not diverse) within leaves Different between leaves Leaves should be diverse

Framingham tree used Gini for splits

Cross validation
Traditional stats small dataset, need all observations to estimate parameters of interest. Data mining loads of data, can afford holdout sample Variation: n-fold cross validation
Randomly divide data into n sets Estimate on n-1, validate on 1 Repeat n times, using each set as holdout.

Pruning
Grow bushy tree on the fit data Classify holdout data Likely farthest out branches do not improve, possibly hurt fit on holdout data Prune non-helpful branches. What is helpful? What is good discriminator criterion?

Goals
Want diversity in parent node > summed diversities in child nodes Goal is to reduce diversity within leaves Goal is to maximize differences between leaves Use same evaluation criteria as for splits Costs (profits) may enter the picture for splitting or evaluation.

Accounting for Costs

Pardon me (sir, maam) can you spare some change? Say sir to male +$2.00 Say maam to female +$5.00 Say sir to female -$1.00 (balm for slapped face) Say maam to male -$10.00 (nose splint)

Including Probabilities
Leaf has Pr(M)=.7, Pr(F)=.3. You say: M True Gender M 0.7 (2) 0.7 (-10) F

0.3 (5) F

Expected profit is 2(0.7)-1(0.3) = $1.10 if I say sir Expected profit is -7+1.5 = -$5.50 (a loss) if I say Maam Weight leaf profits by leaf size (# obsns.) and sum Prune (and split) to maximize profits.

Additional Ideas
Forests Draw samples with replacement (bootstrap) and grow multiple trees. Random Forests Randomly sample the features (predictors) and build multiple trees. Classify new point in each tree then average the probabilities, or take a plurality vote from the trees

Bagging Bootstrap aggregation Boosting Similar, iteratively reweights points that were misclassified to produce sequence of more accurate trees.
* Lift Chart - Go from leaf of most to least response. - Lift is cumulative proportion responding.

Regression Trees
Continuous response (not just class) Predicted response constant in regions
Predict 80

Predict 50
X2 Predict 130 Predict 20

Predict 100 X1

Predict Pi in cell i. Yij jth response in cell i. Split to minimize Si Sj (Yij-Pi)2

Predict 80

Predict 50

Predict 130

Predict 20

Predict 100

Predict Pi in cell i. Yij jth response in cell i. Split to minimize Si Sj (Yij-Pi)2

Logistic Regression
Trees seem to be main tool. Logistic another classifier Older tried & true method Predict probability of response from input variables (Features) Linear regression gives infinite range of predictions 0 < probability < 1 so not linear regression.

Logistic idea: Map p in (0,1) to L in whole real line Use L = ln(p/(1-p)) Model L as linear in temperature Predicted L = a + b(temperature) Given temperature X, compute a+bX then p = eL/(1+eL) p(i) = ea+bXi/(1+ea+bXi) Write p(i) if response, 1-p(i) if not Multiply all n of these together, find a,b to maximize

Example: Ignition
Flame exposure time = X Ignited Y=1, did not ignite Y=0
Y=0, X= 3, 5, 9 10 , 13, 16 Y=1, X = 11, 12 14, 15, 17, 25, 30

Q=(1-p)(1-p)(1-p)(1-p)pp(1-p)pp(1-p)ppp Ps all different p=f(exposure) Find a,b to maximize Q(a,b)

Generate Q for array of (a,b) values

DATA LIKELIHOOD; ARRAY Y(14) Y1-Y14; ARRAY X(14) X1-X14; DO I=1 TO 14; INPUT X(I) y(I) @@; END; DO A = -3 TO -2 BY .025; DO B = 0.2 TO 0.3 BY .0025; Q=1; DO i=1 TO 14; L=A+B*X(i); P=EXP(L)/(1+EXP(L)); IF Y(i)=1 THEN Q=Q*P; ELSE Q=Q*(1-P); END; IF Q<0.0006 THEN Q=0.0006; OUTPUT; END;END; CARDS; 3 0 5 0 7 1 9 0 10 0 11 1 12 1 13 0 14 1 15 1 16 0 17 1 25 1 30 1 ;

Likelihood function (Q)

-2.6

0.23

IGNITION DATA The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates

Parameter Intercept TIME

DF 1 1

Estimate -2.5879 0.2346

Standard Error 1.8469 0.1502

Wald Chi-Square 1.9633 2.4388

Pr > ChiSq 0.1612 0.1184

Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 79.2 20.8 0.0 48 Somers' D Gamma Tau-a c 0.583 0.583 0.308 0.792

4 right, 1 wrong 5 right, 4 wrong

Example: Framingham
X=age Y=1 if heart trouble, 0 otherwise

Framingham
The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Estimate Error Chi-Square -5.4639 0.0630 0.5563 0.0110 96.4711 32.6152

Parameter Intercept age

DF 1 1

Pr>ChiSq <.0001 <.0001

Example: Shuttle Missions

O-rings failed in Challenger disaster Low temperature Prior flights erosion and blowby in O-rings Feature: Temperature at liftoff Target: problem (1) - erosion or blowby vs. no problem (0)

Neural Networks
Very flexible functions Hidden Layers Multilayer Perceptron

output inputs

Logistic function of Logistic functions Of data

Arrows represent linear combinations of basis functions, e.g. logistics b1

Example:
Y = a + b1 p1 + b2 p2 + b3 p3 Y = 4 + p1+ 2 p2 - 4 p3

Should always use holdout sample Perturb coefficients to optimize fit (fit data)
Nonlinear search algorithms

Eliminate unnecessary arrows using holdout data. Other basis sets

Radial Basis Functions Just normal densities (bell shaped) with adjustable means and variances.

Terms
Train: estimate coefficients Bias: intercept a in Neural Nets Weights: coefficients b Radial Basis Function: Normal density Score: Predict (usually Y from new Xs) Activation Function: transformation to target Supervised Learning: Training data has response.

Hidden Layer L1 = -1.87 - .27Age 0.20SBP22 H11=exp(L1)/(1+exp(L1))

L2 = -20.76 -21.38*H11 Pr{first_chd} = exp(L2)/(1+exp(L2)) Activation Function

Demo (optional)
Compare several methods using SAS Enterprise Miner
Decision Tree Nearest Neighbor Neural Network

Unsupervised Learning
We have the features (predictors) We do NOT have the response even on a training data set (UNsupervised) Clustering
Agglomerative
Start with each point separated

Divisive
Start with all points in one cluster then spilt

EM PROC FASTCLUS
Step 1 find seeds as separated as possible Step 2 cluster points to nearest seed
Drift: As points are added, change seed (centroid) to average of each coordinate Alternatively: Make full pass then recompute seed and iterate.

Clusters as Created

As Clustered

Cubic Clustering Criterion (to decide # of Clusters)

Divide random scatter of (X,Y) points into 4 quadrants Pooled within cluster variation much less than overall variation Large variance reduction Big R-square despite no real clusters CCC compares random scatter R-square to what you got to decide #clusters 3 clusters for macaroni data.

Association Analysis
Market basket analysis
What theyre doing when they scan your VIP card at the grocery People who buy diapers tend to also buy _________ (beer?) Just a matter of accounting but with new terminology (of course ) Examples from SAS Appl. DM Techniques, by Sue Walsh:

Termnilogy
Baskets: ABC ACD BCD ADE BCE Rule Support Confidence X=>Y Pr{X and Y} Pr{Y|X} A=>D 2/5 2/3 C=>A 2/5 2/4 B&C=>D 1/5 1/3

Dont be Fooled!
Lift = Confidence /Expected Confidence if Independent
Checking-> Saving V No Yes No (1500) 500 1000 Yes (8500) 3500 5000 (10000) 4000 6000

SVG=>CHKG Expect 8500/10000 = 85% if independent Observed Confidence is 5000/6000 = 83% Lift = 83/85 < 1. Savings account holders actually LESS likely than others to have checking account !!!

Summary
Data mining a set of fast stat methods for large data sets Some new ideas, many old or extensions of old Some methods: Decision Trees Nearest Neighbor Neural Nets Clustering Association

Web Strategy Proposal
No ratings yet
Web Strategy Proposal
15 pages
Telecom Industry Value Chain Turning On Its Head
No ratings yet
Telecom Industry Value Chain Turning On Its Head
2 pages
K2 Tutorial
No ratings yet
K2 Tutorial
2 pages
Biz Plan Outlines and Action Lines SN Items Descriptions Action
No ratings yet
Biz Plan Outlines and Action Lines SN Items Descriptions Action
2 pages
Data Mining Tutorial
No ratings yet
Data Mining Tutorial
58 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (650)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1857)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4104)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

Data Mining Tutorial

Uploaded by

Data Mining Tutorial

Uploaded by

Data Mining Tutorial

Some Actual Data

How to make splits?

2(400/75)+ 2(400/25) = 42.67

WHERE IS HIGH BP CUTOFF???

Measuring Worth of a Split

Logworth for Age Splits

Age 47 maximizes logworth

How to make splits?

Other Split Evaluations

Framingham tree used Gini for splits

Accounting for Costs

Predict Pi in cell i. Yij jth response in cell i. Split to minimize Si Sj (Yij-Pi)2

Predict Pi in cell i. Yij jth response in cell i. Split to minimize Si Sj (Yij-Pi)2

Q=(1-p)(1-p)(1-p)(1-p)pp(1-p)pp(1-p)ppp Ps all different p=f(exposure) Find a,b to maximize Q(a,b)

Generate Q for array of (a,b) values

Likelihood function (Q)

IGNITION DATA The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates

Parameter Intercept TIME

Estimate -2.5879 0.2346

Standard Error 1.8469 0.1502

Wald Chi-Square 1.9633 2.4388

Pr > ChiSq 0.1612 0.1184

4 right, 1 wrong 5 right, 4 wrong

Parameter Intercept age

Pr>ChiSq <.0001 <.0001

Example: Shuttle Missions

Logistic function of Logistic functions Of data

Arrows represent linear combinations of basis functions, e.g. logistics b1

Eliminate unnecessary arrows using holdout data. Other basis sets

Hidden Layer L1 = -1.87 - .27*Age 0.20*SBP22 H11=exp(L1)/(1+exp(L1))

Cubic Clustering Criterion (to decide # of Clusters)

You might also like

Hidden Layer L1 = -1.87 - .27Age 0.20SBP22 H11=exp(L1)/(1+exp(L1))