Data Mining: Bob Stine Dept of Statistics, Wharton School University of Pennsylvania
Data Mining: Bob Stine Dept of Statistics, Wharton School University of Pennsylvania
Data Mining: Bob Stine Dept of Statistics, Wharton School University of Pennsylvania
Wharton
Department of Statistics
Testable claims
Wharton
Department of Statistics
Science requires making claims that are testable Claims of prediction provide such a test
2
Empiricism run wild, lack of theory or hypotheses Post hoc inference Need to leverage technology
Tukey Cost of theory relative to cost of computing
Diagnostic
Have I missing something?
Deep connections
Multidimensional scaling, Likelihood, Modern regr
Wharton
Department of Statistics
Plan
Data mining with regression, logistic regression Illustrate key ideas in familiar context Alternative methods Trees, networks, ensemble methods
Boosting and bagging
Cannot learn stat without doing statistics! Modern computing provides Packages
JMP from SAS
Will be front-end to SAS Enterprise Miner Available on Newberry systems
Software
New ways to look at old things, like regression New approaches to data analysis
My Background
Effects of modeling on forecast accuracy Bootstrap resampling Predictive models in credit, health Alternative regression methods Combining traditional data and text Political science and voting behavior
t-shirts
6
Research Questions
Question to guide analysis Ideal data?
Whos most at risk of a disease? Whats going to happen to nancial markets? Are any of these people dishonest? Will this person vote if I get them to register? Whom will this registered voter choose? Whom would those who didnt vote choose?
7
of survey Background Two waves, every two years Questions Categorical responses
makes these interesting What Get out the vote, phone banks
Did you vote? For whom? Numerical responses How much do you like this candidate
Wharton
Department of Statistics
Role of participation in election Would those who didnt vote change things?
90/10 rule
8
Data Browsing
Load directly from SPSS sav le 25383-0001-Data.sav Almost square: 2,323 cases x 1963 variables Virtually all categorical role of scaling Feeling thermometers, moderators (N5)
No algorithm as good (yet) as the modeler who knows how to build predictive features
9
Browsing ANES
Interactive graphics: Plot linking and brushing Participation, political interest (A1-A10)
prevalence of missing data. Problem for categorical data?
10
Bivariate relationships
Browsing ANES
Special scatterplot if mix types
Consistency of responses
: scatterplot
Choice and opinion of war in Iraq Choice and rating of candidate
B1/D1
Q14/A14f
Q14/D2xx
11
Models
What is a model? Why are they needed? E(Y|X) = 0 + 1X + independence, equal var normal Statistical signicance Condence interval, test, p-value
Interpretation?
12
Model Diagnostics
Wharton
Department of Statistics
13
Multiple Regression
What other factors affect association between pre-election rating and post rating?
Trial and error by adding to multiple regression? Use of t-statistics and p-values to decide
14
Which do we keep, which do we exclude? How many variables did you try? What made you try those? What about other correlated variables?
15
Possible Model
Wharton
Department of Statistics
16
Wharton
Department of Statistics
17
Surface prole
Looking at Fit
Wharton
Department of Statistics
18
Take-Aways
Skim syllabus, bibliography Think about what you did for models Peek at the codebook for ANES
Z drive on Newberry computers Come with questions
Assignment
Wharton
Department of Statistics
20
Picking the variables for a regression What about those missing values?
Next Time
Wharton
Department of Statistics
21