0% found this document useful (0 votes)
42 views21 pages

Data Mining: Bob Stine Dept of Statistics, Wharton School University of Pennsylvania

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 21

Data Mining Introduction

Bob Stine Dept of Statistics, Wharton School University of Pennsylvania

Wharton
Department of Statistics

An insult? Predictive modeling

What is data mining?


Large, wide data sets, often unstructured Automatic, complex models
Networks, trees, ensembles black boxes Leverage modern computing, recent results in learning theory

Emphasize prediction rather than explanation


Association and prediction rather than cause and effect

Testable claims
Wharton
Department of Statistics

Science requires making claims that are testable Claims of prediction provide such a test
2

Data mining poorly matched to soc science? Response


Honest
A better match to most of us do in common practice

Data Mining in Soc Sci

Empiricism run wild, lack of theory or hypotheses Post hoc inference Need to leverage technology
Tukey Cost of theory relative to cost of computing

Diagnostic
Have I missing something?

Deep connections
Multidimensional scaling, Likelihood, Modern regr

Wharton
Department of Statistics

Week 1 Week 2 Syllabus


Wharton
Department of Statistics

Plan

Data mining with regression, logistic regression Illustrate key ideas in familiar context Alternative methods Trees, networks, ensemble methods
Boosting and bagging

Hands-on: Lab sessions on Thursdays Annotated bibliography July 4


4

Cannot learn stat without doing statistics! Modern computing provides Packages
JMP from SAS
Will be front-end to SAS Enterprise Miner Available on Newberry systems

Software

New ways to look at old things, like regression New approaches to data analysis

R Others: Stata, SPSS, Weka,


Wharton
Department of Statistics

Time series analysis

My Background

Effects of modeling on forecast accuracy Bootstrap resampling Predictive models in credit, health Alternative regression methods Combining traditional data and text Political science and voting behavior

Model selection in general Recent

Long time friend of Summer Program


Wharton
Department of Statistics

t-shirts
6

What question do you want to answer?


And do you have the right data to do it?

Research Questions
Question to guide analysis Ideal data?

Questions from science, business Social science questions


Wharton
Department of Statistics

Whos most at risk of a disease? Whats going to happen to nancial markets? Are any of these people dishonest? Will this person vote if I get them to register? Whom will this registered voter choose? Whom would those who didnt vote choose?
7

of survey Background Two waves, every two years Questions Categorical responses

2008 ANES Survey

makes these interesting What Get out the vote, phone banks

Did you vote? For whom? Numerical responses How much do you like this candidate

Wharton
Department of Statistics

ANES ideal data Is the Missing data, self-reported, interviewer effects...

Role of participation in election Would those who didnt vote change things?

90/10 rule
8

Spirit of EDA, exploratory data analysis


Know your data Know your tools

Data Browsing



ANES data table



Variable creation
Wharton
Department of Statistics

Probably several tools, using each for what it does best

Load directly from SPSS sav le 25383-0001-Data.sav Almost square: 2,323 cases x 1963 variables Virtually all categorical role of scaling Feeling thermometers, moderators (N5)

No algorithm as good (yet) as the modeler who knows how to build predictive features
9

Marginal distributions Interesting variables

Browsing ANES

Interactive graphics: Plot linking and brushing Participation, political interest (A1-A10)
prevalence of missing data. Problem for categorical data?

Feeling thermometer (B1 group)


numbers or categories? Missing a problem?

JMP treatment of numerical/ categorical

Spending bundle and scaling (P1 group)


Likert scales, ordinal-interval-ratio measurement

Intention to vote (A6, Q1 in rst wave)


Repeats prior question, reliability of data

Choice in election (C6,Q14 in second wave)


Wharton
Department of Statistics

Importance of sampling weights (65.5% in sample, 52.9% in election)

10

Bivariate relationships

Browsing ANES
Special scatterplot if mix types

Contingency tables, scatterplots Asymmetry of roles: explanatory vs response

Consistency of responses
: scatterplot
Choice and opinion of war in Iraq Choice and rating of candidate

FT rating of Dem candidate pre/post election

B1/D1

War and voter choice: table, mosaic plot

Q14/A14f

Feelings and voter choice: logistic regression


Wharton
Department of Statistics

Q14/D2xx

11

Models in statistics Example: SRM Inference


Wharton
Department of Statistics

Models

All models are wrong, some are useful Box

What is a model? Why are they needed? E(Y|X) = 0 + 1X + independence, equal var normal Statistical signicance Condence interval, test, p-value

Interpretation?
12

Residual diagnostics Calibration

Model Diagnostics

Is the model correct on average Check by smoothing Y on X or Y on

Interactive tool for spline in JMP

Wharton
Department of Statistics

13

Does one explanatory variable a complete


description of the response?
Media Emotional interest in outcome Attitude to Iraq war, economy,

Multiple Regression

What other factors affect association between pre-election rating and post rating?

How do these factors contribute to model


Additive as another explanatory variable Affecting other factors (interaction)

How should we decide which?


Wharton
Department of Statistics

Trial and error by adding to multiple regression? Use of t-statistics and p-values to decide
14

Multiple Regression Model


Underlying model has assumptions Key assumption is the larger equation Xs are known
E(Y|X) = 0 + 1 X1 + 2X2 + + kXk and additive Same assumptions for the unexplained variation

Grow to a multiple regression model

Judging the other variables

Which do we keep, which do we exclude? How many variables did you try? What made you try those? What about other correlated variables?

Use of t-statistics, F-statistics in this setting


Wharton
Department of Statistics

15

Grow simple regr into a multiple regression


model that includes interactions
Add Happy/Care, care who wins Interaction What does all of this tell you?

Possible Model

Wharton
Department of Statistics

16

Prole of Model Alternative way to look at a model


Visual presentation of effects vs tabular What does the interaction do? Bars indicate condence

Error bars indicate condence

Wharton
Department of Statistics

17

Surface prole

Looking at Fit

How would it look were there no interaction?

Wharton
Department of Statistics

18

Role of data mining in social science research


Diagnostic Better way to do what we are doing Right on average

Take-Aways

Calibration of a model Classical statistics, inference Interactive graphical visualization


Wharton
Department of Statistics

Exploring data (plot linking, brushing) Exploring models (proling, surfaces)


19

Skim syllabus, bibliography Think about what you did for models Peek at the codebook for ANES
Z drive on Newberry computers Come with questions

Assignment

Think about your own modeling and data

Wharton
Department of Statistics

20

Picking the variables for a regression What about those missing values?

Next Time

Wharton
Department of Statistics

21

You might also like