0% found this document useful (0 votes)
4 views

Lecture 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

SHERVIN SHAHROKHI TEHRANI

Predictive Analytics With SAS


Lecture One: Introduction

Prof. Ryan Webb

1
About Me…
• Shervin Shahrokhi Tehrani (Call me Shervin)
❑ Ph. D in Mathematics
❑ Ph. D in Marketing
Research Interest: Quantitive Marketing (Theory & Empirical &
Experimental): Advertising, Retailing, Heuristic decision processes
Email: [email protected]
• Phone: 214-714-8448
• Office: JSOM 13.219
• Office hours: By appointment ☺

2
This Course…

4
Let’s Start our Journey…

5
My Personal Data Challenge

h"ps://www.instagram.com/foodiehappie/

h"ps://www.youtube.com/channel/UCGL3FaWbVVve3GLUmSDer3A

6
Foodie Happie Product : Food & Happiness

7
Learn About Persian & International Foods

8
Wellness & Happiness & Learning Foods Across the Countries

9
How can I increase my followers if I see the data of my followers

10
Why Predictive Analytics?

“What’s ubiquitous and cheap? Data.


And what is scarce? The analytic ability to utilize that
data.”
• Hal Varian (Google’s Chief Economist)

11
What is Predictive Analytics?
Advertising & Sales

A firm advertising data in 200-markets:


• X-axis shows advertising budget in thousand of dollars
• Y-axis shows sale units in thousand scale

Can we predict Sales using data on TV, Radio, and Newspaper spending? How?

12
What is Predictive Analytics?
Advertising & Sales

Can we predict Sales using data on TV, Radio, and Newspaper spending? How?

We need to use a model: Sales ≈ f(TVAd, RadioAd, NewspaperAd)

13
What is Predictive Analytics?
Advertising & Sales

Can we predict Sales using data on TV, Radio, and Newspaper spending? How?

We need to use a model: Sales ≈ f(TVAd, RadioAd, NewspaperAd)

Why approximately?
14
What is Predictive Analytics?
Advertising & Sales

Can we predict Sales using data on TV, Radio, and Newspaper spending? How?

We need to use a model: Sales ≈ f(TVAd, RadioAd, NewspaperAd)


What about price, salesforce abilities, errors
in measurement, …?
15
What is Predictive Analytics?
Advertising & Sales

Can we predict Sales using data on TV, Radio, and Newspaper spending? How?

We need to use a model: Sales = f(TVAd, RadioAd, NewspaperAd)+ϵ


There are some random factors (price, salesforce abilities, errors in
measurement, …) we don’t have it: Unobserved Error Term
16
What is Predictive Analytics?
Regression of Sales on Each Advertising Medium

Let's assume f is a linear function of advertising budget:

Sales = β0 + β1TVAd+ϵ Sales = β0 + β1RadioAd+ϵ Sales = β0 + β1NewspaperAd+ϵ

Normal Distribution Error Term : ϵ ∼ N(0,σ 2)

17
Is it the right (the best) way to model Sales and Ad?
• How far are we confident about our model?
• Can we predict confidently the sales if we change advertising
strategy?
• Should not we consider all Ads in one model?

• Is it better to spend on either TVAd, RadioAd, or NewspaperAd?


Or should we allocate $X-budget in (XTVAd, XRadioAd, XNewspaperAd)?

• Provide an optimal allocation of our $ advertising-budget to


maximize our sale in next year? What is the confidence interval?
• …

18
Issues & Data-Driven Solutions
• HR Analytics:
I. Who are the most productive salespeople, employee?
II. Which managers have the highest retention rates? What do they do?
III. Which training program is best for an employee?
IV. Why do people leave?
V. What is the cost of turnover?
VI. Why do people join the organization?
• Firm-Product Analysis:
I. What are our most/least profitable products?
II. What are our production costs & how can we lower them?
III. What is our quality level & how can we improve that (Fed Ex)?
IV. What is our cycle time & how can we lower it?
V. What are the sources of product innovation?
VI. What impacts the demand of our product?

19
Issues & Data-Driven Solutions
• Financial Services
I. Who to give a loan?
II. What interest rate to offer?
III. How much amount of loan to offer?
IV. Who is likely to default?
V. How accurate are the financial forecasts?
VI. Where to invest and how much risk to take?
• Customer related
I. Who are the most/least profitable customers?
II. Who are the most/least satisfied customers?
III. What is fastest/slowest customer segment?
IV. What type of ads bring most customers?
V. What is our customer experience like & how can we improve it?
VI. What is the cost of customer acquisition?
VII. What are the reasons for losing customer?
VIII. What are the costs of customer transactions?

20
Data Science Philosophy

Observables (Data): Y = f(X) + ϵ


Y := Dependent Variable, Response, Target
X := Independent Variables, Inputs, Predictors X = (X1, X2, ⋯, Xp)

Note we observe a Sample (Random) of observations:

y1 x11 x12 ⋯ x1p Observation 1


y2 x21 x22 ⋯ x2p
Y= X=
⋮ ⋮ ⋮ ⋯ ⋮
yn xn1 xn2 ⋯ xnp
Observation n
21
Data Science Philosophy
Y = f(X) + ϵ
Unobservables (Model): What is f()? Do we know it?

f() := systematic relationship between X and Y Not Random


We don’t know this relationship. What should we do?
We must make assumptions about f() and estimate it Random
ϵ : random error independent of X with mean 0
i.e. assume it has no information

i.e. assume it has no information 22


Data Generating Process (DGP) vs. Models
DGP means whatever mechanism is at work in the real world giving rise
to the sample
• The distribution of predictors, responses, and how do they relate.
i.e. precisely the true funcFonal forms of f() and ϵ in real world.

Model:
• A set of DGPs to approximate the unknown DGP.
• We can make assumptions about which DGPs are in the set of models
we consider, but in general, there are many possible.
• So we have to let the data tell us which one to pick.
• We do this by estimating f()

23
A simulated Example with one predictor
A Data Sample Data Generating Process

80

80
Data True f()
70

70
ϵ
60

60
Income

Income
50

50
40

40
30

30
20

20
10 12 14 16 18 20 22 10 12 14 16 18 20 22

Years of Education Years of Education

24
A simulated example with two predictor

Incom
e ϵ

y
rit
Ye

o
ni
ars

Se
of
Ed
u ca
tio
n

25
Why do we need f()? Prediction
1) Prediction: We observe a new sample of X but not Y
̂
Y ̂ = f(X) since ϵ averages to zero, i.e. E[ϵ] = 0

predicFon for Y esFmate of f()

Generally, in predicFon we don't care about the exact form of f(),


only the accuracy of the predicFon
e.g. how Y ̂ is close to Y

26
Prediction Accuracy
Suppose X and f() are fixed
̂
E[Y − Y ̂ ]2 = E[ f(X) + ϵ − f(X)] 2

̂
= E[ f(X) − f(X)] 2
+ Var(ϵ)
Reducible Irreducible
PredicFon accuracy depends on Var(ϵ) .
-may be due to unmeasured variables
-may be due to natural (unmeasurable) variation

The methods in this course will focus on minimizing Reducible error


Minimizing Irreducible error requires new sources of data. It thus
provides an upper bound on the accuracy of prediction given a dataset.
27
Why do we need f()? Inference
2) Inference: We want to understand how Y is related to X
e.g. how does Y change when X changes?

- which predictors are related to the response?


- what is this relationship? (positive? negative? dependent on other predictors?)
- what is the form of this relationship? linear? non-linear?

Examples
- What factors do change the demand? Price? Ad? Past purchase behaviors?
- What are factors that people default a loan?
- Why or Why not did people vote to Trump or Clinton?

28
How new Advertising budget predicts Sale?

How much will sales increase if TV budget is increased to $200,000? $400,000?


We can’t just average sales at $400,000 because we don’t observe it!
What if newspaper budget is increased to $200,000? Does it even matter?
29
Do all Advertising-type increase the Sale?

• Why should we advertise?


• Should we advertise on all mediums?
• If yes, which medium is more effective on sale?

30
How to estimate f()?
Estimation Methods f()
Suppose we observe n data points as:
We assume a specific functional form for function f()
{(x1, y1), (x2, y2), …, (xn, yn)} Where xi = (xi1, xi2, …, xip)

Generally, there are two methods to do this

1) Parametric: Advantage: Interpretability Disadvantage: Inflexibility to fit

2) Non-parametric: Disadvantage: Non-interpretability Advantage: Flexibility


to fit

31
How to estimate f() − Parametric
How to estimate f() − Parametric
1) Assume functional form
e.g. f() is linear in X
f(X) = β0 + β1X1 + β2 X2 + … + βp Xp

2) Choose a procedure to Fit/Train the model


e.g. Ordinary Least Squares to find best coefficient

Y ≈ β0̂ + β1̂ X1 + β2̂ X2 + … + βp̂ Xp

32
How to estimate f() − Parametric
The Parametric Approach Trade-off

f(X) = β0 + β1X1 + β2 X2 + … + βp Xp

Y ≈ β0̂ + β1̂ X1 + β2̂ X2 + … + βp̂ Xp

Advantages:
1. Very easy to interpret the effects
2. We need few data points (e.g. p+1observations at least)

Disadvantages:
1. Does our functional form (e.g. linear model) is flexible enough
to estimate the true function f() ?

33
How to estimate f() − Parametric
Flexibility vs Overfitting
We can increase the flexibility of a parametric model, but this generally
requires fitting more parameters.
Degree of polynomial
e.g. Polynomial Regression

Y = β0+(β11X1 + β12 X12 + β13X13 + … + β1d X1d)


+(β21X2 + β22 X22 + β23X23 + … + β2d X2d)
+…
+(βp1Xp + βp2 Xp2 + βp3Xp3 + … + βpd Xpd) + ϵ

34
Flexibility vs Overfitting
We can increase the flexibility of a parametric model, but this generally
requires fitting more parameters.
Degree of polynomial
e.g. Polynomial Regression
Y = β0+(β11X1 + β12 X12 + β13X13 + … + β1d X1d)
+(β21X2 + β22 X22 + β23X23 + … + β2d X2d)
+…
+(βp1Xp + βp2 Xp2 + βp3Xp3 + … + βpd Xpd) + ϵ
But, we should be careful about overfitting issue : That means our model follows the
error term noise too closely, by fitting to the most of observations perfectly!
What if we use our model for predicting outside of our sample, which should have
different random error terms?
35
Flexibility vs Overfitting
Linear Parametric Model True Model Spline Model

income ≈ β0 + β1educaFon + β2seniority

Incom
Incom

Incom
e
e

e
y

y
rit

rit
Ye Ye
io

io

ity
a ar
n

n
rs

or
s Ye
Se

Se
o of

ni
fE E ars

Se
du du of
ca ca Ed
tio tio uc
n n ati
on

36
Flexibility vs Overfitting

High
Subset Selection
Lasso

Least Squares
Interpretability

Generalized Additive Models


Trees

Bagging, Boosting

Support Vector Machines


Low

Low High

Flexibility

37
Non-parametric (Nearest Neighbors)
Some “neighborhood”
So let’s relax this definition and use nearby data points…
which has to be defined
̂ = Mean(Y | X ∈ 𝒩(x)) = 1

f(x) yx
∣ 𝒩(x) ∣ x∈𝒩(x)

𝒩(x)

38
Non-parametric Approach Trade-off
̂ = Mean(Y | X ∈ 𝒩(x)) = 1
𝒩(x) f (x)
∣ 𝒩(x) ∣ ∑
yx
x∈𝒩(x)

Advantages:
1. There is no assumption about functional form of f().

Disadvantages:
1. Very hard to interpret the estimation result
2. We need lots of data points to have good estimation in each neighbor
3. Only works well when p is small and n is large (curse of dimensionality)

39
The Curse of Dimensionality
But these non-parameteric methods can really suck when p is large (for a fixed n)

40
Supervised vs. Unsupervised
Supervised
Unsupervised
For each predictor(s) xi there is a yi For each predictor(s) xi there is no yi
Aim is to accurately predict yi Aim is to understand
relaFonship between xi

12
80

80

8
10
70

70
60

60

6
Income

Income

X2

X2
50

50

4
40

40

4
30

30

2
2
20

20

0 2 4 6 8 10 12 0 2 4 6
10 12 14 16 18 20 22 10 12 14 16 18 20 22
X1 X1
Years of Education Years of Education

What if group membership (o,+,∆) unknown?

41
Regression vs. Classification
Quantitative:
variables take on numerical values. (e.g. age, prices, income)
-called “regression” problems

Qualitative (Categorial):
variables fall into multiple discrete classes/categories
(e.g gender, brands, cancer diagnosis, purchase)

-called “classification” problems

The format of the Response variable will determine the method used.
All methods can handle Predictor Variables of both types.
42
Assessing Model Accuracy

43
Why is assessing accuracy important?

We will learn about many different methods/models for data.


Each has its benefits and drawbacks:
• flexible vs interpretable
• accuracy vs robustness

Selecting the right method is the most important thing an analyst does.

44
How do we measure quality of fit?

For quantitative data (regression), we use the Mean Squared Error (MSE) of a Predictor:

1 n 1 n
̂ ))2
(yi − yî )2 =
n∑ ∑
MSE = (yi − f(xi
i=1
n i=1

̂ then MSE would be small.


If all yi are "close" to the predicted f(),


̂ will be close to f() on average if MSE is smaller.
f()
45
Training Data vs. Test Data
We usually divide our data into two subsets:

• The training data (In-sample) will be used to train/fit our models to


determine the best DGP on our training data.

• The test data (holdout-sample) will be used to measure accuracy


of predictions of our models on our test data.

46
How do we measure quality of fit?
Mean Squared Error of a Predictor (training data):
fit to observed data-points

1 n ̂ ))2
n∑
MSEtrain = (yi − f(xi
i=1

̂ by minimizing the MSE


Note that we find the estimation f() trian

Mean Squared Error of a Predictor (test data):


We care about a new observation
N0
̂ 2 1 ̂ ))2
N0 ∑
MSEtest = Avg(y0 − f(x0)) = (y0 − f(x0
i=1
47
Smallest MSEtrain ≠ smallest MSEtest

Many esFmaFon methods specifically try to minimize MSEtrain

therefore MSEtrain < MSEtest

48
A Simulated Example 1
True f() is the non-linear black curve
MSEtest

2.5
Linear Regression

12

2.0
(Smooth) Spline
10

Mean Squared Error

1.5
(Rough) Spline
8

picking up
Y

variation due to Var (ϵ)

1.0
6

random chance

0.5
4

MSEtrain
2

0.0
0 20 40 60 80 100 2 5 10 20
“degrees of freedom”
X Flexibility

Overfi\ng: a less flexible model would have yielded a smaller MSEtest


49
A Simulated Example 2
True f() is the linear black curve
True f () MSEtest

2.5
12
Linear Regression

2.0
10
lower MSEtest!

Mean Squared Error


(Smooth) Spline

1.5
8

(Rough) Spline
Y

Var (ϵ)

1.0
picking up variation
6

due to random

0.5
4

chance

MSEtrain
2

0.0
0 20 40 60 80 100 2 5 10 20
“degrees of freedom”
X Flexibility

50
A Simulated Example 3

20
True f ()
MSEtest
20
Linear Regression

15
Mean Squared Error
(Smooth) Spline
10

10
Y

(Rough) Spline
picking up variation
0

due to random

5
chance Var (ϵ)
−10

MSEtrain

0
0 20 40 60 80 100 2 5 10 20
“degrees of freedom”
X Flexibility

51
Bias-Variance Tradeoff

The “u-shape” in MSEtest is due to two compeFng properFes

̂ )) = E[ f(x
Let define Bias( f(x ̂ )] − f(x )
0 0 0

( 0) [ 0 ]
2
̂ )]2 = Var f(x
E[y0 − f(x ̂ ) + Bias( f(x
̂ )) + Var(ϵ)
0

We want to choose a method with low variance and low bias. Can never be lower than this.

̂ )) : Measure of how much f ̂ changes when esFmaFng on different training set


Var( f(x0

̂ )) : error introduced in f ̂ by approximaFng a (complicated) DGP with a (simple) f().


Bias( f(x0

52
Bias-Variance Tradeoff - Three Examples

Example 1 Example 2 Example 3


2.5

2.5

20
MSE
Bias
Var
2.0

2.0

15
1.5

1.5

10
1.0

1.0

5
0.5

0.5
0.0

0.0

0
2 5 10 20 2 5 10 20 2 5 10 20

Flexibility Flexibility Flexibility

53
What about Classification?
So far, we have focused on regression, but these concepts transfer to the classification setting.

{(x1, y1), …, (xn, yn)} where y1, …, yn ∈ C are qualitaFve

Examples 1 0 to compute, we
need to convert
Email: C = {spam, not} labels into numbers

Consumer Choice: C = {Brand A, Brand B, Brand C}


1 2 3
We want to:
predict an outcome yô given some new observaFons data xo

understand the roles and relaFonships between different preditors X1, …, Xp


54
Classification Accuracy
Indicator: = 1 if yi ≠ yî ,
= 0 otherwise

1 n
n∑
Training Error rate: I(yi ≠ yî )
i=1
i.e. fraction of incorrect classifications

new observation

Test Error rate: Avg(I(y0 ≠ y0̂ ))

again, we want to minimize the test error rate


55

You might also like