Lecture 1
Lecture 1
1
About Me…
• Shervin Shahrokhi Tehrani (Call me Shervin)
❑ Ph. D in Mathematics
❑ Ph. D in Marketing
Research Interest: Quantitive Marketing (Theory & Empirical &
Experimental): Advertising, Retailing, Heuristic decision processes
Email: [email protected]
• Phone: 214-714-8448
• Office: JSOM 13.219
• Office hours: By appointment ☺
2
This Course…
4
Let’s Start our Journey…
5
My Personal Data Challenge
h"ps://www.instagram.com/foodiehappie/
h"ps://www.youtube.com/channel/UCGL3FaWbVVve3GLUmSDer3A
6
Foodie Happie Product : Food & Happiness
7
Learn About Persian & International Foods
8
Wellness & Happiness & Learning Foods Across the Countries
9
How can I increase my followers if I see the data of my followers
10
Why Predictive Analytics?
11
What is Predictive Analytics?
Advertising & Sales
Can we predict Sales using data on TV, Radio, and Newspaper spending? How?
12
What is Predictive Analytics?
Advertising & Sales
Can we predict Sales using data on TV, Radio, and Newspaper spending? How?
13
What is Predictive Analytics?
Advertising & Sales
Can we predict Sales using data on TV, Radio, and Newspaper spending? How?
Why approximately?
14
What is Predictive Analytics?
Advertising & Sales
Can we predict Sales using data on TV, Radio, and Newspaper spending? How?
Can we predict Sales using data on TV, Radio, and Newspaper spending? How?
17
Is it the right (the best) way to model Sales and Ad?
• How far are we confident about our model?
• Can we predict confidently the sales if we change advertising
strategy?
• Should not we consider all Ads in one model?
18
Issues & Data-Driven Solutions
• HR Analytics:
I. Who are the most productive salespeople, employee?
II. Which managers have the highest retention rates? What do they do?
III. Which training program is best for an employee?
IV. Why do people leave?
V. What is the cost of turnover?
VI. Why do people join the organization?
• Firm-Product Analysis:
I. What are our most/least profitable products?
II. What are our production costs & how can we lower them?
III. What is our quality level & how can we improve that (Fed Ex)?
IV. What is our cycle time & how can we lower it?
V. What are the sources of product innovation?
VI. What impacts the demand of our product?
19
Issues & Data-Driven Solutions
• Financial Services
I. Who to give a loan?
II. What interest rate to offer?
III. How much amount of loan to offer?
IV. Who is likely to default?
V. How accurate are the financial forecasts?
VI. Where to invest and how much risk to take?
• Customer related
I. Who are the most/least profitable customers?
II. Who are the most/least satisfied customers?
III. What is fastest/slowest customer segment?
IV. What type of ads bring most customers?
V. What is our customer experience like & how can we improve it?
VI. What is the cost of customer acquisition?
VII. What are the reasons for losing customer?
VIII. What are the costs of customer transactions?
20
Data Science Philosophy
Model:
• A set of DGPs to approximate the unknown DGP.
• We can make assumptions about which DGPs are in the set of models
we consider, but in general, there are many possible.
• So we have to let the data tell us which one to pick.
• We do this by estimating f()
23
A simulated Example with one predictor
A Data Sample Data Generating Process
80
80
Data True f()
70
70
ϵ
60
60
Income
Income
50
50
40
40
30
30
20
20
10 12 14 16 18 20 22 10 12 14 16 18 20 22
24
A simulated example with two predictor
Incom
e ϵ
y
rit
Ye
o
ni
ars
Se
of
Ed
u ca
tio
n
25
Why do we need f()? Prediction
1) Prediction: We observe a new sample of X but not Y
̂
Y ̂ = f(X) since ϵ averages to zero, i.e. E[ϵ] = 0
26
Prediction Accuracy
Suppose X and f() are fixed
̂
E[Y − Y ̂ ]2 = E[ f(X) + ϵ − f(X)] 2
̂
= E[ f(X) − f(X)] 2
+ Var(ϵ)
Reducible Irreducible
PredicFon accuracy depends on Var(ϵ) .
-may be due to unmeasured variables
-may be due to natural (unmeasurable) variation
Examples
- What factors do change the demand? Price? Ad? Past purchase behaviors?
- What are factors that people default a loan?
- Why or Why not did people vote to Trump or Clinton?
28
How new Advertising budget predicts Sale?
30
How to estimate f()?
Estimation Methods f()
Suppose we observe n data points as:
We assume a specific functional form for function f()
{(x1, y1), (x2, y2), …, (xn, yn)} Where xi = (xi1, xi2, …, xip)
31
How to estimate f() − Parametric
How to estimate f() − Parametric
1) Assume functional form
e.g. f() is linear in X
f(X) = β0 + β1X1 + β2 X2 + … + βp Xp
32
How to estimate f() − Parametric
The Parametric Approach Trade-off
f(X) = β0 + β1X1 + β2 X2 + … + βp Xp
Advantages:
1. Very easy to interpret the effects
2. We need few data points (e.g. p+1observations at least)
Disadvantages:
1. Does our functional form (e.g. linear model) is flexible enough
to estimate the true function f() ?
33
How to estimate f() − Parametric
Flexibility vs Overfitting
We can increase the flexibility of a parametric model, but this generally
requires fitting more parameters.
Degree of polynomial
e.g. Polynomial Regression
34
Flexibility vs Overfitting
We can increase the flexibility of a parametric model, but this generally
requires fitting more parameters.
Degree of polynomial
e.g. Polynomial Regression
Y = β0+(β11X1 + β12 X12 + β13X13 + … + β1d X1d)
+(β21X2 + β22 X22 + β23X23 + … + β2d X2d)
+…
+(βp1Xp + βp2 Xp2 + βp3Xp3 + … + βpd Xpd) + ϵ
But, we should be careful about overfitting issue : That means our model follows the
error term noise too closely, by fitting to the most of observations perfectly!
What if we use our model for predicting outside of our sample, which should have
different random error terms?
35
Flexibility vs Overfitting
Linear Parametric Model True Model Spline Model
Incom
Incom
Incom
e
e
e
y
y
rit
rit
Ye Ye
io
io
ity
a ar
n
n
rs
or
s Ye
Se
Se
o of
ni
fE E ars
Se
du du of
ca ca Ed
tio tio uc
n n ati
on
36
Flexibility vs Overfitting
High
Subset Selection
Lasso
Least Squares
Interpretability
Bagging, Boosting
Low High
Flexibility
37
Non-parametric (Nearest Neighbors)
Some “neighborhood”
So let’s relax this definition and use nearby data points…
which has to be defined
̂ = Mean(Y | X ∈ 𝒩(x)) = 1
∑
f(x) yx
∣ 𝒩(x) ∣ x∈𝒩(x)
𝒩(x)
38
Non-parametric Approach Trade-off
̂ = Mean(Y | X ∈ 𝒩(x)) = 1
𝒩(x) f (x)
∣ 𝒩(x) ∣ ∑
yx
x∈𝒩(x)
Advantages:
1. There is no assumption about functional form of f().
Disadvantages:
1. Very hard to interpret the estimation result
2. We need lots of data points to have good estimation in each neighbor
3. Only works well when p is small and n is large (curse of dimensionality)
39
The Curse of Dimensionality
But these non-parameteric methods can really suck when p is large (for a fixed n)
40
Supervised vs. Unsupervised
Supervised
Unsupervised
For each predictor(s) xi there is a yi For each predictor(s) xi there is no yi
Aim is to accurately predict yi Aim is to understand
relaFonship between xi
12
80
80
8
10
70
70
60
60
6
Income
Income
X2
X2
50
50
4
40
40
4
30
30
2
2
20
20
0 2 4 6 8 10 12 0 2 4 6
10 12 14 16 18 20 22 10 12 14 16 18 20 22
X1 X1
Years of Education Years of Education
41
Regression vs. Classification
Quantitative:
variables take on numerical values. (e.g. age, prices, income)
-called “regression” problems
Qualitative (Categorial):
variables fall into multiple discrete classes/categories
(e.g gender, brands, cancer diagnosis, purchase)
The format of the Response variable will determine the method used.
All methods can handle Predictor Variables of both types.
42
Assessing Model Accuracy
43
Why is assessing accuracy important?
Selecting the right method is the most important thing an analyst does.
44
How do we measure quality of fit?
For quantitative data (regression), we use the Mean Squared Error (MSE) of a Predictor:
1 n 1 n
̂ ))2
(yi − yî )2 =
n∑ ∑
MSE = (yi − f(xi
i=1
n i=1
⟺
̂ will be close to f() on average if MSE is smaller.
f()
45
Training Data vs. Test Data
We usually divide our data into two subsets:
46
How do we measure quality of fit?
Mean Squared Error of a Predictor (training data):
fit to observed data-points
1 n ̂ ))2
n∑
MSEtrain = (yi − f(xi
i=1
48
A Simulated Example 1
True f() is the non-linear black curve
MSEtest
2.5
Linear Regression
12
2.0
(Smooth) Spline
10
1.5
(Rough) Spline
8
picking up
Y
1.0
6
random chance
0.5
4
MSEtrain
2
0.0
0 20 40 60 80 100 2 5 10 20
“degrees of freedom”
X Flexibility
2.5
12
Linear Regression
2.0
10
lower MSEtest!
1.5
8
(Rough) Spline
Y
Var (ϵ)
1.0
picking up variation
6
due to random
0.5
4
chance
MSEtrain
2
0.0
0 20 40 60 80 100 2 5 10 20
“degrees of freedom”
X Flexibility
50
A Simulated Example 3
20
True f ()
MSEtest
20
Linear Regression
15
Mean Squared Error
(Smooth) Spline
10
10
Y
(Rough) Spline
picking up variation
0
due to random
5
chance Var (ϵ)
−10
MSEtrain
0
0 20 40 60 80 100 2 5 10 20
“degrees of freedom”
X Flexibility
51
Bias-Variance Tradeoff
̂ )) = E[ f(x
Let define Bias( f(x ̂ )] − f(x )
0 0 0
( 0) [ 0 ]
2
̂ )]2 = Var f(x
E[y0 − f(x ̂ ) + Bias( f(x
̂ )) + Var(ϵ)
0
We want to choose a method with low variance and low bias. Can never be lower than this.
52
Bias-Variance Tradeoff - Three Examples
2.5
20
MSE
Bias
Var
2.0
2.0
15
1.5
1.5
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20
53
What about Classification?
So far, we have focused on regression, but these concepts transfer to the classification setting.
Examples 1 0 to compute, we
need to convert
Email: C = {spam, not} labels into numbers
1 n
n∑
Training Error rate: I(yi ≠ yî )
i=1
i.e. fraction of incorrect classifications
new observation