Topic 6 Understanding Causality and Regression (Updated)
Topic 6 Understanding Causality and Regression (Updated)
• We can loosen a constraint but that typically requires other scarce resources.
• A decision on which objects to control or change (i.e. managerial intervention)
typically precede any decision on how to control or change them.
• Understanding causality is crucial to making effective interventions.
Intervention =
Running an
advertising
campaign
Objective is to
increase this
Causal modelling
• Consider a decision on the purchase of a
new equipment: An means a causal relationship
◦ Quality has two levels: high or low + suggests a positive relationship
• How about the Number of Customers visiting the shop and Volume?
• Also how about the effects of advertising and media hype on sales?
Causal model for assessing market value
• Now we put all elements
together, this is our causal
model for situational
assessment.
• Note that there is no
(business) goal / objective
in terms of optimisation or
decision making.
• Rather it assesses how
causal factors affect the
market volume.
Causal modelling for Interventions
• Example 2: instead of doing a situational assessment, you are now
asked to decide how much to spend on advertising for these
products.
◦ You need to set an objective, e.g. high market share (the proportion of sales
through your retailers to the total number sold).
◦ So the decision variable is “Advertise”.
◦ Simplify intervention decision: (1) run an advertising or (2) not doing that.
◦ Further simplify that you will know the price at the time you set “Advertise”.
Influence diagram
• Often, rectangle shape refers to strategic
option (i.e. decision point, choice variable,
value directly controlled by a strategic
agent – decision making agent)
• Hexagon shape refers to objective (e.g.
profit, value, market share, etc.). Decision
are made to optimise the objective.
• Circle shape refers to probabilistic
variables that are chance variables,
uncertain quantities, environmental factors
and other elements outside the direct
control of strategic agents.
[+] More
Advertising leads [+] High Sales
to a greater lead to greater
certainty of a certainty of a high
larger number of Market Share
unit sales
This is only an
informational link: [-] [-] Higher Price
you know price leads to a greater
when deciding on certainty of a
Advertise smaller number of
unit sales
From causal diagrams to mathematical
equations
• A simplest form of empirical model would be
using regression model as below
• Stage 1
10.000
Life Ladder
8.000
6.000
4.000 We can
2.000 add a
trendline in
0.000
Excel
0.000 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000
Log GDP per capita
A linear relationship:
Happiness score is dependent variable
Log GDP per capita is the independent variable
Excel Trendline Tool
• Right click on data series
and choose Add trendline
frpop-up menu
Check the boxes Display
Equatiom on on chart and
Display R-squared value
on chart
Simple linear regression using least-square
• Simple linear regression model
Y 0 1 X 8.1
• We estimate the parameters (ßs) from the sample data
Yˆ b0 b1 X 8.2
• Once estimated, we can
◦ Assess/explain if X is an important factor explaining Y,
◦ “predict” the value of Y given a specific value of X
◦ Yˆ b b X
i 0 1 i
Least square regression
• Residuals are the observed errors
associated with estimating the value of the ei Yi Yˆi 8.3
dependent variable using the regression line.
• The best-fitting line minimizes the sum of
squares of the residuals.
Simple Linear Regression with Excel
• Using Analysis Toolpak:
◦ Data > DataAnalysis > Regression
Results: Regression Statistics (metrics)
• Multiple R:
◦ sample correlation coefficient
◦ varies from 0 to 1
• R Square:
◦ coefficient of determinant
◦ varies from 0 (no fit) to 1 (perfect fit)
• Adjusted R Square:
◦ Adjusted R square for sample size
and number of X variables.
residual
Standard residual
standard deviation
1.5
0.5
Residuals
0
0.000 2.000 4.000 6.000 8.000 10.000 12.000 14.000
-0.5
-1
-1.5
-2
-2.5
-3
Log GDP per capita
What are drawbacks of simple linear
regression models?
• Consider the case of happiness:
Multiple linear regression
• Consider the case study of happiness (at national level)
• What are possible “causes” and/or “factors” that can explain
variations in the Happiness level across countries?
Multiple linear regression
• A linear regression model with more than one independent variable is
called a multiple linear regression model.
Y 0 1 X 1 2 X 2 k X k 8.10
◦ Y is the dependent variable, Xi are the independent (explanatory) variables;
◦ βi are the regression coefficients for the independent variables, ε the error term.
Yˆ b0 b1 X 1 b2 X 2 bk X k 8.11
We are ignoring here possible causal and
statistical relationships among
A causal graph independent variables
Wealth +
Happiness
Freedom
+
+
Social - + Others
support Perception of +
corruption Health
Generosity
“If you were in trouble, do you have
relatives or friends you can count on to The time series of healthy life
help you whenever you need them, or expectancy at birth
not?”
8
8
Happiness (Cantril Ladder)
6
6
Happiness
4
4
2
2
“Are you satisfied or dissatisfied with The national average of GWP responses
your freedom to choose what you do to the question, “Have you donated
money to a charity in the past month?”
with your life?”
on GDP per capita
8
8
6
6
Happiness
Happiness
4
4
2
2
• The average of binary answers to two GWP
8
questions:
F-test with p-
value of 0
Every X variables
have t-test (two
tailed) reported
in this columns
(p-value)
ANOVA for Multiple Regression
• ANOVA tests for significance of the entire model. That is, it computes
an F-statistic testing the hypotheses:
H 0 : 1 2 k 0
H1 : at least one j is not 0
P-value = 0.000
Reject H0
Interpreting the coefficients
• 𝛼_2=0.175 p-value=0.074
If wealth (log of GDP per capita) increases by 1 unit, holding all the other independent variables
constant, the value of happiness will increase by 0.175, significant at level of 10%
Or a 1% of GDP per capita will increase happiness score by 0.175/100 (0.00175), significant at
level of 10%
• 𝛼_3=3.55 p-value=0.000
If social support increases by 1 unit, holding all the other independent variables constant, the value
of happiness will increase by 3.55 , significant at level of 1%
Should I include a new Xi variable?
• Some argue that a good regression model should include only
significant independent variables.
◦ But not always clear exactly what will happen when we add or remove variables
from a model: variables that are (or are not) significant in one model may (or may
not) be significant in another.
◦ Should not consider dropping all insignificant variables at one time,
◦ Should take a more structured approach.
Should I include a new Xi variable?
• Using adjusted R-square
◦ Adding an independent variable to a regression model often increase the value of
R-square
◦ Adjusted R-square reflects both the number of Xi variables and sample size.
◦ Adjusted R-square may either increase or decrease when an Xi variable is added
or dropped.
◦ An increase in adjusted R-square indicates the model has improved.
◦ But some prefer models what are simpler (i.e. having less Xi variables) when only
minor differences in the adjusted R-square scores.
Systematic Model Building Approach
1. Consider causal graphs
2. Descriptive analysis & checking out for outliers in both Y and X variables
3. Correlation matrix of all available variables
4. Construct a model with all available independent variables and examine the value of
coefficients and p-values for each coefficient.
5. If p-values > 10%, consider to remove and run step 4 again. You should check adjusted R-
square again.
6. Once majority (or all) x variables are statistically significant and the signs of coefficients are
consistent with expectations, then you are closer to a good model.
7. Check all assumptions (next week learning)