0% found this document useful (0 votes)
35 views66 pages

Topic 6 Understanding Causality and Regression (Updated)

Uploaded by

k60.2111213014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views66 pages

Topic 6 Understanding Causality and Regression (Updated)

Uploaded by

k60.2111213014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Topic 6 Linear Regression

Vincent Hoang (2022), Lecture 9, 10


Camn et al (2016), Chapter 7
Outline
• Understanding causality
• Linear regression
Recall Topic 4: measures of association
• Correlation: two variables are said to have a strong statistical
relationship with one another if they appear to move together.
◦ Positive or negative relationships (direction of the relationship)
◦ Strong, moderate or weak relationships (strength of the relationship)
• If cor(x,y) is positive or negative (regardless the strength of the
correlation), can we conclude …
◦ x causes y? or
◦ y causes x? or
◦ something else causes both? or
◦ anything else?
Dependence or correlation?
• Dependence:
◦ Variables are dependent on each other if the value of one variable gives
information about the distribution of the other.
◦ What are key statistics of a distribution? For example normal distribution?

• Is that statistical correlation always meaningful, especially for


prediction purposes? (i.e. predictive analytics)
• Remember that “correlation does not imply causation”
Causality
• Causality describes a relationship between two
(or more) things (phenomena, events, variables,
etc.) in which a change in one causes a change
in another.
• In this diagram, A causes B under certain
conditions. suy luan
◦ So, if we observe an effect, necessarily we can infer
there is a cause prior to the effect.
◦ If there is cause, not necessarily the effect will come
about. effect khong nhat thiet phai co

◦ But if a cause and all other certain conditions are


complete, it is very likely that the cause will produce its
effect(s).
Causal thinking & business decision making
• Two related scenarios
1 Situational assessment
◦ Consider any business situation (i.e. business problem that needs to be
solved)
◦ We would like to assess that situation, then we often ask “how did that
happen?”
◦ Often used in Root Cause Analysis
2 Interventions
Advanced analytics & root cause analysis
• The machine learning model can be trained to
◦ analyse the equipment’s data output under regular “healthy” operating
conditions,
◦ detect “anomalies” (i.e. any pattern of deviation from “healthy” conditions),
◦ to predict the “behavioural” pattern of the anomaly,
◦ if the predicted values exceed the “normal” threshold, an alert is sent.
neu cac gia tri du doan vuot qua bth, mot canh bao se duoc gui

• Applications: early detection of safety issues, machine failures, more


efficient electrical consumption, predicting quality deviation,
adjusting process to prevent material waste, etc.
Source: https://medium.datadriveninvestor.com/root-cause-analysis-in-the-age-of-industry-4-0-9516af5fb1d0
Causality & interventions
• Important business decisions involve the use of limited (scarce) resources.
• The trade-off in the form of a resource-allocation decision:
◦ Should resources (time, equipment, land, …) be devoted to project A or project B?

• We can loosen a constraint but that typically requires other scarce resources.
• A decision on which objects to control or change (i.e. managerial intervention)
typically precede any decision on how to control or change them.
• Understanding causality is crucial to making effective interventions.
Intervention =
Running an
advertising
campaign

Objective is to
increase this
Causal modelling
• Consider a decision on the purchase of a
new equipment: An  means a causal relationship
◦ Quality has two levels: high or low + suggests a positive relationship

◦ High quality equipment can perform more tasks,


hence increase production productivity but the
parts are more expensive.
◦ Maintenance cost: greater the quality of the
equipment, more expensive are the parts, hence
higher maintenance cost.
Foundations for causal graphs
do thi nhan qua la do thi chu ky co huong

• Causal graphs are directed acyclic


graphs (DAGs). They have
◦ A set of vertices (or nodes) representing
variables in the model
◦ A set of edges (or links) presenting the
connections between variables.
◦ Directed path between two nodes: arrow
shows a direction from a cause to its effect.
◦ There is no circle in DAGs.
Feedback loops & time dimensions
• Consider a relationship between joy and
physical exercise.
◦ Is there any causal relationship between them?
◦ If yes, which variable is cause and which is
effect?

• We can convert circles into directed


acyclic graphs in which we have a time
dimension.
◦ At period 0: joy is a cause leading to more
exercise
◦ At period 1: feedback from exercise (period 0) to
joy (period 1)
Structures in casual graphs
• There are three building block

Chain: one variable (X)


causes another (Y) which
causes another (Z) One variable (X) causes two Two variables (X,Y)cause a
other variables (Y & Z). third (Z).

X is a common cause for both Y Z is a common effect of both


and Z. X&Y
Chain
• Example: X learning efforts, Y employability, Z chance of getting a
job.
◦ Y depends on X for its value (hence X and Y are dependent)
◦ Z depends on Y for its value (hence Y and Z are dependent)
◦ Z depends on Y which depends on X,
◦ hence X and Z are also dependent: dependence of X and Z is due to Y being able to change.
◦ What if we hold Y constant (fixed): then changes in X are no linked to changes in Z.
Therefore statistically we say that X and Z are conditionally independent given Y.
Fork
• Example: X is temperature, Y sales of ice cream and Z sales of fan.
◦ Y depends on X for its value (X and Y are dependent)
◦ Z depends on X for its value (X and Z are dependent)
◦ We can still say that (statistically) Y and Z are (statistically) dependent because
changes in Y reflect changes in X which lead to changes in Z.
◦ If you calculate correlation values, what would you expect?
◦ Again correlation does not imply causation.
◦ Easily to see that if holding X fixed, changes in Y no longer linked to changes
in Z.
Collider
• X is competence (at work), Y is Networking , Z: Promotion (at work)
• Both X and Y are causes of Z
• X and Z (similarly Y and Z) are dependent
• X and Y are independent: they neither cause the other nor have a
common cause.
◦ However statistically we can see that if we hold Z fixed, if X change then Y
must also change in a certain way. Why?
◦ Hence we say X and Y are conditionally dependent given Z.
Observed associations
• We can observe associations between two variables in the data.
• However, these associations have two mechanisms
◦ Causal associations
◦ Non-causal associations

• So again, correlation (association) does not imply causation.


Draw assumptions before
making conclusions!
• Consider 3 variables, how
many possible causal
models?
• Statistical associations does
not imply causation.
• Hence, it is better to use
knowledge to draw
assumptions (causal graphs)
prior to making conclusions
regarding causality.
Causal modelling for market volume
• Suppose you are asked to make an assessment of the size of the
market for laptop computers.
• The following variables are relevant:
◦ Price: average price per unit
◦ Advertising: the amount of money spent on advertising products
◦ Number of Customers visiting the shop
◦ Media Hype: whether independent media sources report on or display related
products
◦ Market Volume: the total amount of goods sold for your product category
Price & Volume
• The causal relationship between Price & Volume?

• How about the Number of Customers visiting the shop and Volume?

• Any relationship between Price and Number of Customers?


Advertising & Volume
• Do you expect that higher advertising expenditure will lead to higher sales
(market volume)?

• But how about the impact of advertising and number of customers on


sales?

• Also how about the effects of advertising and media hype on sales?
Causal model for assessing market value
• Now we put all elements
together, this is our causal
model for situational
assessment.
• Note that there is no
(business) goal / objective
in terms of optimisation or
decision making.
• Rather it assesses how
causal factors affect the
market volume.
Causal modelling for Interventions
• Example 2: instead of doing a situational assessment, you are now
asked to decide how much to spend on advertising for these
products.
◦ You need to set an objective, e.g. high market share (the proportion of sales
through your retailers to the total number sold).
◦ So the decision variable is “Advertise”.
◦ Simplify intervention decision: (1) run an advertising or (2) not doing that.
◦ Further simplify that you will know the price at the time you set “Advertise”.
Influence diagram
• Often, rectangle shape refers to strategic
option (i.e. decision point, choice variable,
value directly controlled by a strategic
agent – decision making agent)
• Hexagon shape refers to objective (e.g.
profit, value, market share, etc.). Decision
are made to optimise the objective.
• Circle shape refers to probabilistic
variables that are chance variables,
uncertain quantities, environmental factors
and other elements outside the direct
control of strategic agents.
[+] More
Advertising leads [+] High Sales
to a greater lead to greater
certainty of a certainty of a high
larger number of Market Share
unit sales

This is only an
informational link: [-] [-] Higher Price
you know price leads to a greater
when deciding on certainty of a
Advertise smaller number of
unit sales
From causal diagrams to mathematical
equations
• A simplest form of empirical model would be
using regression model as below

The primes indicate


This is called
coefficients with respect to
dependent variable This is called
independent variable x
error term
• This equation fails to capture the actual
relationship among independent variables (x1
 x5)
Shortcomings
• Consider X1, X2 and X4:
associations among these
variables are clear, hence we
call that this model suffers from
multicollinearity problem.
• Also, we cannot use standard
significance tests to reliably
determine which independent
variables exert the most
influence.
A solution (not discussed further in this unit)
• Possible to use structural equation model (SEM)
(stepwise regression) via a two-stage regression.

• Stage 1

• Stage 2: using estimated value of the independent


variable obtained from stage 1 regression.
Summaries
• Causal relationships are crucial for (1) situational assessments and (2)
interventions, as part of business analytics.
• If there is a cause-and-effect relationship between two variables x and y, there is
statistical association.
• But (statistical) correlation/association does not necessarily imply causation.
• Casual thinking and graphs are very useful because
◦ They capture both causality and statistical association
◦ They assist with both situational assessment and intervention tasks in business analytics
◦ From managerial perspective, they allow identification of relevant stakeholders (agents, people,
departments, etc.) related in analytics projects as well as resources allocation.
Analytics & Happiness
• What values do business analytics deliver?
◦ Happiness/satisfaction matters every corner of our lives: overall life, work,
school, business, etc.
◦ Overall aims are to increase satisfaction.
◦ Situational analysis informs interventions: how?

• Our use of happiness case study is to illustrate regression analysis.


Your satisfaction (happiness) matters!
• Discuss the following questions from your own experience and
knowledge
◦ What makes you happy = what are the causes of your own happiness?
◦ What makes you sad = what are the causes of your own sadness?

• Draw a casual graph (with directed paths)


Happiness
and Income

Source: World Happiness Report 2024


Page 22
Life Satisfaction & Income across Countries in 2023
14.000 Let’s plot
12.000 y = 0.8263x + 4.8509 the data
R² = 0.6127

10.000
Life Ladder

8.000

6.000

4.000 We can
2.000 add a
trendline in
0.000
Excel
0.000 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000
Log GDP per capita

Log GDP per capita Linear (Log GDP per capita)

A linear relationship:
Happiness score is dependent variable
Log GDP per capita is the independent variable
Excel Trendline Tool
• Right click on data series
and choose Add trendline
frpop-up menu
 Check the boxes Display
Equatiom on on chart and
Display R-squared value
on chart
Simple linear regression using least-square
• Simple linear regression model
Y   0  1 X   8.1
• We estimate the parameters (ßs) from the sample data
Yˆ  b0  b1 X 8.2 
• Once estimated, we can
◦ Assess/explain if X is an important factor explaining Y,
◦ “predict” the value of Y given a specific value of X
◦ Yˆ  b  b X
i 0 1 i
Least square regression
• Residuals are the observed errors
associated with estimating the value of the ei  Yi  Yˆi 8.3
dependent variable using the regression line.
• The best-fitting line minimizes the sum of
squares of the residuals.
Simple Linear Regression with Excel
• Using Analysis Toolpak:
◦ Data > DataAnalysis > Regression
Results: Regression Statistics (metrics)
• Multiple R:
◦ sample correlation coefficient
◦ varies from 0 to 1

• R Square:
◦ coefficient of determinant
◦ varies from 0 (no fit) to 1 (perfect fit)

• Adjusted R Square:
◦ Adjusted R square for sample size
and number of X variables.

• Standard error: variability


between observed and predicted
Y values
Results: Regression Statistics (metrics)
• df Degrees of freedom

• SS: sum of square

• MS: mean squared errors


Interpreting Regression Statistics
• R square = 0.613 means
that 61.3% of variation in the
happiness level are
explained by the model, in
this case by the log values
of income per capita.
• The remaining 38.7% (100%
- 61.3%) is UNEXPLAINED.
• Adjusted R-square is often
used.
F-test (Analysis of Variance)
• ANOVA conducts an F - test to determine whether variation in Y is
due to varying levels of X.
• ANOVA is used to test for significance of regression:
◦ H0: population slope coefficient = 0
◦ H1: population slope coefficient ≠ 0

• Excel reports the p-value (Significance F).


• Rejecting H0 indicates that X explains variation in Y.
Interpreting Coefficients
• Intercept: often not important
• Log GDP per capita: 3
Y   0  1 X   8.1
elements
◦ Direction of the relationship: positive
value
◦ The magnitude of the relationship:
0.742, meaning that for each one-point
increase in the Log GDP per capita, the
happiness level increase by 0.742.
◦ Statistical strength of the relationship:
Interpreting Coefficients
• Log GDP per capita:
◦ Statistical strength of the relationship
Y   0  1 X   8.1 can be assessed using the hypothesis
testing
Testing Hypotheses for Regression Coefficients
• We would like to test if the coefficient (log(GDP)) is statistically
significant from zero.
• If Coefficient (ᵦ1) = 0, what does this mean?
• If Coefficient (ᵦ1) ≠ 0, what does this mean? (you should consider
one tailed tests)
b1  0
◦ Test statistics: t  8.8 
◦ P-value approach standard error
Interpreting Coefficients
• Log GDP per capita:
◦ we can use p-value to assess two-
Y   0  1 X   8.1 tailed test.
◦ H0: β1 = 0 vs H0: β1 ≠ 0
◦ In this example, p-value is nearly zero, <
5% hence we can conclude that there is
sufficient evidence to conclude that the
true β1 is not zero. This means that there
exists a relationship between happiness
level and log of GDP per capita.
◦ We can also conduct a one tailed test.
Confidence Intervals for Regression Coefficient
• Confidence intervals (Lower
95% and Upper 95% values in
the output) provide information
about the unknown values of
the true regression
coefficients, accounting for
sampling error.
• For this example, a 95%
confidence interval for the
income variable is
[0.638;0.845].
Prediction
• Intercept: often not important

• If you know the value of Log GDP per


Y   0  1 X   8.1 capita (e.g. 7), you can predict the value
of Happiness level.

• Predicted HL for Vietnam = -1.411 +


0.742*9.392 = 5.558
Confidence Intervals & Prediction
• Although we predicted for Vietnam
-1.411 + 0.742*9.392 = 5.558
• if the true population parameters are at
the extremes of the confidence intervals,
the estimate might be as low as
-1.411 + 0.638*9.392 = 4.581
or as high as
-1.411 + 0.845*9.392 = 6.525
Residual analysis
• Residual = Actual Y value − Predicted Y value

residual
Standard residual 
standard deviation

• Outliers: standard residuals outside ± 2 or ± 3 are potential outliers


Residual Outputs – Residual Plot

Log GDP per capita Residual Plot


2.5

1.5

0.5
Residuals

0
0.000 2.000 4.000 6.000 8.000 10.000 12.000 14.000
-0.5

-1

-1.5

-2

-2.5

-3
Log GDP per capita
What are drawbacks of simple linear
regression models?
• Consider the case of happiness:
Multiple linear regression
• Consider the case study of happiness (at national level)
• What are possible “causes” and/or “factors” that can explain
variations in the Happiness level across countries?
Multiple linear regression
• A linear regression model with more than one independent variable is
called a multiple linear regression model.
Y   0  1 X 1   2 X 2     k X k   8.10 
◦ Y is the dependent variable, Xi are the independent (explanatory) variables;
◦ βi are the regression coefficients for the independent variables, ε the error term.

• We estimated the particle regression coefficients bi

Yˆ  b0  b1 X 1  b2 X 2    bk X k 8.11
We are ignoring here possible causal and
statistical relationships among
A causal graph independent variables

Wealth +
Happiness
Freedom
+
+
Social - + Others
support Perception of +
corruption Health
Generosity
“If you were in trouble, do you have
relatives or friends you can count on to The time series of healthy life
help you whenever you need them, or expectancy at birth
not?”
8

8
Happiness (Cantril Ladder)
6

6
Happiness
4

4
2

2
“Are you satisfied or dissatisfied with The national average of GWP responses
your freedom to choose what you do to the question, “Have you donated
money to a charity in the past month?”
with your life?”
on GDP per capita
8

8
6

6
Happiness

Happiness
4

4
2

2
• The average of binary answers to two GWP
8

questions:

• “Is corruption widespread throughout the


6

government or not?” and


Happiness

• “Is corruption widespread within businesses


or not?”
4

• Where data for government corruption are


missing, the perception of business
2

corruption is used as the overall corruption-


perception measure.
Multiple regression

Larger adjusted
R-squared

F-test with p-
value of 0

Every X variables
have t-test (two
tailed) reported
in this columns
(p-value)
ANOVA for Multiple Regression
• ANOVA tests for significance of the entire model. That is, it computes
an F-statistic testing the hypotheses:
H 0 : 1   2     k  0
H1 : at least one  j is not 0

P-value = 0.000
Reject H0
Interpreting the coefficients
• 𝛼_2=0.175 p-value=0.074
If wealth (log of GDP per capita) increases by 1 unit, holding all the other independent variables
constant, the value of happiness will increase by 0.175, significant at level of 10%
Or a 1% of GDP per capita will increase happiness score by 0.175/100 (0.00175), significant at
level of 10%

• 𝛼_3=3.55 p-value=0.000
If social support increases by 1 unit, holding all the other independent variables constant, the value
of happiness will increase by 3.55 , significant at level of 1%
Should I include a new Xi variable?
• Some argue that a good regression model should include only
significant independent variables.
◦ But not always clear exactly what will happen when we add or remove variables
from a model: variables that are (or are not) significant in one model may (or may
not) be significant in another.
◦ Should not consider dropping all insignificant variables at one time,
◦ Should take a more structured approach.
Should I include a new Xi variable?
• Using adjusted R-square
◦ Adding an independent variable to a regression model often increase the value of
R-square
◦ Adjusted R-square reflects both the number of Xi variables and sample size.
◦ Adjusted R-square may either increase or decrease when an Xi variable is added
or dropped.
◦ An increase in adjusted R-square indicates the model has improved.
◦ But some prefer models what are simpler (i.e. having less Xi variables) when only
minor differences in the adjusted R-square scores.
Systematic Model Building Approach
1. Consider causal graphs
2. Descriptive analysis & checking out for outliers in both Y and X variables
3. Correlation matrix of all available variables
4. Construct a model with all available independent variables and examine the value of
coefficients and p-values for each coefficient.
5. If p-values > 10%, consider to remove and run step 4 again. You should check adjusted R-
square again.
6. Once majority (or all) x variables are statistically significant and the signs of coefficients are
consistent with expectations, then you are closer to a good model.
7. Check all assumptions (next week learning)

You might also like