0% found this document useful (0 votes)
8 views

Topic 6 Understanding Causality and Regression

Uploaded by

racieanhdao5203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Topic 6 Understanding Causality and Regression

Uploaded by

racieanhdao5203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Topic 6 Linear Regression

Vincent Hoang (2022), Lecture 9, 10


Camn et al (2016), Chapter 7
Recall Topic 4: measures of association
• Correlation: two variables are said to have a strong statistical relationship
with one another if they appear to move together.
◦ Positive or negative relationships (direction of the relationship)
◦ Strong, moderate or weak relationships (strength of the relationship)

• If cor(x,y) is positive or negative (regardless the strength of the correlation),


can we conclude …
◦ x causes y? or
◦ y causes x? or
◦ something else causes both? or
◦ anything else?
Dependence or correlation?
• Dependence:
◦ Variables are dependent on each other if the value of one variable gives
information about the distribution of the other.
◦ What are key statistics of a distribution? For example normal distribution?

• Is that statistical correlation always meaningful, especially for


prediction purposes? (i.e. predictive analytics)
• Remember that “correlation does not imply causation”
Causality
• Causality describes a relationship between two (or
more) things (phenomena, events, variables, etc.) in
which a change in one causes a change in another.
• In this diagram, A causes B under certain
conditions.
◦ So, if we observe an effect, necessarily we can infer there
is a cause prior to the effect.
◦ If there is cause, not necessarily the effect will come about.
◦ But if a cause and all other certain conditions are complete,
it is very likely that the cause will produce its effect(s).
Causal thinking & business decision making
• Two related scenarios
1 Situational assessment
◦ Consider any business situation (i.e. business problem that needs to be
solved)
◦ We would like to assess that situation, then we often ask “how did that
happen?”
◦ Often used in Root Cause Analysis
2 Interventions
Advanced analytics & root cause analysis
• The machine learning model can be trained to
◦ analyse the equipment’s data output under regular “healthy” operating
conditions,
◦ detect “anomalies” (i.e. any pattern of deviation from “healthy” conditions),
◦ to predict the “behavioural” pattern of the anomaly,
◦ if the predicted values exceed the “normal” threshold, an alert is sent.

• Applications: early detection of safety issues, machine failures, more


efficient electrical consumption, predicting quality deviation,
adjusting process to prevent material waste, etc.
Source: https://medium.datadriveninvestor.com/root-cause-analysis-in-the-age-of-industry-4-0-9516af5fb1d0
Causality & interventions
• Important business decisions involve the use of limited (scarce) resources.
• The trade-off in the form of a resource-allocation decision:
◦ Should resources (time, equipment, land, …) be devoted to project A or project B?

• We can loosen a constraint but that typically requires other scarce resources.
• A decision on which objects to control or change (i.e. managerial
intervention) typically precede any decision on how to control or change
them.
• Understanding causality is crucial to making effective interventions.
Intervention =
Running an
advertising
campaign

Objective is to
increase this
Causal modelling
• Consider a decision on the purchase of a
new equipment: An  means a causal
◦ Quality has two levels: high or low relationship
+ suggests a positive relationship
◦ High quality equipment can perform more tasks,
hence increase production productivity but the
parts are more expensive.
◦ Maintenance cost: greater the quality of the
equipment, more expensive are the parts, hence
higher maintenance cost.
Foundations for causal graphs
• Causal graphs are directed acyclic
graphs (DAGs). They have
◦ A set of vertices (or nodes) representing
variables in the model
◦ A set of edges (or links) presenting the
connections between variables.
◦ Directed path between two nodes: arrow
shows a direction from a cause to its effect.
◦ There is no circle in DAGs.
Feedback loops & time dimensions
• Consider a relationship between joy and
physical exercise.
◦ Is there any causal relationship between them?
◦ If yes, which variable is cause and which is
effect?

• We can convert circles into directed acyclic


graphs in which we have a time dimension.
◦ At period 0: joy is a cause leading to more
exercise
◦ At period 1: feedback from exercise (period 1) to
joy (period 1)
Structures in casual graphs
• There are three building block

Chain: one variable (X)


causes another (Y) which
causes another (Z) One variable (X) causes two Two variables (X,Y)cause a
other variables (Y & Z). third (Z).

X is a common cause for both Z is a common effect of both


Y and Z. X&Y
Chain
• Example: X learning efforts, Y employability, Z chance of getting a
job.
◦ Y depends on X for its value (hence X and Y are dependent)
◦ Z depends on Y for its value (hence Y and Z are dependent)
◦ Z depends on Y which depends on X,
◦ hence X and Z are also dependent: dependence of X and Z is due to Y being able to change.
◦ What if we hold Y constant (fixed): then changes in X are no linked to changes in Z.
Therefore statistically we say that X and Z are conditionally independent given Y.
Fork
• Example: X is temperature, Y sales of ice cream and Z sales of fan.
◦ Y depends on X for its value (X and Y are dependent)
◦ Z depends on X for its value (X and Z are dependent)
◦ We can still say that (statistically) Y and Z are (statistically) dependent because
changes in Y reflect changes in X which lead to changes in Z.
◦ If you calculate correlation values, what would you expect?
◦ Again correlation does not imply causation.
◦ Easily to see that if holding X fixed, changes in Y no longer linked to changes
in Z.
Collider
• X is competence (at work), Y is Networking , Z: Promotion (at work)
• Both X and Y are causes of Z
• X and Z (similarly Y and Z) are dependent
• X and Y are independent: they neither cause the other nor have a
common cause.
◦ However statistically we can see that if we hold Z fixed, if X change then Y
must also change in a certain way. Why?
◦ Hence we say X and Y are conditionally dependent given Z.
Observed associations
• We can observe associations between two variables in the data.
• However, these associations have two mechanisms
◦ Causal associations
◦ Non-causal associations

• So again, correlation (association) does not imply causation.


Draw assumptions before
making conclusions!
• Consider 3 variables, how
many possible causal models?
• Statistical associations does
not imply causation.
• Hence, it is better to use
knowledge to draw
assumptions (causal graphs)
prior to making conclusions
regarding causality.
Causal modelling for market volume
• Suppose you are asked to make an assessment of the size of the
market for laptop computers.
• The following variables are relevant:
◦ Price: average price per unit
◦ Advertising: the amount of money spent on advertising products
◦ Number of Customers visiting the shop
◦ Media Hype: whether independent media sources report on or display related
products
◦ Market Volume: the total amount of goods sold for your product category
Price & Volume
• The causal relationship between Price & Volume?

• How about the Number of Customers visiting the shop and Volume?

• Any relationship between Price and Number of Customers?


Advertising & Volume
• Do you expect that higher advertising expenditure will lead to higher sales
(market volume)?

• But how about the impact of advertising and number of customers on sales?

• Also how about the effects of advertising and media hype on sales?
Causal model for assessing market value
• Now we put all elements
together, this is our causal
model for situational
assessment.
• Note that there is no (business)
goal / objective in terms of
optimisation or decision
making.
• Rather it assesses how causal
factors affect the market value.
Causal modelling for Interventions
• Example 2: instead of doing a situational assessment, you are now
asked to decide how much to spend on advertising for these
products.
◦ You need to set an objective, e.g. high market share (the proportion of sales
through your retailers to the total number sold).
◦ So the decision variable is “Advertise”.
◦ Simplify intervention decision: (1) run an advertising or (2) not doing that.
◦ Further simplify that you will know the price at the time you set “Advertise”.
Influence diagram
• Often, rectangle shape refers to strategic
option (i.e. decision point, choice variable,
value directly controlled by a strategic agent –
decision making agent)
• Hexagon shape refers to objective (e.g.
profit, value, market share, etc.). Decision are
made to optimise the objective.
• Circle shape refers to probabilistic variables
that are chance variables, uncertain
quantities, environmental factors and other
elements outside the direct control of strategic
agents.
[+] More
Advertising leads [+] High Sales
to a greater lead to greater
certainty of a certainty of a
larger number of high Market
unit sales Share

This is only an
informational [-] [-] Higher Price
link: you know leads to a greater
price when certainty of a
deciding on smaller number
Advertise of unit sales
Influence diagram
• Often, rectangle shape refers to strategic
option (i.e. decision point, choice variable,
value directly controlled by a strategic agent –
decision making agent)
• Hexagon shape refers to objective (e.g.
profit, value, market share, etc.). Decision are
made to optimise the objective.
• Circle shape refers to probabilistic variables
that are chance variables, uncertain
quantities, environmental factors and other
elements outside the direct control of strategic
agents.
[+] More
Advertising leads
to a greater [+] High Sales
certainty of a lead to greater
larger number of certainty of a
unit sales high Market
Share

This is only an
informational [-] [-] Higher Price
link: you know leads to a greater
price when certainty of a
deciding on smaller number
Advertise of unit sales
From causal diagrams to mathematical
equations
• A simplest form of empirical model would be
using regression model as below

The primes indicate


This is called
coefficients with respect to
dependent variable This is
independent variable x
called error
• This equation fails to capture the actual term
relationship among independent variables (x1
 x6)
Shortcomings
• Consider X1, X2 and X4:
associations among these
variables are clear, hence we call
that this model suffers from
multicollinearity problem.
• Also, we cannot use standard
significance tests to reliably
determine which independent
variables exert the most influence.
A solution (not discussed further in this unit)
• Possible to use structural equation model (SEM)
(stepwise regression) via a two-stage regression.
• Stage 1

• Stage 2: using estimated value of the


independent variable obtained from stage 1
regression.
Summaries
• Causal relationships are crucial for (1) situational assessments and (2)
interventions, as part of business analytics.
• If there is a cause-and-effect relationship between two variables x and y, there is
statistical association.
• But (statistical) correlation/association does not necessarily imply causation.
• Casual thinking and graphs are very useful because
◦ They capture both causality and statistical association
◦ They assist with both situational assessment and intervention tasks in business analytics
◦ From managerial perspective, they allow identification of relevant stakeholders (agents,
people, departments, etc.) related in analytics projects as well as resources allocation.
Analytics & Happiness
• What values do business analytics deliver?
◦ Happiness/satisfaction matters every corner of our lives: overall life, work,
school, business, etc.
◦ Overall aims are to increase satisfaction.
◦ Situational analysis informs interventions: how?

• Our use of happiness case study is to illustrate regression analysis.


Your satisfaction (happiness) matters!
• Discuss the following questions from your own experience and
knowledge
◦ What makes you happy = what are the causes of your own happiness?
◦ What makes you sad = what are the causes of your own sadness?

• Draw a casual graph (with directed paths)


Happiness
and Income

Source: World Happiness Report 2024


Page 22
Life Satisfaction & Income across Countries in 2023
Let’s plot
14.000
the data
12.000
f(x) = 0.826265860289798 x + 4.85093160342794
10.000 R² = 0.612735502632602
Life Ladder

8.000
6.000
4.000
2.000
We can
add a
0.000
3.000 3.500 4.000 4.500 5.000 5.500 6.000 6.500 7.000 7.500 8.000 trendline
Log GDP per capita
in Excel

Log GDP per capita Linear (Log GDP per capita)


Linear (Log GDP per capita) Linear (Log GDP per capita)

A linear relationship:
Happiness score is dependent variable
Log GDP per capita is the independent variable
Excel Trendline Tool
• Right click on data series
and choose Add trendline
frpop-up menu
 Check the boxes Display
Equatiom on on chart and
Display R-squared value on
chart
Simple linear regression using least-square
• Simple linear regression model
Y   0  1 X   8.1
• We estimate the parameters (ßs) from the sample data
Yˆ  b  b X 0 1 8.2 
• Once estimated, we can
◦ Assess/explain if X is an important factor explaining Y,
◦ “predict” the value of Y given a specific value of X
◦ Yˆi  b0  b1 X i
Least square regression
• Residuals are the observed errors
associated with estimating the value of the ei  Yi  Yˆi 8.3
dependent variable using the regression line.
• The best-fitting line minimizes the sum of
squares of the residuals.
Simple Linear Regression with Excel
• Using Analysis Toolpak:
◦ Data > DataAnalysis > Regression
Results: Regression Statistics (metrics)
• Multiple R:
◦ sample correlation coefficient
◦ varies from -1 to + 1
◦ negative if slope is negative

• R Square:
◦ coefficient of determinant
◦ varies from 0 (no fit) to 1 (perfect fit)

• Adjusted R Square:
◦ Adjusted R square for sample size
and number of X variables.

• Standard error: variability between


observed and predicted Y values
Interpreting Regression Statistics
• R square = 0.613 means that
61.3% of variation in the
happiness level are explained
by the model, in this case by
the log values of income per
capita.
• The remaining 38.7% (100% -
61.3%) is UNEXPLAINED.
• Adjusted R-square is often
used.
F-test (Analysis of Variance)
• ANOVA conducts an F - test to determine whether variation in Y is
due to varying levels of X.
• ANOVA is used to test for significance of regression:
◦ H0: population slope coefficient = 0
◦ H1: population slope coefficient ≠ 0

• Excel reports the p-value (Significance F).


• Rejecting H0 indicates that X explains variation in Y.
Interpreting Coefficients
• Intercept: often not important
• Log GDP per capita: 3
Y   0  1 X   8.1
elements
◦ Direction of the relationship: positive
value
◦ The magnitude of the relationship:
0.742, meaning that for each one-point
increase in the Log GDP per capita, the
happiness level increase by 0.742.
◦ Statistical strength of the relationship:
Interpreting Coefficients
• Log GDP per capita:
◦ Statistical strength of the relationship
Y   0  1 X   8.1 can be assessed using the hypothesis
testing
Testing Hypotheses for Regression Coefficients
• We would like to test if the coefficient (log(GDP)) is statistically
significant from zero.
• If Coefficient (ᵦ1) = 0, what does this mean?

• If Coefficient (ᵦ1) ≠ 0, what does this mean? (you should consider


one tailed tests) b1  0
◦ Test statistics: t  8.8
standard error
◦ P-value approach
Interpreting Coefficients
• Log GDP per capita:
◦ we can use p-value to assess two-
Y   0  1 X   8.1 tailed test.
◦ H0: β1 = 0 vs H0: β1 ≠ 0
◦ In this example, p-value is nearly zero, <
5% hence we can conclude that there is
sufficient evidence to conclude that the
true β1 is not zero. This means that there
exists a relationship between happiness
level and log of GDP per capita.
◦ We can also conduct a one tailed test.
Confidence Intervals for Regression Coefficient
• Confidence intervals (Lower
95% and Upper 95% values in
the output) provide information
about the unknown values of
the true regression
coefficients, accounting for
sampling error.
• For this example, a 95%
confidence interval for the
income variable is
[0.638;0.845].
Prediction
• Intercept: often not important
• If you know the value of Log GDP per
Y   0  1 X   8.1 capita (e.g. 7), you can predict the value
of Happiness level.
• Predicted HL for Vietnam = -1.411 +
0.742*9.392 = 5.558
Confidence Intervals & Prediction
• Although we predicted for Vietnam
-1.411 + 0.742*9.392 = 5.558
• if the true population parameters are at
the extremes of the confidence intervals,
the estimate might be as low as
-1.411 + 0.638*9.392 = 4.581
or as high as
-1.411 + 0.845*9.392 = 6.525
Residual analysis
• Residual = Actual Y value − Predicted Y value
residual
Standard residual 
standard deviation

• Outliers: standard residuals outside ± 2 or ± 3 are potential outliers


Residual Outputs – Residual Plot

Log GDP per capita Residual Plot


3

1
Residuals

0
6.000 7.000 8.000 9.000 10.000 11.000 12.000

-1

-2

-3

Log GDP per capita


What are drawbacks of simple linear
regression models?
• Consider the case of happiness:
Multiple linear regression
• Consider the case study of happiness (at national level)
• What are possible “causes” and/or “factors” that can explain
variations in the Happiness level across countries?
Multiple linear regression
• A linear regression model with more than one independent variable is
called a multiple linear regression model.
Y   0  1 X 1   2 X 2     k X k   8.10 
◦ Y is the dependent variable, Xi are the independent (explanatory) variables;
◦ βi are the regression coefficients for the independent variables, ε the error term.

• We estimated the particle regression coefficients bi


Yˆ  b0  b1 X 1  b2 X 2    bk X k 8.11
We are ignoring here possible causal
and statistical relationships among
A causal graph independent variables

Wealth +
Happines
Freedo + s
m
Social +
Perception - + Others
suppor +
t of
corruption Health
Generosit
y
Happiness (Cantril Ladder)
“If you were in trouble, do you have
relatives or friends you can count on to The time series of healthy life
help you whenever you need them, or expectancy at birth

Happiness
not?”
8

2
.2 .4 .6 .8 1 30 40 50 60 70 80
Social support Healthy life expectancy at birth

Life Ladder Fitted values Life Ladder Fitted values


“Are you satisfied or dissatisfied The national average of GWP
with your freedom to choose what responses to the question, “Have you

Happiness
Happiness
donated money to a charity in the past
you do with your life?”

2
8

2
month?” on GDP per capita

.2 .4 .6 .8 1 -.4 -.2 0 .2 .4 .6
Freedom to make life choices Generosity
Life Ladder Fitted values Life Ladder Fitted values
Happiness
8

2
• The average of binary answers to two GWP
questions:
• “Is corruption widespread throughout the
government or not?” and
• “Is corruption widespread within businesses
or not?”
• Where data for government corruption are
missing, the perception of business
0 .2 .4 .6 .8 1
Perceptions of corruption corruption is used as the overall corruption-
Life Ladder Fitted values perception measure.
Multiple regression
Larger adjusted
R-squared

F-test with p-
value of 0

Every X
variables have
t-test (two
tailed) reported
in this columns
(p-value)
ANOVA for Multiple Regression
• ANOVA tests for significance of the entire model. That is, it computes
an F-statistic testing the hypotheses:
H 0 : 1   2     k  0
H1 : at least one  j is not 0
Interpreting the coefficients
• 𝛼_2=0.175 p-value=0.074
If wealth (log of GDP per capita) increases by 1 unit, holding all the other
independent variables constant, the value of happiness will increase by 0.175,
significant at level of 10%

• 𝛼_3=3.55 p-value=0.000
If social support increases by 1 unit, holding all the other independent variables
constant, the value of happiness will increase by 3.55 , significant at level of 1%
Should I include a new Xi variable?
• Some argue that a good regression model should include only
significant independent variables.
◦ But not always clear exactly what will happen when we add or remove variables
from a model: variables that are (or are not) significant in one model may (or may
not) be significant in another.
◦ Should not consider dropping all insignificant variables at one time,
◦ Should take a more structured approach.
Should I include a new Xi variable?
• Using adjusted R-square
◦ Adding an independent variable to a regression model often increase the value of
R-square
◦ Adjusted R-square reflects both the number of Xi variables and sample size.
◦ Adjusted R-square may either increase or decrease when an Xi variable is added
or dropped.
◦ An increase in adjusted R-square indicates the model has improved.
◦ But some prefer models what are simpler (i.e. having less Xi variables) when only
minor differences in the adjusted R-square scores.
Systematic Model Building Approach
1. Consider causal graphs
2. Descriptive analysis & checking out for outliers in both Y and X variables
3. Correlation matrix of all available variables
4. Construct a model with all available independent variables and examine the value of
coefficients and p-values for each coefficient.
5. If p-values > 10%, consider to remove and run step 4 again. You should check adjusted R-
square again.
6. Once majority (or all) x variables are statistically significant and the signs of coefficients are
consistent with expectations, then you are closer to a good model.
7. Check all assumptions (next week learning)

You might also like