Module 3 - Data Analysis_S RM
Module 3 - Data Analysis_S RM
OVERVIEW
• There are many situations where it is important to
understand the relationship between two variables in a
dataset
• We explore:
• the importance of understanding relationships
between variables in decision making
• some different types of relationships between variables
Suppose we are a car dealership that wants to understand the
relationship between the number of cars sold per year and the
number of years’ experience a car salesperson has
We would expect that the more experience a car salesperson
has, the more cars they are likely to sell
It would be interesting to try and quantify this relationship to
assess the effect experience has on selling cars
1. It allows us to make predictions about what we are trying to
explore:
e.g., annual car sales
2. It allows us to understand if, or how much, a particular factor
contributes to annual car sales:
e.g., the amount of experience of the salesperson
Two variables are related if their values change when there
is an increase or decrease in the other variable
The two variables may either move in the same direction or
in the opposite direction
When considering the relationship between two variables,
we need to look at two aspects of their relationship:
correlation and causation
We will explore these in more detail in the next two videos
7
Correlation – a measure of strength of linear relationship
between two variables
Correlation coefficient – a number between -1 and 1.
A correlation of 0 indicates that the two variables have no linear
relationship to each other.
A positive correlation coefficient indicates a linear relationship
for which one variable increases as the other also increases.
A negative correlation coefficient indicates a linear relationship
for one variable that increases while the other decreases.
CORRELATION
Correlation is positive when the values increase
together, and
Correlation is negative when one value decreases as
the other increases
A correlation is assumed to be linear (following a
line)
Source: https://www.mathsisfun.com/data/correlation.html
Correlation is measured by the correlation coefficient “r”, which
measures the joint variability of the two variables
This is also called the standardised covariance, and ranges between -1
and 1
The closer to -1, the stronger the negative linear relationship (i.e.,
move in opposite directions)
The closer to 1, the stronger the positive linear relationship (i.e., move
in the same direction)
The closer to 0, the weaker the linear relationship
11
There is a formula to calculate the correlation coefficient “r”
The formula in Excel is correl(dataA, dataB)
dataA represents the first data set
dataB the second data set
In this course, you can calculate correlation using Excel
Use Bon Appetit data to show correlation between different pairs
of variables
14
OVERVIEW
• Causation is another important relationship between two variables
that can often be confused with correlation
• We explore:
• the definition of causation
• the relationship between correlation and
causation
• https://www.youtube.com/watch?v=VMUQSMFGB
Do
Causation is where one variable or event causes changes in
another variable or event
A high correlation does not always imply causation – you will
often hear the phrase “correlation does not imply causation”
While high correlation might suggest that one variable may
affect the outcome of another, this could be purely coincidental
Correlation does not prove one thing causes the other:
one thing might cause the other
the other might cause the first to happen
they may be linked by a different thing
or it could be random chance!
There can be many reasons why the data has a good
correlation.
EXAMPLE
The Ice Cream shop finds how many sunglasses were sold by a large store each day
and compares them to their ice cream sales:
Does this mean that sunglasses make people want ice cream?
19
We saw how the relationship between variables can
be explored with the creation of scatter plots
We can use a technique called linear regression to
better understand the relationship between two
variables
Regression analysis assists decision-making by allowing a deeper
understanding of the relationship between variables
There are different kinds of regression analysis, but in this course,
we're going to focus on linear regression
Linear regression analysis is used to predict the value of a
variable based on the value of another variable
The variable you want to predict is called the dependent variable
The variable you are using to predict the other variable's value is
called the independent variable
The linear regression equation is: Y= a + bX, where:
Y is the dependent variable (the variable on the Y axis)
X is the independent variable (plotted on the X axis)
b is the slope of the line, and
a is the y-intercept
In the simple linear regression model, where y = b0 + b1x + u, we typically
refer to y as the
Dependent Variable, or
Left-Hand Side Variable, or
Explained Variable, or
Regressand
23
In the simple linear regression of y on x, we typically refer to x as
the
Independent Variable, or
Right-Hand Side Variable, or
Explanatory Variable, or
Regressor, or
Covariate, or
Control Variables
24
Wage = b0 + b1edu + u
E(u) = 0
27
We need to make a crucial assumption about how u and x are
related
We want it to be the case that knowing something about x does
not give us any information about u, so that they are completely
unrelated. That is, that
E(u|x) = E(u) = 0, which implies
E(y|x) = b0 + b1x
28
E(y|x) as a linear function of x, where for any x
the distribution of y is centered about E(y|x)
y
f(y)
.E(y|x) = b + b x
.
0 1
x1 x2
29
Basic idea of regression is to estimate the population
parameters from a sample
Let {(xi,yi): i=1, …,n} denote a random sample of size n
from the population
For each observation in this sample, it will be the case
that
yi = b0 + b1xi + ui
30
Population regression line, sample data points
and the associated error terms
y E(y|x) = b0 + b1x
y4 .
u4{
y3 .} u3
y2 u2{ .
u1
y1 .
}
x1 x x x x
2 31
3 4
Intuitively, OLS is fitting a line through the sample points such
that the sum of squared residuals is as small as possible, hence
the term least squares
The residual, û, is an estimate of the error term, u, and is the
difference between the fitted line (sample regression function)
and the sample point
32
Sample regression line, sample data points
and the associated estimated error terms
y
y4 .
û4{
yˆ bˆ0 bˆ1 x
y3 .} û3
y2 û2{ .
y1 }. û1
x1 x x x x
2 33
3 4
To derive the OLS estimates we need to realize that our main
assumption of E(u|x) = E(u) = 0 also implies that
Cov(x,u) = E(xu) = 0
34
n
x x y
i i y
bˆ1 i 1
n
x x
2
i
i 1
n
provided that xi x 0
2
i 1
35
The slope estimate is the sample covariance between x and y
divided by the sample variance of x
If x and y are positively correlated, the slope will be positive
If x and y are negatively correlated, the slope will be negative
Only need x to vary in our sample
36
We can think of each observation as being made
up of an explained part, and an unexplained part,
yi yˆ i uˆi We then define the following :
y y is the total sum of squares (SST)
2
i
Can compute the fraction of the total sum of squares (SST) that
is explained by the model, call this the R-squared of regression
R2 = SSE/SST = 1 – SSR/SST
38
R2 = coefficient of determination: the proportion of variation
explained by the independent variable (regression model)
0 R2 1
The square root of R2 is the sample correlation coefficient, r
(where the sign of r is the same as the slope of the fitted line)
Use CEOSAL1
For the population of chief executive officers, let y be annual salary
(salary) in thousands of dollars.
Thus, y = 856.3 indicates an annual salary of $856,300, and y=1,452.6
indicates a salary of $1,452,600.
Let x be the average return on equity (roe) for the CEO’s firm for the
previous three years. (Return on equity is defined in terms of net
income as a percentage of common equity.) For example, if roe is 10,
then average return on equity is 10%.
Estimate the relationship between this measure of firm performance
and CEO compensation. Comment on R-squared.
40
The OLS estimates of b1 and b0 are unbiased
Proof of unbiasedness depends on our 4 assumptions – if any
assumption fails, then OLS is not necessarily unbiased
Remember unbiasedness is a description of the estimator – in a
given sample we may be “near” or “far” from the true parameter
41
Now we know that the sampling distribution of our estimate is
centered around the true parameter
Want to think about how spread out this distribution is
Much easier to think about this variance under an additional
assumption, so
Assume Var(u|x) = s2 (Homoskedasticity)
42
Homoskedastic Case
y
f(y|x)
.E(y|x) = b + b x
.
0 1
x1 x2
43
Heteroskedastic Case
f(y|x)
.
. E(y|x) = b0 +
.
b 1x
x1 x2 x3 x
44
Wage = b0 + b1edu + u
If we also make the homoskedasticity assumption, then
Var(u|educ) = s2 does not depend on the level of education, which
is the same as assuming Var(wage|educ) = s2 .
We don’t know what the error variance, s2, is, because we don’t
observe the errors, ui
46
47
OVERVIEW
We:
• explore probability distributions
• find examples of normal distributions
• explore confidence intervals and how
they relate to decision making
PROBABILITY
DISTRIBUTION
Data can be
"distributed"
(spread out)
in different
ways
NORMAL PROBABILITY
DISTRIBUTION
There are also many cases where the data tends to be around a central value with no
bias left or right, like this:
In calculating
the standard
deviation,
we find that
generally:
It is helpful to know the standard deviation, because
we can say that any value is:
likely to be within 1 standard deviation (68 out of 100
should be)
very likely to be within 2 standard deviations (95 out
of 100 should be)
almost certainly within 3 standard deviations (997 out
of 1000 should be)
A value more than three standard deviations from the
mean is likely to be an outlier – a measurement error
or an anomaly.
STANDARDIZING
Any Normal Distribution can be converted to the Standard
Normal Distribution.
Source: https://www.mathsisfun.com/data/standard-normal-distribution.html
In business, you sometimes need to give an answer with some
consideration given to how confident you are about that
answer
This is where confidence intervals can help, often reliant on
underlying probability distributions
A useful estimate would indicate a range of values and the
probability that the actual value is within that range
56
OVERVIEW
In this video, we:
• examine how to test whether a given hypothesis is
correct based on the available datasets
• take a look at the general hypothesis testing process
• https://www.youtube.com/watch?v=ZzeXCKd5a18
A hypothesis is a statement that might be true
Researchers generally formulate a hypothesis and then collect
data to test whether the hypothesis is true or not
A sample is generally selected from a larger group (the
"population") that will, hopefully, let you find out things about
the larger group
Samples should be chosen randomly
Example: you ask 100 randomly chosen people at a soccer match
what their main job is. Your sample is the 100, while the
population is all the people at that match.
Hypothesis testing involves a null hypothesis and an alternative
hypothesis
H0: The null hypothesis: is a statement of no effect, relationship, or
difference between two or more groups or factors
e.g., There is no difference in the incidence of skin cancer across ages 0 to 5
years.
H1: The alternative hypothesis: is the statement that there is an
effect or difference. This is usually the hypothesis the researcher is
interested in proving.
e.g., The incidence of skin cancer differs with the age.
The investigator needs to set a “level of significance” (α)
This is how confident they need to be before they reject the
null hypothesis and accept the alternative hypothesis
A significance level of 5% (α = 0.05) indicates that the
investigator will only reject the null hypothesis if there is
less than a 5% chance that it is true
In other words, the alternative hypothesis will be accepted
only if the probability that it is true is 95% or more
A p-value:
is a measure of the probability that an observed
difference could have occurred just by random chance
63