0% found this document useful (0 votes)
5 views

Module 3 - Data Analysis_S RM

Uploaded by

29mai03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Module 3 - Data Analysis_S RM

Uploaded by

29mai03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

2

OVERVIEW
• There are many situations where it is important to
understand the relationship between two variables in a
dataset
• We explore:
• the importance of understanding relationships
between variables in decision making
• some different types of relationships between variables
 Suppose we are a car dealership that wants to understand the
relationship between the number of cars sold per year and the
number of years’ experience a car salesperson has
 We would expect that the more experience a car salesperson
has, the more cars they are likely to sell
 It would be interesting to try and quantify this relationship to
assess the effect experience has on selling cars
1. It allows us to make predictions about what we are trying to
explore:
e.g., annual car sales
2. It allows us to understand if, or how much, a particular factor
contributes to annual car sales:
e.g., the amount of experience of the salesperson
 Two variables are related if their values change when there
is an increase or decrease in the other variable
 The two variables may either move in the same direction or
in the opposite direction
 When considering the relationship between two variables,
we need to look at two aspects of their relationship:
correlation and causation
 We will explore these in more detail in the next two videos
7
 Correlation – a measure of strength of linear relationship
between two variables
 Correlation coefficient – a number between -1 and 1.
 A correlation of 0 indicates that the two variables have no linear
relationship to each other.
 A positive correlation coefficient indicates a linear relationship
for which one variable increases as the other also increases.
 A negative correlation coefficient indicates a linear relationship
for one variable that increases while the other decreases.
CORRELATION
 Correlation is positive when the values increase
together, and
 Correlation is negative when one value decreases as
the other increases
 A correlation is assumed to be linear (following a
line)

 Source: https://www.mathsisfun.com/data/correlation.html
 Correlation is measured by the correlation coefficient “r”, which
measures the joint variability of the two variables
 This is also called the standardised covariance, and ranges between -1
and 1
 The closer to -1, the stronger the negative linear relationship (i.e.,
move in opposite directions)
 The closer to 1, the stronger the positive linear relationship (i.e., move
in the same direction)
 The closer to 0, the weaker the linear relationship
11
 There is a formula to calculate the correlation coefficient “r”
 The formula in Excel is correl(dataA, dataB)
 dataA represents the first data set
 dataB the second data set
 In this course, you can calculate correlation using Excel
 Use Bon Appetit data to show correlation between different pairs
of variables
14
OVERVIEW
• Causation is another important relationship between two variables
that can often be confused with correlation
• We explore:
• the definition of causation
• the relationship between correlation and
causation
• https://www.youtube.com/watch?v=VMUQSMFGB
Do
 Causation is where one variable or event causes changes in
another variable or event
 A high correlation does not always imply causation – you will
often hear the phrase “correlation does not imply causation”
 While high correlation might suggest that one variable may
affect the outcome of another, this could be purely coincidental
 Correlation does not prove one thing causes the other:
 one thing might cause the other
 the other might cause the first to happen
 they may be linked by a different thing
 or it could be random chance!
 There can be many reasons why the data has a good
correlation.
EXAMPLE
The Ice Cream shop finds how many sunglasses were sold by a large store each day
and compares them to their ice cream sales:

Does this mean that sunglasses make people want ice cream?
19
 We saw how the relationship between variables can
be explored with the creation of scatter plots
 We can use a technique called linear regression to
better understand the relationship between two
variables
 Regression analysis assists decision-making by allowing a deeper
understanding of the relationship between variables
 There are different kinds of regression analysis, but in this course,
we're going to focus on linear regression
 Linear regression analysis is used to predict the value of a
variable based on the value of another variable
 The variable you want to predict is called the dependent variable
 The variable you are using to predict the other variable's value is
called the independent variable
 The linear regression equation is: Y= a + bX, where:
 Y is the dependent variable (the variable on the Y axis)
 X is the independent variable (plotted on the X axis)
 b is the slope of the line, and
 a is the y-intercept
 In the simple linear regression model, where y = b0 + b1x + u, we typically
refer to y as the
 Dependent Variable, or
 Left-Hand Side Variable, or
 Explained Variable, or
 Regressand

23
 In the simple linear regression of y on x, we typically refer to x as
the
 Independent Variable, or
 Right-Hand Side Variable, or
 Explanatory Variable, or
 Regressor, or
 Covariate, or
 Control Variables

24
 Wage = b0 + b1edu + u

 wage: measured in dollars per hour


 educ: years of education,
 b1 measures the change in hourly wage given another year of
education, holding all other factors fixed.
 Some of those factors include labor force experience, innate
ability, tenure with current employer, work ethic, and numerous
other things.
 Use WAGE1
 For the population of people in the workforce in 1976, let y =
wage, where wage is measured in dollars per hour. Thus, for a
particular person, if wage = 6.75, the hourly wage is $6.75.
 Let x = educ denote years of schooling; for example, educ = 12
corresponds to a complete high school education.
 Estimate the impact of education on wage
 The average value of u, the error term, in the population is 0. That
is,

 E(u) = 0

 This is not a restrictive assumption, since we can always use b0 to


normalize E(u) to 0

27
 We need to make a crucial assumption about how u and x are
related
 We want it to be the case that knowing something about x does
not give us any information about u, so that they are completely
unrelated. That is, that
 E(u|x) = E(u) = 0, which implies
 E(y|x) = b0 + b1x

28
E(y|x) as a linear function of x, where for any x
the distribution of y is centered about E(y|x)
y
f(y)

.E(y|x) = b + b x
.
0 1

x1 x2
29
 Basic idea of regression is to estimate the population
parameters from a sample
 Let {(xi,yi): i=1, …,n} denote a random sample of size n
from the population
 For each observation in this sample, it will be the case
that
 yi = b0 + b1xi + ui

30
Population regression line, sample data points
and the associated error terms
y E(y|x) = b0 + b1x
y4 .
u4{

y3 .} u3
y2 u2{ .

u1
y1 .
}

x1 x x x x
2 31
3 4
 Intuitively, OLS is fitting a line through the sample points such
that the sum of squared residuals is as small as possible, hence
the term least squares
 The residual, û, is an estimate of the error term, u, and is the
difference between the fitted line (sample regression function)
and the sample point

32
Sample regression line, sample data points
and the associated estimated error terms
y
y4 .
û4{
yˆ  bˆ0  bˆ1 x
y3 .} û3
y2 û2{ .

y1 }. û1

x1 x x x x
2 33
3 4
 To derive the OLS estimates we need to realize that our main
assumption of E(u|x) = E(u) = 0 also implies that

 Cov(x,u) = E(xu) = 0

34
n

 x  x  y
i i  y
bˆ1  i 1
n

 x  x 
2
i
i 1
n
provided that   xi  x   0
2

i 1

35
 The slope estimate is the sample covariance between x and y
divided by the sample variance of x
 If x and y are positively correlated, the slope will be positive
 If x and y are negatively correlated, the slope will be negative
 Only need x to vary in our sample

36
We can think of each observation as being made
up of an explained part, and an unexplained part,
yi  yˆ i  uˆi We then define the following :
  y  y  is the total sum of squares (SST)
2
i

  yˆ  y  is the explained sum of squares (SSE)


2
i

 uˆ is the residual sum of squares (SSR)


2
i

Then SST  SSE  SSR


37
 How do we think about how well our sample regression line fits
our sample data?

 Can compute the fraction of the total sum of squares (SST) that
is explained by the model, call this the R-squared of regression

 R2 = SSE/SST = 1 – SSR/SST

38
 R2 = coefficient of determination: the proportion of variation
explained by the independent variable (regression model)
0  R2  1
 The square root of R2 is the sample correlation coefficient, r
(where the sign of r is the same as the slope of the fitted line)
 Use CEOSAL1
 For the population of chief executive officers, let y be annual salary
(salary) in thousands of dollars.
 Thus, y = 856.3 indicates an annual salary of $856,300, and y=1,452.6
indicates a salary of $1,452,600.
 Let x be the average return on equity (roe) for the CEO’s firm for the
previous three years. (Return on equity is defined in terms of net
income as a percentage of common equity.) For example, if roe is 10,
then average return on equity is 10%.
 Estimate the relationship between this measure of firm performance
and CEO compensation. Comment on R-squared.
40
 The OLS estimates of b1 and b0 are unbiased
 Proof of unbiasedness depends on our 4 assumptions – if any
assumption fails, then OLS is not necessarily unbiased
 Remember unbiasedness is a description of the estimator – in a
given sample we may be “near” or “far” from the true parameter

41
 Now we know that the sampling distribution of our estimate is
centered around the true parameter
 Want to think about how spread out this distribution is
 Much easier to think about this variance under an additional
assumption, so
 Assume Var(u|x) = s2 (Homoskedasticity)

42
Homoskedastic Case

y
f(y|x)

.E(y|x) = b + b x
.
0 1

x1 x2
43
Heteroskedastic Case

f(y|x)

.
. E(y|x) = b0 +

.
b 1x

x1 x2 x3 x
44
 Wage = b0 + b1edu + u
 If we also make the homoskedasticity assumption, then
Var(u|educ) = s2 does not depend on the level of education, which
is the same as assuming Var(wage|educ) = s2 .
 We don’t know what the error variance, s2, is, because we don’t
observe the errors, ui

 What we observe are the residuals, ûi

 We can use the residuals to form an estimate of the error


variance

46
47
OVERVIEW
We:
• explore probability distributions
• find examples of normal distributions
• explore confidence intervals and how
they relate to decision making
PROBABILITY
DISTRIBUTION
Data can be
"distributed"
(spread out)
in different
ways
NORMAL PROBABILITY
DISTRIBUTION
There are also many cases where the data tends to be around a central value with no
bias left or right, like this:

• The blue curve is a normal distribution


• The yellow histogram shows some data that follows it closely, but not perfectly
(which is usual)
• Often called a bell curve
NORMAL PROBABILITY
DISTRIBUTION

Many things closely follow a normal distribution:


• heights of people
• size of things produced by machines
• errors in measurements
• blood pressure
• marks on a test
STANDARD DEVIATION
Standard deviation is a measure of how spread out numbers are

In calculating
the standard
deviation,
we find that
generally:
 It is helpful to know the standard deviation, because
we can say that any value is:
 likely to be within 1 standard deviation (68 out of 100
should be)
 very likely to be within 2 standard deviations (95 out
of 100 should be)
 almost certainly within 3 standard deviations (997 out
of 1000 should be)
 A value more than three standard deviations from the
mean is likely to be an outlier – a measurement error
or an anomaly.
STANDARDIZING
Any Normal Distribution can be converted to the Standard
Normal Distribution.

To convert a value to a standard score (z-score):


• first subtract the mean,
• then divide by the Standard Deviation

Source: https://www.mathsisfun.com/data/standard-normal-distribution.html
 In business, you sometimes need to give an answer with some
consideration given to how confident you are about that
answer
 This is where confidence intervals can help, often reliant on
underlying probability distributions
 A useful estimate would indicate a range of values and the
probability that the actual value is within that range
56
OVERVIEW
In this video, we:
• examine how to test whether a given hypothesis is
correct based on the available datasets
• take a look at the general hypothesis testing process
• https://www.youtube.com/watch?v=ZzeXCKd5a18
 A hypothesis is a statement that might be true
 Researchers generally formulate a hypothesis and then collect
data to test whether the hypothesis is true or not
 A sample is generally selected from a larger group (the
"population") that will, hopefully, let you find out things about
the larger group
 Samples should be chosen randomly
 Example: you ask 100 randomly chosen people at a soccer match
what their main job is. Your sample is the 100, while the
population is all the people at that match.
 Hypothesis testing involves a null hypothesis and an alternative
hypothesis
 H0: The null hypothesis: is a statement of no effect, relationship, or
difference between two or more groups or factors
 e.g., There is no difference in the incidence of skin cancer across ages 0 to 5
years.
 H1: The alternative hypothesis: is the statement that there is an
effect or difference. This is usually the hypothesis the researcher is
interested in proving.
 e.g., The incidence of skin cancer differs with the age.
 The investigator needs to set a “level of significance” (α)
 This is how confident they need to be before they reject the
null hypothesis and accept the alternative hypothesis
 A significance level of 5% (α = 0.05) indicates that the
investigator will only reject the null hypothesis if there is
less than a 5% chance that it is true
 In other words, the alternative hypothesis will be accepted
only if the probability that it is true is 95% or more
 A p-value:
 is a measure of the probability that an observed
difference could have occurred just by random chance

 helps determine the


significance of the
results in relation to
the null hypothesis
 Once the p-value is determined, the outcome of the hypothesis
test follows:
 If the p-value is less than or equal to α (significance level), then
the null hypothesis is rejected and the alternative hypothesis is
accepted
 If the p-value is greater than α, then the null hypothesis is
retained and the alternative hypothesis is rejected
 EXCEL: The media company
 STATA: WES

63

You might also like