0% found this document useful (0 votes)

14 views63 pages

Module 3 - Data Analysis_S RM

Uploaded by

29mai03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views63 pages

Module 3 - Data Analysis_S RM

Uploaded by

29mai03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

2

OVERVIEW
• There are many situations where it is important to
understand the relationship between two variables in a
dataset
• We explore:
• the importance of understanding relationships
between variables in decision making
• some different types of relationships between variables
 Suppose we are a car dealership that wants to understand the
relationship between the number of cars sold per year and the
number of years’ experience a car salesperson has
 We would expect that the more experience a car salesperson
has, the more cars they are likely to sell
 It would be interesting to try and quantify this relationship to
assess the effect experience has on selling cars
1. It allows us to make predictions about what we are trying to
explore:
e.g., annual car sales
2. It allows us to understand if, or how much, a particular factor
contributes to annual car sales:
e.g., the amount of experience of the salesperson
 Two variables are related if their values change when there
is an increase or decrease in the other variable
 The two variables may either move in the same direction or
in the opposite direction
 When considering the relationship between two variables,
we need to look at two aspects of their relationship:
correlation and causation
 We will explore these in more detail in the next two videos
7
 Correlation – a measure of strength of linear relationship
between two variables
 Correlation coefficient – a number between -1 and 1.
 A correlation of 0 indicates that the two variables have no linear
relationship to each other.
 A positive correlation coefficient indicates a linear relationship
for which one variable increases as the other also increases.
 A negative correlation coefficient indicates a linear relationship
for one variable that increases while the other decreases.
CORRELATION
 Correlation is positive when the values increase
together, and
 Correlation is negative when one value decreases as
the other increases
 A correlation is assumed to be linear (following a
line)

 Source: https://www.mathsisfun.com/data/correlation.html
 Correlation is measured by the correlation coefficient “r”, which
measures the joint variability of the two variables
 This is also called the standardised covariance, and ranges between -1
and 1
 The closer to -1, the stronger the negative linear relationship (i.e.,
move in opposite directions)
 The closer to 1, the stronger the positive linear relationship (i.e., move
in the same direction)
 The closer to 0, the weaker the linear relationship
11
 There is a formula to calculate the correlation coefficient “r”
 The formula in Excel is correl(dataA, dataB)
 dataA represents the first data set
 dataB the second data set
 In this course, you can calculate correlation using Excel
 Use Bon Appetit data to show correlation between different pairs
of variables
14
OVERVIEW
• Causation is another important relationship between two variables
that can often be confused with correlation
• We explore:
• the definition of causation
• the relationship between correlation and
causation
• https://www.youtube.com/watch?v=VMUQSMFGB
Do
 Causation is where one variable or event causes changes in
another variable or event
 A high correlation does not always imply causation – you will
often hear the phrase “correlation does not imply causation”
 While high correlation might suggest that one variable may
affect the outcome of another, this could be purely coincidental
 Correlation does not prove one thing causes the other:
 one thing might cause the other
 the other might cause the first to happen
 they may be linked by a different thing
 or it could be random chance!
 There can be many reasons why the data has a good
correlation.
EXAMPLE
The Ice Cream shop finds how many sunglasses were sold by a large store each day
and compares them to their ice cream sales:

Does this mean that sunglasses make people want ice cream?
19
 We saw how the relationship between variables can
be explored with the creation of scatter plots
 We can use a technique called linear regression to
better understand the relationship between two
variables
 Regression analysis assists decision-making by allowing a deeper
understanding of the relationship between variables
 There are different kinds of regression analysis, but in this course,
we're going to focus on linear regression
 Linear regression analysis is used to predict the value of a
variable based on the value of another variable
 The variable you want to predict is called the dependent variable
 The variable you are using to predict the other variable's value is
called the independent variable
 The linear regression equation is: Y= a + bX, where:
 Y is the dependent variable (the variable on the Y axis)
 X is the independent variable (plotted on the X axis)
 b is the slope of the line, and
 a is the y-intercept
 In the simple linear regression model, where y = b0 + b1x + u, we typically
refer to y as the
 Dependent Variable, or
 Left-Hand Side Variable, or
 Explained Variable, or
 Regressand

23
 In the simple linear regression of y on x, we typically refer to x as
the
 Independent Variable, or
 Right-Hand Side Variable, or
 Explanatory Variable, or
 Regressor, or
 Covariate, or
 Control Variables

24
 Wage = b0 + b1edu + u

 wage: measured in dollars per hour

 educ: years of education,
 b1 measures the change in hourly wage given another year of
education, holding all other factors fixed.
 Some of those factors include labor force experience, innate
ability, tenure with current employer, work ethic, and numerous
other things.
 Use WAGE1
 For the population of people in the workforce in 1976, let y =
wage, where wage is measured in dollars per hour. Thus, for a
particular person, if wage = 6.75, the hourly wage is $6.75.
 Let x = educ denote years of schooling; for example, educ = 12
corresponds to a complete high school education.
 Estimate the impact of education on wage
 The average value of u, the error term, in the population is 0. That
is,

 E(u) = 0

 This is not a restrictive assumption, since we can always use b0 to

normalize E(u) to 0

27
 We need to make a crucial assumption about how u and x are
related
 We want it to be the case that knowing something about x does
not give us any information about u, so that they are completely
unrelated. That is, that
 E(u|x) = E(u) = 0, which implies
 E(y|x) = b0 + b1x

28
E(y|x) as a linear function of x, where for any x
the distribution of y is centered about E(y|x)
y
f(y)

.E(y|x) = b + b x
.
0 1

x1 x2
29
 Basic idea of regression is to estimate the population
parameters from a sample
 Let {(xi,yi): i=1, …,n} denote a random sample of size n
from the population
 For each observation in this sample, it will be the case
that
 yi = b0 + b1xi + ui

30
Population regression line, sample data points
and the associated error terms
y E(y|x) = b0 + b1x
y4 .
u4{

y3 .} u3
y2 u2{ .

u1
y1 .
}

x1 x x x x
2 31
3 4
 Intuitively, OLS is fitting a line through the sample points such
that the sum of squared residuals is as small as possible, hence
the term least squares
 The residual, û, is an estimate of the error term, u, and is the
difference between the fitted line (sample regression function)
and the sample point

32
Sample regression line, sample data points
and the associated estimated error terms
y
y4 .
û4{
yˆ  bˆ0  bˆ1 x
y3 .} û3
y2 û2{ .

y1 }. û1

x1 x x x x
2 33
3 4
 To derive the OLS estimates we need to realize that our main
assumption of E(u|x) = E(u) = 0 also implies that

 Cov(x,u) = E(xu) = 0

34
n

 x  x  y
i i  y
bˆ1  i 1
n

 x  x 
2
i
i 1
n
provided that   xi  x   0
2

i 1

35
 The slope estimate is the sample covariance between x and y
divided by the sample variance of x
 If x and y are positively correlated, the slope will be positive
 If x and y are negatively correlated, the slope will be negative
 Only need x to vary in our sample

36
We can think of each observation as being made
up of an explained part, and an unexplained part,
yi  yˆ i  uˆi We then define the following :
  y  y  is the total sum of squares (SST)
2
i

  yˆ  y  is the explained sum of squares (SSE)

2
i

 uˆ is the residual sum of squares (SSR)

2
i

Then SST  SSE  SSR

37
 How do we think about how well our sample regression line fits
our sample data?

 Can compute the fraction of the total sum of squares (SST) that
is explained by the model, call this the R-squared of regression

 R2 = SSE/SST = 1 – SSR/SST

38
 R2 = coefficient of determination: the proportion of variation
explained by the independent variable (regression model)
0  R2  1
 The square root of R2 is the sample correlation coefficient, r
(where the sign of r is the same as the slope of the fitted line)
 Use CEOSAL1
 For the population of chief executive officers, let y be annual salary
(salary) in thousands of dollars.
 Thus, y = 856.3 indicates an annual salary of $856,300, and y=1,452.6
indicates a salary of $1,452,600.
 Let x be the average return on equity (roe) for the CEO’s firm for the
previous three years. (Return on equity is defined in terms of net
income as a percentage of common equity.) For example, if roe is 10,
then average return on equity is 10%.
 Estimate the relationship between this measure of firm performance
and CEO compensation. Comment on R-squared.
40
 The OLS estimates of b1 and b0 are unbiased
 Proof of unbiasedness depends on our 4 assumptions – if any
assumption fails, then OLS is not necessarily unbiased
 Remember unbiasedness is a description of the estimator – in a
given sample we may be “near” or “far” from the true parameter

41
 Now we know that the sampling distribution of our estimate is
centered around the true parameter
 Want to think about how spread out this distribution is
 Much easier to think about this variance under an additional
assumption, so
 Assume Var(u|x) = s2 (Homoskedasticity)

42
Homoskedastic Case

y
f(y|x)

.E(y|x) = b + b x
.
0 1

x1 x2
43
Heteroskedastic Case

f(y|x)

.
. E(y|x) = b0 +

.
b 1x

x1 x2 x3 x
44
 Wage = b0 + b1edu + u
 If we also make the homoskedasticity assumption, then
Var(u|educ) = s2 does not depend on the level of education, which
is the same as assuming Var(wage|educ) = s2 .
 We don’t know what the error variance, s2, is, because we don’t
observe the errors, ui

 What we observe are the residuals, ûi

 We can use the residuals to form an estimate of the error

variance

46
47
OVERVIEW
We:
• explore probability distributions
• find examples of normal distributions
• explore confidence intervals and how
they relate to decision making
PROBABILITY
DISTRIBUTION
Data can be
"distributed"
(spread out)
in different
ways
NORMAL PROBABILITY
DISTRIBUTION
There are also many cases where the data tends to be around a central value with no
bias left or right, like this:

• The blue curve is a normal distribution

• The yellow histogram shows some data that follows it closely, but not perfectly
(which is usual)
• Often called a bell curve
NORMAL PROBABILITY
DISTRIBUTION

Many things closely follow a normal distribution:

• heights of people
• size of things produced by machines
• errors in measurements
• blood pressure
• marks on a test
STANDARD DEVIATION
Standard deviation is a measure of how spread out numbers are

In calculating
the standard
deviation,
we find that
generally:
 It is helpful to know the standard deviation, because
we can say that any value is:
 likely to be within 1 standard deviation (68 out of 100
should be)
 very likely to be within 2 standard deviations (95 out
of 100 should be)
 almost certainly within 3 standard deviations (997 out
of 1000 should be)
 A value more than three standard deviations from the
mean is likely to be an outlier – a measurement error
or an anomaly.
STANDARDIZING
Any Normal Distribution can be converted to the Standard
Normal Distribution.

To convert a value to a standard score (z-score):

• first subtract the mean,
• then divide by the Standard Deviation

Source: https://www.mathsisfun.com/data/standard-normal-distribution.html
 In business, you sometimes need to give an answer with some
consideration given to how confident you are about that
answer
 This is where confidence intervals can help, often reliant on
underlying probability distributions
 A useful estimate would indicate a range of values and the
probability that the actual value is within that range
56
OVERVIEW
In this video, we:
• examine how to test whether a given hypothesis is
correct based on the available datasets
• take a look at the general hypothesis testing process
• https://www.youtube.com/watch?v=ZzeXCKd5a18
 A hypothesis is a statement that might be true
 Researchers generally formulate a hypothesis and then collect
data to test whether the hypothesis is true or not
 A sample is generally selected from a larger group (the
"population") that will, hopefully, let you find out things about
the larger group
 Samples should be chosen randomly
 Example: you ask 100 randomly chosen people at a soccer match
what their main job is. Your sample is the 100, while the
population is all the people at that match.
 Hypothesis testing involves a null hypothesis and an alternative
hypothesis
 H0: The null hypothesis: is a statement of no effect, relationship, or
difference between two or more groups or factors
 e.g., There is no difference in the incidence of skin cancer across ages 0 to 5
years.
 H1: The alternative hypothesis: is the statement that there is an
effect or difference. This is usually the hypothesis the researcher is
interested in proving.
 e.g., The incidence of skin cancer differs with the age.
 The investigator needs to set a “level of significance” (α)
 This is how confident they need to be before they reject the
null hypothesis and accept the alternative hypothesis
 A significance level of 5% (α = 0.05) indicates that the
investigator will only reject the null hypothesis if there is
less than a 5% chance that it is true
 In other words, the alternative hypothesis will be accepted
only if the probability that it is true is 95% or more
 A p-value:
 is a measure of the probability that an observed
difference could have occurred just by random chance

 helps determine the

significance of the
results in relation to
the null hypothesis
 Once the p-value is determined, the outcome of the hypothesis
test follows:
 If the p-value is less than or equal to α (significance level), then
the null hypothesis is rejected and the alternative hypothesis is
accepted
 If the p-value is greater than α, then the null hypothesis is
retained and the alternative hypothesis is rejected
 EXCEL: The media company
 STATA: WES

MGM3165 Chapter 9 10 (1)
No ratings yet
MGM3165 Chapter 9 10 (1)
44 pages
Session_19&20
No ratings yet
Session_19&20
54 pages
03 - Simple Linear Regression
No ratings yet
03 - Simple Linear Regression
13 pages
Week-4 Statistical-Forecasting Handout
No ratings yet
Week-4 Statistical-Forecasting Handout
9 pages
Summary: Correlation and Regression
No ratings yet
Summary: Correlation and Regression
6 pages
Eco 315 Weeks 1-2 - 2024
No ratings yet
Eco 315 Weeks 1-2 - 2024
97 pages
Simple Linear Regression and Correlation
No ratings yet
Simple Linear Regression and Correlation
77 pages
Topic - 7 - Regression Analysis
No ratings yet
Topic - 7 - Regression Analysis
33 pages
Advancedeconometricsl3!4!240128102442 58a0f1f1
No ratings yet
Advancedeconometricsl3!4!240128102442 58a0f1f1
58 pages
Correlation Simple Regression
No ratings yet
Correlation Simple Regression
26 pages
Topic 3_simple Regression Analysis
No ratings yet
Topic 3_simple Regression Analysis
37 pages
Data Analytics Lesson 11 Notes
No ratings yet
Data Analytics Lesson 11 Notes
8 pages
CHAPTER 2 Tesfaye Final_New Slide
No ratings yet
CHAPTER 2 Tesfaye Final_New Slide
159 pages
Topic - chapter 12 - Regression models
No ratings yet
Topic - chapter 12 - Regression models
1 page
Ch14 Regression
No ratings yet
Ch14 Regression
89 pages
1683609733_Deck2-BusinessIntelligence-M1-ACSA
No ratings yet
1683609733_Deck2-BusinessIntelligence-M1-ACSA
15 pages
probablity
No ratings yet
probablity
4 pages
Regression Models - Follow
No ratings yet
Regression Models - Follow
7 pages
MetNum1 2023 1 Week 13
No ratings yet
MetNum1 2023 1 Week 13
70 pages
Relationship- Correlation and Regression (1)
No ratings yet
Relationship- Correlation and Regression (1)
42 pages
Chapter-4-Simple Linear Regression & Correlation
100% (3)
Chapter-4-Simple Linear Regression & Correlation
9 pages
MODULE-3
No ratings yet
MODULE-3
34 pages
PARAMETRIC-TEST
No ratings yet
PARAMETRIC-TEST
49 pages
EECM3724 Unit 9 ch14 Slides 2023
No ratings yet
EECM3724 Unit 9 ch14 Slides 2023
57 pages
Correlation and Linear Regression
No ratings yet
Correlation and Linear Regression
46 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
49 pages
Clase 2
No ratings yet
Clase 2
48 pages
Presentation4 - Bivariate Analysis and Simple Linear Regression
No ratings yet
Presentation4 - Bivariate Analysis and Simple Linear Regression
31 pages
Stat 4-6 Chapter
No ratings yet
Stat 4-6 Chapter
37 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
Regression&Corr&Annova
No ratings yet
Regression&Corr&Annova
71 pages
Midterm 2 Nem Veg Leges
No ratings yet
Midterm 2 Nem Veg Leges
9 pages
Part 2 Exploring Relationships Among Variables
No ratings yet
Part 2 Exploring Relationships Among Variables
8 pages
Linear Regression
No ratings yet
Linear Regression
216 pages
Chapter-9-Simple Linear Regression & Correlation
No ratings yet
Chapter-9-Simple Linear Regression & Correlation
11 pages
Simple Linear Regression and Correlation 568a5ac2ce9b3
No ratings yet
Simple Linear Regression and Correlation 568a5ac2ce9b3
31 pages
Session 5 Marked B PDF
No ratings yet
Session 5 Marked B PDF
36 pages
Simple Regression and Correlation
No ratings yet
Simple Regression and Correlation
30 pages
Corelation and Regression
No ratings yet
Corelation and Regression
137 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Simple LR Lecture
No ratings yet
Simple LR Lecture
60 pages
Simple Regression and Simple Correlation: MA261 Statistical and Numerical Techniques March 24, 2022
No ratings yet
Simple Regression and Simple Correlation: MA261 Statistical and Numerical Techniques March 24, 2022
52 pages
Research in Daily Life 2
100% (1)
Research in Daily Life 2
11 pages
Correlation and Simple Linear Regression: Y. I.E. X
100% (1)
Correlation and Simple Linear Regression: Y. I.E. X
9 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Regression With One Regressor
No ratings yet
Regression With One Regressor
25 pages
Statistics For Business STAT130: Unit 8: Correlation and Regression Analysis
No ratings yet
Statistics For Business STAT130: Unit 8: Correlation and Regression Analysis
56 pages
02 V3 2016 CFA二级强化班 Quantitative Methods
No ratings yet
02 V3 2016 CFA二级强化班 Quantitative Methods
79 pages
Regression Analysis
No ratings yet
Regression Analysis
41 pages
Lecture9 Regression1 PDF
No ratings yet
Lecture9 Regression1 PDF
22 pages
Regression Analysis
100% (1)
Regression Analysis
43 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Regression Analysis
No ratings yet
Regression Analysis
12 pages
Correlation and Regression 2
No ratings yet
Correlation and Regression 2
24 pages
Correlation and Linear Regression
No ratings yet
Correlation and Linear Regression
25 pages
Quantitative
100% (1)
Quantitative
90 pages
Review: I Am Examining Differences in The Mean Between Groups
100% (2)
Review: I Am Examining Differences in The Mean Between Groups
44 pages
机器学习_ 学习笔记 (All in One)_V0.97更多医学课请加微信782878241
No ratings yet
机器学习_ 学习笔记 (All in One)_V0.97更多医学课请加微信782878241
762 pages
Ouattara Steenvoorden 2023 the Elusive Effect of Political Trust on Participation Participatory Resource Or
No ratings yet
Ouattara Steenvoorden 2023 the Elusive Effect of Political Trust on Participation Participatory Resource Or
19 pages
Regression Analysis
No ratings yet
Regression Analysis
21 pages
Calculus Refresher
From Everand
Calculus Refresher
A. A. Klaf
3/5 (8)
Regression control chart for two related variables: a forgotten lesson
No ratings yet
Regression control chart for two related variables: a forgotten lesson
19 pages
2021 Sarstedtetal. HandbookofMarketResearch
No ratings yet
2021 Sarstedtetal. HandbookofMarketResearch
48 pages
Anova
No ratings yet
Anova
46 pages
Correlation
100% (1)
Correlation
29 pages
Gujarati Chap 3
No ratings yet
Gujarati Chap 3
44 pages
Frontmatter
No ratings yet
Frontmatter
20 pages
Chapter 6
No ratings yet
Chapter 6
19 pages
Non Financial Factors
No ratings yet
Non Financial Factors
23 pages
Basic Concepts of Measurements
No ratings yet
Basic Concepts of Measurements
18 pages
Science Process Skills
No ratings yet
Science Process Skills
65 pages
Nakasero PDF
No ratings yet
Nakasero PDF
10 pages
Multiple - Linear - Regression - AirBNB - Student - File0.2 - New (1) .Ipynb - Colaboratory
No ratings yet
Multiple - Linear - Regression - AirBNB - Student - File0.2 - New (1) .Ipynb - Colaboratory
8 pages
? An SME Reorganization Prediction Model: Which Characteristics Predict The Survival of Insolvent Firms
No ratings yet
? An SME Reorganization Prediction Model: Which Characteristics Predict The Survival of Insolvent Firms
15 pages
Keywords: Rumors, Marketplace Rumors, Consumer Behavior, Word of
No ratings yet
Keywords: Rumors, Marketplace Rumors, Consumer Behavior, Word of
20 pages
Top 10 Machine Learning Algorithms With Their Use
100% (1)
Top 10 Machine Learning Algorithms With Their Use
12 pages
Materials Handling Management:A Case Study: Article
No ratings yet
Materials Handling Management:A Case Study: Article
13 pages
P4 Project Report
No ratings yet
P4 Project Report
28 pages
PR2 Module O3. Variables
No ratings yet
PR2 Module O3. Variables
12 pages
A Conceptual Framework For Understanding Customer Satisfaction in Banking Sector: The Mediating in Uence of Service Quality and Organisational Oath
No ratings yet
A Conceptual Framework For Understanding Customer Satisfaction in Banking Sector: The Mediating in Uence of Service Quality and Organisational Oath
11 pages
Prediction of Resale Value of The Car Using Linear Regression Algorithm
No ratings yet
Prediction of Resale Value of The Car Using Linear Regression Algorithm
5 pages
Analysis of Trade Before and After The WTO: A Case Study of India
No ratings yet
Analysis of Trade Before and After The WTO: A Case Study of India
8 pages
Science Fair Final Report Format
No ratings yet
Science Fair Final Report Format
5 pages
PsychAss Reviewer
No ratings yet
PsychAss Reviewer
18 pages
Causal Comparative Research
No ratings yet
Causal Comparative Research
24 pages
TQM - TRG - F-09 - Discriminant Analysis - Rev01 - 20180602 PDF
No ratings yet
TQM - TRG - F-09 - Discriminant Analysis - Rev01 - 20180602 PDF
22 pages
Forensic Accounting As A Tool For Fraud Detection and Prevention in Public Sector: Moderating On Mdas in Nigeria
No ratings yet
Forensic Accounting As A Tool For Fraud Detection and Prevention in Public Sector: Moderating On Mdas in Nigeria
11 pages
Experimental Design
No ratings yet
Experimental Design
3 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Research 2 Activity Sheet Quarter 3 - MELC 1 & 2 Weeks 1-2
100% (2)
Research 2 Activity Sheet Quarter 3 - MELC 1 & 2 Weeks 1-2
12 pages

Module 3 - Data Analysis_S RM

Uploaded by

Module 3 - Data Analysis_S RM

Uploaded by

2

 wage: measured in dollars per hour

 This is not a restrictive assumption, since we can always use b0 to

  yˆ  y  is the explained sum of squares (SSE)

 uˆ is the residual sum of squares (SSR)

Then SST  SSE  SSR

 What we observe are the residuals, ûi

 We can use the residuals to form an estimate of the error

• The blue curve is a normal distribution

Many things closely follow a normal distribution:

To convert a value to a standard score (z-score):

 helps determine the

You might also like