0% found this document useful (0 votes)

16 views44 pages

BS Topic 3

Uploaded by

Rikesh Ranjan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views44 pages

BS Topic 3

Uploaded by

Rikesh Ranjan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

This document belongs to ESCP Business School. It cannot be modified nor distributed without the authors’ consent.

BUSINESS STATISTICS

Prof. Lynn FARAH

Describing Relationships between Two

Statistical Variables
https://www.researchgate.net/pu
blication/23782639_Second-to-
Fourth_Digit_Ratio_Predicts_Succ
ess_among_High-
Frequency_Financial_Traders

2
3

3
Crossing two QUALITATIVE
variables:
To study relationships between two qualitative (categorical)
variables, start with a contingency table.

Its contents are the counts organized by the categories in both

variables, with:
Variable 1 ⇿ rows
Variable 2 ⇿ columns

Each cell of the table gives the count for a combination of

values of the two variables.

4
Example
The data below comes from a study of 626 people attending a particular
American medical centre.
The study looked at where people had their tattoo and whether they
have Hepatitis C.

Marginal
Distribution

Conditional Distribution

5
Conditional percentages: percent of what?

Compare:
1. What percent of individuals with
no tattoo have Hep C?
The clue is in the
2. What percent of individuals with
underlined phrase...
Hep C have no tattoo? Always look for what
3. What percent (?) have no tattoo follows the “of” ...
and do have Hep C? If there is no “of”
then implicitly it is
“what percent of
everyone”

6
Are Hepatitis C and Tattoo Origin independent?

How did we obtain the below table?

Let’s compare conditional distributions shown in columns:

About 33 % of people with tattoos from a commercial parlor have Hep
C; only 3.5% of people with no tattoo have Hep C.

The % of people with Hepatitis C varies across the categories of Tattoo

origin.
Þ the variables are clearly not independent!

7
Are Hepatitis C and Tattoo Origin independent?

We can also see this using a Individual has

stacked bar chart: Hep C?
100%
• One bar for each category of the 90%
first variable (origin of tatoo0
• Each bar represents 100% of that 80%

category 70%

• The bar has segments 60%

corresponding to the categories 50% Yes

of the second variable (HepC) 40%
No

30%
The bars don’t look the same
Þ the variables are not
20%

independent! 10%

0%
Commercial Elsewhere No Tattoo
parlor
8
Two qualitative/categorical variables: Independence

A pair of qualitative/categorical variables are independent if the

distribution of one variable is the same for each of the categories of the
other variable.

This can be recognized in the relative conditional frequency

representation of the contingency table:
• Look at the “percentage by column (or row) totals”
• If the percentages corresponding to the conditional distributions are
very different then the variables are not independent.
• If the percentages look similar, the variables are probably
independent.

9
Chi-Squared Statistic & Cramer’s Coefficient
The Chi-squared statistic is a measure of association (dependence) in a
contingency table.

- It is calculated based on a comparison of the observed contingency table of

counts to an artificial table of “expected counts”, a table showing what we
would get in theory had the two variables been independent.
- In a theoretical situation of total independence, the value of the Chi-squared
statistic is zero (never happens with real life data!!)
- The Chi-squared statistic can reach very high values (it varies with sample
size and number of rows/columns in the contingency table) => problem in
assessing how strong the dependence is!!

Hence, we deduce Cramer’s coefficient from the Chi-squared statistic.

It allows us to assess the strength of the association on a 0 -1 scale
(0=total independence; 1=total dependence).

10
Formulas for the curious ones only J

To calculate the
expected counts for
N i. N.j Ni. : row total
N.j : column total
the artificial table: Exp ij = N : total count
N
To calculate the Chi- 2
squared statistic:
χ 2
= ∑∑
( Obs ij
− Exp ij )
Exp ij
To calculate Cramer’s
coefficient (formula accounts
for the sample size and the
χ2
γ=
number of rows/columns):
N× min(p −1;q −1)
11
Chi-Squared Statistic & Cramer’s Coefficient

Chi-square = 57.91
How small/big is this value?? We can’t assess it, so we calculate
Cramer’s coefficient.

Cramer’s coefficient= 0.3

On a scale from 0 to 1, this value indicates a quite low
dependence between Hep C and Tattoo origin
12
One more Example: Gender & Satisfaction
Gender Women Men Total

Satisfaction
-- 53 32 85

- 156 88 244

+ 92 200 292

++ 27 152 179

Total 328 472 800

Is there an association between gender and satisfaction

w.r.t. cafeteria food among the staff on Paris campus?
13
Row percentages
Gender Women Men Total

Satisfaction
-- 62 38 100

- 64 36 100

+ 32 68 100

++ 15 85 100

Margin 41 59 100
14
Column percentages
Gender Women Men Margin

Satisfaction
-- 16 7 11

- 48 19 30

+ 28 42 37

++ 8 32 22

Total 100 100 100

15
Observed Situation
Gender Women Men Total

Satisfaction
-- 53 32 85

- 156 88 244
Artificial/Expected
+ 92 200 292
Independence
++ 27 152 179
Situation
Gender Women Men Total
Total 328 472 800
Satisfaction
-- 35 50 85

- 100 144 244

+ 120 172 292

++ 73 106 179

16
Total 328 472 800
Calculating the Chi-Squared Statistic:
!2=129

Deriving Cramer’s V coefficient:

" = 0.4

Þ There is a rather medium association (dependence)

between gender and satisfaction (w.r.t food in the
cafeteria).

17
18

18
19

Crossing two QUANTITATIVE

variables:
To study relationships between two quantitative variables, start with a
scatterplot. Each individual can be represented by a point (xi, yi) in the
plane defined by the two variables.

1st step:
Choose the variable to be placed on the x-axis, the explanatory (predictor)
variable, and the one to be placed on the y-axis, the response (predicted)
variable.

2nd step:
Draw the scatterplot.

19
https://ourworldindata.org/grapher/alcoh
ol-consumption-vs-gdp-per-capita

20
Which is Response and which is Explanatory?

• Baseball teams: scores on runs & tickets sold

Tickets = Response (y), Runs = Explanatory (x)
=> Do baseball teams that score more runs also sell more tickets?

• Students: SAT scores & grades

Grades = Response (y), SAT score = Explanatory (x)
=>Do students with higher SAT scores get better grades?

• People: BMI & wrist size

BMI = Response (y), Wrist Size = Explanatory (x)
=> Can we estimate a person’s BMI by measuring their wrist size?

21
22

We need to examine the scatterplot and answer the below

questions :

• Is there a relationship between the 2 variables?

• If yes, can we evaluate the strength of this relationship?

• Can we model this relationship?

Þ Direction, Strength & Shape

Þ Unusual Features (outliers, clusters)

22
23
24
25
26

Do not confuse correlation and causality!

Is there a relationship between the life expectancy in a country and the
number of televisions?

Does sending more televisions to certain countries improve the life

expectancy?
Þ There is an underlying “lurking variable” which is the economic
development of the country.
26
27

Do not confuse correlation and causality!

For cities in a certain country, the number of places of worship and the
number of homicides is positively correlated – how can that be explained?

Þ There is an underlying “lurking variable” which is the size of the city –

larger cities have more places of worship, and more homicides.

Two variables may be related even if neither one is the “cause” the other.

Explanations for relationships:

1. Causality: find a causal mechanism
2. Coincidence: look for more data
3. Common underlying cause - “lurking variable”

Þ Scatterplots and correlations never prove causation!!

27
28
Linear Correlation
Given a linear relationship, a number between -1 and 1 can be
assigned to a scatterplot to measure the intensity of the correlation –
the linear correlation coefficient r:

Strong No relationship Strong

negative positive
-1 +1
Weak 0 Weak
negative positive

#$%(',)) ∑ '-.
' )-.
)
Formula: != where #$% ', ) =
+' +) /

29
30
For each scatterplot below, suggest an approximate value for the
linear correlation coefficient r

31
32
Regression Line

Given a linear relationship between two quantitative variables, a

linear model can be constructed.

Although the points in a scatterplot usually do not all lie on a straight

line, some lines do a better job of describing the relationship seen in
the scatterplot.

We aim to find the line that does the best job in modelling the
relationship: line of best fit.

Even that model won’t be perfect: some points will be above the line
and some will be below it.

33
34
Example
The data collected describes the number of manatees killed in the
Florida waterways and the number of powerboats registered, for each
year 1988 to 2000.

Who: The years 1988 to 2000 – 1 point represents 1 year

What: (Y)The number of manatees killed
(X) The number of powerboats registered, in 1000’s
Why: To understand better the cause of manatee deaths

35
36
/ = −45.671 + 0.131234
01

"! = −45.671 + 0.131.

37
Where does the equation come from?
The residual for a point in the scatterplot is the difference (in the y-
direction, vertically) between the true observed y-value of that case and the
y-value predicted by the line (denoted as y^).

residual = observed − predicted = y − ŷ

- A negative residual means the predicted value’s too big (overestimate).

- A positive residual means the predicted value’s too small (underestimate).
38
Where does the equation come from?

There is a unique line which minimizes the sum of the squares of the
residuals for all points in the scatterplot
Þ It is considered THE line of best fit

Þ Making ∑ " − "$ % as small as possible leads to computing the

following coefficients for the line equation:

-.
&'()* = , " − 012*,3*)2 = "4 − &'()*. 7̅
-/

39
Compared to all other
imaginable lines, the line of
best fit yields the smallest
value for the sum of squares
of residuals: 1233.204

But how big/small is this

value? And what does it
mean?

We calculate the r2 value:

≈10520/(10520+1233)
= 10520/11753
= 0.895 = 89.5%
40
But what does the r2 value mean?
It represents the percentage of the variation in Y which has been
accounted for by the model in terms of X

Thus 89.5% of the variation in manatee deaths can be attributed to

variation in powerboat registrations, via the linear regression model.

There is still 10.5% of variation in manatee deaths unexplained by the

model (and hence left in the residuals) which may be due to other
factors (climate, food sources, environment, random variation from year
to year, etc...).

It is also just equal to (r)2 = (0.946)2=0.895

41
The Domain of a Regression Model
Interpolation vs. Extrapolation

42
The Domain of a Regression Model
Interpolation vs. Extrapolation
The model for manatee deaths as a function of power boat registrations was
constructed based on the number of powerboat registrations (the x-value)
between about 450 and 750 thousand registrations.
This is called the domain of the model.
A model can be used to determine y-values (manatee deaths) for x-values
(powerboat registrations) within the domain, but outside this domain it may not
make sense.

For example, the y-intercept, which is when x=0, is very far outside this range
for the model in our example, and in this case (being negative) it clearly doesn’t
make sense.

Let’s predict the number of manatees killed when 500 thousand power boats
are registered.
If there are 500 thousand power boat registrations, the model says:
ŷ = -45.671 + 0.131(500) = 19.83
That is, that there would be about 20 manatee killed. 43
The Domain of a Regression Model
Interpolation vs. Extrapolation

When the x-value being used is within the domain of the model, such an
estimate is called an interpolation. Interpolations are safe.

When the x-value being used is outside the domain of the model (here
[450;750]), it is called an extrapolation.

Be careful making extrapolations! Close to the domain it may be reasonable to

do so, but far from the domain it is unlikely to be an appropriate use of the
model.

JAMOVI AND Basic Statistics
No ratings yet
JAMOVI AND Basic Statistics
28 pages
Data Exploration and Visualization Unit 2
100% (1)
Data Exploration and Visualization Unit 2
19 pages
Correlation and Regression
100% (5)
Correlation and Regression
49 pages
Chapter 2 - EDA - Examining Relationships
No ratings yet
Chapter 2 - EDA - Examining Relationships
40 pages
QM2 23-24 Session 3
No ratings yet
QM2 23-24 Session 3
53 pages
Introduction To Statistics?: Dr. Smitabh Barik
No ratings yet
Introduction To Statistics?: Dr. Smitabh Barik
85 pages
Analise Bivariada - Moodle
No ratings yet
Analise Bivariada - Moodle
46 pages
Quantitative Anaysise Solomon
No ratings yet
Quantitative Anaysise Solomon
51 pages
7d - Correlation With % Correlation
No ratings yet
7d - Correlation With % Correlation
59 pages
Chapter 03
No ratings yet
Chapter 03
19 pages
Statistical Biology - Reviewer
100% (1)
Statistical Biology - Reviewer
6 pages
Chapter 2
No ratings yet
Chapter 2
67 pages
Inferential Statistics II
No ratings yet
Inferential Statistics II
62 pages
Stat215 Test 2
No ratings yet
Stat215 Test 2
18 pages
L3 Correlation
No ratings yet
L3 Correlation
101 pages
Inferential Statistics
No ratings yet
Inferential Statistics
171 pages
Chapter2-ESTA3042 2020S2
No ratings yet
Chapter2-ESTA3042 2020S2
80 pages
Notes3.1 TPS6up
No ratings yet
Notes3.1 TPS6up
19 pages
Gea Cheatsheet
No ratings yet
Gea Cheatsheet
3 pages
Sec Assignment - Unit II
No ratings yet
Sec Assignment - Unit II
14 pages
AP Stats Study Guide
No ratings yet
AP Stats Study Guide
17 pages
3 - Bidimensional Statistics
No ratings yet
3 - Bidimensional Statistics
41 pages
BA 216 Lecture 5 Notes
No ratings yet
BA 216 Lecture 5 Notes
31 pages
Correg
No ratings yet
Correg
19 pages
Correlation and Regression
80% (5)
Correlation and Regression
24 pages
Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression
No ratings yet
Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression
17 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
Correlation and Regression Analysis Using SPSS
No ratings yet
Correlation and Regression Analysis Using SPSS
102 pages
Chapter XI Correlation and Regression
No ratings yet
Chapter XI Correlation and Regression
41 pages
Chapter 3 Notes 2024 2025 PDF
No ratings yet
Chapter 3 Notes 2024 2025 PDF
28 pages
Choosing Appropriate Statistical Tool - PDF
No ratings yet
Choosing Appropriate Statistical Tool - PDF
48 pages
Second Stats Packet 24
No ratings yet
Second Stats Packet 24
100 pages
BST 32202 Linear Regression 1 Introduction
No ratings yet
BST 32202 Linear Regression 1 Introduction
12 pages
STT 215 Exam 1 Study Guide
No ratings yet
STT 215 Exam 1 Study Guide
2 pages
9 Tutorial Statistics Revision
No ratings yet
9 Tutorial Statistics Revision
56 pages
Chapter 4: Describing The Relationship Between Two Variables
No ratings yet
Chapter 4: Describing The Relationship Between Two Variables
27 pages
Chapter 03 Describing Bivarate Data
No ratings yet
Chapter 03 Describing Bivarate Data
32 pages
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
No ratings yet
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
31 pages
SPSS Basic Guidance 3
No ratings yet
SPSS Basic Guidance 3
18 pages
Correlation and Regression 2020
No ratings yet
Correlation and Regression 2020
63 pages
Corr - Regression Analysis
No ratings yet
Corr - Regression Analysis
19 pages
Applied Statistics Summary
No ratings yet
Applied Statistics Summary
9 pages
6 Continuous Data Analysis
No ratings yet
6 Continuous Data Analysis
49 pages
RBC Statistics Overview RBC
No ratings yet
RBC Statistics Overview RBC
31 pages
Stat and Prob Q4 Week 7 Module 15 Lorena
No ratings yet
Stat and Prob Q4 Week 7 Module 15 Lorena
24 pages
SEE5211 Chapter3-P2017
No ratings yet
SEE5211 Chapter3-P2017
58 pages
Basic Statistics (3685) PPT - Lecture On 22-01-2019
No ratings yet
Basic Statistics (3685) PPT - Lecture On 22-01-2019
29 pages
9.bivariate Analysis
No ratings yet
9.bivariate Analysis
64 pages
Correlation and Regression: Associate Professor Georgi Iskrov, PHD Department of Social Medicine and Public Health
No ratings yet
Correlation and Regression: Associate Professor Georgi Iskrov, PHD Department of Social Medicine and Public Health
28 pages
Descriptive Statistics Inferential Statistics: Chinna Chadayan
No ratings yet
Descriptive Statistics Inferential Statistics: Chinna Chadayan
40 pages
Bio Statistics
No ratings yet
Bio Statistics
174 pages
Descriptive Stats (E.g., Mean, Median, Mode, Standard Deviation) Z-Test &/or T-Test For A Single Population Parameter (E.g., Mean)
No ratings yet
Descriptive Stats (E.g., Mean, Median, Mode, Standard Deviation) Z-Test &/or T-Test For A Single Population Parameter (E.g., Mean)
43 pages
Analysing Quantitative Data
No ratings yet
Analysing Quantitative Data
33 pages
Correlation and Regression: Predicting The Unknown
No ratings yet
Correlation and Regression: Predicting The Unknown
5 pages
Pearson R Correlation: Test
No ratings yet
Pearson R Correlation: Test
5 pages
MATH 121 (Chapter 10) - Correlation & Regression
No ratings yet
MATH 121 (Chapter 10) - Correlation & Regression
30 pages
AP Statistics 1st Semester Study Guide
No ratings yet
AP Statistics 1st Semester Study Guide
6 pages
Toc
No ratings yet
Toc
14 pages
MODULE 9 - Practical Research 1 (STEM) : Most Frequently Used Data Collection Techniques
No ratings yet
MODULE 9 - Practical Research 1 (STEM) : Most Frequently Used Data Collection Techniques
8 pages
We I Bull Analysis
No ratings yet
We I Bull Analysis
72 pages
Bio-Stat Class 2 and 3
No ratings yet
Bio-Stat Class 2 and 3
58 pages
MidTerm MGT782 JULY 2023
No ratings yet
MidTerm MGT782 JULY 2023
6 pages
CH 17 Statistica
No ratings yet
CH 17 Statistica
36 pages
Module 7
No ratings yet
Module 7
11 pages
Sire Index: DR Aashish Dhakal
No ratings yet
Sire Index: DR Aashish Dhakal
9 pages
DWDM Unitwise Questions
No ratings yet
DWDM Unitwise Questions
3 pages
On Eda
No ratings yet
On Eda
60 pages
AIML Lect6 Ensembles
No ratings yet
AIML Lect6 Ensembles
41 pages
Chapter 2 The Simple Regression Model
No ratings yet
Chapter 2 The Simple Regression Model
9 pages
It 311-Ads Module 5
No ratings yet
It 311-Ads Module 5
9 pages
DataAnalysis1 Lecture11b
No ratings yet
DataAnalysis1 Lecture11b
16 pages
Chapter 1
No ratings yet
Chapter 1
34 pages
Canonical Correspondence Analysis (CCA) and Other Techniques
No ratings yet
Canonical Correspondence Analysis (CCA) and Other Techniques
42 pages
ECN225 Week2 PS
No ratings yet
ECN225 Week2 PS
3 pages
Problem Set 1
No ratings yet
Problem Set 1
4 pages
General SEM Analysis Results
No ratings yet
General SEM Analysis Results
10 pages
Unit I Probability and Random Variables: S. No Questions BT Level Part - A
No ratings yet
Unit I Probability and Random Variables: S. No Questions BT Level Part - A
20 pages
Tugas Rista Bria
No ratings yet
Tugas Rista Bria
10 pages
MI2020E Problems of Chapter 4
No ratings yet
MI2020E Problems of Chapter 4
6 pages
Ivanov 2013 Personalize Hotel Searches
No ratings yet
Ivanov 2013 Personalize Hotel Searches
2 pages
APO DP - Forecast Model Parameters: First-Order Exponential Smoothing
No ratings yet
APO DP - Forecast Model Parameters: First-Order Exponential Smoothing
6 pages
San Agustin Institute of Technology: Third Periodical Exam
No ratings yet
San Agustin Institute of Technology: Third Periodical Exam
2 pages
(Dataset1) C:/Users/Hp/Documents/Diabetes - Sav: Descriptives
No ratings yet
(Dataset1) C:/Users/Hp/Documents/Diabetes - Sav: Descriptives
5 pages
SHOPEE CHEGG Qonita966@
No ratings yet
SHOPEE CHEGG Qonita966@
4 pages
Normal Probability Plot Using Montgomery's Table: F (X) 3.930146793x - 66.9775616458 R 0.9733361199
No ratings yet
Normal Probability Plot Using Montgomery's Table: F (X) 3.930146793x - 66.9775616458 R 0.9733361199
4 pages
STEYX Function
No ratings yet
STEYX Function
2 pages