0% found this document useful (0 votes)
16 views44 pages

BS Topic 3

Uploaded by

Rikesh Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views44 pages

BS Topic 3

Uploaded by

Rikesh Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

This document belongs to ESCP Business School. It cannot be modified nor distributed without the authors’ consent.

BUSINESS STATISTICS

Prof. Lynn FARAH

Describing Relationships between Two


Statistical Variables
https://www.researchgate.net/pu
blication/23782639_Second-to-
Fourth_Digit_Ratio_Predicts_Succ
ess_among_High-
Frequency_Financial_Traders

2
3

3
Crossing two QUALITATIVE
variables:
To study relationships between two qualitative (categorical)
variables, start with a contingency table.

Its contents are the counts organized by the categories in both


variables, with:
Variable 1 ⇿ rows
Variable 2 ⇿ columns

Each cell of the table gives the count for a combination of


values of the two variables.

4
Example
The data below comes from a study of 626 people attending a particular
American medical centre.
The study looked at where people had their tattoo and whether they
have Hepatitis C.

Marginal
Distribution

Conditional Distribution

5
Conditional percentages: percent of what?

Compare:
1. What percent of individuals with
no tattoo have Hep C?
The clue is in the
2. What percent of individuals with
underlined phrase...
Hep C have no tattoo? Always look for what
3. What percent (?) have no tattoo follows the “of” ...
and do have Hep C? If there is no “of”
then implicitly it is
“what percent of
everyone”

6
Are Hepatitis C and Tattoo Origin independent?

How did we obtain the below table?

Let’s compare conditional distributions shown in columns:


About 33 % of people with tattoos from a commercial parlor have Hep
C; only 3.5% of people with no tattoo have Hep C.

The % of people with Hepatitis C varies across the categories of Tattoo


origin.
Þ the variables are clearly not independent!

7
Are Hepatitis C and Tattoo Origin independent?

We can also see this using a Individual has


stacked bar chart: Hep C?
100%
• One bar for each category of the 90%
first variable (origin of tatoo0
• Each bar represents 100% of that 80%

category 70%

• The bar has segments 60%

corresponding to the categories 50% Yes


of the second variable (HepC) 40%
No

30%
The bars don’t look the same
Þ the variables are not
20%

independent! 10%

0%
Commercial Elsewhere No Tattoo
parlor
8
Two qualitative/categorical variables: Independence

A pair of qualitative/categorical variables are independent if the


distribution of one variable is the same for each of the categories of the
other variable.

This can be recognized in the relative conditional frequency


representation of the contingency table:
• Look at the “percentage by column (or row) totals”
• If the percentages corresponding to the conditional distributions are
very different then the variables are not independent.
• If the percentages look similar, the variables are probably
independent.

9
Chi-Squared Statistic & Cramer’s Coefficient
The Chi-squared statistic is a measure of association (dependence) in a
contingency table.

- It is calculated based on a comparison of the observed contingency table of


counts to an artificial table of “expected counts”, a table showing what we
would get in theory had the two variables been independent.
- In a theoretical situation of total independence, the value of the Chi-squared
statistic is zero (never happens with real life data!!)
- The Chi-squared statistic can reach very high values (it varies with sample
size and number of rows/columns in the contingency table) => problem in
assessing how strong the dependence is!!

Hence, we deduce Cramer’s coefficient from the Chi-squared statistic.


It allows us to assess the strength of the association on a 0 -1 scale
(0=total independence; 1=total dependence).

10
Formulas for the curious ones only J

To calculate the
expected counts for
N i. N.j Ni. : row total
N.j : column total
the artificial table: Exp ij = N : total count
N
To calculate the Chi- 2
squared statistic:
χ 2
= ∑∑
( Obs ij
− Exp ij )
Exp ij
To calculate Cramer’s
coefficient (formula accounts
for the sample size and the
χ2
γ=
number of rows/columns):
N× min(p −1;q −1)
11
Chi-Squared Statistic & Cramer’s Coefficient

Chi-square = 57.91
How small/big is this value?? We can’t assess it, so we calculate
Cramer’s coefficient.

Cramer’s coefficient= 0.3


On a scale from 0 to 1, this value indicates a quite low
dependence between Hep C and Tattoo origin
12
One more Example: Gender & Satisfaction
Gender Women Men Total

Satisfaction
-- 53 32 85

- 156 88 244

+ 92 200 292

++ 27 152 179

Total 328 472 800

Is there an association between gender and satisfaction


w.r.t. cafeteria food among the staff on Paris campus?
13
Row percentages
Gender Women Men Total

Satisfaction
-- 62 38 100

- 64 36 100

+ 32 68 100

++ 15 85 100

Margin 41 59 100
14
Column percentages
Gender Women Men Margin

Satisfaction
-- 16 7 11

- 48 19 30

+ 28 42 37

++ 8 32 22

Total 100 100 100

15
Observed Situation
Gender Women Men Total

Satisfaction
-- 53 32 85

- 156 88 244
Artificial/Expected
+ 92 200 292
Independence
++ 27 152 179
Situation
Gender Women Men Total
Total 328 472 800
Satisfaction
-- 35 50 85

- 100 144 244

+ 120 172 292

++ 73 106 179

16
Total 328 472 800
Calculating the Chi-Squared Statistic:
!2=129

Deriving Cramer’s V coefficient:


" = 0.4

Þ There is a rather medium association (dependence)


between gender and satisfaction (w.r.t food in the
cafeteria).

17
18

18
19

Crossing two QUANTITATIVE


variables:
To study relationships between two quantitative variables, start with a
scatterplot. Each individual can be represented by a point (xi, yi) in the
plane defined by the two variables.

1st step:
Choose the variable to be placed on the x-axis, the explanatory (predictor)
variable, and the one to be placed on the y-axis, the response (predicted)
variable.

2nd step:
Draw the scatterplot.

19
https://ourworldindata.org/grapher/alcoh
ol-consumption-vs-gdp-per-capita

20
Which is Response and which is Explanatory?

• Baseball teams: scores on runs & tickets sold


Tickets = Response (y), Runs = Explanatory (x)
=> Do baseball teams that score more runs also sell more tickets?

• Students: SAT scores & grades


Grades = Response (y), SAT score = Explanatory (x)
=>Do students with higher SAT scores get better grades?

• People: BMI & wrist size


BMI = Response (y), Wrist Size = Explanatory (x)
=> Can we estimate a person’s BMI by measuring their wrist size?

21
22

We need to examine the scatterplot and answer the below


questions :

• Is there a relationship between the 2 variables?

• If yes, can we evaluate the strength of this relationship?

• Can we model this relationship?

Þ Direction, Strength & Shape


Þ Unusual Features (outliers, clusters)

22
23
24
25
26

Do not confuse correlation and causality!


Is there a relationship between the life expectancy in a country and the
number of televisions?

Does sending more televisions to certain countries improve the life


expectancy?
Þ There is an underlying “lurking variable” which is the economic
development of the country.
26
27

Do not confuse correlation and causality!

For cities in a certain country, the number of places of worship and the
number of homicides is positively correlated – how can that be explained?

Þ There is an underlying “lurking variable” which is the size of the city –


larger cities have more places of worship, and more homicides.

Two variables may be related even if neither one is the “cause” the other.

Explanations for relationships:


1. Causality: find a causal mechanism
2. Coincidence: look for more data
3. Common underlying cause - “lurking variable”

Þ Scatterplots and correlations never prove causation!!

27
28
Linear Correlation
Given a linear relationship, a number between -1 and 1 can be
assigned to a scatterplot to measure the intensity of the correlation –
the linear correlation coefficient r:

Strong No relationship Strong


negative positive
-1 +1
Weak 0 Weak
negative positive

#$%(',)) ∑ '-.
' )-.
)
Formula: != where #$% ', ) =
+' +) /

29
30
For each scatterplot below, suggest an approximate value for the
linear correlation coefficient r

31
32
Regression Line

Given a linear relationship between two quantitative variables, a


linear model can be constructed.

Although the points in a scatterplot usually do not all lie on a straight


line, some lines do a better job of describing the relationship seen in
the scatterplot.

We aim to find the line that does the best job in modelling the
relationship: line of best fit.

Even that model won’t be perfect: some points will be above the line
and some will be below it.

33
34
Example
The data collected describes the number of manatees killed in the
Florida waterways and the number of powerboats registered, for each
year 1988 to 2000.

Who: The years 1988 to 2000 – 1 point represents 1 year


What: (Y)The number of manatees killed
(X) The number of powerboats registered, in 1000’s
Why: To understand better the cause of manatee deaths

35
36
/ = −45.671 + 0.131234
01

"! = −45.671 + 0.131.


37
Where does the equation come from?
The residual for a point in the scatterplot is the difference (in the y-
direction, vertically) between the true observed y-value of that case and the
y-value predicted by the line (denoted as y^).

residual = observed − predicted = y − ŷ

- A negative residual means the predicted value’s too big (overestimate).


- A positive residual means the predicted value’s too small (underestimate).
38
Where does the equation come from?

There is a unique line which minimizes the sum of the squares of the
residuals for all points in the scatterplot
Þ It is considered THE line of best fit

Þ Making ∑ " − "$ % as small as possible leads to computing the


following coefficients for the line equation:

-.
&'()* = , " − 012*,3*)2 = "4 − &'()*. 7̅
-/

39
Compared to all other
imaginable lines, the line of
best fit yields the smallest
value for the sum of squares
of residuals: 1233.204

But how big/small is this


value? And what does it
mean?

We calculate the r2 value:


≈10520/(10520+1233)
= 10520/11753
= 0.895 = 89.5%
40
But what does the r2 value mean?
It represents the percentage of the variation in Y which has been
accounted for by the model in terms of X

Thus 89.5% of the variation in manatee deaths can be attributed to


variation in powerboat registrations, via the linear regression model.

There is still 10.5% of variation in manatee deaths unexplained by the


model (and hence left in the residuals) which may be due to other
factors (climate, food sources, environment, random variation from year
to year, etc...).

It is also just equal to (r)2 = (0.946)2=0.895

41
The Domain of a Regression Model
Interpolation vs. Extrapolation

42
The Domain of a Regression Model
Interpolation vs. Extrapolation
The model for manatee deaths as a function of power boat registrations was
constructed based on the number of powerboat registrations (the x-value)
between about 450 and 750 thousand registrations.
This is called the domain of the model.
A model can be used to determine y-values (manatee deaths) for x-values
(powerboat registrations) within the domain, but outside this domain it may not
make sense.

For example, the y-intercept, which is when x=0, is very far outside this range
for the model in our example, and in this case (being negative) it clearly doesn’t
make sense.

Let’s predict the number of manatees killed when 500 thousand power boats
are registered.
If there are 500 thousand power boat registrations, the model says:
ŷ = -45.671 + 0.131(500) = 19.83
That is, that there would be about 20 manatee killed. 43
The Domain of a Regression Model
Interpolation vs. Extrapolation

When the x-value being used is within the domain of the model, such an
estimate is called an interpolation. Interpolations are safe.

When the x-value being used is outside the domain of the model (here
[450;750]), it is called an extrapolation.

Be careful making extrapolations! Close to the domain it may be reasonable to


do so, but far from the domain it is unlikely to be an appropriate use of the
model.

44

You might also like