BS Topic 3
BS Topic 3
BUSINESS STATISTICS
2
3
3
Crossing two QUALITATIVE
variables:
To study relationships between two qualitative (categorical)
variables, start with a contingency table.
4
Example
The data below comes from a study of 626 people attending a particular
American medical centre.
The study looked at where people had their tattoo and whether they
have Hepatitis C.
Marginal
Distribution
Conditional Distribution
5
Conditional percentages: percent of what?
Compare:
1. What percent of individuals with
no tattoo have Hep C?
The clue is in the
2. What percent of individuals with
underlined phrase...
Hep C have no tattoo? Always look for what
3. What percent (?) have no tattoo follows the “of” ...
and do have Hep C? If there is no “of”
then implicitly it is
“what percent of
everyone”
6
Are Hepatitis C and Tattoo Origin independent?
7
Are Hepatitis C and Tattoo Origin independent?
category 70%
30%
The bars don’t look the same
Þ the variables are not
20%
independent! 10%
0%
Commercial Elsewhere No Tattoo
parlor
8
Two qualitative/categorical variables: Independence
9
Chi-Squared Statistic & Cramer’s Coefficient
The Chi-squared statistic is a measure of association (dependence) in a
contingency table.
10
Formulas for the curious ones only J
To calculate the
expected counts for
N i. N.j Ni. : row total
N.j : column total
the artificial table: Exp ij = N : total count
N
To calculate the Chi- 2
squared statistic:
χ 2
= ∑∑
( Obs ij
− Exp ij )
Exp ij
To calculate Cramer’s
coefficient (formula accounts
for the sample size and the
χ2
γ=
number of rows/columns):
N× min(p −1;q −1)
11
Chi-Squared Statistic & Cramer’s Coefficient
Chi-square = 57.91
How small/big is this value?? We can’t assess it, so we calculate
Cramer’s coefficient.
Satisfaction
-- 53 32 85
- 156 88 244
+ 92 200 292
++ 27 152 179
Satisfaction
-- 62 38 100
- 64 36 100
+ 32 68 100
++ 15 85 100
Margin 41 59 100
14
Column percentages
Gender Women Men Margin
Satisfaction
-- 16 7 11
- 48 19 30
+ 28 42 37
++ 8 32 22
15
Observed Situation
Gender Women Men Total
Satisfaction
-- 53 32 85
- 156 88 244
Artificial/Expected
+ 92 200 292
Independence
++ 27 152 179
Situation
Gender Women Men Total
Total 328 472 800
Satisfaction
-- 35 50 85
++ 73 106 179
16
Total 328 472 800
Calculating the Chi-Squared Statistic:
!2=129
17
18
18
19
1st step:
Choose the variable to be placed on the x-axis, the explanatory (predictor)
variable, and the one to be placed on the y-axis, the response (predicted)
variable.
2nd step:
Draw the scatterplot.
19
https://ourworldindata.org/grapher/alcoh
ol-consumption-vs-gdp-per-capita
20
Which is Response and which is Explanatory?
21
22
22
23
24
25
26
For cities in a certain country, the number of places of worship and the
number of homicides is positively correlated – how can that be explained?
Two variables may be related even if neither one is the “cause” the other.
27
28
Linear Correlation
Given a linear relationship, a number between -1 and 1 can be
assigned to a scatterplot to measure the intensity of the correlation –
the linear correlation coefficient r:
#$%(',)) ∑ '-.
' )-.
)
Formula: != where #$% ', ) =
+' +) /
29
30
For each scatterplot below, suggest an approximate value for the
linear correlation coefficient r
31
32
Regression Line
We aim to find the line that does the best job in modelling the
relationship: line of best fit.
Even that model won’t be perfect: some points will be above the line
and some will be below it.
33
34
Example
The data collected describes the number of manatees killed in the
Florida waterways and the number of powerboats registered, for each
year 1988 to 2000.
35
36
/ = −45.671 + 0.131234
01
There is a unique line which minimizes the sum of the squares of the
residuals for all points in the scatterplot
Þ It is considered THE line of best fit
-.
&'()* = , " − 012*,3*)2 = "4 − &'()*. 7̅
-/
39
Compared to all other
imaginable lines, the line of
best fit yields the smallest
value for the sum of squares
of residuals: 1233.204
41
The Domain of a Regression Model
Interpolation vs. Extrapolation
42
The Domain of a Regression Model
Interpolation vs. Extrapolation
The model for manatee deaths as a function of power boat registrations was
constructed based on the number of powerboat registrations (the x-value)
between about 450 and 750 thousand registrations.
This is called the domain of the model.
A model can be used to determine y-values (manatee deaths) for x-values
(powerboat registrations) within the domain, but outside this domain it may not
make sense.
For example, the y-intercept, which is when x=0, is very far outside this range
for the model in our example, and in this case (being negative) it clearly doesn’t
make sense.
Let’s predict the number of manatees killed when 500 thousand power boats
are registered.
If there are 500 thousand power boat registrations, the model says:
ŷ = -45.671 + 0.131(500) = 19.83
That is, that there would be about 20 manatee killed. 43
The Domain of a Regression Model
Interpolation vs. Extrapolation
When the x-value being used is within the domain of the model, such an
estimate is called an interpolation. Interpolations are safe.
When the x-value being used is outside the domain of the model (here
[450;750]), it is called an extrapolation.
44