Skript - Statistik 1 - en - WS2425
Skript - Statistik 1 - en - WS2425
Statistics 1
Lecture Slides
Prof. Dr. Jens Perret
Version 2024
ISM 2024 Perret 1
Disclaimer
Any use of this lecture script, in full or in part, outside of the ISM and of events organised by it, is
prohibited without the prior consent of the school.
The author (or authors) is responsible for the content of this script.
Please be advised that trying to get additional information about the content or structure of exams
from lecturers, in particular external lecturers, may be considered an attempt in gaining an unfair
advantage over other students, i.e. cheating.
Allowed tools:
• ISM-Formulary Part 1
• Non-programable calculators are allowed.
Please check the list in the ISM-net on allowed calculator models. If you cannot find your calculator
on the list please contact the Prof. Dr. Jens Perret ([email protected]) the coordinator of the
module at least one month in advance before the exam.
All slides marked as excursus are not relevant for the exam.
Perret, J.K. (2022): Workbook Statistics, Available at the ISM Moodle Platform
Fahrmeir L., R. Künstler, I. Pigeot und G. Tutz (2011): Statistik, Springer, 7. Edition.
If you suspect you are among the 98.9% check out the additional online course materials:
For the lectures you can find accompanying explanatory and exercise videos at the ISM moodle
platform online at
moodle.ism.de
The materials made available there, as well as the materials linked to in this slide set, are all of the
exam questions from previous semesters and the mock exams simulate the structure of current exams.
Excel Tutorials
01 Statistical Basics
1.1 Basic Terms
1.2 Scale levels
1.3 Types of data
1.4 Diagrams and histograms
02 Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
2.4 Measures of distribution
2.5 Boxplots
03 Bivariate Statistics
3.1 Contingency tables
3.2 Scatterplots
3.3 Measures of association
3.3.1 Covariance and Pearson‘s Correlation Coefficient
3.3.2 Spearman‘s Rank-correlation coefficient
3.3.3 χ2-Statistic and Contingency
3.3.4 Scale levels and Effect sizes
3.4 Simple linear regression
01
Statistical Basics
1.1 Basic Terms
1.2 Scale levels
1.3 Types of data
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Population
Set of all feasible statistical units
Subpopulation
Part of the population
Sample
Subset of the population (usually significantly smaller), set of statistical units that is considered in an
analysis.
Characteristic
(also: variables) Variables, properties that are measured, studied
Characteristic value
Value that a characteristic takes
Continuous characteristics can take any value (e.g. variables that can be measured as exactly as
necessary)
Discrete characteristics can only take a finite (or countably infinite) number of values. On an axis gaps
exist between distinct values (e.g. age in years)
Quasi-continuous characteristics are actually discrete but usually are considered to be continuous
(e.g. monetary variables like returns or income)
Descriptive characteristics can take more than one value (i.e. mobile numbers)
Exercises on Continuity
01
Statistical Basics
1.1 Basic Terms
1.2 Scale levels
1.3 Types of data
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Ordinally scaled characteristics can be ordered. However, the distance between two values cannot
sensibly be interpreted.
(e.g. categories of quality, marks, ratings,…)
Interval scaled characteristics can be ordered and the distances between two values can sensibly be
interpreted. However, no absolute zero exists.
(e.g. year of birth, temperature in degree celsius,...)
Ratio scaled characteristics can be ordered, distances can be interpreted and they have a natural
absolute zero.
(e. g. age, height, monetary values,…)
The interval and ratio scale are summarized under the term cardinal or metric scale.
Nominal scale
Frequencies
Ordinal scale
Plus: Ordering, ranking
Interval scale
Plus: Addition and substraction
Ratio scale
Plus: Multiplication and division
Exercises on Descriptiveness
01
Statistical Basics
1.2 Scale levels
1.3 Types of data
1.4 Diagrams and histograms
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Panel data
In a panel data is collected for a number of statistical units at different points in time. The important
aspect with panels is that the participant structure does not change over time.
(e.g. panel surveys, developments of unemployment rates across all EU countries)
Types of cross-
sectional data
x1, x2,...., xn
Population
Sample
Sampling
Estimation
The ordered data set for the patients‘ age looks as follows:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10)
18 21 21 23 25 34 37 46 56 89
1 2 0.0156 0.0156
2 30 0.2344 0.25
3 37 0.2891 0.5391
4 28 0.2188 0.7579
5 23 0.1797 0.9376
6 8 0.0625 1
Class Frequency
[u0; u1) n1
[u1; u2) n2
⁞ ⁞
[uk-1; uk) nk
Sum: n = n1 + .... + nk
01
Statistical Basics
1.2 Scale levels
1.3 Types of data
1.4 Diagrams and histograms
A first illustration of the collected data is usually achieved via one of the following types of figures:
• Pie chart
• Block chart
• Bar chart
• Line chart
• Histogram
Pie charts:
According to the frequencies sectors are calculated and drawn:
Block charts:
According to the relative or absolute frequencies blocks are drawn:
Frequency
Bar charts:
According to relative or absolute frequencies bars are drawn:
Frequency
Line charts:
In regards to the relative or absolute frequencies succeeding end points are connected with
continuous lines. Line charts are almost exclusively used when trying to illustrate time series.
Frequency
Histogram:
• Is used in particular with classified data.
• Attention! In contrast to block charts it is the area of each block and not it height that is
proportional to the underlying frequencies. Thus the height of each block is given as:
r
Area of the block = Width ∙ Height = ∆i ∆i = ri with Δi = ui – ui-1
i
02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.1 Mode
2.2.2 Median
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Mode:
= most frequent characteristic value (if unique)
• Simple to calculate
• Can be used even with nominally scaled data
But
• Very primitive measure of central tendency which should not be the sole criterion if a higher level
scale is present!
xM = a i with ri = max rj
j
The mode is thus the characteristic value that is present the most, the characteristic with the largest ni
and the largest ri. If this value is not unique the mode usually is not reported.
The mode is thus the center (the middle value) mi of the class that reports the highest frequency.
Exercise:
Calculate the mode for the...
Patient dataset Oktoberfest dataset Tip dataset
i xi ai ni Class ni
1 25 1 2 [0; 1) 3
2 21 2 30 [1; 2) 4
3 18 3 37 [2; 3) 4
4 37 4 28 [3; 4) 2
5 56 5 23 [4; 5) 7
6 89 6 8
7 46
8 23
9 21
10 34
Solution:
02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.1 Mode
2.2.2 Median
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Median:
Idea:
Calculate the middle (not the mean) of a data set, i.e. the median divides the data set in two parts that
each contain 50% of the data points.
• In many cases much harder to calculate than the arithmetic mean
• Insusceptible to outliers
• Can be used with only ordinally scaled data
But
• If data is metrically scaled additional information might be lost!
ci = rj ≥ 0.5
j=1
σi−1
j=1 rj is the cumulated frequency of class i-1 and ri is the relative frequency of class i.
Exercise:
Calculate the median for the...
Patient dataset Oktoberfest dataset Tip dataset
i xi ai ni Class ni
1 25 1 2 [0; 1) 3
2 21 2 30 [1; 2) 4
3 18 3 37 [2; 3) 4
4 37 4 28 [3; 4) 2
5 56 5 23 [4; 5) 7
6 89 6 8
7 46
8 23
9 21
10 34
Solution:
02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.2 Median
2.2.3 Quantiles
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Special quantiles:
• Quartile: In this case the data set is quartered.
x0.25 is the first / lower quartile. x0.75 is the third / upper quartile. X0.5 and thus the second /
“middle” quartile, is the median.
• Percentile: Quantiles, that divide by specific percentages are referred to as percentiles, as i.e. the
5%-percentile x0.05
i−1 i
ci = rj ≥ p
j=1
σi−1
j=1 rj is the cumulated frequency of class i-1 and ri is the relative frequency of class i.
Exercise:
Calculate the 20% quantile of the...
Patient dataset Oktoberfest dataset Tip dataset
i xi ai ni Class ni
1 25 1 2 [0; 1) 3
2 21 2 30 [1; 2) 4
3 18 3 37 [2; 3) 4
4 37 4 28 [3; 4) 2
5 56 5 23 [4; 5) 7
6 89 6 8
7 46
8 23
9 21
10 34
Solution:
02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.3 Quantiles
2.2.4 Arithmetic mean
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Arithmetic mean
Idea: calculate average value
• Simple to calculate and well known
Exercise:
Calculate the arithmetic mean for the...
Patient dataset Oktoberfest dataset Tip dataset
i xi ai ni Class ni
1 25 1 2 [0; 1) 3
2 21 2 30 [1; 2) 4
3 18 3 37 [2; 3) 4
4 37 4 28 [3; 4) 2
5 56 5 23 [4; 5) 7
6 89 6 8
7 46
8 23
9 21
10 34
Solution:
Patient dataset Oktoberfest dataset Tip dataset
i xi ai ni Class mi ni
1 25 1 2 [0; 1) 0,5 3
2 21 2 30 [1; 2) 1,5 4
3 18 3 37 [2; 3) 2,5 4
4 37 4 28 [3; 4) 3,5 2
5 56 5 23 [4; 5) 4,5 7
6 89 6 8 Sum ni 20
7 46 Sum ni 128 Sum mini 56
8 23 Sum aini 448
9 21
10 34
Sum xi 370
xത = 37 xത = 3.5 xത = 2.8
ISM 2024 Perret 68
2.2 Measures of Central Tendency
Overview Online Exercises
02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.4 Arithmetic mean
2.2.5 Weighted arithmetic mean
2.2.6 Geometric mean
2.3 Measures of dispersion
xത w = wi xi
i=1
Alternative formula:
σni=1 wi xi
xത w = n
σi=1 wi
Example:
During a race a Chinese bicycle athlete drives 42km/h for 2.5 hours and afterwards 31km/h for 3
hours.
What is the average speed that he drives during the race?
Solution:
It holds that
• In the first 2.5 hours he drives 105 km.
• In the second part he drives 93 km.
02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.5 Weighted arithmetic mean
2.2.6 Geometric mean
2.2.7 Harmonic mean Excel, Tabellen, Tabellenkalkulation, Statistiken
It is uncommon to use the geometric mean for classified data as too much information is lost through
the classification process.
Exercise:
The value of a stock increases by 30% in the first year and decreases by 20% in the second year.
Solution:
(by using growth factors)
Starting with the initial capital K0 the end capital after two years can be calculated via:
K2 = K0(1 + 0.3)(1 – 0.2) = K0∙1.3∙0.8
If the capital yielded interest evenly with an interest of i it should holds that:
K2 = K0(1 + i)2
Therefore i can be determined via the geometric mean as follows:
2
i = 1.3∙0.8 - 1 = 0.0198 = 1.98%.
Caution! The geometric mean needs to be applied to the growth factors (1 + i), not to the growth
rates i!
02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.5 Weighted arithmetic mean
2.2.6 Geometric mean
2.2.7 Harmonic mean Excel, Tabellen, Tabellenkalkulation, Statistiken
Question: You drive a given route in one direction with a speed of 100km/h and on the way back you
drive 200km/h. What has been your average speed?
Use:
Averaging of ratios (quotients) if the numerator distribution is known
Example:
The cyclist Anna Bolika drives for 90km with a constant speed of 36km/h. Afterwards see goes for
40km at a constant speed of 32km/h.
What has been her average speed over the whole distance?
Solution:
90
• For the first 90 km she needs 36 = 2.5 hours.
40
• The other 40 km take her 32 = 1.25 hours.
֜ She cycles for 3.75 hours total and covers 130 km.
130 km km
The average speed thus is 3.75 h = 34. 6ത h
Solution 2: (using the weighted harmonic mean)
−1
90 1 40 1
xത h = 90 + 40 ∙ 36 + 90 + 40 ∙ 32 ≈ 34.67
02
Univariate Statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
2.3.1 Range
2.3.2 Interquartilerange
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
The range per se is a measure of dispersion but as such it is rather unsuited as it is very susceptible to
outliers. It can rather be considered as an orientation when considering the classification of data.
Examples:
Patients dataset: R = 89 – 18 = 71
Oktoberfest dataset: R=6–1=5
02
Univariate Statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
2.3.1 Range
2.3.2 Interquartilerange
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Formula:
IQA = x0.75 – x0.25
Note:
• Impervious to outliers
• Linked to the quartile deviation
But:
• Cannot be directly linked to the standard deviation
Examples:
Patient dataset: IQA = 46 – 21 = 25
Oktoberfest dataset: IQA = 4 – 2.5 = 1.5
02
Univariate Statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
2.3.2 Interquartilerange
2.3.3 Variance and Standard deviation
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Problem:
• Oftentimes dispersion needs to be minimized (see deduction of the regression model)
• Minimization usually leads to derivation
• But: The function of the absolute value is not continuously differentiable at all points
Solution:
Use quadratic deviations!
ISM 2024 Perret 90
2.3.3 Variance and Standard deviation
For the different types of data for the theoretical variance we thus get:
(The expected value μ is known.)
For the sample variance of a partial sample the following formulae hold (corrected variance):
(The expected value μ is approximated by the arithmetic mean.)
For σ2 and s2 the so called displacement law holds that makes calculating the variance much easier :
Note:
• Very important in the context of inductive statistics
But:
• Susceptible to outliers
• Uses squared units of the original data set and makes it thus hard to interpret.
Solution: Extracting a root → Standard deviation
n
1
σ= σ2 = xi − xത 2
n
i=1
n
1
s= s2 = xi − xത 2
n−1
i=1
Note:
• Advantage as compared to the variance: same unit as the original data set
• Most common measure of dispersion
But:
• Still susceptible to outliers
i xi xi2
1 25 625 xത = 37
2 21 441 s2 = 10/9∙(0.1∙18.058 – 37 2) = 485.33
3 18 324 s = 22.03
4 37 1.369
5 56 3.136
6 89 7.921
7 46 2.116
8 23 529
9 21 441
10 34 1.156
Sum: 370 18.058
ai ni niai2 xത = 3.5
1 2 2 s2 = 128/127∙(1.766/128 – 3.52) = 1.5591
2 30 120 s = 1.2486
3 37 333
4 28 448
5 23 575
6 8 288
Sum: 128 1.766
02
Univariate Statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
2.3.2 Interquartilerange
2.3.3 Variance and Standard deviation
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
The coefficient of variation relates standard deviation and expected value or sample standard
deviation and the arithmetic mean:
σ s
V= μ
or V= xത
Exercise:
Decide whether the dispersion is larger in the Oktoberfest dataset or in the patient dataset.
Solution:
Measures Patient dataset:
xത = 37 and s = 22.03 => V = 0.60
Measures Oktoberfest dataset:
xത = 3.5 and s = 1.25 => V = 0.36
Measures Tipps:
xത = 2.8 and s = 1.53 => V = 0.55
Random Exercise
02
Univariate Statistics
2.3 Measures of dispersion
2.4 Measures of distribution
2.4.1 Skewness
2.4.2 Kurtosis
2.5 Boxplots Excel, Tabellen, Tabellenkalkulation, Statistiken
Theoretical:
n
1 xi − μ 3
S=
n σ
i=1
In relation to a sample:
n 3
1 xi − xത
S=
n−1 s
i=1
0
1 2 3 4 5 6 7 8 9 10 11 12 13
160
140
120
100
80
60
40
20
140
120
100
80
60
40
20
02
Univariate Statistics
2.3 Measures of dispersion
2.4 Measures of distribution
2.4.1 Skewness
2.4.2 Kurtosis
2.5 Boxplots
Theoretically:
n
1 xi − μ 4
W=
n σ
i=1
In relation to a sample:
n 2
1 xi − xത
W=
n−1 s
i=1
0,5
0,4
0,3
0,2
0,1
0
-3 -2 -1 0 1 2 3
0,5
0,4
0,3
0,2
0,1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
0,2
0,15
0,1
0,05
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
02
Univariate Statistics
2.3 Measures of dispersion
2.4 Measures of distribution
2.5 Boxplots
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
5-Number-Summary:
x(1) Sample minimum
x0.25 Lower quartile
x0.5 Median
x0.75 Upper quartile
x(n) Sample maximum
• Possible outliers and the severity of these outliers can be detected via the antennas which report
the maximum and minimum.
• Additionally, the size of the box represents the interquartile range and thus reports on a measure
of dispersion.
• The boxplot also answers questions regarding the skewness and symmetry of the data set:
• If the median is situated (more or less) in the middle of the box, this indicates a (more or less)
symmetrical distribution.
• If the median tends more towards the lower quartile this indicates a rightwards skewed
distribution.
• If the median tends more towards the upper quartile this indicates a leftwards skewed distribution.
Example:
Construct a boxplot for
• the patient dataset
• the Oktoberfest dataset
Solution:
100,0 7,0
90,0
6,0
80,0
70,0 5,0
60,0
4,0
50,0
3,0
40,0
30,0 2,0
20,0
1,0
10,0
0,0 0,0
Patientendaten Oktoberfestdaten
Patient dataset Oktoberfest dataset
03
Bivariate Statistics
3.1 Contingency tables
3.2 Scatterplots
3.3 Measures of association
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Up to this point: univariate data which means that only a single variable / characteristic has been
considered. In this part the interaction of two variable is studied.
Raw data:
Bivariate raw data is given as data points: (x1, y1),…, (xn, yn)
Grouped data:
Bivariate grouped data is given in a two-dimensional frequency table, a so called contingency table:
b1 b2 ... bl
a1 n11 n12 ... n1l n1●
a2 n21 n22 ... n2l n2●
⁞ ⁞ ⁞ ⁞ ⁞
ak nk1 nk2 ... nkl nk●
n●1 n●2 ... n●l n
Classified data:
Bivariate classified data is given via a two-dimensional contingency table:
Example:
Age group Sex
Male Female Total
Below 3 years 1,018,505 966,018 1,984,523
3 up to less than 6 years 1,041,011 984,172 2,025,183
6 up to less than 15 years 3,485,685 3,309,900 6,795,585
15 up to less than 18 years 1,195,380 1,133,681 2,329,061
18 up to less than 25 years 3,325,707 3,194,751 6,520,458
25 up to less than 30 years 2,455,885 2,416,648 4,872,533
30 up to less than 40 years 4,763,360 4,731,444 9,494,804
40 up to less than 50 years 6,756,735 6,594,133 13,350,868
50 up to less than 65 years 8,081,342 8,247,217 16,328,559
65 up to less than 75 years 4,246,483 4,788,107 9,034,590
75 years and more 2,775,848 4,707,683 7,493,531
Total 39,145,941 41,073,754 80,219,695
Population of Germany 09.05.2011 (Census date) by sex and age groups (Source: Destatis)
ni● = nij
j=1
n●j = nij
i=1
ri● = rij
j=1
r●j = rij
i=1
Exercise:
Calculate using the information given below the missing data from the contingency table, that
describes the relation between place of living and chosen way of commuting to work.
Solution:
Public
Car By foot Bicycle Sum
Transport
Essen 150 230 265 70 715
Wuppertal 110 250 25 10 395
Köln 400 300 240 80 1020
Dortmund 610 120 120 20 870
Sum 1270 900 650 180 3000
03
Bivariate Statistics
3.1 Contingency tables
3.2 Scatterplots
3.3 Measures of association
Scatterplot:
03
Bivariate Statistics
3.2 Scatterplots
3.3 Measures of association
3.3.1 Covariance and Pearson‘s Correlation coefficient
3.3.2 Spearman‘s Rank-correlation coefficient
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Covariance
n
1
σxy = xi − xത yi − yത
n
i=1
Thus:
• Many and large squares in the first and third quadrant lead to positive values (“positive relation”)
• Many and large squares in the second and fourth quadrant lead to negative values (“negative
relation”)
Problem:
Covariance is not bounded and can take any real value!
Solution:
Standardization
1 n
σ
rxy = n i=1 xi ∙ yi − xത ∙ yത =
x ∙ y − xത ∙ yത
1 n 2 1 n 2
σi=1 x − xത 2 σ 2 x 2 − xത 2 y 2 − yത 2
n n i=1 y − yത
Attention!
Even if no linear relation exists a non-linear relation can still exist even though the correlation
coefficient has a value close to zero.
Example:
A hospital has measured the number of sold entrance tickets for a neighboring ski resort x (in
thousands) as well as the number y of patients that needed to be treated for broken bones:
i xi yi
1 5 12
2 6 14
3 5.5 9
4 2 4
5 3.8 7
6 4.4 10
7 6.2 13
8 5.6 12
9 4.2 7
10 5.9 15
Solution: (Approach 1)
i xi yi (xi - xത) (yi - yത) (xi - xത)2 (yi - yത)2 (xi - xത) (yi - yത)
1 5 12 0.14 1.7 0.0196 2.89 0.238
2 6 14 1.14 3.7 1.2996 13.69 4.218
3 5.5 9 0.64 -1.3 0.4096 1.69 -0.832
4 2 4 -2.86 -6.3 8.1796 39.69 18.018
5 3.8 7 -1.06 -3.3 1.1236 10.89 3.498
6 4.4 10 -0.46 -0.3 0.2116 0.09 0.138
7 6.2 13 1.34 2.7 1.7956 7.29 3.618
8 5.6 12 0.74 1.7 0.5476 2.89 1.258
9 4.2 7 -0.66 -3.3 0.4356 10.89 2.178
10 5.9 15 1.04 4.7 1.0816 22.09 4.888
Sum: 48.6 103 x x 15.104 112.1 37.22
37.22
xത = 4.86 yത = 10.3 rxy = = 0.905
15.104 112.1
Solution: (Approach 2)
i xi yi xi2 yi2 xiyi
1 5 12 25 144 60
2 6 14 36 196 84
3 5.5 9 30.25 81 49.5
4 2 4 4 16 8
5 3.8 7 14.44 49 26.6
6 4.4 10 19.36 100 44
7 6.2 13 38.44 169 80.6
8 5.6 12 31.36 144 67.2
9 4.2 7 17.64 49 29.4
10 5.9 15 34.81 225 88.5
Sum: 48.6 103 251.3 1.173 537.8
Average: 4.86 10.3 25.13 117.3 53.78
Exercise 8.1
Exercise 8.2
Exercise 9.1
Exercise 9.2
Random Exercise
03
Bivariate Statistics
3.2 Scatterplots
3.3 Measures of association
3.3.1 Covariance and Pearson‘s Correlation coefficient
3.3.2 Spearman‘s Rank-correlation coefficient
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Example
Original Data set: E C A B F D
Ranked Data sert: 5 3 1 2 6 4
Example
Original Data set: A C A B C C
A exists twice and takes rank 1 and 2, thus in both cases the average rank of 1.5 is used (average of the
two ranks)
B exists once at rank 3, thus a rank of 3 is used
C exists thrice on ranks 4 through 6, thus in all three cases the average rank of 5 is used (average of the
three ranks)
Ranked Data set: 1.5 5 1.5 3 5 5
If the dataset is at least ordinally scaled for both variables a ranking can be established, where R(xi) is
the rank of xi in the dataset. For an ordered list it holds that R(x(i)) = i
If two data points hold the same rank we call it a tie. If every rank is only assigned once we say that no
ties exist.
Mathematically the correlation coefficient by Spearman results from applying the correlation
coefficient by Pearson to the ranking. It simplifies for the situation without ties.
Spearman’s correlation coefficient for raw data (with ties)
R x ∙R y −R x ∙R y
R xy =
2 2
R x 2 −R x R y 2 −R y
Example:
The following table summarizes the ECTS marks of six randomly selected pupils in mathematics (xi)
and physics (yi) :
xi yi
B A
B B
A B
C C
E D
D D
Calculate Spearman’s rank correlation coefficient.
Solution:
i Xi yi R(xi) R(yi) R(xi)2 R(yi)2 R(xi)R(yi)
1 B A 2.5 1 6.25 1 2.5
2 B B 2.5 2.5 6.25 6.25 6.25
3 A B 1 2.5 1 6.25 2,5
4 C C 4 4 16 16 16
5 E D 6 5.5 36 30.25 33
6 D D 5 5.5 25 30.25 27.5
Sum 21 21 90.5 90 87.75
Average 3.5 3.5 15.0833 15 14.625
14.625−3.5∙3.5
Rxy = = 2.375/2.7913 = 0.8509
15.0833−3.5∙3.5 15−3.5∙3.5
Exercise 7.1
Exercise 7.2
Random Exercise
03
Bivariate Statistics
3.2 Scatterplots
3.3 Measures of association
3.3.2 Spearman‘s Rank-correlation coefficient
3.3.3 χ2-Statistic and Contingency
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
3.3.4 Scale levels and Effect sizes Excel, Tabellen, Tabellenkalkulation, Statistiken
Distinction:
• Contingency: Measure of association between nominally scaled variables
• Measure of association by Yule
• Contingency coefficient by Pearson
• Correlation: Measure of association between metrically or ordinally scaled variables
• Covariance and correlation coefficient by Bravais-Pearson (metric scale)
• Rank-correlation by Spearman (ordinal scale)
If at least one of the variables is nominally scaled the association is referred to a contingency.
In the special case that both variables can only take two values we simply call it association and the
corresponding contingency table is called four-fields-table:
b1 b2
a1 n11 n12 n1●
a2 n21 n22 n2●
n●1 n●2 n
Example:
A survey on smoking behavior collected data from 200 students and resulted in the following dataset :
Solution:
30∙50 − 70∙50
Y= = -0.4
30∙50 + 70∙50
If a contingency table reports more than two rows or columns the Yule coefficient can no longer be
calculated. The calculation of the contingency coefficient in this case takes part in different steps :
χ2 (Chi squared) for any contingency tables
k l 2
2
nij − eij
χ =
eij
i=1 j=1
χ2 M −1
K= ∈ 0;
χ2 + n M
with M = min{I; J} (I is the number of rows and J is the number of columns of the contingency table)
M
K∗ = ∙ K ∈ 0; 1
M−1
with M = min{I; J}.
Cramer‘s V
χ2
V= ∈ 0; 1
𝑛 ∙ min(𝐼 − 1; 𝐽 − 1)
I is the number of rows and J is the number of columns of the contingency table.
Cramer‘s Phi
χ2
Phi =
𝑛
Example:
Calculate the corrected contingency coefficient for the following dataset :
Smoker Non-smoker
Female 30 70 100
Male 50 50 100
80 120 200
Solution:
Expected frequencies: Smoker Non-smoker
Female 40 60 100
Male 40 60 100
80 120 200
8.33 8.33
K= = 0.2 V= = 0.2041
8.33+200 200·min(1;1)
M=2
K* = 1.4142∙0.2 = 0.2828
Interpreting contingency:
Does a value of 0.2 represent strong or weak association?
• Interpretation is usually wrong as the value is usually interpreted in comparison to the absolute
value of a correlation coefficient which is wrong.
• While in theory the corrected contingency coefficient can take any value between 0 and 1 in reality
values are usually way below 1.
• Thus, statements regarding the strength of the association should always be supplemented by a
contingency test (to be discussed in statistics 2).
• Also the type of association can only be determined by conditional distributions.
Exercise 6.1
Exercise 6.2
Exercise 6.3
Exercise 6.4
Random Exercise
03
Bivariate Statistics
3.2 Scatterplots
3.3 Measures of association
3.3.2 Spearman‘s Rank-correlation coefficient
3.3.3 χ2-Statistic and Contingency coefficient
3.3.4 Scale levels and Effect sizes
Bivariate Data and Scale levels: (Which indicator can be calculated when?)
Grouping
0.0 < |Measure| < 0.1 No Association
0.1 < |Measure| < 0.3 Weak Association
0.3 < |Measure| < 0.5 Moderate Association
0.5 < |Measure| < 1.0 Strong Association
03
Bivariate Statistics
3.2 Scatterplots
3.3 Measures of association
3.4 Simple linear regression
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts
Goal: The dependent variable y should be described via the independent variable x.
The goal lies in finding a function (line) that when set amidst the scatterplot
minimizes the sum of squared distances between the line and all of the points
of the scatterplot.
y y
x x
yi
ෝ
yi
b0 = yത - b1തx
x∙y − x∙
ഥ yഥ
• b1 =
x2 − xഥ 2
• b0 = yത - b1 തx
a) Calculate the linear regression line that explains the costs in relation to the produced quantity.
b) Dina Vier receives a new order of 2,800 picture books. Which total costs can be expected?
Average over yi = y
Explained variance σn
i=1 yෝi − y 2
2
R = =
Total variance σni=1 yi − y
2
It holds that 0 ≤ R² ≤ 1.
The coefficent gives the share of the total variance that can be explained via the regression line.
x∙y−xഥ∙yഥ 2
R2 = rxy2 =
x2 −xഥ2 y2−yഥ2
Grouping
0.00 < R2 < 0.01 No explanatory power
0.01 < R2 < 0.09 Weak explanatory power
0.09 < R2 < 0.25 Moderate explanatory power
0.25 < R2 < 1.00 Strong explanatory power
Exercise 10.1
Exercise 10.2
Exercise 10.3
Exercise 10.4
Random Exercise