0% found this document useful (0 votes)
18 views185 pages

Skript - Statistik 1 - en - WS2425

The document outlines the course structure and examination details for 'Statistics 1' taught by Prof. Dr. Jens Perret at ISM. It covers essential statistical concepts, types of data, and scale levels, emphasizing the importance of understanding terminology for further studies. Additionally, it provides guidelines on exam tools, references, and resources available for students to enhance their understanding of statistics.

Uploaded by

ömer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views185 pages

Skript - Statistik 1 - en - WS2425

The document outlines the course structure and examination details for 'Statistics 1' taught by Prof. Dr. Jens Perret at ISM. It covers essential statistical concepts, types of data, and scale levels, emphasizing the importance of understanding terminology for further studies. Additionally, it provides guidelines on exam tools, references, and resources available for students to enhance their understanding of statistics.

Uploaded by

ömer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 185

Bachelor of Arts / Science

Statistics 1
Lecture Slides
Prof. Dr. Jens Perret
Version 2024
ISM 2024 Perret 1
Disclaimer
Any use of this lecture script, in full or in part, outside of the ISM and of events organised by it, is
prohibited without the prior consent of the school.
The author (or authors) is responsible for the content of this script.

Lecture scripts cannot be cited in scholarly works.

Please be advised that trying to get additional information about the content or structure of exams
from lecturers, in particular external lecturers, may be considered an attempt in gaining an unfair
advantage over other students, i.e. cheating.

ISM International School of Management


GmbH
Otto-Hahn-Str. 19
44227 Dortmund
www.ism.de

ISM 2024 Perret 2


Description of the Module
Mathematics and statistics are central aids to the depiction, analysis and interpretation of economic
matters. Beyond this they also make a contribution to decision-making in optimisation tasks in
enterprises. The “Mathematical basics” module teaches mathematical principles (“Business
mathematics” course) and basic descriptive statistics knowledge (“Statistics 1” course).

ISM 2024 Perret 3


Exam
Written Exam:
• 120 minutes (60 minutes Business Mathematics / 60 minutes Statistics 1)
• 100 points (50 points Business Mathematics / 50 points Statistics 1)
• The exam is passed if the sum of point is at least 50.

Allowed tools:
• ISM-Formulary Part 1
• Non-programable calculators are allowed.
Please check the list in the ISM-net on allowed calculator models. If you cannot find your calculator
on the list please contact the Prof. Dr. Jens Perret ([email protected]) the coordinator of the
module at least one month in advance before the exam.

All slides marked as excursus are not relevant for the exam.

ISM 2024 Perret 4


Statistics 1
References:
Keller, G. (2014): Statistics for management and economics, CT: Cengage Learning, Stamford, 10.
Edition.

Perret, J.K. (2022): Workbook Statistics, Available at the ISM Moodle Platform

German Language References:


Bamberg, G.; Baur F. und Krapp, M. (2012): Statistik, Oldenbourg, Munich, 17. Edition.

Fahrmeir L., R. Künstler, I. Pigeot und G. Tutz (2011): Statistik, Springer, 7. Edition.

Jeske, R. (2017): Kochbuch der Quantitativen Methoden 3: Statistik, Lulu, 2. Edition.

Bamberg, G.; Baur F. und Krapp, M. (2012): Statistik-Arbeitsbuch – Übungsaufgaben, Fallstudien,


Lösungen, Oldenbourg, Munich, 9. Edition.

Schwarz, J. (2013): Aufgabensammlung zur Statistik, NWB Verlag, Herne, 7. Edition.

ISM 2024 Perret 5


Statistics 1
Statistics about Statistics
0.1% of all students understand Statistics just by reading the slides
1% of all students understand Statistics just by attending the lectures
98.9% need additional materials and exercises to understand it

If you suspect you are among the 98.9% check out the additional online course materials:
For the lectures you can find accompanying explanatory and exercise videos at the ISM moodle
platform online at
moodle.ism.de
The materials made available there, as well as the materials linked to in this slide set, are all of the
exam questions from previous semesters and the mock exams simulate the structure of current exams.

ISM 2024 Perret 6


Statistics 1
Direct Links:
This script has direct links to the learning platform as well as to tutorials on how the contents covered
can be realized in Microsoft Excel. Using the following buttons, which are distributed throughout the
script, you can directly access the thematically relevant content in the pdf version of the script.
d

Learning Management Platform Moodle (Theory)

Learning Management Platform Moodle (Exercise)

Learning Management Platform Moodle (Random Exercise)

Excel Tutorials

ISM 2024 Perret 7


Statistics 1
Contents

01 Statistical Basics
1.1 Basic Terms
1.2 Scale levels
1.3 Types of data
1.4 Diagrams and histograms

02 Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
2.4 Measures of distribution
2.5 Boxplots

ISM 2024 Perret 8


Statistics 1
Contents

03 Bivariate Statistics
3.1 Contingency tables
3.2 Scatterplots
3.3 Measures of association
3.3.1 Covariance and Pearson‘s Correlation Coefficient
3.3.2 Spearman‘s Rank-correlation coefficient
3.3.3 χ2-Statistic and Contingency
3.3.4 Scale levels and Effect sizes
3.4 Simple linear regression

ISM 2024 Perret 9


Statistics

01
Statistical Basics
1.1 Basic Terms
1.2 Scale levels
1.3 Types of data
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

ISM 2024 Perret 10


Why do I need this?
Statistics, like mathematics has its own language
To understand this and the following lecture the used terminology is required.
This terminology will be required again later in more advanced courses in the bachelor and master
studies terms.
In particular scale levels play a crucial role when deciding which statistical methods at all are allowed
to be used or which interpretations are possible.

ISM 2024 Perret 11


Case Study
In a questionnaire the following questions are asked:
(For saving and analyzing the data information of descriptiveness and the scale levels is required)
How strongly do you agree that luxury retailers overcharge?
Fully agree Rather agree Indifferent Rather not Not agree
agree
not descriptive
-2 -1 0 1 2
quasimetric
O O O O O

Which of the following brands do you know? (Multiple answers possible)


descriptive
BMW Louis Vuitton Haribo Lufthansa Deutsche Bahn ISM
O O O O O O nominal

Please state you age. not descriptive


____________ metric
Please select you gender.
not descriptive
Male Female Diverse
O O O nominal

ISM 2024 Perret 12


1.1 Basic Terms
Statistical units
(also: elementary unit, unit of observation) Person or object whose characteristics are measured.

Population
Set of all feasible statistical units

Subpopulation
Part of the population

Sample
Subset of the population (usually significantly smaller), set of statistical units that is considered in an
analysis.

Characteristic
(also: variables) Variables, properties that are measured, studied

Characteristic value
Value that a characteristic takes

ISM 2024 Perret 13


1.1 Basic Terms
Quantitative characteristics can be measured mathematically. They can be counted or measured.
(e.g. age, size, income, return,…)

Qualitative characteristics can be distinguished by their type


(e.g. sex, nationality, …)

Continuous characteristics can take any value (e.g. variables that can be measured as exactly as
necessary)

Discrete characteristics can only take a finite (or countably infinite) number of values. On an axis gaps
exist between distinct values (e.g. age in years)

Quasi-continuous characteristics are actually discrete but usually are considered to be continuous
(e.g. monetary variables like returns or income)

Descriptive characteristics can take more than one value (i.e. mobile numbers)

ISM 2024 Perret 14


1.1 Basic Terms
Exercise:
Which of the following characteristics are discrete? Which are continuous?
- Capacity of beer bottles
- Number of oranges in a box
- Income of officials in pay group A5
- Temperature in a fridge
- Number of white stags in the Harz
- Frequency of radio stations (in kHz)

ISM 2024 Perret 15


1.1 Basic Terms
Solution:
Which of the following characteristics are discrete? Which are continuous?
- Capacity of beer bottles continuous
- Number of oranges in a box discrete
- Income of officials in pay group A5 discrete
- Temperature in a fridge continuous
- Number of white stags in the Harz discrete
- Frequency of radio stations (in kHz) continuous

ISM 2024 Perret 16


1.1 Basic Terms
Overview Online Exercises

Exercises on Qualitative vs. Quantative

Exercises on Continuity

ISM 2024 Perret 17


Statistics

01
Statistical Basics
1.1 Basic Terms
1.2 Scale levels
1.3 Types of data
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

ISM 2024 Perret 18


1.2 Scale levels
Nominally scaled characteristics can only be differentiated by their type. They cannot be ordered in
any consistent manner.
(e.g. sex, nationality,…)

Ordinally scaled characteristics can be ordered. However, the distance between two values cannot
sensibly be interpreted.
(e.g. categories of quality, marks, ratings,…)

Interval scaled characteristics can be ordered and the distances between two values can sensibly be
interpreted. However, no absolute zero exists.
(e.g. year of birth, temperature in degree celsius,...)

Ratio scaled characteristics can be ordered, distances can be interpreted and they have a natural
absolute zero.
(e. g. age, height, monetary values,…)

The interval and ratio scale are summarized under the term cardinal or metric scale.

ISM 2024 Perret 19


1.2 Scale levels
Possible mathematical operations:

Nominal scale
Frequencies

Ordinal scale
Plus: Ordering, ranking

Interval scale
Plus: Addition and substraction

Ratio scale
Plus: Multiplication and division

ISM 2024 Perret 20


1.2 Scale levels
Exercise:
Determine for each of the following characteristics the scale level:
- Banking account number
- Water consumption in liters per capita
- Number of clients of a bank
- Quality of a camping area
- Passengers in a car
- Travelling speed of a plane
- Types of schools
- Temperature in degree celsius
- Access delay with a computer
- Score at iceskating
- Magnitude of an earthquake
- Grouping of students according to their nationality
- Lens power of glasses

ISM 2024 Perret 21


1.2 Scale levels
Solution:
Determine for each of the following characteristics the scale level:
- Banking account number nominal
- Water consumption in liters per capita ratio
- Number of clients of a bank ratio
- Quality of a camping area ordinal
- Passengers in a car ratio
- Travelling speed of a plane ratio
- Types of schools nominal
- Temperature in degree celsius interval
- Access delay with a computer ratio
- Score at iceskating ordinal
- Magnitude of an earthquake ordinal
- Grouping of students according to their nationality nominal
- Lens power of glasses ordinal / interval

ISM 2024 Perret 22


1.2 Scale levels
Overview Online Exercises

Exercises on Descriptiveness

Exercises on Scale Levels

ISM 2024 Perret 23


Statistics

01
Statistical Basics
1.2 Scale levels
1.3 Types of data
1.4 Diagrams and histograms
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

ISM 2024 Perret 24


1.3 Types of data
Crosssectional data
With crosssectional data many statistical units are considered for one single point in time.
(e.g. mosts surveys, GDP of all countries in the world in 2017)

Time Series data


A time series consists of data for one statistical unit but a number of points in time.
(e.g. Income development of an employee, development of Germany‘s government debt ratio)

Panel data
In a panel data is collected for a number of statistical units at different points in time. The important
aspect with panels is that the participant structure does not change over time.
(e.g. panel surveys, developments of unemployment rates across all EU countries)

ISM 2024 Perret 25


1.3 Types of data

Types of cross-
sectional data

Original data Frequency table Frequency table


(raw data) (unclassified) (classified)

ISM 2024 Perret 26


1.3 Types of data
Original data (also: raw data) is given as a series of observations:

x1, x2,...., xn

The number n gives the size or sample size.

Population
Sample
Sampling

Estimation

ISM 2024 Perret 27


1.3 Types of data
For ten patients of an emergency room the age has been noted:
25, 21, 18, 37, 56, 89, 46, 23, 21, 34

The ordered data set for the patients‘ age looks as follows:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10)
18 21 21 23 25 34 37 46 56 89

ISM 2024 Perret 28


1.3 Types of data
Grouped data is given as a frequency table in which each characteristic value ai is given, accompanied
by its absolute frequency ni:

Characteristic Absolute Relative Cumulated


Value Frequencies (ni) Frequencies (ri) Frequencies (ci)
a1 n1 r1 c1
a2 n2 r2 c2
⁞ ⁞ ⁞ ⁞
ak nk rk ck
Sum: n = n1 + ... + nk 1 -

ISM 2024 Perret 29


1.3 Types of data
A random sample of 128 visitors of the Oktoberfest yielded the following frequency table regarding
the beer consumption during their visit:

Consumed liters of beer Abs. frequency Rel. frequency Cumulated frequency

1 2 0.0156 0.0156
2 30 0.2344 0.25
3 37 0.2891 0.5391
4 28 0.2188 0.7579
5 23 0.1797 0.9376
6 8 0.0625 1

ISM 2024 Perret 30


1.3 Types of data
Classified data is given as a frequency data in which the respective category [ui-1; ui) and the corres-
ponding (absolute) frequencies ni are summarized:

Class Frequency
[u0; u1) n1
[u1; u2) n2
⁞ ⁞
[uk-1; uk) nk
Sum: n = n1 + .... + nk

ISM 2024 Perret 31


1.3 Types of data
The distribution of German pharmacies turnover (in Mio. € without VAT) in 2012 is given as:
Class ri Class ri Class ri Class ri
[0; 0.75) 0.070 [1.75; 2) 0.112 [3; 3.25) 0.022 [4.25; 4.5) 0.005
[0.75; 1) 0.096 [2; 2.25) 0.087 [3.25; 3.5) 0.016 [4.5; 4.75) 0.003
[1; 1.25) 0.136 [2.25; 2.5) 0.051 [3.5; 3.75) 0.014 [4.75; 5) 0.004
[1.25;1.5) 0.151 [2.5;2.75) 0.049 [3.75; 4) 0.010 [5; 10] 0.014
[1.5; 1.75) 0.120 [2.75; 3) 0.033 [4; 4.25) 0.007

Source: ABDA – Bundesvereinigung Deutscher Apothekerverbände

ISM 2024 Perret 32


1.3 Types of data
Exercise:
Student Witta Miene works at a restaurant‘s salad bar where the price is calculated per gramm. This
results in very uneven tips as most customers just round up. Twenty randomly selected tips (in €) are
summarized in the following table:

3.20 2.79 4.07 4.13 2.63


1.34 2.51 0.07 0.36 2.69
1.43 4.78 0.82 4.09 1.22
4.04 4.69 4.47 3.11 1.73
Classify the data into classes of a width of 1€ each.

ISM 2024 Perret 33


1.3 Types of data
Solution:

Class ai (abs. frequency) ri (rel. frequency) ci (cumu. frequency)


[0; 1] 3 0.15 0.15
(1; 2] 4 0.20 0.35
(2; 3] 4 0.20 0.55
(3; 4] 2 0.10 0.65
(4; 5] 7 0.35 1.00

ISM 2024 Perret 34


Statistics

01
Statistical Basics
1.2 Scale levels
1.3 Types of data
1.4 Diagrams and histograms

Excel, Tabellen, Tabellenkalkulation, Statistiken

ISM 2024 Perret 35


1.4 Diagrams and histograms

A first illustration of the collected data is usually achieved via one of the following types of figures:
• Pie chart
• Block chart
• Bar chart
• Line chart
• Histogram

ISM 2024 Perret 36


1.4 Diagrams and histograms

Pie charts:
According to the frequencies sectors are calculated and drawn:

ISM 2024 Perret 37


1.4 Diagrams and histograms

Block charts:
According to the relative or absolute frequencies blocks are drawn:

Frequency

ISM 2024 Perret 38


1.4 Diagrams and histograms

Bar charts:
According to relative or absolute frequencies bars are drawn:

Frequency

ISM 2024 Perret 39


1.4 Diagrams and histograms

Line charts:
In regards to the relative or absolute frequencies succeeding end points are connected with
continuous lines. Line charts are almost exclusively used when trying to illustrate time series.

Frequency

ISM 2024 Perret 40


1.4 Diagrams and histograms

Histogram:
• Is used in particular with classified data.
• Attention! In contrast to block charts it is the area of each block and not it height that is
proportional to the underlying frequencies. Thus the height of each block is given as:

r
Area of the block = Width ∙ Height = ∆i ∆i = ri with Δi = ui – ui-1
i

• The height is also referred to as frequency density fi:


ri
fi =
∆i

ISM 2024 Perret 41


Statistics

02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.3 Measures of dispersion

ISM 2024 Perret 42


2.1 Scale levels and statistics
Descriptive Statistics
Univariate Bivariate
Scale Level Central Dispersion Distribution Association Causality
Tendency (Modeling)
Nominal Mode - - χ2-Statistic -
(most frequent (unlimited)
value) Corrected
Contingency
Coefficient
(Interval [0; 1])

Ordinal Median Range - Spearmans Rank- -


(value in the (Max-Min) Correlation
middle) Interquartilerange (Interval [-1; 1])
Quartiles (50% in the middle)

Metric Mean Standard Deviation Skewness Covariance Simple Linear


(Average) (Root of the variance) (small vs. large values) (unlimited) Regression
Variance Kurtosis Pearsons
(Squared deviation (Heterogeneity of Correlation
from the mean) values) (Interval [-1; 1])
Coefficient of
Variation
(unscaled deviation)

ISM 2024 Perret 43


Statistics

02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.1 Mode
2.2.2 Median
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

2.2.3 Quantiles Excel, Tabellen, Tabellenkalkulation, Statistiken

2.3 Measures of dispersion

ISM 2024 Perret 44


Why do I need this?
A brief overview
Often it makes sense to summarize big data sets for readers of statistical reports.
For this statistics are required that report on the central tendency of the data set while summarizing
the whole data set in one value.
The choice of the mostly suitable measure of central tendency is dependent on the scale levels of the
variables.

ISM 2024 Perret 45


Case Study
In the beginning of any data analysis usually the implemented sample is descibed:
Excerpt from a study on luxury and e-commerce

Variables Older generation Younger generation


Age (Median) 50.4 22,8
Income (Median) 6,242.13€ 1,674.02€
Participants 80 71

Source: Perret, Horn, Holthaus (2022)


ISM 2024 Perret 46
2.2 Measures of central tendency

Meaning behind measures of central tendency (in general):


• Describes the center of the data set (not if using quantiles or percentiles)
• Attention!
Only sensible if the underlying distribution is unimodal (one peak).
If the distribution has more than one peak a figure is better suited than trying to reduce the data
set to one single value!

ISM 2024 Perret 47


2.2.1 Mode

Mode:
= most frequent characteristic value (if unique)
• Simple to calculate
• Can be used even with nominally scaled data
But
• Very primitive measure of central tendency which should not be the sole criterion if a higher level
scale is present!

ISM 2024 Perret 48


2.2.1 Mode

Mode (Unclassified frequency table):

xM = a i with ri = max rj
j

The mode is thus the characteristic value that is present the most, the characteristic with the largest ni
and the largest ri. If this value is not unique the mode usually is not reported.

Mode (Classified frequency table):


ri rj
xM = m i with = max
ui −ui−1 j ui −ui−1

The mode is thus the center (the middle value) mi of the class that reports the highest frequency.

ISM 2024 Perret 49


2.2.1 Mode

Exercise:
Calculate the mode for the...
Patient dataset Oktoberfest dataset Tip dataset
i xi ai ni Class ni
1 25 1 2 [0; 1) 3
2 21 2 30 [1; 2) 4
3 18 3 37 [2; 3) 4
4 37 4 28 [3; 4) 2
5 56 5 23 [4; 5) 7
6 89 6 8
7 46
8 23
9 21
10 34

ISM 2024 Perret 50


2.2.1 Mode

Solution:

Patient dataset Oktoberfest dataset Tip dataset


i xi ai ni Class ni
1 25 1 2 [0; 1) 3
2 21 2 30 [1; 2) 4
3 18 3 37 [2; 3) 4
4 37 4 28 [3; 4) 2
5 56 5 23 [4; 5) 7
6 89 6 8
7 46
8 23
9 21
10 34
xM = 21 xM = 3 xM = 4.5
ISM 2024 Perret 51
Statistics

02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.1 Mode
2.2.2 Median
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

2.2.3 Quantiles Excel, Tabellen, Tabellenkalkulation, Statistiken

2.3 Measures of dispersion

ISM 2024 Perret 52


2.2.2 Median

Median:
Idea:
Calculate the middle (not the mean) of a data set, i.e. the median divides the data set in two parts that
each contain 50% of the data points.
• In many cases much harder to calculate than the arithmetic mean
• Insusceptible to outliers
• Can be used with only ordinally scaled data

But
• If data is metrically scaled additional information might be lost!

ISM 2024 Perret 53


2.2.2 Median

Median (Raw data):


n uneven:
x0.5 = x((n+1)/2)
n even:
x0.5 = 0.5(x(n/2) + x(n/2+1))

Median (Grouped data):


x0.5 = ai, for which the cumulated frequency ci for the first time becomes greater or equal to 0.5

ISM 2024 Perret 54


2.2.2 Median

Median (Classified frequency table):


If i is the first class for which the cumulated frequency ci is larger than or equal to 0.5:
i

ci = ෍ rj ≥ 0.5
j=1

then the median is given as:


0.5 − σi−1
j=1 rj
x0.5 = ui−1 + ui − ui−1
ri
Here ui-1 is the lower bound of class i and ui is the upper bound of class i.

σi−1
j=1 rj is the cumulated frequency of class i-1 and ri is the relative frequency of class i.

ISM 2024 Perret 55


2.2.2 Median

Exercise:
Calculate the median for the...
Patient dataset Oktoberfest dataset Tip dataset
i xi ai ni Class ni
1 25 1 2 [0; 1) 3
2 21 2 30 [1; 2) 4
3 18 3 37 [2; 3) 4
4 37 4 28 [3; 4) 2
5 56 5 23 [4; 5) 7
6 89 6 8
7 46
8 23
9 21
10 34

ISM 2024 Perret 56


2.2.2 Median

Solution:

Patient dataset Oktoberfest dataset Tip dataset


i xi x(i) ai ni ri Class ni ri
1 25 18 1 2 0.02 [0; 1) 3 0.15
2 21 21 2 30 0.23 [1; 2) 4 0.2
3 18 21 3 37 0.29 [2; 3) 4 0.2
4 37 23 4 28 0.22 [3; 4) 2 0.1
5 56 25 5 23 0.18 [4; 5) 7 0.35
6 89 34 6 8 0.06
7 46 37
8 23 46
9 21 56
10 34 89
x0.5 = 29.5 x0.5 = 3 x0.5 = 2.75
ISM 2024 Perret 57
Statistics

02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.2 Median
2.2.3 Quantiles
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

2.2.4 Arithmetic mean Excel, Tabellen, Tabellenkalkulation, Statistiken

2.3 Measures of dispersion

ISM 2024 Perret 58


2.2.3 Quantiles

The idea of the median can be generalized:


• The median divides the data set into two parts that each contain 50% of the data.
• Now: Find the value that divides the data set in any two parts, e.g. 90% and 10%.
The p-quantile is denominated by the part of the data that lies before it; this part gives us the index:
• xp is the value that p % of the data set do not exceed and that (1-p)% of the data set do not fall
below.

Special quantiles:
• Quartile: In this case the data set is quartered.
x0.25 is the first / lower quartile. x0.75 is the third / upper quartile. X0.5 and thus the second /
“middle” quartile, is the median.
• Percentile: Quantiles, that divide by specific percentages are referred to as percentiles, as i.e. the
5%-percentile x0.05

ISM 2024 Perret 59


2.2.3 Quantiles

Quantile (Raw data):


x np np is not integer
xp = ൝
0.5 x(np) + x(np+1) np is integer

where x describes the next integer larger or equal to x.

Quantile (Grouped data):

i−1 i

ai if ෍ rj < p and ෍ rj > p


j=1 j=1
xp = i−1 i

0.5 ai + ai+1 if ෍ rj < p and ෍ rj = p


j=1 j=1

ISM 2024 Perret 60


2.2.3 Quantiles

Quantile (Classified data):


If i is the first class for which the cumulated frequency ci is larger or equal to p:
i

ci = ෍ rj ≥ p
j=1

then the p-quantile is given as:


p − σi−1
j=1 rj
xp = ui−1 + ui − ui−1
ri
Here ui-1 is the lower bound of class i and ui is the upper bound of class i.

σi−1
j=1 rj is the cumulated frequency of class i-1 and ri is the relative frequency of class i.

ISM 2024 Perret 61


2.2.3 Quantiles

Exercise:
Calculate the 20% quantile of the...
Patient dataset Oktoberfest dataset Tip dataset
i xi ai ni Class ni
1 25 1 2 [0; 1) 3
2 21 2 30 [1; 2) 4
3 18 3 37 [2; 3) 4
4 37 4 28 [3; 4) 2
5 56 5 23 [4; 5) 7
6 89 6 8
7 46
8 23
9 21
10 34

ISM 2024 Perret 62


2.2.3 Quantiles

Solution:

Patient dataset Oktoberfest dataset Tip dataset


i xi x(i) ai ni ri Class ni ri
1 25 18 1 2 0.02 [0; 1) 3 0.15
2 21 21 2 30 0.23 [1; 2) 4 0.2
3 18 21 3 37 0.29 [2; 3) 4 0.2
4 37 23 4 28 0.22 [3; 4) 2 0.1
5 56 25 5 23 0.18 [4; 5) 7 0.35
6 89 34 6 8 0.06
7 46 37
8 23 46
9 21 56
10 34 89
x0.2 = 21 x0.2 = 2 x0.2 = 1.25
ISM 2024 Perret 63
Statistics

02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.3 Quantiles
2.2.4 Arithmetic mean
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

2.2.5 Weighted arithmetic mean Excel, Tabellen, Tabellenkalkulation, Statistiken

2.3 Measures of dispersion

ISM 2024 Perret 64


2.2.4 Arithmetic Mean

Arithmetic mean
Idea: calculate average value
• Simple to calculate and well known

• BUT very susceptible to outliers and requires a metric scaled variable

ISM 2024 Perret 65


2.2.4 Arithmetic Mean

Arithmetic Mean (Raw data):


n
x1 + x2 +. . . +xn 1
xത = = ෍ xi
n n
i=1

Arithmetic Mean (Frequency table without classification):


k k k
1 ni
xത = ෍ ni ai = ෍ ai = ෍ ri ai
n n
i=1 i=1 i=1

Arithmetic Mean (Frequency table with classified data):


k k
n1 m1 + n2 m2 +. . . +nk mk 1 ni
xത = = ෍ n i mi = ෍ mi
n n n
i=1 i=1

where mi is the middle of class i.

ISM 2024 Perret 66


2.2.4 Arithmetic Mean

Exercise:
Calculate the arithmetic mean for the...
Patient dataset Oktoberfest dataset Tip dataset
i xi ai ni Class ni
1 25 1 2 [0; 1) 3
2 21 2 30 [1; 2) 4
3 18 3 37 [2; 3) 4
4 37 4 28 [3; 4) 2
5 56 5 23 [4; 5) 7
6 89 6 8
7 46
8 23
9 21
10 34

ISM 2024 Perret 67


2.2.4 Arithmetic Mean

Solution:
Patient dataset Oktoberfest dataset Tip dataset
i xi ai ni Class mi ni
1 25 1 2 [0; 1) 0,5 3
2 21 2 30 [1; 2) 1,5 4
3 18 3 37 [2; 3) 2,5 4
4 37 4 28 [3; 4) 3,5 2
5 56 5 23 [4; 5) 4,5 7
6 89 6 8 Sum ni 20
7 46 Sum ni 128 Sum mini 56
8 23 Sum aini 448
9 21
10 34
Sum xi 370
xത = 37 xത = 3.5 xത = 2.8
ISM 2024 Perret 68
2.2 Measures of Central Tendency
Overview Online Exercises

Exercise 2.1 (unclassified) Exercise 2.6 (unclassified)

Exercise 2.2 (unclassified) Exercise 5.1 (classified)

Exercise 2.3 (unclassified) Exercise 5.2 (classified)

Exercise 2.4 (unclassified) Exercise 5.3 (classified)

Exercise 2.5 (unclassified) Exercise 5.4 (classified)

Random Exercise (unclassified) Random Exercise (classified)

ISM 2024 Perret 69


Statistics

02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.4 Arithmetic mean
2.2.5 Weighted arithmetic mean
2.2.6 Geometric mean
2.3 Measures of dispersion

ISM 2024 Perret 70


2.2.5 Weighted arithmetic mean

Weighted arithmetic mean:


Up to this point: Using the same weights for all observations (1/n)
Idea: Purposely assigning different weights to observations
(with n weights wi for the n characteristics i)
n

xത w = ෍ wi xi
i=1

with wi ≥ 0 and σni=1 wi = 1

Alternative formula:
σni=1 wi xi
xത w = n
σi=1 wi

ISM 2024 Perret 71


2.2.5 Weighted arithmetic mean

Additional important applications for the weighted arithmetic mean:


Averaging of ratios (quotients), if the denominators distribution is known.

Example:
During a race a Chinese bicycle athlete drives 42km/h for 2.5 hours and afterwards 31km/h for 3
hours.
What is the average speed that he drives during the race?

ISM 2024 Perret 72


2.2.5 Weighted arithmetic mean

Solution:
It holds that
• In the first 2.5 hours he drives 105 km.
• In the second part he drives 93 km.

֜ In total he drives 5.5 hours and covers a distance of 198km.


His average speed thus amounts to 198km / 5.5h = 36 km/h

Solution 2: (using the weighted arithmetic mean)


2,5 3
xത = 3+2.5 ∙42 + 3+2.5 ∙31=36

ISM 2024 Perret 73


2.2.5 Weighted arithmetic mean
Exercise:
Six pairs participate in the finals of a dancing contest. Their performance is evaluated by five judges
R1,..., R5. Every mark from 1 to 6 can be given by each judge only once, whereas 1 describes the best
and 6 the worst performance:
Pair R1 R2 R3 R4 R5
1 1 2 1 1 6
2 5 4 5 5 5
3 4 5 3 4 4
4 2 1 2 3 2
5 6 6 6 6 1
6 3 3 4 2 3
For the final score averages should be calculated.
Construct a ranking of the pairs by calculating these means:
a) The arithmetic mean
b) The weighted mean where the best and worst mark is erased and the other results are weighted
with a equal weight of 1/3 each.

ISM 2024 Perret 74


2.2.5 Weighted arithmetic mean
Solution:
Pair R1 R2 R3 R4 R5 xത xത 𝐰
1 1 2 1 1 6 2.2 1.33
2 5 4 5 5 5 4.8 5
3 4 5 3 4 4 4 4
4 2 1 2 3 2 2 2
5 6 6 6 6 1 5 6
6 3 3 4 2 3 3 3

ISM 2024 Perret 75


Statistics

02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.5 Weighted arithmetic mean
2.2.6 Geometric mean
2.2.7 Harmonic mean Excel, Tabellen, Tabellenkalkulation, Statistiken

2.3 Measures of dispersion

ISM 2024 Perret 76


2.2.6 Geometric mean

Geometric mean (Raw data):


n
xത g = x1 ∙. . .∙ xn
Geometric mean (Grouped data):
n n n
xത g = a1 1 ∙. . .∙ akk

It is uncommon to use the geometric mean for classified data as too much information is lost through
the classification process.

Use of the geometric mean: Averaging growth rates

Exercise:
The value of a stock increases by 30% in the first year and decreases by 20% in the second year.

ISM 2024 Perret 77


2.2.6 Geometric mean

Solution:
(by using growth factors)
Starting with the initial capital K0 the end capital after two years can be calculated via:
K2 = K0(1 + 0.3)(1 – 0.2) = K0∙1.3∙0.8
If the capital yielded interest evenly with an interest of i it should holds that:
K2 = K0(1 + i)2
Therefore i can be determined via the geometric mean as follows:
2
i = 1.3∙0.8 - 1 = 0.0198 = 1.98%.
Caution! The geometric mean needs to be applied to the growth factors (1 + i), not to the growth
rates i!

ISM 2024 Perret 78


Statistics

02
Univariate Statistics
2.1 Scale levels and statistics
2.2 Measures of central tendency
2.2.5 Weighted arithmetic mean
2.2.6 Geometric mean
2.2.7 Harmonic mean Excel, Tabellen, Tabellenkalkulation, Statistiken

2.3 Measures of dispersion

ISM 2024 Perret 79


2.2.7 Harmonic mean

Question: You drive a given route in one direction with a speed of 100km/h and on the way back you
drive 200km/h. What has been your average speed?

At first glance the harmonic mean might seem strange…


Harmonic mean (Raw data):
n −1
1 1 1
xത h = = ෍
1 n 1 n xi
σ i=1
n i=1 xi

Harmonic mean (Grouped data):


a −1
1 1 ni
xത h = = ෍
1 a ni n ai
σ i=1
n i=1 ai

ISM 2024 Perret 80


2.2.7 Harmonic mean

Use:
Averaging of ratios (quotients) if the numerator distribution is known

Example:
The cyclist Anna Bolika drives for 90km with a constant speed of 36km/h. Afterwards see goes for
40km at a constant speed of 32km/h.
What has been her average speed over the whole distance?

ISM 2024 Perret 81


2.2.7 Harmonic mean

Solution:
90
• For the first 90 km she needs 36 = 2.5 hours.
40
• The other 40 km take her 32 = 1.25 hours.

֜ She cycles for 3.75 hours total and covers 130 km.
130 km km
The average speed thus is 3.75 h = 34. 6ത h
Solution 2: (using the weighted harmonic mean)
−1
90 1 40 1
xത h = 90 + 40 ∙ 36 + 90 + 40 ∙ 32 ≈ 34.67

ISM 2024 Perret 82


Statistics

02
Univariate Statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
2.3.1 Range
2.3.2 Interquartilerange
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

2.3.3 Variance and Standard deviation


2.4 Measures of distribution

ISM 2024 Perret 83


Why do I need this?
To make the scope of the data graspable
While measures of central tendency summarize the data set into a single value, this value does not tell
us anything about the fluctuation of the data points around this measure.
While measures of dispersion measure the scope or the severity of deviations, measures of
distribution illustrate the type of deviation.
The combination of measures of central tendency and dispersion gives the basis for a description of
the variables used in an analysis. Distributions like the normal distribution (as the single most
important distribution) are primarily defined via the mean and the standard deviation.
Like with measures of central tendency, the choice of measure depends on the scale levels of the
relevant variables.

ISM 2024 Perret 84


Case Study
At the beginning of any analysis there is a description of the implemented sample:
Excerpt from a study on social media influencers

Variable Follower Followed Posts Age Growth Rate Posting Engagement


Frequency Rate
Minimum 50,247 4 72 19 -0.81 0.03 0.16%

Maximum 68,689,593 6,558 17,236 43 2.92 6.62 16.81%

Mean 1,766,737.22 991.58 3,724.81 29.55 0.12 1.21 2.04%

Coefficient of 2.80 0.83 0.74 0.14 2.26 0.70 0.98


Variation
Participants 255 255 255 255 255 255 255

Source: Perret, Edler (2022)


ISM 2024 Perret 85
2.3.1 Range

Range (Raw data): SB = x(n) – x(1)

Range (Grouped data): SB = ak – a1

Range (Classified data): uncommon

The range per se is a measure of dispersion but as such it is rather unsuited as it is very susceptible to
outliers. It can rather be considered as an orientation when considering the classification of data.

Examples:
Patients dataset: R = 89 – 18 = 71
Oktoberfest dataset: R=6–1=5

ISM 2024 Perret 86


Statistics

02
Univariate Statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
2.3.1 Range
2.3.2 Interquartilerange
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

2.3.3 Variance and Standard deviation


2.4 Measures of distribution

ISM 2024 Perret 87


2.3.2 Interquartilerange

Formula:
IQA = x0.75 – x0.25

Note:
• Impervious to outliers
• Linked to the quartile deviation

But:
• Cannot be directly linked to the standard deviation

Examples:
Patient dataset: IQA = 46 – 21 = 25
Oktoberfest dataset: IQA = 4 – 2.5 = 1.5

ISM 2024 Perret 88


Statistics

02
Univariate Statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
2.3.2 Interquartilerange
2.3.3 Variance and Standard deviation
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

2.3.4 Coefficient of variation Excel, Tabellen, Tabellenkalkulation, Statistiken

2.4 Measures of distribution

ISM 2024 Perret 89


2.3.3 Variance and Standard deviation

How to measure dispersion?


First idea:
n
1
dx = ෍ xi − xത
n
i=1

= average absolute deviation from the mean

Problem:
• Oftentimes dispersion needs to be minimized (see deduction of the regression model)
• Minimization usually leads to derivation
• But: The function of the absolute value is not continuously differentiable at all points

Solution:
Use quadratic deviations!
ISM 2024 Perret 90
2.3.3 Variance and Standard deviation

The difference between theory and sample:

Theory Census Sample


Expected value μ ഥ
Arith. mean X Arith. mean xത
Variance σ Uncorrected Sample Corrected Sample
Variance S2 Variance s2

Example: (6-sided dice)


Theory: μ = 3.5 and σ2 = 2.9167
Census: Not possible, as the dice can be thrown an infinite number of times.
Sample: Throw it three times 2 4 3
xത = 3 and s2 = 1
Throw it three times 6 1 2
xത = 3 and s2 = 7
ISM 2024 Perret 91
2.3.3 Variance and Standard deviation

For the different types of data for the theoretical variance we thus get:
(The expected value μ is known.)

Variance (Raw data)


n
1
σ2 = ෍ x i − μ 2
n
i=1
Variance (Grouped data)
k
1
σ2 = ෍ n i a i − μ 2
n
i=1
Variance (Classified data)
k
1
σ2 = ෍ n i m i − μ 2
n
i=1

ISM 2024 Perret 92


2.3.3 Variance and Standard deviation

For the sample variance of a census whole population):


(The expected value μ is approximated by the arithmetic mean.)

Sample Variance (Raw data)


n
1
s 2 = ෍ xi − xത 2
n
i=1
Sample Variance (Grouped data)
k
1
s 2 = ෍ ni ai − xത 2
n
i=1
Sample Variance (Classified data)
k
1
s 2 = ෍ ni mi − xത 2
n
i=1

ISM 2024 Perret 93


2.3.3 Variance and Standard deviation

For the sample variance of a partial sample the following formulae hold (corrected variance):
(The expected value μ is approximated by the arithmetic mean.)

Sample Variance (Raw data)


n
1
s2 = ෍ xi − xത 2
n−1
i=1
Sample Variance (Grouped data)
k
1
s2 = ෍ ni ai − xത 2
n−1
i=1
Sample Variance (Classified data)
k
1
s2 = ෍ ni mi − xത 2
n−1
i=1

ISM 2024 Perret 94


2.3.3 Variance and Standard deviation

For σ2 and s2 the so called displacement law holds that makes calculating the variance much easier :

Theoretical Variance and corrected Sample Variance (Raw data)


n n
1 n 1 n
σ = ෍ x i 2 − μ2 = x 2 − μ2
2 2
s = 2 2
෍ xi − xത = x 2 − xത 2
n n−1 n n−1
i=1 i=1
Theoretical Variance and corrected Sample Variance (Grouped data)
k k
1 n 1
σ = ෍ ni ai 2 − μ2
2 2
s = ෍ ni ai 2 − xത 2
n n−1 n
i=1 i=1
Theoretical Variance and corrected Sample Variance (Classified data)
k k
1 n 1
σ2 = ෍ n i m i 2 − μ2 s2 = ෍ ni mi 2 − xത 2
n n−1 n
i=1 i=1

ISM 2024 Perret 95


2.3.3 Variance and Standard deviation

Note:
• Very important in the context of inductive statistics
But:
• Susceptible to outliers
• Uses squared units of the original data set and makes it thus hard to interpret.
Solution: Extracting a root → Standard deviation

n
1
σ= σ2 = ෍ xi − xത 2
n
i=1

n
1
s= s2 = ෍ xi − xത 2
n−1
i=1

ISM 2024 Perret 96


2.3.3 Variance and Standard deviation

Note:
• Advantage as compared to the variance: same unit as the original data set
• Most common measure of dispersion
But:
• Still susceptible to outliers

ISM 2024 Perret 97


2.3.3 Variance and Standard deviation

Example Patient dataset:

i xi xi2
1 25 625 xത = 37
2 21 441 s2 = 10/9∙(0.1∙18.058 – 37 2) = 485.33
3 18 324 s = 22.03
4 37 1.369
5 56 3.136
6 89 7.921
7 46 2.116
8 23 529
9 21 441
10 34 1.156
Sum: 370 18.058

ISM 2024 Perret 98


2.3.3 Variance and Standard deviation

Example Oktoberfest dataset:

ai ni niai2 xത = 3.5
1 2 2 s2 = 128/127∙(1.766/128 – 3.52) = 1.5591
2 30 120 s = 1.2486
3 37 333
4 28 448
5 23 575
6 8 288
Sum: 128 1.766

ISM 2024 Perret 99


2.3.3 Variance and Standard deviation

Example Tip dataset:

Class mi ni nimi2 xത = 2.8

[0; 1) 0.5 3 0.75 s2 = 20/19∙(201/20 – 2.82) = 2.3263

[1; 2) 1.5 4 9.00 s = 1.5252


[2; 3) 2.5 4 25.00
[3; 4) 3.5 2 24.50
[4; 5) 4.5 7 141.75
Sum: x 20 201.00

ISM 2024 Perret 100


Statistics

02
Univariate Statistics
2.2 Measures of central tendency
2.3 Measures of dispersion
2.3.2 Interquartilerange
2.3.3 Variance and Standard deviation
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

2.3.4 Coefficient of variation


2.4 Measures of distribution

ISM 2024 Perret 101


2.3.4 Coefficient of variation

Up to this point: Used measures of absolute dispersion


Again observe the Oktoberfest data set:
Does the dispersion differ if we were to measure in milliliters instead of liters?
Yes and No…
• The absolute dispersion is dependent on the scale level and thus changes in regards to the unit of
measurement.
• The dispersion of the data set, however, should be independent of any scale. Therefore sometimes
a relative measure of dispersion is sought.

The coefficient of variation relates standard deviation and expected value or sample standard
deviation and the arithmetic mean:
σ s
V= μ
or V= xത

ISM 2024 Perret 102


2.3.4 Coefficient of variation

Exercise:
Decide whether the dispersion is larger in the Oktoberfest dataset or in the patient dataset.

ISM 2024 Perret 103


2.3.4 Coefficient of variation

Solution:
Measures Patient dataset:
xത = 37 and s = 22.03 => V = 0.60
Measures Oktoberfest dataset:
xത = 3.5 and s = 1.25 => V = 0.36
Measures Tipps:
xത = 2.8 and s = 1.53 => V = 0.55

The data in the patent dataset shows a higher relative dispersion.

ISM 2024 Perret 104


2.3 Measures of Dispersion
Overview Online Exercises

Exercise 3.1 Exercise 3.5

Exercise 3.2 Exercise 3.6

Exercise 3.3 Exercise 3.7

Exercise 3.4 Exercise 3.8

Random Exercise

ISM 2024 Perret 105


Statistics

02
Univariate Statistics
2.3 Measures of dispersion
2.4 Measures of distribution
2.4.1 Skewness
2.4.2 Kurtosis
2.5 Boxplots Excel, Tabellen, Tabellenkalkulation, Statistiken

ISM 2024 Perret 106


2.4.1 Skewness

Which other measures might be interesting in describing a data set?


• Skewness versus symmetry of a data set

Theoretical:
n
1 xi − μ 3
S= ෍
n σ
i=1

In relation to a sample:
n 3
1 xi − xത
S= ෍
n−1 s
i=1

ISM 2024 Perret 107


2.4.1 Skewness

Histogram of a symmetric distribution (S = 0):


(The same amount of large and small values)
10

0
1 2 3 4 5 6 7 8 9 10 11 12 13

ISM 2024 Perret 108


2.4.1 Skewness

Histogram of a rightwards skewed distribution (S > 0):


(Larger amount of small values than of large values / Mode < Median < Mean)
180

160

140

120

100

80

60

40

20

ISM 2024 Perret 109


2.4.1 Skewness

Histogram of a leftwards skewed distribution (S < 0):


(Larger amount of large value than of small values / Mode > Median > Mean)
160

140

120

100

80

60

40

20

ISM 2024 Perret 110


Statistics

02
Univariate Statistics
2.3 Measures of dispersion
2.4 Measures of distribution
2.4.1 Skewness
2.4.2 Kurtosis
2.5 Boxplots

ISM 2024 Perret 111


2.4.2 Kurtosis

Which other measures might be interesting in describing a data set?


• Kurtosis (Kurtosis) as is the behavior in outlying parts (tails) of a distribution

Theoretically:
n
1 xi − μ 4
W= ෍
n σ
i=1

In relation to a sample:
n 2
1 xi − xത
W= ෍
n−1 s
i=1

ISM 2024 Perret 112


2.4.2 Kurtosis

Histogram of a normally arched (mesokurtic) (normal) distribution (W = 3):

0,5

0,4

0,3

0,2

0,1

0
-3 -2 -1 0 1 2 3

ISM 2024 Perret 113


2.4.2 Kurtosis

Histogram of an acute (leptokurtic) distribution, W > 3:


(Values very close to the mean, rather homogeneous)
0,6

0,5

0,4

0,3

0,2

0,1

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

ISM 2024 Perret 114


2.4.2 Kurtosis

Histogram of a flat (platykurtic) distribution (W < 3):


(Few values that are close to the mean, rather heterogeneous)
0,25

0,2

0,15

0,1

0,05

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

ISM 2024 Perret 115


Statistics

02
Univariate Statistics
2.3 Measures of dispersion
2.4 Measures of distribution
2.5 Boxplots

Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

Excel, Tabellen, Tabellenkalkulation, Statistiken

ISM 2024 Perret 116


2.5 5-Number-Summary

Which information is important when describing a data set?

5-Number-Summary:
x(1) Sample minimum
x0.25 Lower quartile
x0.5 Median
x0.75 Upper quartile
x(n) Sample maximum

ISM 2024 Perret 117


2.5 Boxplots

Graphical illustration of the 5-Number-Summary: The Boxplot:

Sample maximum x(n)

Upper quartile x0.75


Median x0.5

Lower quartile x0.25

Sample minimum x(1)

ISM 2024 Perret 118


2.5 Boxplots

Interpretation of the boxplot:


• The boxplot reports the median (line in the middle of the box) as the main measure of central
tendency which is also insusceptible to outliers.

• Possible outliers and the severity of these outliers can be detected via the antennas which report
the maximum and minimum.

• Additionally, the size of the box represents the interquartile range and thus reports on a measure
of dispersion.

ISM 2024 Perret 119


2.5 Boxplots

• The boxplot also answers questions regarding the skewness and symmetry of the data set:

• If the median is situated (more or less) in the middle of the box, this indicates a (more or less)
symmetrical distribution.

• If the median tends more towards the lower quartile this indicates a rightwards skewed
distribution.

• If the median tends more towards the upper quartile this indicates a leftwards skewed distribution.

• Again, this interpretation of skewness is robust as in its insusceptibility towards outliers.

ISM 2024 Perret 120


2.5 Boxplots

Example:
Construct a boxplot for
• the patient dataset
• the Oktoberfest dataset

ISM 2024 Perret 121


2.5 Boxplots

Solution:

100,0 7,0
90,0
6,0
80,0
70,0 5,0

60,0
4,0
50,0
3,0
40,0
30,0 2,0
20,0
1,0
10,0
0,0 0,0
Patientendaten Oktoberfestdaten
Patient dataset Oktoberfest dataset

ISM 2024 Perret 122


2.5 Boxplots and Quantiles
Overview Online Exercises

Exercise 4.1 Exercise 4.5

Exercise 4.2 Exercise 4.6

Exercise 4.3 Exercise 4.7

Exercise 4.4 Exercise 4.8

ISM 2024 Perret 123


Statistics

03
Bivariate Statistics
3.1 Contingency tables
3.2 Scatterplots
3.3 Measures of association

Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

ISM 2024 Perret 124


3.1 Contingency tables

Up to this point: univariate data which means that only a single variable / characteristic has been
considered. In this part the interaction of two variable is studied.

Raw data:
Bivariate raw data is given as data points: (x1, y1),…, (xn, yn)

Grouped data:
Bivariate grouped data is given in a two-dimensional frequency table, a so called contingency table:

b1 b2 ... bl
a1 n11 n12 ... n1l n1●
a2 n21 n22 ... n2l n2●
⁞ ⁞ ⁞ ⁞ ⁞
ak nk1 nk2 ... nkl nk●
n●1 n●2 ... n●l n

ISM 2024 Perret 125


3.1 Contingency tables

Classified data:
Bivariate classified data is given via a two-dimensional contingency table:

[b0*; b1*) [b1*; b2*) ... [bl-1*; bl*)


[a0*; a1*) n11 n12 ... n1l n1●
[a1*; a2*) n21 n22 ... n2l n2●
⁞ ⁞ ⁞ ⁞ ⁞
[ak-1*; ak*) nk1 nk2 ... nkl nk●
n●1 n●2 ... n●l n

ISM 2024 Perret 126


3.1 Contingency tables

Example:
Age group Sex
Male Female Total
Below 3 years 1,018,505 966,018 1,984,523
3 up to less than 6 years 1,041,011 984,172 2,025,183
6 up to less than 15 years 3,485,685 3,309,900 6,795,585
15 up to less than 18 years 1,195,380 1,133,681 2,329,061
18 up to less than 25 years 3,325,707 3,194,751 6,520,458
25 up to less than 30 years 2,455,885 2,416,648 4,872,533
30 up to less than 40 years 4,763,360 4,731,444 9,494,804
40 up to less than 50 years 6,756,735 6,594,133 13,350,868
50 up to less than 65 years 8,081,342 8,247,217 16,328,559
65 up to less than 75 years 4,246,483 4,788,107 9,034,590
75 years and more 2,775,848 4,707,683 7,493,531
Total 39,145,941 41,073,754 80,219,695

Population of Germany 09.05.2011 (Census date) by sex and age groups (Source: Destatis)

ISM 2024 Perret 127


3.1 Contingency tables

Marginal distribution of bivariate data


Absolute marginal frequencies
l

ni● = ෍ nij
j=1

n●j = ෍ nij
i=1

Relative marginal frequencies


l

ri● = ෍ rij
j=1

r●j = ෍ rij
i=1

ISM 2024 Perret 128


3.1 Contingency tables

Conditional distribution of bivariate data


Conditional absolute frequencies
nij
ni|j =
n●j
nij
nj|i =
ni●

Conditional relative frequencies


rij
ri|j =
r●j
rij
rj|i =
ri●

ISM 2024 Perret 129


3.1 Contingency tables

Exercise:
Calculate using the information given below the missing data from the contingency table, that
describes the relation between place of living and chosen way of commuting to work.

Public Bicycle Sum


Car By foot
Transport
Essen 150 265 715
Wuppertal 250 10
Köln 400 300 240 80
Dortmund 610 120 20 870
Sum 900 650 3000

ISM 2024 Perret 130


3.1 Contingency tables

Solution:
Public
Car By foot Bicycle Sum
Transport
Essen 150 230 265 70 715
Wuppertal 110 250 25 10 395
Köln 400 300 240 80 1020
Dortmund 610 120 120 20 870
Sum 1270 900 650 180 3000

ISM 2024 Perret 131


Statistics

03
Bivariate Statistics
3.1 Contingency tables
3.2 Scatterplots
3.3 Measures of association

Excel, Tabellen, Tabellenkalkulation, Statistiken

ISM 2024 Perret 132


3.2 Scatterplots

Scatterplot:

ISM 2024 Perret 133


Statistics

03
Bivariate Statistics
3.2 Scatterplots
3.3 Measures of association
3.3.1 Covariance and Pearson‘s Correlation coefficient
3.3.2 Spearman‘s Rank-correlation coefficient
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

3.3.3 χ2-Statistic and Contingency Excel, Tabellen, Tabellenkalkulation, Statistiken

ISM 2024 Perret 134


Why do I need this?
Basis for empirical market and optinion research
In classical market and opinion research it is often the analysis of relations or differences that is of
particular interest.
Like with the measures discussed earlier, the choice of a suitable measure of association depends on
the scale levels of the variables involved.
If, as is the case in the following chapters, measures of association are only calculated but no tests on
their significance are conducted, they are considered desciptive tools, since they only describe
properties of the underlying sample, but do not allow for any inference regarding the population.

ISM 2024 Perret 135


Case Study
Study on the relation between learning and and performance goal orientation and professional
success (Spearman rank correlation)
Variable Income Hierarchy Self-evaluation Success Work Satisfaction
Satisfaction
Learning Goal 0.08 0.05 0.09 0.16* 0.17**
Orientation
Approach-Performance 0.02 0.02 0.10 0.01 0.07
Goal Orientation
Avoidance-Performance -0.13* -0.16* -0.16* -0.21 -0.17**
Goal Orientation

Source: Böge, Perret, Netzel (2021)


ISM 2024 Perret 136
3.3.1 Covariance and Pearson's Correlation coefficient

Covariance
n
1
σxy = ෍ xi − xത yi − yത
n
i=1

Covariance (Displacement law)


n
1
σxy = ෍ xi ∙ yi − xത ∙ yത = xy − xത ∙ yത
n
i=1

ISM 2024 Perret 137


3.3.1 Covariance and Pearson's Correlation coefficient

ISM 2024 Perret 138


3.3.1 Covariance and Pearson's Correlation coefficient

Thus:
• Many and large squares in the first and third quadrant lead to positive values (“positive relation”)
• Many and large squares in the second and fourth quadrant lead to negative values (“negative
relation”)

Problem:
Covariance is not bounded and can take any real value!

Solution:
Standardization

ISM 2024 Perret 139


3.3.1 Covariance and Pearson's Correlation coefficient

Correlation coefficient by Bravais and Pearson:


(Standardization of the covariance with the standard deviation of x and y)
σxy
rxy = ∈ −1; 1
σx σy

1 n
σ
rxy = n i=1 xi ∙ yi − xത ∙ yത =
x ∙ y − xത ∙ yത
1 n 2 1 n 2
σi=1 x − xത 2 σ 2 x 2 − xത 2 y 2 − yത 2
n n i=1 y − yത

ISM 2024 Perret 140


3.3.1 Covariance and Pearson's Correlation coefficient

Interpretation of the correlation coefficient by Bravais-Pearson


The correlation coefficient by Bravais-Pearson measures the linear relation between two variables:
rxy = 1 All observations are situated on a line with a positive slope
rxy = -1 All observations are situated on a line with a negative slope
rxy = 0 No linear relation exists.

Attention!
Even if no linear relation exists a non-linear relation can still exist even though the correlation
coefficient has a value close to zero.

ISM 2024 Perret 141


3.3.1 Covariance and Pearson's Correlation coefficient

Example:
A hospital has measured the number of sold entrance tickets for a neighboring ski resort x (in
thousands) as well as the number y of patients that needed to be treated for broken bones:

i xi yi
1 5 12
2 6 14
3 5.5 9
4 2 4
5 3.8 7
6 4.4 10
7 6.2 13
8 5.6 12
9 4.2 7
10 5.9 15

Calculate the correlation coefficient.

ISM 2024 Perret 142


3.3.1 Covariance and Pearson's Correlation coefficient

Solution: (Approach 1)

i xi yi (xi - xത) (yi - yത) (xi - xത)2 (yi - yത)2 (xi - xത) (yi - yത)
1 5 12 0.14 1.7 0.0196 2.89 0.238
2 6 14 1.14 3.7 1.2996 13.69 4.218
3 5.5 9 0.64 -1.3 0.4096 1.69 -0.832
4 2 4 -2.86 -6.3 8.1796 39.69 18.018
5 3.8 7 -1.06 -3.3 1.1236 10.89 3.498
6 4.4 10 -0.46 -0.3 0.2116 0.09 0.138
7 6.2 13 1.34 2.7 1.7956 7.29 3.618
8 5.6 12 0.74 1.7 0.5476 2.89 1.258
9 4.2 7 -0.66 -3.3 0.4356 10.89 2.178
10 5.9 15 1.04 4.7 1.0816 22.09 4.888
Sum: 48.6 103 x x 15.104 112.1 37.22

37.22
xത = 4.86 yത = 10.3 rxy = = 0.905
15.104 112.1

ISM 2024 Perret 143


3.3.1 Covariance and Pearson's Correlation coefficient

Solution: (Approach 2)
i xi yi xi2 yi2 xiyi
1 5 12 25 144 60
2 6 14 36 196 84
3 5.5 9 30.25 81 49.5
4 2 4 4 16 8
5 3.8 7 14.44 49 26.6
6 4.4 10 19.36 100 44
7 6.2 13 38.44 169 80.6
8 5.6 12 31.36 144 67.2
9 4.2 7 17.64 49 29.4
10 5.9 15 34.81 225 88.5
Sum: 48.6 103 251.3 1.173 537.8
Average: 4.86 10.3 25.13 117.3 53.78

53.8 − 4.86 ∙ 10.3


rxy = = 0.905
25.13 − 4.862 117.3 − 10.32
ISM 2024 Perret 144
3.3.1 Covariance and Pearson's Correlation coefficient

Overview Online Exercises

Exercise 8.1

Exercise 8.2

Exercise 9.1

Exercise 9.2

Random Exercise

ISM 2024 Perret 145


Statistics

03
Bivariate Statistics
3.2 Scatterplots
3.3 Measures of association
3.3.1 Covariance and Pearson‘s Correlation coefficient
3.3.2 Spearman‘s Rank-correlation coefficient
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

3.3.3 χ2-Statistic and Contingency Excel, Tabellen, Tabellenkalkulation, Statistiken

ISM 2024 Perret 146


3.3.2 Spearman's Rank-correlation coefficient

Ranking of Data – No Ties


If no ties exist, no value exists more than once, data can without any problem be ranked in an
ascending fashion.

Example
Original Data set: E C A B F D
Ranked Data sert: 5 3 1 2 6 4

ISM 2024 Perret 147


3.3.2 Spearman's Rank-correlation coefficient

Ranking on Data - Ties


If ties are present, values exist more than once in the data set, the ranks that would result multiple
times would have to be averaged.

Example
Original Data set: A C A B C C
A exists twice and takes rank 1 and 2, thus in both cases the average rank of 1.5 is used (average of the
two ranks)
B exists once at rank 3, thus a rank of 3 is used
C exists thrice on ranks 4 through 6, thus in all three cases the average rank of 5 is used (average of the
three ranks)
Ranked Data set: 1.5 5 1.5 3 5 5

ISM 2024 Perret 148


3.3.2 Spearman's Rank-correlation coefficient

If the dataset is at least ordinally scaled for both variables a ranking can be established, where R(xi) is
the rank of xi in the dataset. For an ordered list it holds that R(x(i)) = i
If two data points hold the same rank we call it a tie. If every rank is only assigned once we say that no
ties exist.
Mathematically the correlation coefficient by Spearman results from applying the correlation
coefficient by Pearson to the ranking. It simplifies for the situation without ties.
Spearman’s correlation coefficient for raw data (with ties)

R x ∙R y −R x ∙R y
R xy =
2 2
R x 2 −R x R y 2 −R y

Spearman’s correlation coefficient for raw data (without ties)


2
6 σni=1 R xi − R yi
R xy =1 −
n n2 − 1

ISM 2024 Perret 149


3.3.2 Spearman's Rank-correlation coefficient

What does the correlation coefficient by Spearman actually measure?

Interpretation of the correlation coefficient by Spearman


The correlation coefficient by Spearman measures the monotone relation between variables:
Rxy = 1 With an increase of the x-value the y-value also increases
Rxy = -1 With an increase of the x-value the y-value decreases
Rxy = 0 No monotone relation exists between the two variable. It can however be the
case that a non-monotone relation exists.

What does monotone relation mean?


The correlation coefficient by Pearson can be interpreted as the slope of a line. Spearman’s rank
correlation coefficient in contrast reflects on the monotony of the relationship with monotony
meaning whether the relation has a continuously negative slope / effect or a continuously positive one
without switching between one and the other.
ISM 2024 Perret 150
3.3.2 Spearman's Rank-correlation coefficient

Example:
The following table summarizes the ECTS marks of six randomly selected pupils in mathematics (xi)
and physics (yi) :
xi yi
B A
B B
A B
C C
E D
D D
Calculate Spearman’s rank correlation coefficient.

ISM 2024 Perret 151


3.3.2 Spearman's Rank-correlation coefficient

Solution:
i Xi yi R(xi) R(yi) R(xi)2 R(yi)2 R(xi)R(yi)
1 B A 2.5 1 6.25 1 2.5
2 B B 2.5 2.5 6.25 6.25 6.25
3 A B 1 2.5 1 6.25 2,5
4 C C 4 4 16 16 16
5 E D 6 5.5 36 30.25 33
6 D D 5 5.5 25 30.25 27.5
Sum 21 21 90.5 90 87.75
Average 3.5 3.5 15.0833 15 14.625

14.625−3.5∙3.5
Rxy = = 2.375/2.7913 = 0.8509
15.0833−3.5∙3.5 15−3.5∙3.5

The simplified version cannot be used as ties exist.

ISM 2024 Perret 152


3.3.2 Spearman's Rank-correlation coefficient
Overview Online Exercises

Exercise 7.1

Exercise 7.2

Random Exercise

ISM 2024 Perret 153


Statistics

03
Bivariate Statistics
3.2 Scatterplots
3.3 Measures of association
3.3.2 Spearman‘s Rank-correlation coefficient
3.3.3 χ2-Statistic and Contingency
Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

3.3.4 Scale levels and Effect sizes Excel, Tabellen, Tabellenkalkulation, Statistiken

ISM 2024 Perret 154


3.3.3 χ2-Statistic and Contingency

Distinction:
• Contingency: Measure of association between nominally scaled variables
• Measure of association by Yule
• Contingency coefficient by Pearson
• Correlation: Measure of association between metrically or ordinally scaled variables
• Covariance and correlation coefficient by Bravais-Pearson (metric scale)
• Rank-correlation by Spearman (ordinal scale)

ISM 2024 Perret 155


3.3.3 χ2-Statistic and Contingency

If at least one of the variables is nominally scaled the association is referred to a contingency.
In the special case that both variables can only take two values we simply call it association and the
corresponding contingency table is called four-fields-table:

b1 b2
a1 n11 n12 n1●
a2 n21 n22 n2●
n●1 n●2 n

A popular measure using the four-fields-table is the coefficient of association by Yule:


n11 n22 − n12 n21
Y= ∈ [−1; 1]
n11 n22 + n12 n21

ISM 2024 Perret 156


3.3.3 χ2-Statistic and Contingency

Example:
A survey on smoking behavior collected data from 200 students and resulted in the following dataset :

Smoker Non-smoker Sum


Female 30 70 100
Male 50 50 100
Sum 80 120 200

Calculate Yule’s coefficient of association.

Solution:
30∙50 − 70∙50
Y= = -0.4
30∙50 + 70∙50

ISM 2024 Perret 157


3.3.3 χ2-Statistic and Contingency

If a contingency table reports more than two rows or columns the Yule coefficient can no longer be
calculated. The calculation of the contingency coefficient in this case takes part in different steps :
χ2 (Chi squared) for any contingency tables
k l 2
2
nij − eij
χ = ෍෍
eij
i=1 j=1

Expected frequencies eij :


ni ● ∙ n● j
eij =
n
Here ni● and n●j are the row or column sum of the contingency table.

ISM 2024 Perret 158


3.3.3 χ2-Statistic and Contingency

Contigency coefficient by Pearson

χ2 M −1
K= ∈ 0;
χ2 + n M
with M = min{I; J} (I is the number of rows and J is the number of columns of the contingency table)

Corrected contingency coefficient by Pearson

M
K∗ = ∙ K ∈ 0; 1
M−1
with M = min{I; J}.

ISM 2024 Perret 159


3.3.3 χ2-Statistic and Contingency

Cramer‘s V

χ2
V= ∈ 0; 1
𝑛 ∙ min(𝐼 − 1; 𝐽 − 1)
I is the number of rows and J is the number of columns of the contingency table.

Cramer‘s Phi

χ2
Phi =
𝑛

ISM 2024 Perret 160


3.3.3 χ2-Statistic and Contingency

Example:
Calculate the corrected contingency coefficient for the following dataset :

Smoker Non-smoker
Female 30 70 100
Male 50 50 100
80 120 200

ISM 2024 Perret 161


3.3.3 χ2-Statistic and Contingency

Solution:
Expected frequencies: Smoker Non-smoker
Female 40 60 100
Male 40 60 100
80 120 200

30−40 2 70−60 2 50−40 2 50−60 2


χ2 = + + + = 2.5 + 1.67 + 2.5 + 1.67 = 8.33
40 60 40 60

8.33 8.33
K= = 0.2 V= = 0.2041
8.33+200 200·min(1;1)

M=2
K* = 1.4142∙0.2 = 0.2828

ISM 2024 Perret 162


3.3.3 χ2-Statistic and Contingency

Interpreting contingency:
Does a value of 0.2 represent strong or weak association?
• Interpretation is usually wrong as the value is usually interpreted in comparison to the absolute
value of a correlation coefficient which is wrong.
• While in theory the corrected contingency coefficient can take any value between 0 and 1 in reality
values are usually way below 1.
• Thus, statements regarding the strength of the association should always be supplemented by a
contingency test (to be discussed in statistics 2).
• Also the type of association can only be determined by conditional distributions.

ISM 2024 Perret 163


3.3.3 χ2-Statistic and Contingency
Overview Online Exercises

Exercise 6.1

Exercise 6.2

Exercise 6.3

Exercise 6.4

Random Exercise

ISM 2024 Perret 164


Statistics

03
Bivariate Statistics
3.2 Scatterplots
3.3 Measures of association
3.3.2 Spearman‘s Rank-correlation coefficient
3.3.3 χ2-Statistic and Contingency coefficient
3.3.4 Scale levels and Effect sizes

ISM 2024 Perret 165


3.3.4 Scale levels and Effect sizes

Bivariate Data and Scale levels: (Which indicator can be calculated when?)

Nominal Ordinal Metric


Nominal χ2 /
Ordinal Contingency Spearman‘s
Coefficient Rank CC
Metric Pearson‘s CC

ISM 2024 Perret 166


3.3.4 Scale levels and Effect sizes

Effect sizes: (How strong / pronounced is the association?)


Nominal
Use Cramer‘s V
Ordinal
Use Spearman‘s Rank Correlation Coefficient
Metric
Use Pearson‘s Correlation Coefficient

Grouping
0.0 < |Measure| < 0.1 No Association
0.1 < |Measure| < 0.3 Weak Association
0.3 < |Measure| < 0.5 Moderate Association
0.5 < |Measure| < 1.0 Strong Association

ISM 2024 Perret 167


Statistics 1

03
Bivariate Statistics
3.2 Scatterplots
3.3 Measures of association
3.4 Simple linear regression

Moodle Icon of Flat style - Available in SVG, PNG, EPS, AI Icon fonts

Excel, Tabellen, Tabellenkalkulation, Statistiken

ISM 2024 Perret 168


Why do I need this?
Basis for empirical market and opinion research
While correlations only describe a two-sided relation between two variables, regression analysis offers
a tool to illustrate and study directed relations between variables.
In contrast to classical measures of association regression analysis is not limited to only two variables.
For this reason as well as their flexible implementation regression analysis is considered one of the
central method in quantitative data analysis.
Like with the measures of association, as long as no testes are conducted, regression analysis is
considered a tool of descriptive statistics.

ISM 2024 Perret 169


Case Study
Testing the Catch-Up Hypothesis from economics (Poor countries grow faster than rich countries) for
Russia (Dependent variable is economic growth)

Variable 1994-2013 1994-1999 2000-2013


GDP per capita -0.1153*** -0.3582*** -0.0884***
Constant 1.0628*** 2.3481*** 0.8710***
R2 0.179 0.400 0.248

Source: Perret (2018)


ISM 2024 Perret 170
3.4 Simple linear regression

Correlation: x <-> y two-sided relation (no causal effect)

Regression: x -> y one-sided effect of x on y (causal effect)

Goal: The dependent variable y should be described via the independent variable x.
The goal lies in finding a function (line) that when set amidst the scatterplot
minimizes the sum of squared distances between the line and all of the points
of the scatterplot.

ISM 2024 Perret 171


3.4 Simple linear regression

y y

x x

ISM 2024 Perret 172


3.4 Simple linear regression

yi


yi

Goal: Minimize the sum of the squared differences of all yi and ෝ


yi .

ISM 2024 Perret 173


3.4 Simple linear regression
Assumptions
• The independent variables x impacts the dependent variable but not the other way round
• The x is exogenously given and not stochastic in their nature
• The residuals εi are normally distributed around a mean of 0
• The variable x is uncorrelated with the residuals
• All residuals have a constant variance
(No heteroscedasticity)
• The residuals are not pairwise correlated with each other
(No autocorrelation)

Later on (Statistics 2) the part below also becomes important:


• No variable xi can be written as a linear combination of any other independent variables
(No multicollinearity)

ISM 2024 Perret 174


3.4 Simple linear regression
Calculating the coefficients using the method of least squares
• Calculate the difference between the observed values yi and the estimated values ෝ
yi
(yi - ෝ
yi )
• Square the difference
yi )2
(yi - ෝ
• Sum up all squared differences
n
෍ (yi − ෝ yi )2
i=1
• Replace the estimated values with the formula for the regression line
n
2
෍ yi − (b0 + b1 xi )
i=1
• Factoring out
n
෍ (yi 2 − 2yi b0 − 2yi b1 xi +b0 2 + 2b0 b1 xi +b1 2xi 2)
i=1
ISM 2024 Perret 175
3.4 Simple linear regression
• Minimization in regards to b0 and b1 to find the regression line that leads to the least deviations
n
Regarding b1 : ෍ (−2yi xi + 2b0 xi + 2b1 xi 2)
i=1
n
Regarding b0 : ෍ (−2yi + 2b0 + 2b1 xi )
i=1
• Set equal to 0 and factor out. Divide both equations by 2
n n n
Regarding b1 : ෍ yixi − b0 ෍ xi − b1 ෍ xi2 = 0
i=1 i=1 i=1
n n
Regarding b0 : ෍ yi − b0 n − b1 ෍ xi = 0
i=1 i=1

ISM 2024 Perret 176


3.4 Simple linear regression
• Divide second equation by n and solve for b0
n n
1 1
Regarding b0 : ෍ yi − b 1 ෍ xi = b 0
n n
i=1 i=1
• The two sums are the arithmetic means തx and yത of x and y
Regarding b0: yത - b1തx = b0
• Divide first equation by n and replace b0
n n n
1 1 1
Regarding b1 : ෍ yixi − (തy−b1 തx) ෍ xi −b1 ෍ xi2 = 0
n n n
i=1 i=1 i=1
• Solving for b1
1 σn x y −yഥxഥ σxy
n i=1 i i x∙y − x∙
ഥ yഥ
b1 = 1 = = 2
n 2 2 2
x −x ഥ 2 σx
n σi=1 xi −x ഥ

b0 = yത - b1തx

ISM 2024 Perret 177


3.4 Simple linear regression
Use the following formulas to deduce the regression line y = b1x + b0:

x∙y − x∙
ഥ yഥ
• b1 =
x2 − xഥ 2

• b0 = yത - b1 തx

ISM 2024 Perret 178


3.4 Simple linear regression
Example:
Dina Vier produce high quality picture books. Depending on the edition xi (in 1,000 pieces) the costs yi
(in thousand €) have been summarized in the following table:
Costs (thousand €) Produced quantity (1,000)
22.1 3.1
16.3 2.2
17.8 2.1
25.9 2.9
20.5 2.4
28.4 3.3
12.1 1.5
22.5 3.3

a) Calculate the linear regression line that explains the costs in relation to the produced quantity.
b) Dina Vier receives a new order of 2,800 picture books. Which total costs can be expected?

ISM 2024 Perret 179


3.4 Simple linear regression
Solution:
i yi xi xi2 xiyi
1 22.1 3.1 9.61 68.51
2 16.3 2.2 4.84 35.86
3 17.8 2.1 4.41 37.38
4 25.9 2.9 8.41 75.11
5 20.5 2.4 5.76 49.2
6 28.4 3.3 10.89 93.72
7 12.1 1.5 2.25 18.15
8 22.5 3.3 10.89 74.25
Sum 165.6 20.8 57.06 452.18
Mean 20.7 2.6 - -
0.125 ∙ 452.18 − 20.7 ∙ 2.6
b1 = 0.125 ∙ 57.06 − 2.6 ∙ 2.6 = 7.255 and b0 = 20.7 – 7.255∙2.6 = 1.837
The relationship between costs and produced quantity can thus be described by: y = 7.255x + 1.837
Thus, for an order of 2,800 books costs of 22,151€ can be expected.

ISM 2024 Perret 180


3.4 Simple linear regression
y
Unexplained variance
yi
yොi Explained variance

Average over yi = y

Explained variance σn
i=1 yෝi − y 2
2
R = =
Total variance σni=1 yi − y
2

ISM 2024 Perret 181


3.4 Simple linear regression
The coefficient of determination R² is defined als:
2
σny
෢ −yഥ s²y෡
R2 = i=1 i 2 or 2
R =
s²y
σn y − y ഥ
i=1 i
with yො = a+bx.

It holds that 0 ≤ R² ≤ 1.

The coefficent gives the share of the total variance that can be explained via the regression line.

For the simple linear regression it holds that:

x∙y−xഥ∙yഥ 2
R2 = rxy2 =
x2 −xഥ2 y2−yഥ2

ISM 2024 Perret 182


3.4 Simple linear regression
Effect sizes (How good are the regression results?)

Grouping
0.00 < R2 < 0.01 No explanatory power
0.01 < R2 < 0.09 Weak explanatory power
0.09 < R2 < 0.25 Moderate explanatory power
0.25 < R2 < 1.00 Strong explanatory power

ISM 2024 Perret 183


3.4 Simple linear regression
Overview Online Exercises

Exercise 10.1

Exercise 10.2

Exercise 10.3

Exercise 10.4

Random Exercise

ISM 2024 Perret 184


Statistics 1

Thank you for your attention!


Remember to evaluate!

ISM 2024 Perret 185

You might also like