Statistics
Statistics
BSTA01-01_E
STATISTICS
MASTHEAD
Publisher:
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt
Mailing address:
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
[email protected]
www.iu.de
BSTA01-01_E
Version No.: 001-2023-1103
Concept: IU Internationale Hochschule GmbH
Author(s): Heike Bornewasser-Hermes
Translation: Heike Bornewasser-Hermes and Nazli Andjic
2
TABLE OF CONTENTS
STATISTICS
Introduction
Signposts Throughout the Course Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Suggested Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Unit 1
Introduction 9
Unit 2
Analysis Methods of One-Dimensional Data 19
Unit 3
Analysis Methods of Two-Dimensional Data 57
Unit 4
Linear Regression 87
Unit 5
Fundamentals of Probability Theory 103
3
Unit 6
Special Probability Distributions 135
Unit 7
Statistical Estimation Methods 161
Unit 8
Hypothesis Testing 175
Appendix
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
List of Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4
INTRODUCTION
WELCOME
SIGNPOSTS THROUGHOUT THE COURSE BOOK
This course book contains the core content for this course. Additional learning materials
can be found on the learning platform, but this course book should form the basis for your
learning.
The content of this course book is divided into units, which are divided further into sec-
tions. Each section contains only one new key concept to allow you to quickly and effi-
ciently add new learning material to your existing knowledge.
At the end of each section of the digital course book, you will find self-check questions.
These questions are designed to help you check whether you have understood the con-
cepts in each section.
For all modules with a final exam, you must complete the knowledge tests on the learning
platform. You will pass the knowledge test for each unit when you answer at least 80% of
the questions correctly.
When you have passed the knowledge tests for all the units, the course is considered fin-
ished and you will be able to register for the final assessment. Please ensure that you com-
plete the evaluation prior to registering for the assessment.
Good luck!
6
SUGGESTED READINGS
GENERAL SUGGESTIONS
Freeman, J., Shoesmith, E., Sweeney, D., Anderson, D., & Williams, T. (2017). Statistics for
business and economics (4th ed.). Cengage Learning. http://search.ebscohost.com.pxz.
iubh.de:8080/login.aspx?direct=true&db=cat05114a&AN=ihb.45823&site=eds-live&sc
ope=site
Larsen, R. J., & Marx, M. L. (2012). An introduction to mathematical statistics and its applica-
tion (5th ed.). Prentice Hall.
UNIT 1
7
LEARNING OBJECTIVES
The term “statistics” generally describes two phenomena: (1) the tabular and/or graphical
preparation of data as well as (2) the statistical methods used to collect, prepare, and
draw conclusions from data. From this, it follows that our confrontation with statistics is
omnipresent – whether it is in our studies (e.g., consider how statistical methods are
applied in related courses or bachelor theses) or everyday professional life (e.g., managers
are confronted on a daily basis with statistical evaluations that they have to understand
and interpret).
The aim of your Statistics course book is to teach you the most essential elements of stat-
istical procedures. Methodologically, the course book comes in three parts. First, you will
be presented with the theoretical foundations of individual statistical methods, which will
then be deepened through the use of small examples and illustrations. Second, you will
apply and run the methods learned on application cases. Third, you will learn how to cre-
ate and interpret conclusions drawn from a “sample.”
8
UNIT 1
INTRODUCTION
STUDY GOALS
Introduction
A business owner wonders how their turnover will develop in the future. The head of a
hospital wants to know what its capacities look like in terms of free beds. A private house-
hold needs to compare its monthly income and expenditure. All three use statistical analy-
ses – no matter how simple – to get one step closer to their desired insights. The unit dis-
cusses the most important goals of statistics, the basic concepts, and the process of
statistical investigations.
In the medical field, it is common to investigate whether a newly developed drug actually
has the desired effect on patients. For this purpose, experts often conduct experiments in
which one group of patients takes a drug for a certain period (i.e., the experimental
group), while another group of patients takes a placebo (i.e., the control group). Here, the
aim is to determine whether the health status of the experimental group improves in com-
parison to that of the control group. Numerous data are collected over the course of such
an experiment. The health status is always recorded at the beginning of the study, and fur-
ther data are obtained during the experiment.
Studies in the field of psychology also focus on gaining new insights. For instance, many
psychologists want to increase their knowledge of how much stress different individuals
experience in certain situations. For this purpose, they often carry out experiments to
obtain reliable data on human behavior.
In essence, statistical procedures are necessary to collect findings and draw appropriate
conclusions from different research inquiries – such as those described above – and their
associated data.
To begin this section, let’s take a look at a situation from autumn 2022. During this time,
numerous parts of the health system were severely tested by the ongoing COVID-19 pan-
demic. Many hospitals were pushed to their capacity, nurses and doctors were at the limit
of their ability to cope, and many employees experienced great psychological strain due to
10
social isolation and the constant need to work from home. Statisticians were essential in
obtaining and maintaining an overview of the situation. With this in mind, let’s look at
some specific examples that illustrate the importance of statistics.
At the time, it was crucial to know the current number of available hospital beds. Thus, in
many regions, the number of free and occupied intensive care beds was analyzed on a
daily basis. Likewise, the proportion of patients infected with COVID-19 in the intensive
care section was also examined to determine the available capacities for patients not
infected with COVID-19. With the help of this information and the necessary statistical
know-how, it was possible to estimate both how long the available capacities would last
and when, in the worst case, medical care would collapse.
Notably, there was also a great interest in observing the interaction between the incidence
of mental illness and the demand for therapists. A researcher at the time might have found
themselves with questions like the following ones:
• Is the pandemic encouraging a trend in mental illnesses that were not so much in the
spotlight before?
• Is the supply of psychotherapeutic services sufficient to support these illnesses?
• How long on average does it take for a sick person to obtain a place in therapy?
Reliable data have proven to be indispensable in investigating all of these questions. Only
with such data could experts determine when the situation was under control. Together,
these three examples should demonstrate how important the field of statistics is. Based
on new findings, it enables us to take the best actions.
11
• How many intensive care beds are occupied?
• How many beds in the normal wards are occupied?
• On a scale from “very good” to “very bad,” how well are the patients cared for and trea-
ted?
These and many other points could be useful to help us obtain an overview of German
hospitals and their care services. With the help of this example, let’s go through the most
important terminology that is repeatedly used in statistical contexts.
In our example, every German hospital that can potentially participate in the study repre-
sents an object. Individuals, companies, and other institutions can also be objects. In
Object short, an object is anyone or anything that you would like to gain knowledge about. At the
An object is a person or beginning of a study, it is always crucial to know and be able to name the totality of all
subject of interest about
which one wants to gain objects. This defines the population, and in the present study, it includes all carriers of
information. the characteristics that we want to investigate. In this case, all German hospitals form the
Population population of our study.
The population includes
all objects.
It would be desirable to consider the whole population within the framework of our study.
However, for reasons of time and cost, you often have a much smaller number of objects
Sample at your disposal in practice. A sample consists of the objects that are finally taken into
The sample includes all account. In our case, our sample consists of the hospitals that we examine and can make
actually examined
objects. statements about.
Differentiation of Samples
There are many ways to make a sample or draw a sample from the population. The first
one is to make a random sample. Here, we make a distinction between three types of ran-
Simple random dom sampling. In simple random sampling, the objects are selected randomly, and each
sampling entity has an equal chance of being selected. In the previous example, each hospital has
A simple random sample
provides the best basis for the same probability of being included in the sample. This procedure is only possible if the
meaningful results. entire population is known.
Stratified sampling is another type of random sampling. Here, the population is first divi-
ded into subpopulations according to relevant variables. Samples are then drawn from
the subpopulations. In order to ensure that hospitals of all sizes are included in the exam-
12
ple, you could first form three strata of hospitals, such as (1) hospitals with up to
500 patients, (2) those with more than 500 patients to 1,000 patients, and (3) those with
more than 1,000 patients. Then, hospitals are randomly selected from all three strata.
The final type of random sample discussed here is a lump sample. In this case, naturally
existing subsets of the population are randomly selected and then fully investigated. For
example, to ensure that hospitals from all areas of Germany are included in the study, you
could randomly select some of the 294 German counties. Within the selected counties, all
hospitals are then included (Halfens & Meijers, 2013, pp. 249–250).
Note that only random samples are considered to form a suitable basis for statistical anal-
yses that you want to leverage to draw conclusions about the general public. Some inqui-
ries use ad hoc samples, in which the characteristics that are available at the time of data
collection are selected. This is often the case for street surveys that collect opinions on
current topics. Suppose thata reporter goes to the center of a major city and asks the peo-
ple passing by what they think about a new political leader. They might interview every-
one who is willing to provide a statement and then include them in the sample, making
this an ad hoc sample. Overall, such forms of sampling create insufficient bases for making
generally valid statements (Raithel, 2008, pp. 56–57).
Returning to our hospital study, we are interested in a lot of different information about
the hospitals in question. For example, we would like to know whether the hospitals are
university hospitals, how many intensive care beds are still available, and whether the
patients are satisfied with the hospital. All these characteristics that can be measured are
known as variables. The values that a variable can assume are, indeed, called values. In Variable
other words, a value is a possible observation of a variable. For example, as we have out- A variable is a property of
interest of an object.
lined it, the environment of a hospital has two possible values: urban or rural. In contrast,
Value
the number of free intensive care beds can contain any number as its value, whereas the A value is a possible
general satisfaction of the patients with the hospital has values that range from “very observation of an varia-
ble.
good” to “very poor.”
The type of value is of decisive importance, and different statistical procedures may be
used for individual variables. If the values consist of numbers, far more statistical calcula-
tions can be carried out than if they are only described using words. For example, we can
calculate how high the minimum number of free intensive care beds is, how many beds
are still free on average across all hospitals, and many other measures. However, in terms
of the environment, we can only determine how many of the hospitals are in urban areas
and how many are in rural areas.
Scales of Measurement
To differentiate between the types of statistical analyses that can be carried out for vari-
ous characteristics, the variables are assigned to scales of measurement. Scales of meas- Scale of measurement
urement reflect how variables are defined. A distinction is generally made between three The scale of measure-
ment determines the pos-
different scale levels. sible statistical analyses.
13
Nominal scale The weakest scale, in terms of evaluation possibilities, is the nominal scale. Nominal
The nominal scale allows scale variables are those whose values are actually names or categories, and they cannot
the least statistical evalu-
ation possibilities. be placed in a meaningful order. For instance, the type of hospital ward is a nominal scale
variable. A hospital ward might be, for example, an oncology ward, a trauma surgery, or an
orthopedics ward. These categories cannot be meaningfully ordered. A special kind of
nominal scale variable is a dichotomous variable. A dichotomous variable has exactly two
possible values. For instance, if we treat the environment of a hospital as either urban or
rural, then the environment is a dichotomous variable. Likewise, we could investigate
whether a hospital is a university hospital or not.
Ordinal scale The next strongest scale of measurement is the ordinal scale. Ordinal scale variables also
The ordinal scale allows measure non-numeric concepts, like names and categories. In contrast to the nominal
more statistical evalua-
tion possibilities than the scale, however, these categories can be placed in a meaningful order. For instance, in our
nominal scale but less example, the general satisfaction of the patients with the hospital is an ordinal scale varia-
than the cardinal scale. ble. If we assume that the options that a patient can select are “very good,” “good,” “aver-
age,” “bad,” and “very bad,” then these are categories that can be sorted in a meaningful
way. However, note that the difference between the values cannot be interpreted in a
mathematically meaningful way. For example, one cannot say that a “very good” assess-
ment of a hospital is twice as good as a “good” assessment. Both nominal and ordinal
scale variables can be referred to as qualitative variables.
We can distinguish between types of cardinal scales in two ways. The first way is to make a
Interval scale distinction between an interval scale and a ratio scale. An interval scale applies to cardi-
The interval scale has no nal scale variables that do not have a natural zero point, while a ratio scale applies to vari-
natural zero point.
ables that do have a natural zero point.
Ratio scale
The ratio scale has a natu-
ral zero point. But what do we mean by a natural zero point? Let’s consider the examples of temperature
and credit balance. Temperature is typically measured in degrees Celsius (°C) or Fahren-
heit (°F). Since 0°C does not describe the same degree of heat or cold as 0°F, there is no
natural zero point. Therefore, temperature is an interval scale variable because the num-
ber zero does not have the same meaning in all its units. If, conversely, you look at the
asset or credit balance in your account, it is, in this case, irrelevant which currency this is
measured in. If your credit balance is €0 or $0, it means that there is simply no credit bal-
ance. If the number zero has one and the same meaning in all conceivable units, this
means we have a ratio scale characteristic (Bortz & Schuster, 2010, pp. 12–14).
The cardinal scale is further subdivided into discrete and continuous characteristics based
Discrete variable on the number of different values. A discrete variable refers to countable and distinct val-
A discrete variable has ues (i.e., a finite or countably infinite number of values). An example of a discrete variable
only a few different val-
ues. is the number of people living in a household. In the hospital example, the number of free
14
intensive care beds is also a discrete variable. Even though there may only be a few free
intensive care beds, the number can be determined by counting each bed. A continuous
variable, conversely, refers to a variable that takes on any value within a range, meaning Continuous variable
that the number of possible values within that range is infinite. Such variables are able to A continuous variable has
very many different val-
assume all conceivable interval values (Handl & Kuhlenkasper, 2018, p. 10). In the context ues.
of the hospital study, suppose that we want to consider the expenditure of each hospital.
An infinite number of values is possible simply by measuring the exact amount. Cardinal
scale variables can be referred to as quantitative variables. The following figure summari-
zes the different scale levels discussed in this unit.
Data Collection
The first step is to collect data. The data can be primary or secondary. If the data are col-
lected using written or oral surveys or experiments, then we are collecting primary data. In
our example, if we involve the selected hospitals in a survey, then we are collecting pri-
mary data. However, it is possible to use already-existing data, which is known as secon-
dary data. In the present context, we might be able to use data already collected from the
previous year.
15
Data collection can look very different depending on the temporal dimension. Here, we
Cross-sectional design make a basic distinction between a cross-sectional design and a longitudinal design. If
In a cross-sectional our survey of hospitals is only carried out at one point in time or within a short period of
design, data are collected
only over a short period time (approximately two to four weeks), then we have carried out a cross-sectional design
of time. survey. However, if we want to find out how the situation changes over time, we should
Longitudinal design collect data according to a longitudinal design, which means that we repeatedly collect
In a longitudinal design,
the data on the same variables at several successive points in time.
the same data are repeat-
edly collected at several
successive points in time. When we are discussing longitudinal design, a distinction is made between trend design
Panel design and panel design. If we collect data according to a trend design, we repeatedly examine
The panel design produ- hospitals at regular intervals to investigate our specific questions, and it is not necessary
ces the results with the
most information. to study the same hospitals at each point in time. However, if we use a panel design, we
must precisely fulfill the requirement of evaluating the same objects. In this case, we
would need to involve the same hospitals in the study at intervals of, say, half a year.
The greatest advantage of a panel design is that intraindividual changes can be observed
over time. For example, it is possible to determine how the utilization of intensive care
Panel effects beds changes over time. The disadvantages of this design are the panel effects or learning
This means that the same effects (which is less of a problem in the present example) as well as panel mortalities.
answers are given over
and over again.
Panel mortalities Ultimately, studies that are conducted according to a panel design provide the highest
Over the course of time, information content, followed by trend design and, finally, cross-sectional design (Raithel,
some participants (e.g.,
hospitals) may drop out
2008, pp. 50–51).
of the survey for various
reasons. Data Handling
Once the data have been collected, such as through a survey or observation, data han-
dling must be carried out. This means that the collected data are prepared with the help of
statistical software such as Excel, SPSS, R, and Stata so that they can be evaluated statisti-
cally. This is a very important step because it lays the foundation for the possibility of a
clean evaluation. Often, important data are missing or transferred incorrectly. A precise
examination of the data, here in this second step, is, therefore, indispensable.
Data Analysis
Once all the data have been carefully entered and processed in a statistical program, the
analysis of the data can begin. A distinction is made between three major areas of statis-
Descriptive statistics tics. The first area is descriptive statistics, where the collected data are first described.
The descriptive statistics Tables, graphs, or measures are used for this purpose. Descriptive statistics, thus, serve
are used to describe the
data collected. the purpose of summarizing the data in a condensed form. For example, we can use the
mean value to determine the average number of occupied intensive care beds across all
hospitals.
Inferential statistics The second important area is inferential statistics. These statistics check the transferabil-
The inferential statistics ity of the descriptive results to the population. Inferential statistics are used when, for
check the transferability
of the descriptive results example, not all German hospitals can be included in the hospital study. When we only
to the population. have a selection of them, it becomes important to check whether the descriptive results
for the selected hospitals can be generalized to all hospitals. For this purpose, hypotheses
16
are usually formulated that need to be tested within the framework of inferential statis-
tics. For example, we might test a hypothesis about whether university hospitals have a
larger catchment area than non-university hospitals.
The third area is exploratory statistics. These statistics explore new, under-researched Exploratory statistics
areas (i.e., they are used precisely when there is little or no knowledge of the planned The exploratory statistics
explores new, under-
research area). At the beginning of the COVID-19 pandemic, all statistical analyses were researched areas.
initially of an exploratory nature, since all researchers were moving into an unexplored
area.
In order to understand the basics of statistics, we must, without exception, deal with
cross-sectional data. We will analyze cross-sectional data on paper – without any statisti-
cal programs – and examine both descriptive statistics and inferential statistics inten-
sively.
SUMMARY
Statistics involves the analysis of data with the aim of gaining new
insights. Statistical work is divided into the three steps of (1) data collec-
tion, (2) data preparation, and (3) data analysis. In the last step, a dis-
tinction is made between descriptive, inferential, and explorative analy-
ses. A panel design offers the most informative data design. In this case,
the same characteristic bearers are repeatedly interviewed at regular
intervals on one and the same topic.
As a rule, you only examine data from a sample. However, the composi-
tion of the sample should be as representative as possible of the popula-
tion so that we can draw meaningful conclusions for the public. Within
the framework of such studies, the objects are examined based on a
wide range of variables. The values of these variables are, in turn, deci-
sive in terms of the types of statistical evaluations that are permitted.
Different variables are each assigned to a possible scale of measurement
to ensure a suitable differentiation between them.
17
UNIT 2
ANALYSIS METHODS OF ONE-DIMENSIONAL
DATA
STUDY GOALS
Introduction
Let’s assume that we have conducted a small survey in a hospital. We are interested in
finding out how satisfied the patients are with the nursing robots and how many times
they have encountered such robots before this hospital stay. In addition, we asked about
the gender and age of the patients. Before we begin the analyses and, if necessary, test the
previously formulated hypotheses, it is crucial to describe the collected data. This is done
in the context of descriptive statistics.
Univariate analysis In this section, we will first discuss univariate analysis. We will examine only one variable
The univariate analysis at a time and use tables, graphs, and various measures for this purpose. In the next sec-
examines exactly one var-
iable. tion, we will carry out a bivariate analysis with the help of various correlation analyses. In
Bivariate analysis all these statistical analyses, the scale level is of decisive importance. Depending on the
The bivariate analysis available scale level, we will explain which analyses are feasible and which are not.
examines the relationship
between two variables.
1. What is your gender? The options are “female,” “male,” and “diverse.”
2. How good are the nursing robots in your opinion? The options are “very
good,” “good,” “satisfactory,” “sufficient,” and “poor.”
3. How many times have you encountered similar nursing robots before?
The options are “0,” “1,” “2,” “3,” and so on.
4. How old are you? This is given in years.
20
Table 1: Results of Patient Survey
1 female good 1 16
2 female good 5
3 female good 0 50
4 female good 0 35
5 male satisfactory 1
6 female satisfactory 1 47
7 female satisfactory 2 15
8 female satisfactory 1 20
9 male good 1 47
10 male satisfactory 1 48
11 female satisfactory 1 44
12 male 1
13 female good 2 55
14 female good 1 56
21
Patien Gender Satisfaction Previ- Age
t ous
con-
tact
15 female good 0 35
16 female sufficient 3 48
17 female good 1
18 female good 1 52
20 female sufficient 3
21 female good 0 68
22 female satisfactory 1 17
23 female good 1 26
24 female satisfactory 2 39
25 female satisfactory 1
Each row of this table contains the answers of one patient. For example, the first
patient is female, she finds the nursing robots good, she has encountered simi-
lar robots once before, and she is 16 years old. It is not uncommon that (1) the
interviewees do not want to or cannot answer some questions or (2) they simply
forget to answer. For example, the second patient did not write her age, and the
12th patient did not provide any information on his satisfaction with the nursing
robots. As you can see, there are also several more gaps in the table.
22
This sample data set is the basis for the entire lesson. We will look at each of the four
questions above separately and use various measures to present the data collected in a
clearer way. It should be noted that this compilation of patients forms the sample and not
the population
The scale level is of decisive importance in all forms of statistical analysis, including uni-
variate analysis. The scale level determines which statistical analyses are permitted and
which ones are not. Let’s now go through the four characteristics of the patients:
• Gender is nominally scaled. The expressions are categories that cannot be meaning-
fully ordered.
• Satisfaction with nursing robots is ordinally scaled. The expressions are categories
that can be meaningfully ordered.
• Both previous contact frequency with similar robots and age are cardinally scaled.
The expressions and intervals are numbers. We will also clarify whether these variables
are discrete or continuous.
◦ Previous contact frequency with similar robots only had five different responses (0, 1,
2, 3, and 5). For this reason, this variable should be treated as a discrete one.
◦ Age had 16 different values among a total of 19 responses. In such cases, it is recom-
mended to treat this variable as a continuous one and form age categories from the
given ages. In the following analyses, we will observe that such a procedure proves to
provide much clearer results.
Before starting the evaluation, we must introduce a little notation. One refers to the initial
data, which are available with respect to a variable x, as the primal list or raw data. In Primal list
general terms, the primal list is notated as follows (Bamberg et al., 2022, p. 23): The primary list (or raw
data) contains the collec-
ted data of a variable.
x1, x2, . . ., xn
x1 describes the value for the first person, x2 the value for the second person, and xn the
value for the nth person. n stands for the total number of persons or the sample size. For
the individual persons, a person index i = 1, . . . , n is created, which runs through all per-
sons. The primal list can be also denoted by xi for i = 1, . . . , n. As a starting point for a
data analysis, this original list is available for any scale level.
In the table above, each of the four feature columns is actually a primal list. The column
“Gender” represents the primal list of gender. All 25 patients provided an indication of
their gender, so, in this case, n = 25. The assessment of the nursing robots was ignored by
one person: So, there are responses for n = 24 patients here. The frequency of previous
contact with similar nursing robots was, again, answered by all patients (n = 25), whereas
age was only answered by 19 patients (n = 19).
23
Tables and Graphics
Tables and graphics are commonly used to present the collected data in a clear fashion.
Frequency table Both serve the purpose of summarizing the data of a variable. A frequency table is one
A frequency table sum- the most important tools used to present the collected data of a variable in a clear form.
marizes the collected
data of a variable in a Such a table summarizes which values a variable can take on, how often they occur in a
compressed form. sample, and what proportion they make up of the total number of all the attribute holders
in the sample (Bamberg et al., 2022, p. 23). When setting up a frequency table, the impor-
tant subdivision variables based on their scale levels become noticeable. For this reason,
we will discuss the frequency table for each individual scale level.
It is important to begin here, as when discussing the other scale levels, by clarifying our
notation. First of all, a proficiency index number
j = 1, …, k
is applied to each individual value. In general, a variable has k different values. The indi-
vidual values themselves are identified as
aj with j = 1, …, k .
In the original list, if we count the number of variable carriers that assume the individual
Absolute frequencies values, the absolute frequencies are as follows:
The absolute frequencies
count the occurrence of
the individual variable nj with j = 1, …, k
values.
If we relate the individual absolute frequencies to the total number of variable carriers in
Relative frequencies the sample, we obtain the relative frequencies for the respective variable values:
The relative frequencies
reflect the proportions of
the individual variable f j with j = 1, …, k
values.
Note that the relative frequencies can only take values from 0 to 1. The sum of all relative
frequencies f1 + f2 + … + fk must always be 1. If the individual relative frequencies are
multiplied by 100, the corresponding percentages are obtained. This information is sum-
marized in the following frequency table. For nominal scale variables, this table basically
consists of four columns:
j aj nj fj
1 a1 n1 f1
24
2 a2 n2 f2
⋮ ⋮ ⋮ ⋮
k ak nk fk
Σ n 1
Each row of the frequency table summarizes the most important information about a vari-
able.
f; f; f; f; m; f; f; f; m; m; f; m; f; f; f; f; f; f; m; f; f; f; f; f; f
j aj nj fj
1 male 5 0.2
2 female 20 0.8
Σ 25 1
Counting from the original list, we found out that five male patients were inter-
viewed. Dividing the five male patients by the total of all 25 patients results in a
relative frequency of 0.2 or 20%. Consequently, 20 female patients remain with a
proportion of 0.8 or 80%. The total of all respondents (here, 25) is always the
sum under the column of absolute frequencies. The sum of all relative frequen-
cies in the amount of 1 or 100% is noted below the column of relative frequen-
cies.
25
Ordinal scale variables
For ordinal scale variables, just like cardinal scale variables, the frequency table is exten-
Cumulative frequencies ded by a fifth column. In this column, the cumulative frequencies are entered, which add
The cumulative frequen- up all relative frequencies from the first to the mth relative frequency for any variable
cies sum up the relative
frequencies. j = m:
m
F m = f1 + f2 + … + fm = ∑ f j
j=1
j aj nj fj Fj
1 a1 n1 f1 F1
2 a2 n2 f2 F2
⋮ ⋮ ⋮ ⋮ ⋮
k ak nk fk Fk
Σ n 1
26
For the ordinal scale variable “satisfaction with a nursing robot,” we can make
the following list based on the original list (vg = very good, g = good, s = satisfac-
tory, and sf = sufficient):
j aj nj fj Fj
satisfac-
3 tory 9 9/24 22/24
4 sufficient 2 2/24 1
Σ 24 1
We can see that with 12 patients (half of the respondents) had a good impres-
12
sion of the nursing robots. This represents a proportion of and 50%, respec-
24
tively. Let’s now look at the cumulative frequencies: In the first row of the table,
we find the number of patients who rated the nursing robots “very good,” i.e.,
1
only one patient. This makes a relative and cumulative proportion of and
24
4.2%, respectively. In the cumulative frequency of the second row, the patients
1 12
who found the robots “very good” or “good” are accounted for. Thus,
24 24
13
the sum comes up. Approximately 54.2% of the interviewed patients rated
24
the nursing robots at least “good.”
27
row, we always arrive at 1 or 100%. It should be noted that the frequency table
could have started with the worst category (“sufficient”). In most cases, how-
ever, one starts with the best expression.
The frequency table for discrete cardinal variables takes the same form as the table for
ordinal scale variables. Here, too, cumulative frequencies can be calculated, since the
nature of the variables in the form of numbers allows sorting from the smallest to the larg-
est number. This is exactly the order that is always chosen for cardinal scale variables: One
starts with the smallest expression and ends with the largest one. The shape of the fre-
quency table repeats that for ordinal scale variables.
j aj nj fj Fj
1 a1 n1 f1 F1
2 a2 n2 f2 F2
⋮ ⋮ ⋮ ⋮ ⋮
k ak nk fk Fk
Σ n 1
1; 5; 0; 0; 1; 1; 2; 1; 1; 1; 1; 1; 2; 1; 0; 3; 1; 1; 0; 3; 0; 1; 1; 2; 1
j aj nj fj Fj
28
1 0 5 0.2 0.2
2 1 14 0.56 0.76
3 2 3 0.12 0.88
4 3 2 0.08 0.96
5 5 1 0.04 1
Σ 25 1
It can be seen here that the answer “4” previous contacts with such nursing
robots does not exist among the answers. For this reason, the frequency table
goes directly from “3” to “5.” Aside from this, the frequency table can be under-
stood in the same way as the previous one. For example, let’s look at the second
row of the frequency table: 14 patients had contact with similar nursing robots
14
once before their current hospital stay. This accounts for 0.56 or 56%,
25
whereas 76% (20% + 56%) had such contact at most once.
The frequency table for continuous cardinal variables differs from that for discrete varia-
bles in the second column. Where individual values were previously listed in each row,
there are now classes of values. This means that individual values are summarized in each
class. The first column no longer numbers the individual values but rather the individual
classes from 1 to k. Each class is numbered by a subclass and a class number. Each class is
characterized by a lower and an upper limit.
Let the lower limit of a class be characterized by xj*− 1 such that * denotes that it is a class
boundary and j − 1 denotes the lower boundary. The upper bound of a class j is denoted
by xj*. If there are k classes in general, then the upper and lower bounds of these classes
are as follows:
j = 1: x0*; x1*
j = 2: x1*; x2*
…
j = k: xk* − 1; xk*
29
For a concrete variable, these generally formulated class limits are replaced by concrete
numbers. We see that, for example, the upper limit of the first class x1* is equal to the
lower limit of the second class x1*. If, for age, x1* = 20 years, then the first class ends at 20
and the second starts at 20.
However, the question that in which class a 20-year-old person would be classified
remains open. This problem is solved by placing different brackets at the class bounda-
ries. At the upper limit of a class, a square bracket is placed: . This indicates that this
upper limit belongs to the corresponding class. The lower boundary of a class, except for
the first lower boundary, is provided with a round bracket: . This means that the class
starts at the next larger number than the lower limit itself. Referring to the example above,
a 20-year-old person would be sorted into the first class. A theoretical 200,001-year-old
person would then be assigned to the second class. Overall, the structure of the frequency
table is then as follows:
j xj* − 1, xj* nj fj Fj
1 x0*, x1* n1 f1 F1
2 x1*, x2* n2 f2 F2
⋮ ⋮ ⋮ ⋮ ⋮
k xk* − 1, xk* nk fk Fk
Σ n 1
Each row of a frequency table represents a class of the summarized variable values. This
means that the absolute frequencies now summarize the number of people who are to be
classified in this interval. Accordingly, the relative frequency indicates the proportion of
those who belong to this interval. The cumulative frequency summarizes the proportion
of people who take the value of the upper limit in the maximum case.
Finally, the way the class boundaries should be chosen must be clarified. In principle,
class division should take the form that the proportion of persons in each class increases
linearly with increasing age. This may result in all classes having the same width. However,
the classes can also have different widths. Especially the latter variant often makes sense
if there are only a few very small and/or large values. Then, in the smaller and/or larger
value range, larger classes are often formed so that we can summarize only few observa-
tions to a class. If one works with statistics programs, this optimally divides the values into
classes.
30
RUNNING EXAMPLE: SURVEY ON CARE ROBOTS
As an example, we will now work with the following class division of age:
15; 30 ; 30; 45 ; 45; 50 ; and 50; 70 . We see that the classes have differ-
ent widths. With the first two classes, an age range of 15 years is considered. The
third class includes a range of only 5 years, and the last class a wider range of 20
years. Consider the original list:
16; 50; 35; 47; 15; 20; 47; 48; 44; 55; 56; 35; 48; 52; 49; 68; 17; 26; 39
j xj*− 1, xj* nj fj Fj
Σ 19 1
31
Graphical Representation
For nominal scale variables and their relative frequencies, there are three possibilities for
graphical representation:
• pie chart
• bar chart
• Pareto chart
Pie chart In a pie chart, the individual characteristic values are given a specific area in the circle
A pie chart shows the fre- according to their share in the sample. Therefore, it is important to determine which area
quency distribution of a
variable. the individual variable values occupy in the circle while drawing the circle diagram. Since
a circle has a total angle of 360°, the individual angles αj (alpha) for the variable a j are
determined by
αj = f j · 360° for j = 1, …, k
The relative frequency is, therefore, multiplied by 360° to obtain the corresponding angle.
In this case, the following angles are obtained with the help of the frequency table:
32
Figure 2: Pie Chart for Gender
We can see that most of the 25 patients are female, since this variable expression occupies
the largest area in the circle.
In a bar chart, which is also known as a column chart, the variable values are plotted on Bar chart
the x-axis and the relative frequencies on the y-axis. Finally, a bar is drawn over each vari- A bar chart displays the
frequency distribution of
able in the amount of the relative frequency. a variable in the form of
bars or rods.
Using the data from our example, we obtain the following bar chart.
Of course, the same result can be read from the bar chart as from the pie chart.
33
Pareto chart A Pareto chartis a special form of the bar chart. It simply arranges the bars in the chart
A Pareto chart orders the according to the height of their relative frequencies. This arrangement can be ascending
characteristic values
according to the size of or descending. For the present variable of gender, the Pareto chart is not particularly use-
their occurrence. ful, as there are only two expressions.
• pie charts
• bar charts
A Pareto chart is not used for this scale level, since the meaningful order in the variable
values should not be changed. To create the pie chart, the angles in the circle for the indi-
vidual variable values must first be determined. This is done in the same way as in the
nominal scaled example above.
This results in the following pie chart α1 = 15°, α2 = 180°, α3 = 135°, α4 = 30° .
As we can see, most of the inpatients rated their satisfaction with the nursing robots
“good” or “satisfactory,” as these answers occupy the largest areas in the chart.
The construction of the bar chart is also identical to that of nominal scale variables. Con-
sequently, we obtain the following diagram.
34
Figure 5: Bar Chart for Satisfaction
This bar chart also gives us the same insights, given that the two highest bars are for
“good” and “satisfactory,” respectively.
For the graphical representation of a discrete cardinal variable, only the bar chart is used.
This is done as follows for the variable “previous contact.” As a rule, a pie chart is not used
if the values are numbers.
35
Figure 6: Bar Chart for Previous Contact
Most of the patients have had contact with similar nursing robots once before, followed by
those who have had no such contact at all. Very few patients have had contact with similar
nursing robots twice or even more.
Histogram A histogram chart is used only for continuous features. As we will see for continuous car-
A histogram is drawn only dinal variables, the histogram uses a completely different diagram than those we have
for continuous features.
seen for the previous scale levels (Fahrmeir et al., 2016, p. 38). The reason is due to the
class formation. The histogram ensures that the individual classes can be compared with
each other because these are often (as in the present example) of different widths.
where Δj (delta) is the class width. The histogram is finally created by doing this calcula-
tion of the densities for all classes:
36
fj
f x = Δj
for all j = 1, …, k
To draw a histogram, the first step is to calculate the densities are calculated. Next, a rec-
tangle is drawn over each class at the height of the calculated densities. The area of each
rectangle is characterized by the fact that it reflects the relative frequency f j of the corre-
sponding class. Thus, the total area under the histogram must be 1.
j xj* − 1; xj* nj fj Fj f x
5/19
1 [15; 30] 5 5/19 5/19 = 0.018
15
4/19
2 (30; 45] 4 4/19 9/19 = 0.014
15
6/19
3 (45; 50] 6 6/19 15/19 = 0.063
5
4/19
4 (50; 70] 4 4/19 1 = 0.011
20
19 1
37
Figure 7: Histogram for Age
As we can see, most of the 19 patients are older than 45 and at most 50 years
old. Even though the absolute frequency is highest in this class with 6, such a
clear difference to the other classes only becomes apparent here. Due to the
small width of the considered class in relation to the other classes, a relatively
high density results. Now, let’s consider this class and clarify once again why the
area of this rectangle is equal to the relative frequency. The area of a rectangle is
obtained by multiplying the width and height. In this case, the width is 5 and the
6
height is 0.063 according to the density. With 5 · 0.063, we get 0.315 or .
19
38
Table 11: Location Parameters and Their Possible Applications
Nominal ✓ - -
Ordinal ✓ (✓) -
Cardinal (discrete/continuous) ✓ ✓ ✓
While the mode can be determined for any scale level, quantiles and the mean place cer-
tain requirements on the scale level. In the following section, we will present the individ-
ual measures. It will be shown what the individual measures mean and how they can be
determined and interpreted.
Mode
The mode, which is also known as the modal value, is the characteristic expression of a Mode
variable that occurs most frequently in the sample (Bamberg et al., 2022, p. 16). The math- The mode represents the
most frequent variable
ematical abbreviation of the mode is value.
xmod .
A variable can have one mode or several modes. If there is only one mode, it is called a
unimodal distribution. A distribution with two modes is called bimodal, and one with
more than two modes is called multimodal. We will discuss the determination of the
mode or modes for the different scale levels.
The mode can be determined using the frequency table. To do this, look at the row with
the highest absolute or relative frequency and read off the corresponding value. Alterna-
tively, you can look at the pie chart or the bar chart to determine the mode. The expres-
sion with the largest area in the circle or the largest bar represents the mode. Recall the
frequency table, pie chart, and bar chart for the gender of our 25 patients. The expression
“female” is the most frequently represented one among the patients. Consequently,
xmod = female.
For the ordinal scale variable of satisfaction with the nursing robots, let’s look back at the
table and two graphs. The table and the graphs all show that the attribute “good” was the
most frequently one chosen by the patients. Accordingly, xmod = good.
39
Discrete cardinal variables
The analysis of the number of “previous contact” with similar nursing robots showed that
a single previous contact was mentioned most frequently by 14 patients, giving a share of
56%. Therefore, xmod = 1 is valid.
Only for continuous variables must we determine the mode in a slightly different way.
Because of the different widths of the classes, we cannot simply fall back on the absolute
or relative frequencies; we should determine the mode on the basis of the densities. The
class that has the greatest density is called the mode. The frequency table for age includ-
ing densities shows that the density is greatest in the third class at 0.063. Consequently,
the mode is xmod = 45; 50 . The mode is, therefore, a whole class in this case.
With the help of the histogram, we would also obtain this result because the rectangle in
the class 45 to 50 is the highest compared to the other classes. In some literature, the
mode for a continuous variable is also given in the form of a single number. In that case,
the middle of the class is given as the mode where the density is highest. The middle
45 + 50
between 45 and 50 would be 47.5 2
.
Quantiles
Quantile Another measure of the position of a variable is a quantile. A quantile is a variable value
A quantile is determined that is not exceeded by a certain proportion of objects (Fahrmeir et al., 2016, p. 60). This
by an expression that is
not exceeded by of the proportion can be chosen arbitrarily. Mathematically, one generally notates a quantile by
trait carriers.
xp .
It indicates the variable expression that is not exceeded by percent of the variable carriers.
The remaining 1 − p percent is, therefore, the quantile.
There are three quantiles that are considered particularly important in statistics. One is
Median the median x0.5. This lies exactly in the middle of the ordered data set. 50% of the variable
The median is the most
values are at most as large as x0.5. Accordingly, 50% of the variable values are at least as
important quantile and
forms the center of the large as x0.5. Two other important quantiles are the quartile x0.25 (i.e., the lower quartile:
ordered data set. value not exceeded by 25% of the objects) and the quartile x0.75 (i.e., the upper quartile:
Quartiles value not exceeded by 75 % of the objects). Together with the median, these divide the
Quartiles, together with
the median, provide four sorted data set into four equal sections.
equally sized ranges in
the ordered data set.
In principle, quantiles cannot be determined for nominal scale variables because their cal-
culation requires sorting of the observations. As we will see later, calculating them for
ordinal scale variables fails in some places. For cardinal scale variables, whether discrete
or continuous, quantiles can be computed in any case. We can determine the quantiles
either based on a primal list or a frequency table. Let’s start with the determination from a
primal list.
40
The determination of quantiles from a primal list is done in the same way for all possible
scale levels. We start from the primal list
which contains, in an unsorted order, all expressions of the n variable carriers of a sample.
This list must be sorted in the first step from small to large. Thus, the following ordered
data set is created:
x 1 , x 2 , …, x n
x 1 stands for the smallest and x n for the largest observation. The individual quantiles
can now be determined according to a general procedure and independent of p. The aim
is to find the position in the ordered data set at which the sought-after quantile is located:
x np , if np is not an integer
xp = x np + x np + 1
2
, if np is an integer
In the first step, depending on the quantile xp to be calculated, n · p is calculated with the
help of the sample size. Let’s assume that we are looking for the quantile x0.4 and n = 11.
This would mean that n · p = 11 · 0.4 = 4.4. This is not an integer. Accordingly, in the first
line of the above formula, x 4.4 shows that after the 4.4, the next integer is used as the
place for the quantile we are looking for. Thus, the quantile x0.4 is at the fifth digit in an
ordered data set with 11 observations.
If we have a frequency table, we must distinguish which scale level is available for the
determination of quantiles. So, let’s look at our individual variables according to the scale
of measurement and determine, with the median as well as the two quartiles, the most
important quantiles both from a primal list and the corresponding frequency table. As
already mentioned, a determination for our nominal scale variable is not possible.
For the variable “satisfaction with nursing robots,” we have the following unsorted data
set for n = 24 patients:
We first put this into an ordered form, starting with the best score and ending with the
worst one:
41
vg; g; g; g; g; g; g; g; g; g; g; g; s; s; s; s; s; s; s; s; s; sf; sf
To determine the median for “satisfaction with nursing robots,” we multiply the number of
patients of 24 by 0.5, which gives us 12. Since 12 is an integer, we take the 12th and the
following 13th digit (marked in bold in the sorted data set) to determine the median as the
average of these two:
24 · 0.5=12 integer
x 12 + x 13 good + good
x0.5= 2
= 2
= good
Here, it has already become recognizable why determination of quantiles with ordinal
scale variables is quite questionable. Namely, this is because no average can be formed
from two words. However, it seems possible at this point, since both digits contain the
expression “good.” Overall, the result means that 50% of the patients rated their satisfac-
tion with the nursing robots as at most “good.” The remaining 50% rated it as at least
“good.”
According to this procedure, both the lower and the upper quartile can be determined:
24 · 0.25=6 integer
x 6 +x 7 good + good
x0.25= 2
= 2
= good
24 · 0.75=18 integer
x 18 + x 19 satisfactory + satisfactory
x0.75= 2
= 2
= satisfactory
25% of the respondents rated their satisfaction with the nursing robots as at most “good,”
while 75% rated it as at most “satisfactory.”
If we had a frequency table available, the determination of quantiles would be much eas-
ier. For any quantile xp, we have to check in the column of cumulative frequencies when
the cumulative frequency of p is exceeded for the first time.
j aj nj fj Fj
4 sufficient 2 2/24 1
Σ 24 1
42
For the median, for example, we discuss the cumulative frequencies. In the first row, the
1
cumulative frequency of 24 is still less than 0.5. In the second row, it is greater than 0.5 for
13
the first time with 24 = 0.542, whereupon the expression in the second row equals the
median. The procedure for the two quartiles is analogous:
13 13
x0.5=good, since 24
> 0.5; x0.25 = good, since 24
> 0.25
22
x0.75=satisfactory, since 24
> 0.75
The interpretation is, of course, identical to that of the original list. According to the same
principle, it should also be mentioned that all other quantiles can be determined both
from the original list and the frequency table.
All 25 patients provided information on how many times they have had contact with simi-
lar nursing robots beforehand. The following original list
1; 5; 0; 0; 1; 1; 2; 1; 1; 1; 1; 1; 2; 1; 0; 3; 1; 1; 0; 3; 0; 1; 1; 2; 1
is first sorted from small to large for the determination of the three most important quan-
tiles:
0; 0; 0; 0; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 2; 2; 2; 3; 3; 5
These can now be determined in the same way as for ordinal scale variables. What is dif-
ferent here is that, for example, in the case of the median, multiplying 0.5 by 25 results in
the number of 12.5, which is not a whole number. We, therefore, take the next largest
number (13) and read off the median at the 13th digit:
25 · 0.5=12.5 no integer
x0.5 = x 12.5 =x 13 = 1
For the two quartiles, the calculation of the correct digit is done in the same way:
25 · 0.25=6.25 no integer
x0.25 = x 6.25 =x 7 = 1
25 · 0.75=18.75 no integer
x0.75 = x 18.75 =x 19 = 1
What is striking here is that all three quantiles take on the same value. This always hap-
pens when the value occurs particularly frequently in the data set.
For the determination from the frequency table, we proceed in the same way as for ordi-
nal scale variables.
43
Table 13: Frequency Table for the Previous Contact (2)
j aj nj fj Fj
1 0 5 0.2 0.2
2 1 14 0.56 0.76
3 2 3 0.12 0.88
4 3 2 0.08 0.96
5 5 1 0.04 1
Σ 25 1
This gives us the following results, which are identical to those from the original list:
x0.5=1, since 0.76 > 0,5; x0.25 = 1, since 0.76 > 0.25;
x0.75=1, since 0.76 > 0.75
16; 50; 35; 47; 15; 20; 47; 48; 44; 55; 56; 35; 48; 52; 49; 68; 17; 26; 39
15; 16; 17; 20; 26; 35; 35; 39; 44; 47; 47; 48; 48; 49; 50; 52; 55; 56; 68
Based on this sorted data set, the determination of the quantiles is done as already
known:
19 · 0.5=9.5 no integer
x0.5=x 9.5 = x 10 = 47
19 · 0.25=4.75 no integer
x0.25 = x 4.75 =x 5 = 26
19 · 0.75=14.25 no integer
x0.75 = x 14.25 =x 15 = 50
Therefore, 25% of the patients are 26 years old or less, 50% are younger or older than
47 years, and 75% are not older than 50.
The procedure for determining quantiles for continuous variables will be completely dif-
ferent if only one frequency table is available. If we want to determine an arbitrary quan-
tile xp, we first look in which row the cumulative frequency is greater than p for the first
44
time. If the row is found, we know in which class the quantile we are looking for must be
located. Then, the lower limit xj*− 1 of the classes found, the relative frequency f j of these
same classes, the width of the class Δj, the cumulative frequency of the previous classes
F xj*− 1 , and the cumulative proportion p given by the quantile are used to calculate the
quantile (Fahrmeir et al., 2016, p. 55):
p − F xj* − 1
xp = xj*− 1 + fj
· Δj
With this in mind, let’s look at the determination based on the following frequency table.
j xj* − 1; xj* nj fj Fj
Σ 19 1
Now, let’s take a closer look at the median x0,5: Only in the third row the cumulative fre-
quency is greater than 0.5 for the first time; so, we already know that our result must lie
somewhere between 45 and 50. For the calculation, we start with 45 as the lower limit.
Next, we add a certain fraction. The numerator of the fraction starts with the proportion
given by the median in the amount of 0.5. Next, subtract the cumulative frequency of the
9
previous class in the amount of 19 . The denominator is the relative frequency of the rele-
6
vant class, which is 19 here. We multiply this fraction by the width of the class in the
amount of 5 (50 − 45 = 5):
9
0.5 −
19
x0.5 = 45 + 6
· 5 = 45.41
19
This gives us a median of 45.41. We proceed in the same way to determine the two quar-
5
tiles. It is worth mentioning that the lower quartile falls directly into the first class, since 19
is greater than 0.25. So, if the cumulative frequency of the previous class in the numerator
of the fraction has to be subtracted, the result will always be 0 (i.e., there is simply no one
younger than 15 years old). We get the following results for the two quartiles:
45
0.25 − 0
x0.25 = 15 + 5
· 15 = 29.25
19
9
0.75 −
19
x0.75 = 45 + 6
· 5 = 49.375
19
The interpretation of the content is the same as in the original list. However, the results
are not identical with those from the original list. This is common in the context of contin-
uous features. By forming classes, the original observations are no longer available. We
only know how many persons are in each class. The exact age of each person is no longer
considered. We only get the exact results from the original list. If this is available, it is
always advisable to use it for the calculation of individual measures.
Mean Value
The classic and probably the best-known measure for describing a characteristic is the
Mean value mean value, which is also called average value or arithmetic mean (note that − x is read as
The mean value summari- “x across”). The mean value indicates which variable value is assumed on average by the
zes all data of a variable
to one value. objects (Fahrmeir et al., 2016, p. 50). The determination of −
x is possible for the cardinal
scale without exception. It can again be calculated from both a primal list and a frequency
table. We go through both variants for discrete and continuous variables.
Let’s start by determining the average previous contact with nursing robots based on the
primal list. To do this, we use the following formula:
n
−
x =
1
∑ xi
n
i=1
Only the variable values of all n objects are summed up and divided by n. For our
25 patients with the data
1; 5; 0; 0; 1; 1; 2; 1; 1; 1; 1; 1; 2; 1; 0; 3; 1; 1; 0; 3; 0; 1; 1; 2; 1,
−
x =
1
1 + 5 + … + 2 + 1 = 1.24
25
Therefore, 1.24 patients have been in contact with similar nursing robots before on aver-
age. If we do not have the original list but the frequency table instead, the mean value can
be calculated using either the absolute or relative frequencies. Both variants must lead to
the same result. Of course, it is sufficient to calculate only one:
46
k k
−
x =
1
∑ ajj · nj = ∑ aj · f j
n
j=1 j=1
If the individual values are multiplied by the number of their occurrence nj, then the sum
must be divided by n. If the proficiencies are weighted directly with their proportions f j,
then the division by the sample size is omitted. So, let’s look at this for our example.
j aj nj fj Fj
1 0 5 0.2 0.2
2 1 14 0.56 0.76
3 2 3 0.12 0.88
4 3 2 0.08 0.96
5 5 1 0.04 1
Σ 25 1
The average previous contact with nursing robots also leads to a result of 1.24 based on
the frequency table:
−
x =
1
0 · 5 + 1 · 14 + … + 5 · 1
25
= 1.24
16; 50; 35; 47; 15; 20; 47; 48; 44; 55; 56; 35; 48; 52; 49; 68; 17; 26; 39
the average age can be calculated in the same way as for the discrete variable:
−
x =
1
16 + 50 + … + 26 + 39 = 40.368
19
Therefore, the patients surveyed are 40.368 years old on average. If we have a frequency
table, we can work again with either the absolute or relative frequencies. What is impor-
tant and different, however, is that these are now multiplied by the respective class mean.
47
k k
xj* − 1 + xj*
−
x =
1
∑ mj · nj = ∑ mj · f j with mj =
n 2
j=1 j=1
If we look at the frequency table, we can determine, for example, the middle of the first
15 + 30
class from 15 to 30 by taking the average of these two limits 2
= 22.5 .
j xj* − 1, xj* nj fj Fj
Σ 19 1
With this knowledge, we arrive at the following average age based on the table:
−
x =
1
22.5 · 5 + 37.5 · 4 + 47.5 · 6 + 60 · 4
19
5 4
= 22.5 · 19
+ … + 60 · 19
= 41.447
As with the quantiles, the average value just calculated differs from that from the original
list. We already know the reason for this. Besides, it is recommended to use the original
list for the calculation of the average value if this is available.
Finally, let’s look at a special property of the mean value. We will now consider the original
list of ages again. We have calculated an average age of − x = 40.368 for these 19 individu-
als. Imagine we want to study this group of patients again two years later. All 19 individu-
als have, thus, become two years older. We can note that xi + 2 holds for all i = 1, …, n.
What does this do to the mean? If everyone has become two years older, the original list is
updated as follows:
18; 52; 37; 49; 17; 22; 49; 50; 46; 57; 58; 37; 50; 54; 51; 70; 19; 28; 41
−
x =
1
18 + 52 + … + 28 + 41 = 42.368 .
19
The average age has, therefore, increased by two years compared with the previous aver-
age age. This is not a coincidence but the rule: If all values of a sample are increased or
decreased by a certain value, the mean value also increases or decreases by this value.
48
Suppose that all ages were doubled (xi · 2 for all i = 1, …, n). So, the first person is
32 years old rather than 16. The average value would double, too. Overall, if the original
data are transformed linearly (i.e., something is added or subtracted in the value a and/or
multiplied by something in the amount b) such that new data yi are created, then the aver-
age value of the new data changes by the exact amount of the addition, subtraction, or
multiplication:
yi = a + b · xi −
y =a+b·−
x
Referring to the examples above, would represent the fact that all participants are now
two years older. In contrast, b would stand for the doubling of the age and, thus, the dou-
bling of the mean value.
We will now take a closer look at the mean and the median. Often a question arises about
which of them is better suited to describe the location of a variable. This critically depends
on the data set.
Imagine we are looking at the ages of 20 people. Suppose that 19 of them are between 30
and 35 years old and only one person is 63. The median does not consider the 63 year old
person. It focuses on the middle of the data set, i.e., somewhere between 30 and 35. Thus,
it is generally considered robust. The mean, conversely, takes all ages into account when
summing them up. Thus, the 63 year old person will cause the mean to be pulled up and
to be even higher than 35. Therefore, we would have an average age that is higher than
that of 19 respondents (out of a total of 20). Accordingly, the mean is very sensitive to
outliers. Outlier
A outlier is a measure-
ment value that does not
We can state that a mean is always well suited when the data set is not affected by fit into an expected series
extreme outliers. The median is not affected by such situations. Even if each of the two of measurements or gen-
measures is not always well suited, the comparison of these two helps enormously to erally does not meet the
expectations.
assess the distribution of a variable from a statistical point of view. Thus, we distinguish
between a symmetrical and asymmetrical distributions (Fahrmeir et al., 2016, p. 56):
−
x ≈ x0.5 : symmetrical distribution
−
x > x0.5 : right skewed distribution
−
x < x : left skewed distribution
0.5
A symmetrical distribution is always present when the mean and median are approxi-
mately equal. Graphically, this shows up in a bar chart or histogram in such a way that we
can see an even slope on both sides of the highest bar.
A skewed distribution can occur in two ways. If the mean is greater than the median, we
are dealing with a right skewed distribution. Consequently, the bars decrease towards the
right. A left skewed distribution, on the other hand, is when the mean is smaller than the
49
median. The distribution then decreases to the left in the diagram. Sometimes, a distribu-
tion cannot be sorted directly into one of the three categories, such as a distribution like
the one below on the right with two largest bars being of approximately the same size.
Since mean values can only be calculated for cardinal scale variables, this division into
symmetry and asymmetry also only applies to that scale level.
50
2.3 Measures of Dispersion
In this section, we discuss the measures of dispersion in the context of univariate data
analysis. The aim here is to find out whether the surveyed objects are similar with respect
to a variable or whether they differ from each other. If, in the context of an age survey, all
participants are very similar in age, there will be only a small amount of dispersion. If,
however, they differ very much in age, the dispersion will be correspondingly larger. Meas-
ures of dispersion can only be determined for cardinal scale variables both from an origi-
nal list and a frequency table. We now explain the measures of dispersion range and inter-
quartile range as well as the two most important ones of sample variance and standard
deviation.
Range
The range R is probably the simplest measure to describe the dispersion of a variable Range
(Bamberg et al., 2022, p. 20). We only have to subtract the smallest expression x 1 from The range shows the dis-
tance from the smallest to
the largest expression x n : the largest expression.
R=xn −x1
For the number of previous contacts with similar nursing robots, we find the largest obser-
vation on the far right and the smallest on the left in the sorted data set:
0; 0; 0; 0; 0; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 2; 2; 2; 3; 3; 5
R=5−0=5
Those with the most contact had five more contacts than those with the least contact. The
result could also be read from the frequency table (i.e., value in the last line - value in the
first line).
It is the same for the age of the respondents. The sorted data set
15; 16; 17; 20; 26; 35; 35; 39; 44; 47; 47; 48; 48; 49; 50; 52; 55; 56; 68
R = 68 − 15 = 53 .
51
There is a 53-year difference between the youngest and oldest person. Note that it may
not be advisable here to determine the range from the frequency table, as the smallest
class lower limit and the largest class upper limit do not necessarily have to correspond to
the smallest and largest expressions.
Interquartile Range
Interquartile range The second measure of dispersion with the interquartile range IQR is more suitable in
The interquartile range the case of existing outliers. It focuses on the variation in the central 50% of observations
shows the distance of the
central 50% of objects. (Fahrmeir et al., 2016, p. 61). Sorting the data from small to large, the interquartile range
marks the distance from the lower quartile to the upper quartile:
To calculate the interquartile range, it is, therefore, necessary to calculate the two quar-
tiles. Since we have already discussed this in the previous distance, we only explain the
calculation of the IQRs briefly.
Regarding previous contact with nursing robots, we determined a value of 1 for both the
lower and upper quartiles (original list identical to frequency table). The interquartile
range is, therefore, calculated as follows:
IQR = 1 − 1 = 0 .
The central 50% of persons do not differ from each other in any way in their frequency of
previous contact. There is no dispersion among them.
IQR = 50 − 26 = 24
The central 50% of the patients sorted by age, therefore, differ from each other by a maxi-
mum of 24 years. Using the results of the frequency table, we arrive at a result of
IQR = 49.375 − 29.25 = 20.125. Here, too, the results based on the original list should
be preferred.
Sample variance The sample variance s2 and the standard deviation s are probably the most important
The sample variance is
measures to describe dispersion (Fahrmeir et al., 2016, p. 65). The standard deviation indi-
required to obtain the
standard deviation. cates the average deviation from the mean, and the sample variance is required to obtain
52
the standard deviation. Compared to the other two measures, they involve every single Standard deviation
observation in the calculation. They check how far each individual observation is from the The standard deviation
indicates the average
mean. In the first step, the sample variance is always calculated by deviation from the mean.
n
2 2
∑ xi − −
1 n
s2 = n−1
x = n−1
x2 − x .
i=1
The first variant to the right of the first equal sign shows exactly what has just been descri-
bed. The mean value − x is subtracted from each observation xi. This difference is squared.
After this has been done for all observations and these squared differences have been
summed up, we divide the result by n − 1 (to know the reason why it is divided by n − 1,
please refer to Fahrmeir et al., 2016, p. 65).
An alternative calculation form is offered to the right of the second equal sign. For this var-
iant, only two mean values must be calculated. First, the simple mean, which is finally
2
squared: x . Second, the mean of all squared observations must still be calculated using
n
1
x2 = n
∑ xi2. Each individual observation is squared for this purpose and the mean is
i=1
finally calculated from the squared observations.
The second variant is highly recommended for computation by hand, since it is considered
to be less prone to error. For this reason, we restrict ourselves to exactly this in the follow-
ing explanations. The sample variance cannot be interpreted due to squaring. For this rea-
son, the root of the sample variance is taken to get the standard deviation:
s = s2
This can be interpreted very well in terms of content in conjunction with the mean: The
average of the characteristic is −
x ± s. Let’s now explain the calculation for the two cardi-
nal scale variables from both the original list and the frequency table.
1; 5; 0; 0; 1; 1; 2; 1; 1; 1; 1; 1; 2; 1; 0; 3; 1; 1; 0; 3; 0;
1; 1; 2; 1
− 1
x = 25 1 + 5 + … + 2 + 1 = 1.24
1
x2= 25 12 + 52 + … + 22 + 12 = 2.76
25
s2= 24 2.76 − 1.242 = 1.273
s= 1.273 = 1.128
53
It is recommended to calculate the sample variance in the order given above. One should
start with the simple mean value 1.24, which enters the formula of sample variance in
squared form with 1.242. The mean of the squared observations in the second row
requires squaring each observation. The resulting 2.76 also enters the formula for s2.
Finally, for s2, we need the sample size to be 25, which flows into the form with 25 in the
numerator and 25 − 1 = 24 in the denominator. The result 1.273 should not be interpre-
ted.
Only by forming the standard deviation by taking the root from 1.273 do we get an inter-
pretable result of 1.128. The average previous contact with similar nursing robots is
1.24 ± 1.128 times. Whether this spread is large or small usually depends on the context
as well as the units measured. If we had a frequency table, we would only have to adjust
the way the mean is calculated in the first two steps:
−
x =0 · 0.2 + 1 · 0.56 + … + 5 · 0.04 = 1.24
x2=02 · 0.2 + 12 · 0.56 + … + 52 · 0.04 = 2.76
Here, the mean values were calculated based on the relative frequencies. Care must be
taken when calculating x2; the values must be squared in each case. Since the two subse-
quent calculation steps are analogous, no further illustration is given.
Since the procedure for the age based on the original list is similar to the one above, we
proceed with the following data:
16; 50; 35; 47; 15; 20; 47; 48; 44; 55; 56; 35; 48; 52; 49; 68;
17; 26; 39
− 1
x = 19 16 + 50 + … + 26 + 39 = 40.368
1
x2= 19 162 + 502 + … + 262 + 392 = 1851
19
s2= 18 1851 − 40.3682 = 233.690
s= 233.690 = 15.287
The mean age of the 19 patients is 40.368 ± 15.287 years. From the frequency table, as
with the rest of the measures, we arrive at a slightly different result.
j xj* − 1, xj* nj fj Fj
54
2 (30; 45] 4 4/19 9/19
Σ 19 1
−
x =22.5 ·
5
+ … + 60 ·
4
= 41.447
19 19
5 4
x2=22.52 · 19
+ … + 602 · 19
= 1899.671
19
s2= 18
1899.671 − 41.4472 = 191.918
s= 191.918 = 13.85
Finally, the calculation of the mean value x2 should be discussed again. The class centers
must now be squared in each case. With slightly different mean values we consequently
also come to slightly different results for s2 and s.
We recall the special property of the mean, which changes exactly by an added/subtracted
and/or multiplied value by linear transformation when all the data have been changed in
shape. In the case of the sample variance and standard deviation, this looks somewhat
different. Here, too, we consider the original list of the age again for this purpose:
16; 50; 35; 47; 15; 20; 47; 48; 44; 55; 56; 35; 48; 52; 49; 68;
17; 26; 39
Based on this, we have obtained a sample variance of s2 = 191 . 918 and a standard devia-
tion of s = 13.85. If we now look at all persons again two years later, meaning that xi + 2
applies to all i = 1, …, n, then this does not affect the sample variance and standard
deviation at all. According to the modified original list, all patients have become two years
older, and the numbers have changed fundamentally:
18; 52; 37; 49; 17; 22; 49; 50; 46; 57; 58; 37; 50; 54; 51; 70; 19; 28; 41
However, the distances between the individual people and, thus, the variation/scatter
between them does not change.
In summary, this means that increasing or decreasing all values by a certain amount a
does not lead to any change in the sample variance or standard deviation. The situation is
different if all values of the sample are multiplied by a certain number b. The original pri-
mal list above, if all values were doubled (xi · 2 for all i = 1, …, n), would result in
32; 100; 70; 94; 30; 40; 94; 96; 88; 110; 112; 70; 96; 104; 98; 136; 34; 52; 78.
55
Here, we can see very well that not only all values have changed. However, their distances
to each other and, thus, the dispersion have also taken on a completely different form.
There are now suddenly much larger differences between the individual values. Whereas
in the original list there were 34 years between the first (16 years) and the second
(50 years) person, there are now 68 years (32 and 100 years) between them after the dou-
bling. This leads to the following basic rule: Multiplying all values by b leads to a change in
2
the sample variance by b and in the standard deviation by exactly b. Overall, we hold the
following:
yi = a + b · xi sy2 = b2 · sx2 sy = b · sx
If all original data xi are increased or decreased by a, this has no effect on the sample var-
iance sy2 or standard deviation sy of the new data yi. If, conversely, all original values xi are
2
multiplied by b, then the new sample variance sy2 changes by b compared to the previous
sx2. For the new standard deviation sy, this results only in a change by b.
SUMMARY
Descriptive univariate (i.e., one-dimensional) analysis is a very impor-
tant part of statistical analysis. It is used to describe each variable collec-
ted before going into deeper analysis. We used frequency tables, graphs,
and measures to describe them in more detail. Recall that we covered
popular visualization methods such as pie charts, bar charts, and Pareto
charts using our example of a hospital with nursing robots.
In all the analyses we used, the scale level of the variables considered
was the main focus. This is because depending on the nature of a varia-
ble, certain statistical analyses may or may not be permitted.
56
UNIT 3
ANALYSIS METHODS OF TWO-
DIMENSIONAL DATA
STUDY GOALS
Introduction
Is there a correlation between gender and income? Is there a dependency between hours
worked and cigarette consumption? These and other questions are what we want to
answer in this unit. When we look at two variables together, such as gender and income or
hours worked and cigarette consumption, and want to uncover a possible relationship, we
use bivariate analysis. This is also referred to as analyzing two-dimensional data. As in uni-
variate analysis, the scale level of the two variables under investigation is crucial in bivari-
ate analysis. Therefore, the following cases should be determined:
We will first explain the three cases in which the scale levels of the two variables are iden-
tical. Then, we will describe the procedure for two different scale levels.
58
Table 18: Initial Data for Two Nominal Scale Variables
1 f N 2 f N 3 f N 4 f S
5 m S 6 f N 7 f N 8 f N
9 m N 10 m S 11 f S 12 m N
13 f N 14 f N 15 f S 16 f N
17 f N 18 f N 19 m N 20 f S
21 f N 22 f N 23 f N 24 f N
25 f N 26 m S 27 m N 28 m N
29 f N 30 f S 31 f N 32 m S
33 f N 34 f N 35 f N 36 m N
37 m S 38 m S 39 m N 40 f N
41 f S 42 f S 43 f N 44 f N
45 f N 46 m N 47 f N 48 f N
From the answers, we know that, for instance, the first respondent is a female
non-smoker and the 27th respondent is a male non-smoker. We now want to
find out whether there is a correlation or dependence between gender and
smoking behavior in this sample. Therefore, we need to examine A: gender and
B: smoking behavior (please note that the order is arbitrary). The variable val-
ues of the two features are as follows:
59
• A: female (A1), male (A2)
• B: smoker (B1), non-smoker (B2)
Both A and B have I = 2 and J = 2 variable values. The order in which the val-
ues are arranged (first female and then male or vice versa) is unimportant. The
aim of the bivariate analysis is to consider the two variables together. When they
are considered together, I · J different variable values result in
Ai, Bj with i = 1, 2, …, I; j = 1, 2, …, J .
In the present example, there are the following four combinations of the values:
The first step of the data analysis is to summarize the collected data in a more compressed
way regarding the question. This is done by summarizing the absolute frequencies nij for
Contingency table the individual expressions Ai, Bj in a contingency table for absolute frequencies. The
The contingency table for
structure of this contingency table is as follows (Bamberg et al., 2022, p. 30):
absolute frequencies
forms the counterpart to
the frequency table in the Table 19: General Structure of a Contingency Table With Absolute Frequencies
univariate case.
A/B B1 B2 … BJ
⋮ ⋮ ⋮ ⋱ ⋮ ⋮
n. 1 n. 2 … n. J n
The rows of the table contain the values of the variable A. The table, therefore, has as
many rows as the values of the variable A. The individual columns of the table mark the
values of the variable B. The contingency table is composed of two areas regarding the
absolute frequencies. In the center of the table are the absolute frequencies nij of the vari-
able values Ai, Bj for all combinations i = 1, 2, …, I; j = 1, 2, …, J. The absolute
frequency n21, thus, stands for the number of people who have the variable values A2 and
B1. The margin of the table forms the second part of the contingency table where the mar-
ginal frequencies ni . or n . j are placed. Next, ni . sums up the absolute frequencies of the
row i:
60
J
ni . = ∑ nij
j=1
I
n . j = ∑ nij
i=1
Finally, the lower right corner of the contingency table with n always contains the total
number of people involved in the sample.
f (A1) 7 27 34
m (A2) 6 8 14
13 35 48
Looking at the initial table, we know that there are seven female smokers and 27
female non-smokers. Therefore, there are a total of 34 female nurses, which are
divided into smokers and non-smokers according to the above values. This
information fills the first row. The second row contains the information about
male nurses. Based on the initial table, there are six smokers and eight non-
smokers among them, which brings us to a total of 14 male nurses.
The sum of the first column and, thus, the total number of smokers is 13. There-
fore, the total number of non-smokers is 35. If we did not already know it, we
could calculate the total of all nurses in three different ways. All four values in
the center of the table must add up to the 48 present here. The sum of female
and male nurses and the sum of smokers and non-smokers both must also add
up to a total of 48 nurses.
61
We can see that among both female and male nurses, there are more non-smok-
ers (27 and eight, respectively) than smokers (seven and six, respectively). How-
ever, it should also be noted that there are many more female than male nurses
in our sample.
Based on this contingency table, the question that now arises is whether there is
a correlation between gender and smoking behavior. Does it make a difference if
one is male or female in terms of smoking behavior? This would be the case if,
for example, there were more female smokers than non-smokers and, con-
versely, more male non-smokers than smokers. Since there are more non-smok-
ers than smokers in both sexes, the tendency is that there is little or no correla-
tion.
In order to answer this question in a concrete and meaningful way, a measure called the
Corrected contingency corrected contingency coefficient is used (Bamberg et al., 2022, pp. 36–37). This requires
coefficient four calculations.
This coefficient quantifies
the relationship between
two variables, at least one Step 1: Calculation of the expected frequencies
of which is nominally
scaled.
In the first step, the expected frequencies or the absolute frequencies are determined
under descriptive independence. These are the absolute frequencies for which the two
variables are not related in any way. For instance, the proportion of smokers among
women would be exactly equal to the proportion of smokers among men. The expected
frequency for the variable values Ai and Bj is denoted by n ij and calculated by
ni . · n . j
n ij = n
.
Thus, one takes the product of the marginal frequencies and divides this by the sample
size. In the case of descriptive independence, for all combinations of i and j, nij = n ij. To
show that the two variables are dependent on each other in some way, for at least one
pair Ai, Bj nij ≠ n ij must hold.
Table 21: Contingency Table With Only Marginal Frequencies for the
Variables of Gender and Smoking Behavior
62
w (A1) 34
m (A2) 14
13 35 48
34 · 13
n 11= 48
= 9.208
34 · 35
n 12= 48
= 24.792
14 · 13
n 21= 48
= 3.792
14 · 35
n 22= 48
= 10.208
We now explain the first expected frequency n 11. Consider the first row (female)
and first column (smokers). We take the sum of the row (34 female nurses), mul-
tiply it by the sum of the column (13 smokers), and divide the answer by the
total sum of individuals (48). The other three fields are also calculated in the
same way: The row sum is multiplied by the column sum and divided by the
total number individuals.
Step 2: Calculation of the distances between the absolute and expected frequencies
The second step is to see how far the actual absolute frequencies are from the expected
ones. With the expected frequencies, we now have a reference value that we know how to
interpret. If they are available, independence between the two characteristics applies.
Therefore, the further away the actual frequencies are from the expected frequencies, the
stronger the correlation must be. To measure the distance of these frequencies, the χ2
(Greek letter: Chi) is calculated by
I J 2
nij − n ij
χ2 = ∑ ∑ n ij
.
i = 1j = 1
Thus, in each field of the center of the contingency table, the expected frequency is sub-
tracted from the absolute one. This difference is squared since some distances may be
negative and others positive, which could cause them to cancel each other out to a 0.
Finally, the expected frequency is divided again to make the relation of the distances clear.
If one has only a small sample size, a difference equal to, let’s say, 5 is proportionally
larger than if one has a large sample size. This is done for each field of the contingency
63
table. Subsequently, all summands are added up. If the contingency table consists of two
rows and two columns (i.e., both variables have two values each), then χ2 can be calcula-
ted by
2
n · n11 · n22 − n12 · n21
χ2 = n1 . · n2 . · n . 1 · n . 2
.
This makes the first step superfluous because this alternative is based solely on the num-
bers that are in the original contingency table for absolute frequencies.
13 35 48
64
2 2 2 2
7 − 9.208 27 − 24.792 6 − 3.792 8 − 10.208
χ2= 9.208
+ 24.792
+ 3.792
+ 10.208
=0.529 + 0.197 + 1.286 + 0.478
=2.49
2
48 · 7 · 8 − 27 · 6
χ2 = 34 · 14 · 13 · 35
= 2.49 .
How did we proceed from here? The numerator contains the total number of
people (48). In the first step, the two values on the diagonal (7 and 8) are multi-
plied in the parentheses. The product of the two numbers on the secondary
diagonal (27 and 6) is subtracted from the result. The content of the parentheses
must be squared in total. In the denominator, all edge frequencies (34, 14, 13,
and 35) are multiplied together. Since this value can be from 0 to infinity, we will
certainly not get an extraordinarily high value here.
χ2
K= 2
.
χ +n
It can only assume values between 0 and 1, inclusive. We would obtain a 0 if the expected
frequencies corresponded exactly to the absolute frequencies, and χ2 = 0 would also
apply. The further apart the absolute and expected frequencies are from each other, the
larger χ2 and, thus, K become.
2.49
K= 2.49 + 48
= 0.222 .
However, we will not interpret this result yet. There is still one last calculation
step to be done.
65
Step 4: Calculation of the corrected contingency coefficient
In the last step, the contingency coefficient is corrected. This results in K*. This is because
the more expressions I and J that the two variables A and B have, the larger the number
of cells in the contingency table is and the larger χ2 becomes automatically (since χ2 con-
sists of several summands and one would always add up more). This influence is elimina-
ted here in the last step by
K M −1
K* = Kmax
with Kmax = M
with M = min I, J .
The function min is a minimum function and selects the smaller number of columns or
rows. This results in M , which is used for Kmax and finally used to calculate K*.
2−1
M = min 2, 2 =2 Kmax = = 0 . 707
2
0 . 222
K* = = 0 . 314
0 . 707
Let’s explain it now. Start with the number of rows (2) and columns (2) of the
contingency table. If these two numbers are in curved brackets, you choose the
smaller number of the two. Since we are dealing with the same number twice,
the decision is trivial. If, however, we included occasional smokers in the smok-
ing behavior, the contingency table would consist of three columns. One would
then choose the smaller number of 2 and 3. Once 2 is selected, it is substituted
for M in the formula for Kmax. This gives the above 0.707, which is finally used
to divide the contingency coefficient from the third step, 0.222, by this 0.707.
This leaves only the interpretation of the corrected contingency coefficient. The values
that K* can assume lie between 0 and 1, inclusive. It assumes the value 0 precisely when
there is descriptive independence; the two variables are therefore not related in any way.
The higher the value for K*, the stronger the correlation. In particular, the following clas-
sification applies to the interpretation:
66
RUNNING EXAMPLE: SMOKING HABITS AND GENDER
Since K* = 0.314 applies in the present example, there is a weak dependence
or correlation between gender and smoking behavior. Even if we find a strong
correlation here, the mere number does not tell us what the nature of the corre-
lation is. So, whether male nurses are likely to smoke more than female nurses
or vice versa can only be read from the original contingency table.
The two ordinal scale variables are denoted by X and Y in this context. For each object
carrier i in the sample, there is an observation xi for the variable X and an observation yi
for the variable Y . For each object, there is also a pair of points
xi, yi for i = 1, …, n .
67
Table 23: Initial Data on the Relationship Between Satisfaction With
Care Robots and Satisfaction With Health Status
i xi yi
2 satisfied satisfied
3 poor insufficient
While, for example, the first person rated their satisfaction with the robots as
“good” and satisfaction with their health status as “very good,” the third person
rated both as “poor” and “insufficient,” respectively. The second person rated
both as “satisfactory,” which is average, and the fourth person was in the upper
range regarding satisfaction both with the robots and their own health status.
Now, the question arises how to measure a correlation between two ordinal scale varia-
bles. We have the advantage here that the variable values can be placed in a meaningful
order. This makes it possible, for example, to check whether people who are very satisfied
with the nursing robots are also highly satisfied with their health status. The two variables
Monotonicity can, therefore, be tested for monotony. Monotonicity can mean two things:
This can be understood as
either a correlation in the
same direction or a corre- 1 x1 < x2 y1 < y2
lation in the opposite
direction.
2 x1 < x2 y1 > y2
Let’s go over these relationships in detail. The first type of monotonicity involves a recti-
fied or monotonically increasing correlation. This means that if Person 1 has a smaller pro-
ficiency than Person 2 with respect to variable X, then they also have a smaller profi-
ciency with respect to variable Y . Regarding our example, if satisfaction with robots is
high, so is satisfaction with health status and vice versa.
68
The second type of monotonic relationship is an opposite or monotonically decreasing
one. In this case, Person 1 has a lower score on one value than Person 2 but a higher score
on the other. Regarding our example, this means that, for example, there is a high level of
satisfaction with the robots but a rather low level with health status.
Now, how can we test whether the two variables are monotonically related to each other?
For this purpose, the observations are replaced by ranks. Within a variable, ranks are
assigned to the characteristic carriers. The one with the best expression usually gets the
rank of 1. The characteristic carrier with the worst expression gets the last rank n. The
ranks for the observations xi of the variable X are denoted by ri. The ranks for the obser-
vations yi of the feature Y are called si.
i xi yi ri si
2 satisfied satisfied 3 3
3 poor insufficient 4 4
For the expressions yi (“satisfaction with health status”), the ranks si are
assigned analogously. It was possible that two or more patients gave the same
rating to satisfaction with the robots or health status. In this case, a binding and
69
average rank would have to be assigned. However, we are not going to use such
a procedure in this script (if you are interested, please refer to Fahrmeir et al.,
2016, pp. 133–134).
The measure that is based on the ranks and allows a statement about a monotonic corre-
Spearman’s rank corre- lation is called Spearman’s rank correlation coefficient rS and can be calculated as fol-
lation coefficient lows (Bamberg et al., 2022, p. 35):
The Spearman's rank cor-
relation coefficient is
used for the relationship n
between two variables ∑ ri − −
r · si − −
s
that are both at least ordi- i=1
nally scaled. rS =
n n
2 2
∑ ri − −
r · ∑ si − − s
i=1 i=1
It is used to assess the relationship between two variables that are both at least ordinal
scaled in nature. The following is calculated in the numerator: For both the feature X and
the feature Y , we look at how far the respective ranks of each feature carrier are from the
average rank − r and − s , respectively. For each object i, the product of the differences is
formed. These individual products are finally summed over all individuals. For the denom-
inator, the differences already calculated for the numerator are squared and summed. If
there are no ties, the rank correlation coefficient can also be determined using
n
6 · ∑ di2
i=1
rS = 1 − with di = ri − si
n · n2 − 1
This variant saves a lot more time and is strongly recommended if there are no ties. We
must calculate only the difference di between the ranks and for each feature carrier. Since
we consider just the cases without ties, we can use the second formula at any time.
−
r=
2+3+4+1
= 2.5
4
−
s=
1+3+4+2
= 2.5
4
Since the same ranks from 1 to n are assigned for both characteristics, the aver-
age rank for both is always identical. Here, this is 2.5 for each case. Subse-
quently, it makes sense to create the following auxiliary table.
70
Table 25: Auxiliary Table for the Calculation of the Rank Correlation Coefficient (1)
i ri si ri − −
r si − −
s ri − −
r · ri − −
r
2
si − −
s
2
si − −
s
Σ 4 5 5
n
∑ ri − −
r · si − −
s
i=1 4
rS = = = 0.8
n n 5·5
2 2
∑ ri − −
r · ∑ si − − s
i=1 i=1
Table 26: Auxiliary Table for the Calculation of the Rank Correlation
Coefficient (2)
di = ri
i xi yi ri si − si
very
1 good good 2 1 1
2 satisfied satisfied 3 3 0
71
insuffi-
3 poor cient 4 4 0
very
4 good good 1 2 -1
Here, only the two ranks must be subtracted from each other for each object. For
the first person, this is 2 − 1 = 1. If we calculate these differences for all objects,
we obtain the following for the rank correlation coefficient:
n
2
6 · ∑ di 2
i=1 6 · 12 + 02 + 02 + −1
rS = 1 − =1− = 0.8
n · n2 − 1 4 · 42 − 1
The result is, of course, the same as in the previous formula. However, we will
get it faster with this variant.
We now interpret our result. rS can only assume values between −1 and +1, inclusive. We
obtain positive values if there is a monotonically increasing correlation or a correlation in
the same direction between the two variables under consideration. If, on the other hand,
rS is in the negative range, a monotonically decreasing or opposite relationship applies.
For rS = 0, there is no monotonic correlation. This does not have to mean that there is
generally no correlation between the considered variables. It is only certain with the result
that this correlation is “not monotonous.” If you receive an arbitrary value from 0 to 1 both
in the positive and negative range, the following classification scheme can help you inter-
pret the result:
72
In our interpretation, it is of enormous importance to mention “and vice versa” here. If we
omitted this from the above formulation in the present example, we might have gotten the
impression that satisfaction with nursing robots influences satisfaction with health status.
However, we do not know that at all. A correlation assumes that they can influence each
other. A mutual influence or dependence is examined with a correlation, and it is certified
in the present case.
xi, yi for i = 1, …, n
Table 27: Initial Data for the Relationship Between the Age of the
Mothers and Fathers of Young Patients
i xi yi
1 56 60
2 49 55
3 48 46
4 46 52
73
5 47 56
6 56 51
7 57 71
8 53 60
9 58 61
10 54 58
11 47 49
12 53 65
For instance, the first child has a mother aged 56 years and a father aged 60
years. We now want to ask whether there is a connection between the age of the
mother and the age of the father. Is a comparatively older mother associated
with a relatively older father? Or is an older mother associated with a younger
father and vice versa?
Scatter Plot
Scatter plot To get a first impression of the correlation of the variables, a scatter plot is drawn (Bam-
A scatter plot visualizes berg et al., 2022, p. 32). A scatter plot is a coordinate system in which the two characteris-
the relationship between
two cardinal scale varia- tics form the axes. It visualizes the relationship between two cardinal scale variables. The
bles. x-axis is described by the variable X and the y-axis by the variable Y . The pairs of points
of the individual persons are drawn as dots, circles, or asterisks in the scatter diagram.
Based on the arrangement of the points in their entirety, it can finally be seen whether
there is a positive or negative correlation.
74
Figure 9: Scatter Plot for the Correlation Between the Age of the
Mothers and Fathers of Young Patients
The scatter plot shows that the older a child’s mother is (moved further to the
right on the x-axis), the older the father is (moved further up on the y-axis) and
vice versa. Accordingly, a rather younger mother is associated with a younger
father and vice versa. Overall, the scatter plot indicates a positive correlation.
Regarding the interpretation, it is helpful to consider the mean values of the two variables
in the scatter plot. −
x is entered vertically and − y horizontally in the scatter plot. This
results in four quadrants:
• In Quadrant I, there are those people who have an above-average value for both varia-
bles. Thus, the differences to the mean value are positive in each case; both parents are
older than the average.
• Quadrant II contains those objects who are below average with respect to X and above
average with respect to Y . The difference between the individual values of variable X
and the mean value is, therefore, negative. The difference between the individual values
of variable X and the mean value is also negative (i.e., the mothers are younger than
the average). The differences for the values of Y are positive (i.e., the fathers are older
than the average).
• Quadrant III contains objects who are below the mean for both variables so that the
differences to the mean are negative (i.e., both parents are younger than the average).
75
• In Quadrant IV, the mothers have an above-average value for variable X and a below-
average value for variable Y . The mothers have a positive difference to the mean value.
Thus, for the values of variable X, positive differences to the mean are obtained (i.e.,
the mothers are older than the average). For variable Y , negative differences are
obtained (i.e., the fathers are younger than the average).
If most of the observations are in Quadrants I and III, there is a positive correlation
because both variables tend to the same direction: Either both are below average or above
average. If, conversely, most of the observations are in Quadrants II and IV, there is a nega-
tive correlation. The two variables behave opposite to each other. One variable is below,
and the other above the average.
−
x = 52 −
y = 57
76
Figure 11: Scatter Plot for the Correlation Between the Age of the
Mothers and Fathers of Young Patients Divided Into Quadrants
Eleven of the 12 observations are in Quadrants I and III. This means that there is
a positive correlation between the age of the mother and that of the father.
With the scatter plot, we could see if the correlation between the two variables under con-
sideration was positive or negative. We now need to clarify how strong the correlation is
and which type of correlation can be measured at all. For two nominal scale variables, we
have only checked whether there is a correlation or not. Since ordinal scale variables can
be placed in a meaningful order, it is possible to check for a monotonic correlation. The
distances between two expressions can be interpreted in a mathematically meaningful
way for cardinal scale variables. This means that even stronger analyses can be performed
at this point. Namely, it should now be checked whether the cloud of points in the scatter
plot can be described by a linear straight line. The closer the points are to each other and
form a straight line, the stronger the linear relationship is.
A measure which can make a statement in this regard is the Bravais-Pearson correlation
coefficient rx, y (Bamberg et al., 2022, p. 34). It is defined by Bravais-Pearson corre-
lation coefficient
The Bravais-Pearson cor-
n
relation coefficient
∑ xi − −
x · yi − −
y requires a cardinal scale
i=1 level from both variables.
rx, y =
n n
2 2
∑ xi − −
x · ∑ yi − − y
i=1 i=1
77
and, thus, is very similar to Spearman’s rank correlation coefficient. Instead of ranks, the
characteristic expressions themselves are used here. The numerator is also referred to as
Covariance covariance (Bortz & Schuster, 2010, p. 153). The covariance decides the sign of rx, y. If
The covariance measures
most of the observations are in Quadrants I and III, the covariance and, thus, rx, y become
the relationship between
two cardinal scale varia- positive. If, on the other hand, most of all observations are in Quadrants II and IV, we
bles but is not normalized obtain a negative covariance and a negative correlation coefficient. In the denominator,
to a specific range.
we can observe the product of both sample variances under the root. An alternative to the
above calculation is the following function:
− −−
xy x ·−
y
rx, y =
2 2
x − x · y2 − y
2
It is only based on the calculation of mean values and can, therefore, be solved more
quickly. The only innovation within this formula is the mean xy. It is calculated by
n
1
xy = n
∑ xi · yi.
i=1
Table 28: Auxiliary Table for the Calculation of the Correlation Coefficient
xi − −
x xi − −
x ·
2 2
i xi yi yi − y yi − y xi − −
x yi − −
y
1 56 60 4 3 12 16 9
2 49 55 -3 -2 6 9 4
3 48 46 -4 -11 44 16 121
4 46 52 -6 -5 30 36 25
5 47 56 -5 -1 5 25 1
6 56 51 4 -6 -24 16 36
7 57 71 5 14 70 25 196
8 53 60 1 3 3 1 9
9 58 61 6 4 24 36 16
10 54 58 2 1 2 4 1
78
11 47 49 -5 -8 40 25 64
12 53 65 1 8 8 1 64
Ultimately, the sums of the last three columns are required for the final determi-
nation of the correlation coefficient. These are substituted into the starting for-
mula as follows:
n
∑ xi − −
x · yi − −
y
i=1 220
rx, y= = = 0 . 650
n n 210 · 546
2 2
∑ xi − −
x · ∑ yi − − y
i=1 i=1
The positive covariance 220 is responsible for the fact that the overall result
must also be positive. Since most of the observations are in the first and third
quadrant, such a number must result. The second variant requires the calcula-
tion of the following seven mean values:
−
x =
1
· 56 + 49 + … + 53 = 52
12
−
y =
1
· 60 + 55 + … + 65 = 57
12
−
x 2 = 522 = 2704
y 2 = 572 = 3249
1
x2 = 12
· 562 + 492 + … + 532 = 2721.5
1
y2 = 12 · 602 + 552 + … + 652 = 3294.5
1 −
xy = 12 · 56 · 60 + 49 · 55 + … + 53 · 65 = 2982. 3
79
The only mean value that we have not calculated in advance is the mean value
xy. To do this, we go through all pairs of parents and multiply the age of the
mother by that of the father. The sum is formed from all pairs of parents and
then divided by the total number of pairs of parents, which is 12.
Using the seven calculated values, we arrive at the same result for the correla-
tion coefficient
−
xy − x · −
y 2982. 3 − 52 · 57
rx, y= =
2 2 2721.5 − 2704 · 3294.5 − 3249
x2 − x · y2 − y
−
18. 3
= = 0.650 .
17.5 · 45.5
Let’s now interpret the result rx, y. Just like rS, rx, y can assume values from −1 to +1. If
the coefficient is positive, there is a positive linear relationship between the two variables
X and Y . If one variable assumes large values, this tends to apply to the other variable. A
negative value is obtained for rx, y if there is a negative linear relationship between the
two variables. The two variables behave in opposite ways to each other. Ifrx, y = 1 for the
correlation coefficient, there is a perfect positive linear relationship. As can be seen in the
following figure, all points can now be connected by a straight line.
80
Figure 12: Scatter Plot With a Correlation Coefficient of +1
Whether the straight line begins at the origin or not is completely irrelevant. The only
important thing is that all points can be connected by a straight line.
81
Figure 13: Scatter Plot With a Correlation Coefficient of -1
For rx, y = 0, there is no linear connection between the considered variables. It is also said
that the variables are uncorrelated. Note that this only shows that the relationship is non-
linear. There can be a relationship that is, for instance, quadratic or exponential. This can
be seen in the following left figure. The points can be linked perfectly, but the connection
is not a linear one. Also, a scatterplot like the one on the right, where the points are dis-
tributed without any fixed negative or positive structure, results in a correlation coefficient
close to 0.
82
Figure 14: Scatter Plots With a Correlation Coefficient of 0
In the vast majority of cases, the correlation coefficient is somewhere between 0 and 1 or 0
and −1. In analogy to the rank correlation coefficient, the following classification applies
to the strength of the relationship:
83
RUNNING EXAMPLE: AGE OF PARENTS
In our example, the correlation coefficient rx, y = 0.65 for the variables “old
mother” and “old father” means that there is a strong positive linear relation-
ship between these two variables. So, if the mother is older, this is usually the
case for the father and vice versa.
Let’s now go back to the scatter plot. The points are arranged in ascending form
from bottom left to top right. There is, therefore, a positive connection, which is
why a positive result is also obtained here. It is not surprising that we deviate
slightly from a perfect result of 1, since we cannot connect the points in such a
way to form a straight line.
84
The three cases that we have discussed so far are on the diagonal. In these cases, the two
scale levels are identical. Above the diagonals are the optimal measurements that you
would take for two different scales.
Problems that can occur with correlations are mainly due to the interpretation of the
results. Basically, one is aware of the fact that correlations are symmetrical in nature.
X and Y (or A and B) are equal and can theoretically influence each other. Therefore, we
cannot read from the result of any association measure that one variable influences the
other and not vice versa.
Spurious correlation
It is often concluded from high correlations that the two variables under consideration are
dependent on one another. However, correlation does not necessarily mean causality. It is
often a matter of spurious correlations, which means that a third variable is responsible
for the high correlation. Suppose there is a high negative correlation between the number
of hairs on the head and income for men. This high correlation is certainly not due to the
fact that these two variables are connected but rather to the fact that a third variable, such
as age, correlates strongly with both variables and, thus, ensures a high correlation
between the number of hairs on the head and income.
Nonsense correlation
The so-called “nonsense correlation” is also closely related to the previous point. One
should never pay too much attention to a high correlation between two totally irrelevant
variables.
Type of correlation
SUMMARY
As a second part of the descriptive statistics, a bivariate (i.e., two-dimen-
sional data) analysis should be used if a connection between two varia-
bles is to be checked. This can be employed to find out whether two var-
iables are related in any way. It should be noted that the bivariate
analysis initially assumes that the two variables influence each other.
One should, therefore, never suppose that one variable has a clear direc-
tion of effect on another.
85
In order to select the right measure of association, the weaker level of
the scale should always be considered as a guide value.
86
UNIT 4
LINEAR REGRESSION
STUDY GOALS
Introduction
How does age affect income? Does blood alcohol concentration affect reaction time? If so,
how strong is this influence? How do age and IQ affect the ability to concentrate? If we
want to examine how one or more cardinal scale variables affect another cardinal scale
variable, we use linear regression analysis. If the focus is only on the influence of a single
cardinal scale variable on another cardinal scale variable, this is referred to as simple lin-
ear regression. Finally, if we assume that several cardinal scale variables influence another
cardinal scale variable, we are referring to multiple linear regression.
In this unit, we will deal exclusively with simple linear regression. In addition, it should be
made clear that the simple linear regression is only explained in terms of its most impor-
tant features here. It is a very extensive process that must meet numerous requirements
and is, therefore, subject to several tests.
88
Table 29: Initial Data for Simple Linear Regression
1 0 590
2 0.3 581
3 0.5 687
4 0.7 658
5 1 632
6 1.2 645
7 1.4 687
8 1.8 624
9 2.3 702
10 2.5 789
For instance, the first person had 0 alcohol per mille in their blood and achieved
a reaction time of 590 milliseconds. The seventh person, on the other hand, had
1.4 per mille and a much longer reaction time of 687 milliseconds. The inde-
pendent variable X is defined here by the alcohol concentration and the
dependent variable Y by the reaction time because we want to investigate the
influence of alcohol consumption on the reaction time.
In order to get a first impression of the connection between the two variables X and Y , we
look at the data in a scatter plot (like with the correlation analysis). At this point, it is cru-
cial to know which axis in the coordinate system stands for which variable. The independ-
ent variable X is always plotted on the x-axis and the dependent variable Y on the y-axis.
89
RUNNING EXAMPLE: REACTION TIMES WHILE UNDER THE INFLUENCE
OF ALCOHOL
Regarding our example, we plot the alcohol concentration on the x-axis and the
reaction time on the y-axis. If all objects with their two values are then consid-
ered in the scatter plot, the following form is obtained.
Figure 16: Scatter Plot for the Initial Data of the Simple Linear
Regression
We can see that as alcohol concentration increases (moving further to the right
on the x-axis), reaction time also tends to increase (moving further up on the y-
axis). Indeed, we see a positive correlation here.
If we only wanted to check how strong this connection is, we would use the Bravais-Pear-
son correlation coefficient. This tells us how strong the linear relationship is between the
two variables and how well the points can be described by a linear straight line.
90
Now, the correlation coefficient of rx, y = 0.742 offers a very good basis for
determining the linear straight line that describes the influence of the alcohol
concentration on the reaction time. This is what simple linear regression does: It
determines the regression line for the relationship between an independent var-
iable X and a dependent variable Y .
yi = a + b · xi
We know the values for xi and yi; these are shown in the output table or in the scatter plot.
The parameter a is the y-axis intercept of a linear function and b describes the correspond-
ing slope. These two parameters are determined in the linear regression analysis in such a
way that the point cloud can also be described by a linear straight line. Thus, one tries to
find the straight line in which the points in the scatter plot lie as close as possible. The
statistical method used for this is called the least squares method (Handl & Kuhlenkasper,
2018, p. 478).
To understand this better, we will take a closer look at the following figure.
91
Figure 17: Idea of Simple Linear Regression
There are three points in the figure labeled y1, y2, and y3. These are the exemplary starting
points, which were obtained by combining xi and y1 for three arbitrary objects. Specifi-
cally, these are the points x1, y1 = 2.2 (at Location 2 on the x-axis and Location 2 on
the y-axis), x2, y2 = 5.10 and x3, y3 = 7.7 . We tried to find a linear line for these
three points; the result is the drawn line. However, none of the three points lies on the
straight line. All of them deviate either upward or downward from it.
For the second point y2, the line is drawn very nicely. Actually, one has observed the point
y2, but based on the estimated regression line, one would predict the value y2. Note that
you must always use the roof on a variable when estimating a value based on a regres-
Residual sion line. Accordingly, one would make an error – also called a residual – if one used the
A residual is the distance regression line to forecast y (Bortz & Schuster, 2010, p. 186). The least squares method
between the actual y-
value and the y-value esti- aims to minimize the sum of these errors. For the above example with the three points,
mated based on the the straight line would be selected where the sum of squared deviations is minimal. We
regression line. take the squared deviations because some points deviate upward and others downward
from the regression line.
The formulas for a and b, which are obtained by the least squares method, are now as fol-
lows.
We start with the formula for b. This parameter must always be calculated first because we
need the result of b for the calculation of a. Have a look at the formula.
92
xy − −
x ·−
y
b= 2
2
x −x
The numerator should look very familiar. It contains the covariance between the two vari-
ables. The numerator is, therefore, the same as the numerator of the Bravais-Pearson cor-
relation coefficient. The denominator is not new either; it includes part of the denomina-
tor of the correlation coefficient, namely the variance of the independent variables.
Accordingly, to get the result for b, we need to determine five means, as we have already
learned. It is recommended to start with the simple means − x and −y . We can square the
2
mean value − x directly afterward to get the mean value x . In the denominator, we also
need the mean x2 for which all values of x are squared, summed, and divided by the num-
ber of observations. Finally, for the mean xy, xi and yi are multiplied for each individual,
all products are added up, and they are divided by the number of observations. If all deter-
mined mean values are inserted into the above formula, we obtain the coefficient b. This is
now a component to calculate the second coefficient a:
a=−
y −b·−
x
Once the two coefficients a and b have been determined, the regression line can be drawn.
− 1
x = 10 · 0.0 + 0.3 + … + 2.5 = 1.17
− 1
y = 10 · 590 + 581 + … + 789 = 659.5
2
x =1.172 = 1.3689
1
x2= 10 · 0.02 + 0.32 + … + 2.52 = 2.001
1
xy= 10 · 0.0 · 590 + 0.3 · 581 + … + 2.5 · 789 = 805.65
xy − −
x ·−y 805.65 − 1.17 · 659.5
b= 2
= 2.001 − 1.3689
= 53.844
x2 − x
a=−
y −b·−
x = 659.5 – 53.844 · 1.17 = 596.503
93
yi = 596.503 + 53.844 · xi .
Again, the roof is used as a symbol for an estimate over the dependent variable,
since we can only estimate reaction times based on this equation, which means
that we cannot predict them exactly. Often in the equation, the two mathemati-
cal abbreviations are replaced by the actual variables. Accordingly, one could
also write down the above equation as
If you have found a regression line and, therefore, know a and b, you have to take a closer
Regression constant look at them. The parameter a is the regression constant. It describes the y-axis intercept
The regression constant
of the linear regression line, i.e., which value y assumes when x = 0. Considering our
describes the y-intercept
of the linear regression example, a describes which reaction time is to be expected if the alcohol concentration is
line. 0 per mille. Often, this number does not make sense in terms of content since negative
values may result for the regression constant.
Regression coefficient More important is the parameter b, which is called the regression coefficient (Bortz &
The regression coefficient Schuster, 2010, p. 188). It describes the slope of the linear regression line and tells us by
describes the slope of the
linear regression line. how many units y changes when x increases by one unit.
In our example, b describes the change in reaction time when the alcohol concentration
increases by 1 per mille. This can produce both a positive and a negative result. Positive
values represent a positive influence of the independent variable on the dependent varia-
ble. Consequently, a negative regression coefficient means that a negative influence is
exerted. It is true that the larger the regression coefficient, the steeper the regression line.
A high positive regression coefficient would indicate a steep positive regression line.
Accordingly, a strong negative regression coefficient indicates a steep negative regression
line (from top left to bottom right).
94
The great advantage of the established regression line is that we can use it to
make predictions. For any value xi, we can estimate the corresponding value of
the dependent variable yi. Imagine if the police stopped a suspected road user
and asked them to take a Breathalyzer test, which resulted in an alcohol concen-
tration of 0.4 per mille of blood. We could estimate the reaction time of the road
user based on the equation. Please note that only “estimating” the reaction time
is possible because we can never be completely sure that we hit it exactly. For
that, all points in the scatter plot must lie perfectly on a straight line. The esti-
mate is obtained by substituting the alcohol concentration of 0.4 for xi:
Thus, for a person with such an alcohol concentration, one would expect a reac-
tion time of 618.041 milliseconds based on the estimated regression equation.
Finally, the regression line can be plotted in the scatter plot.
Figure 18: Scatter Plot for the Initial Data Including Regression Line of
the Simple Linear Regression
The regression line drawn above is now the one that best describes the influ-
ence of the alcohol concentration on the reaction time. We find the regression
constant of 596.503 where the regression line begins. There is the y-axis inter-
cept. The regression coefficient of 53.844 cannot be read directly from the
graph. If from the starting point of the regression line, one goes to the right by 1
(on the x-axis to Position 1) and from there up to the straight line. Then, one has
to go up by 53.844 milliseconds.
95
Overall, it should be mentioned that calculating the two coefficients a and b and, thus,
establishing the regression equation for prediction is crucial. Additionally, the interpreta-
tion of the two calculated coefficients is important. The graphical representation of the
data situation as well as the drawing of the straight lines is of secondary importance.
Correlation Coefficient
At the beginning of this unit, we used the Bravais-Pearson correlation coefficient to check
whether there is a linear relationship between alcohol concentration and reaction time.
Now, we use it again to assess the quality of the regression line in terms of linearity. If the
points in the scatter plot do not approximate a linear course, you should not use the calcu-
lated linear straight line to even make predictions based on it. The formula of the Bravais-
Pearson correlation coefficient is as follows:
xy − −
x ·−
y
rx, y =
2 2
2
x − x · y2 − y
Regarding the calculation of rx, y, it is of great advantage that we have already calculated
five of the seven required mean values for the regression coefficient b. Only the mean
2
value −y has to be squared once for y . The final relevant mean y2 again requires squaring
each observation yi, summing them, and dividing the result by the total number of obser-
vations.
We know that the correlation coefficient can take values between −1 and +1, inclusive,
and the proximity of the value to +1 or −1 is indicative of a strong linear relationship. In
the context of this linear regression analysis, we should obtain, at least, a correlation coef-
ficient of 0.5 or −0.5 to speak of an adequate linear relationship.
96
− 1
x = 10 · 0.0 + 0.3 + … + 2.5 = 1.17
− 1
y = 10 · 590 + 581 + … + 789 = 659.5
2
x =1.172 = 1.3689
1
x2= 10 · 0.02 + 0.32 + … + 2.52 = 2.001
1
xy= 10 · 0.0 · 590 + 0.3 · 581 + … + 2.5 · 789 = 805.65
2
Only the last two mean values y and y2 are missing. We calculate these first:
2
y =659.52 = 434,940.25
1
y2= 10 · 5902 + 5812 + … + 7892 = 438,271.3
With this, the Bravais-Pearson correlation coefficient can now finally be deter-
mined:
xy − −
x ·−
y
rx . y =
2 2
x − x · y2 − y
2
Coefficient of Determination
The coefficient of determination is probably the most important criterion for evaluating a
regression line. But what is meant by the coefficient of determination? We can see that the
reaction times vary for the 10 participants in the previous example. Accordingly, there is a
dispersion in the observations of the dependent variable. With the help of the regression
analysis, we tried to find the reason for the different reaction times in the alcohol concen-
tration. In other words, we tried to explain the reaction time with the help of the alcohol
concentration. The part of the dispersion of the dependent variable (i.e., reaction time)
that can be attributed to the independent variable (i.e., alcohol concentration) is called
the explained dispersion. The remaining part of the dispersion of the dependent variable Explained dispersion
– which is not caused by the independent variable but can be attributed to other factors or The dispersion scatter is
due to the independent
even errors in the measurement – is called unexplained dispersion. For instance, the age variable.
of the participant or the time of day of the reaction test could play a role. Besides these Unexplained dispersion
two examples, many other factors could also explain the variation in reaction time, but The unexplained disper-
sion arises from unac-
they were not considered in the regression model. In addition, the reaction times may not
counted variables and
measurement error.
97
be recorded correctly and there may be measurement errors (Benesch, 2013, pp. 119–
Total dispersion 120). Both dispersions taken together result in the total dispersion of the dependent vari-
The total dispersion is able:
composed of the
explained and unex-
plained dispersion. total dispersion = explained dispersion + unexplained dispersion
In summary, the reaction times of the 10 test subjects are all different. Therefore, we have
a basic dispersion, the total dispersion. A part of this dispersion can be explained by the
alcohol concentration, which describes the explained dispersion. The rest of the total dis-
persion, which is not due to the alcohol concentration, forms the unexplained dispersion.
Coefficient of determi- The coefficient of determination reflects the explanatory power of a regression model.
nation
The coefficient of deter-
With the mathematical abbreviation R2, it puts the explained dispersion in relation to the
mination reflects the total dispersion:
explanatory power of a
regression model. explained dispersion explained dispersion
R2 = explained dispersion+unexplained dispersion
= total dispersion
This results in the proportion of the dispersion of the dependent variable that can be
explained by the independent variable (Benesch, 2013, p. 119). A possible calculation of
the coefficient of determination is calculating the explained and unexplained dispersion
according to the above principle. Since this can be quite time-consuming under certain
circumstances, we use the simplest way to obtain a result for R2:
R2 = rx,
2
y
R2 = 0.7422 = 0.551 .
The coefficient of determination can take on values between 0 and 1, inclusive, and results
in a percentage when multiplied by 100. If, for example, we obtain an R2 of 0.7, this means
that 70% of the dispersion of the dependent variable is due to the independent variable.
This is equivalent to saying that the remaining 30% of the dispersion is explained by other
factors or errors. If R2 = 0, the dispersion of the dependent variable cannot be explained
in any way by the independent variable. If R2 = 1, we have a complete explanation, since
the scatter of the dependent variable can be explained perfectly by the independent varia-
98
ble. All points in the scatter plot should lie on a straight line such that the correlation coef-
ficient is either +1 or −1. Consequently, the coefficient of determination becomes 1,
meaning 100%. In general, it is desirable if R2 is close to 1. In practice, one is very satisfied
with a coefficient of determination of at least 0.3, meaning 30%.
Standard Error
The last way to evaluate the regression line is to use the standard error. It describes the Standard error
absolute estimation error when using the regression model. If we wanted to use the estab- The standard error
describes the absolute
lished regression line to make predictions, we would only succeed in making an accurate estimation error when
forecast if all points were on a straight line and we, thus, had obtained a coefficient of using the regression
determination of 1. In most cases, we fail in obtaining this complete explanatory power. As model.
a result, we can make an error in the forecast of the dependent variable. This average error
across all forecasts is described by the standard error, which is represented by σ x, y
(Bortz & Schuster, 2010, p. 190).
We will not explain the calculation of the standard error here(as we need the unexplained
dispersion for that). Rather, we will present its interpretation.
σ x, y = 43.279 .
This means that if we were to use the above regression line to predict reaction
times, we would make an error of 43.279 milliseconds on average. Therefore, we
would deviate on average by 43.279 milliseconds upward or downward from the
true value.
99
The value for the standard error is an absolute value in the unit of measurement of the
dependent variable (here, milliseconds). Whether this standard error is large or small
entails an assessment that is both very subjective and often difficult. A solution is the cal-
Relative standard error culation of the relative standard error, repesented by σ 0, which describes the percentage
The relative standard deviation of the estimated value from the actual one. It is calculated by
error indicates the stand-
ard error in percentage
form. σ x, y
σ0 = y
,
and, thus, it only requires dividing the standard error obtained above by the mean value y
of the dependent variable.
43.279
σ0 = 659.5
= 0.066
with y = 659.5. The average deviation of the reaction times estimated based on
the regression line from the actual reaction times is 6.6% 0.066 · 100 .
The percentage deviation can probably be better estimated than the absolute deviation.
The closer we are to 0%, the better the regression equation is suited for forecasts. In the
present case, due to the low percentage error, we can certainly speak of a good prerequi-
site for using the established equation for forecasts.
Overall, it can be stated that the three criteria for assessing the regression line go hand in
hand. If there is a high correlation (whether positive or negative), the result will be a high
coefficient of determination and, consequently, a low standard error and a low relative
standard error. If the correlation is low (no matter whether positive or negative), the result
will be a low coefficient of determination and, thus, a large standard error and a large rela-
tive standard error.
SUMMARY
Simple linear regression is used to examine the influence of a cardinal
scale independent variable X on a cardinal scale dependent variable Y .
Based on the previous findings, we know that independent variable line-
arly influences the dependent variable and in no case the other way
around. If these conditions are met, the linear equation can be construc-
ted based on the least squares method. In particular, the regression
coefficient contained therein describes the nature of the influence of X
100
on Y . Since the least squares method finds a regression line for every
given set of data, it is important to evaluate it in terms of its quality and
explanatory power. While the correlation coefficient determines the
degree of linearity, the coefficient of determination gives us information
about the explanatory content. Finally, the standard error provides
information about the average error one would make when using the
regression model for forecasts.
101
UNIT 5
FUNDAMENTALS OF PROBABILITY THEORY
STUDY GOALS
Introduction
What will the weather be like tomorrow? We usually look at the weather app to check the
chance of rain or to see if the sun comes out after all. For such future events, meaning
those that we do not know in advance whether they will occur or not, we want to deter-
mine their probabilities. In this unit, we will concentrate on the basics of probability
theory to understand how such probabilities come about.
Therefore, it is unknown which of several possible situations will occur; the outcomes
Random experiment depend on chance. Such processes are called random experiments or random processes
A random experiment (Bortz & Schuster, 2010, p. 49). All possible outcomes of a random process are summarized
deals with processes
whose outcomes depend in the result set Ω or result space (Nachtigall & Wirtz, 2013, p. 18). We will now look at
on chance. some examples.
Result set
The result set describes
the set of all results of a
random experiment.
EXAMPLE: OUTCOME SETS IN RANDOM EXPERIMENTS
Rolling a six-sided die
Ω = 1, 2, 3, 4, 5, 6
Tossing a coin
104
Ω = heads, tails
If a coin is thrown twice, the result set will expand a little further:
Both the first and the second time, the coin can land on heads (heads, heads) or
on tails (tails, tails). A mix of both is also possible in two different orders (heads,
tails or vice versa).
Healing success
Imagine that we are all therapists. You are interested in the success of each
patient that you have. This can be measured, for example, by “success” or “no
success,” which is reflected in the result set in the same way:
Ω = success, no success
If you are interested in certain outcomes within a random experiment, you must deal with
a special event. Events describe a subset of the result set and include what you are inter- Event
ested in. They are always denoted by a capital letter (e.g., A, B, or C; Nachtigall & Wirtz, An event describes a sub-
set of the result set and
2013, p. 18). We look at individual events for two of the examples mentioned above. includes what you are
interested in.
In the case of the throwing of a die, we are interested in even numbers. There-
fore, the events include only the numbers 2, 4, and 6 and are read as
A = 2, 4, 6 .
Healing success
A = success .
105
There are three special forms of events: (1) sure, (2) impossible, and (3) elementary. A sure
Sure event event is one that always occurs. This can only refer to the result set, Ω. No matter how
A sure event always often we roll a die, how often we flip a coin, or how many patients we treat, we know that
occurs.
one of the above-described outcomes (e.g., obtaining a number from 1 to 6 in the case of
rolling a die) will always occur.
Impossible event Animpossible event can never occur. For instance, it is impossible to obtain the number 7
An impossible event can or 8 when throwing standard six-sided dice. We will discuss impossible events in more
never occur.
detail later.
Elementary event Finally, anelementary eventdescribes every single result of the result set. In the therapy
An elementary event is example, both the outcomes “success” and “no success” are elementary events (Bortz &
every single result of the
result set. Schuster, 2010, p. 49).
As we have seen, individual result sets and events are sets of possible situations in the
form of numbers or words. Based on these existing sets, the usual set operations can be
carried out because the linking of individual events in turn results in new events. The most
important set operations will be explained by means of examples.
Ω = 1, 2, 3, 4, 5, 6 .
Accordingly, the numbers from 1 to 6 can be rolled if the die is rolled only once.
Now, each of the three players needs one of a certain selection of numbers to be
able to win the game:
Thus, for each player, one of the numbers they are interested in is enough to win
the game.
Complementary event The first possible set operation is a complementary event (also called a counter event).
A complementary event −
Assuming an event A, the complementary event for this is denoted by A . It describes the
contains exactly the
opposite of the actual situation in which the opposite of A occurs. Accordingly, it includes everything that is con-
event. tained in the result set Ω but not in A (Handl & Kuhlenkasper, 2018, p. 181).
106
Such set operations can be illustrated very well using a Venn diagram (Nachtigall & Wirtz, Venn diagram
2013, p. 19). A Venn diagram consists of a rectangular box. This box contains all elements A Venn diagram is used to
illustrate event opera-
of the result set. Depending on the number of events considered, they are represented by tions.
one or more circles (or, as an alternative, rectangles) in the box. These circles contain the
results of the individual events. If you deal with only one event at first, one circle in the
rectangular box is enough.
The box shown above contains all elements of the result set. Regarding the die example,
the box consists of the numbers 1, 2, 3, 4, 5, and 6. The event here exemplarily labeled
A is represented by the circle and contains those numbers that are of interest with the
event A. For B and C possibly further events, this can be represented in a similar way.
Consequently, in the blue shaded area, we will have all numbers that are not contained in
−
A but are still in the result set. This, therefore, represents the complementary event A .
107
Figure 20: Complementary Event to Event A in Dice Game
We are optimistic and want to pass the exam in any case. So, the event A is as
follows:
A = passing statistics
The complementary event to A is the event of not passing the statistics exam:
108
−
A = failing statistics
Now, we bring the two events together and see what operations are possible between
them. The first possibility is the union set A ∪ B (read as A unites B) between the two Union set
events A and B. This operation unifies all the outcomes that are in A or B. It is also said The union set summari-
zes all results which are
that at least one of the two events occurs (Bortz & Schuster, 2010, p. 50). If the two events contained in at least one
A and B have something in common (i.e., they overlap), this commonality is represented of the events.
in a Venn diagram by the middle section.
Figure 22: Venn Diagram for the Union Set of Players A and B
109
There are two circles for the two events. They overlap because there is a com-
mon feature with the number 4. The two numbers that are only interesting to
PlayerA (1 and 2) are in the part of Circle A that has nothing to do with Circle B.
Likewise, the two numbers that are decisive for Player B (3 and 5) are only in the
part of Circle B that has no common area with Circle A. The union set now
describes the situation in which at least one of the two players wins.
But what does “at least one of the players wins” mean? It means that either only
A (with 1 or 2), only B (with 3 or 5) wins, or both win together (with 4). Thus, the
union set is A ∪ B = 1, 2, 3, 4, 5 . So, if one of these five numbers is rolled,
at least one will win in every case. For the remaining constellations, the union
set is determined similarly:
• A ∪ C = 1, 2, 4, 5, 6 , since A = 1, 2, 4 and C = 5, 6
• B ∪ C = 3, 4, 5, 6 , since B = 3, 4, 5 and C = 5, 6
Next, we will look at the situation using a Venn diagram. The entire left circle, A,
describes passing the statistics exam. The part of A without the overlap means
passing only statistics. The right circle, B, represents passing the math exam,
and the part without the overlap again describes passing this exam only. The
overlapping area means that both statistics and math are passed. Everything
around the two circles signals the failure of both exams.
110
Figure 23: Union Set for Passing at Least One Examination
We now want to determine the union of these two events, which means passing
at least one exam: statistics only, math only, or both:
Using the average set A ∩ B (read as A intersected B), we can determine the commonal- Average set
ity of the two events (Bortz & Schuster, 2010, p. 51). Accordingly, both events under con- The average set contains
the commonality of
sideration occur simultaneously. events.
Figure 24: Venn Diagram for the Intersection Set ofA and B
In a Venn diagram, the intersection is exactly where the two events overlap.
111
RUNNING EXAMPLE: DICE GAME WITH THREE PEOPLE
Consider the two players A A = 1, 2, 4 and B B = 3, 4, 5 again. For
this situation, the Venn diagram with the intersection highlighted in blue looks
like this.
The intersection A ∩ B here describes the situation in which both players win
the game together. These two players can only win simultaneously if the num-
ber 4 is rolled. Consequently, A ∩ B = 4 . For the other two possible constella-
tions the following intersections result:
• B ∩ C = 5 , since B = 3, 4, 5 and C = 5, 6
• A∩C = = ∅ (empty set), since A = 1, 2, 4 and C = 5, 6
We should take a closer look at the intersection A ∩ C. Since the two players do
not have a common number to lead to their victory, the intersection is marked
as“empty.” We can mark this either by having no result in the two curly brackets
or by using the ∅ sign for the empty set.
Disjoint Events that have no commonality and, thus, no intersection, are called disjoint or incom-
Two events are disjoint if patible events (Handl & Kuhlenkasper, 2018, p. 181).
they have no commonal-
ity.
112
The last set operation is offered by the difference which describes the sole event of an
event. Considering the two events A and B, if we want to determine the difference A\B Difference
(read as A without B), we should find out when A but not B occurs (Handl & Kuhlenkas- The difference describes
the sole occurrence of an
per, 2018, p. 183). In the Venn diagram, this is the part of circle A that does not include the event.
intersection.
The difference becomes interesting only if the two events under consideration have some-
thing in common. If their intersection is empty, then A\B will simply correspond to A and
B\A to B. We can see this in the following example, among others.
113
Basically, Player A can win if 1, 2, or 4 is rolled. But if they want to be the only
winner, then the number 4 must not fall because in this case Player B will also
win. So, we can state that the difference “A without B” is characterized by
A\B = 1, 2 . If 1 or 2 is rolled, Player A will win but not Player B. For all fur-
ther pairings the differences can be formed in the same way:
For A\C as well as for C\A, we can observe that the first mentioned event
always occurs completely because both events have no similarities and are,
therefore, disjunctive.
Determination of Probabilities
The prerequisite for determining probabilities is that the result set of a random process is
known and finite. Thus, one must know how many results can occur in a random process.
Only because we know a die has six sides, we can determine that each side has a 1 in 6
chance of being rolled. If we assume that there are n equally possible outcomes of a ran-
1
dom event, then the probability of each elementary event is n . Therefore, for a die, each
Probability 1
side has an equal probability 6 of of being rolled. In general, the probability P of an event
A probability describes
the chance for the events A is defined by
of an event.
number of outcomes in A
P A = number of outcomes in Ω
114
Thus, one divides the number of favorable outcomes (contained in A) by the number of
possible outcomes (contained in Ω). Alternatively, to abbreviate the probability of the
event A, one can write
A
P A = Ω
,
3
• A = 1, 2, 4 A =3 P A =
= 0.5
6
3
• B = 3, 4, 5 B =3 P B = = 0.5
6
2
• C = 5, 6 C =2 P C = = 0.33
6
We see that the calculation of the probability depends only on how many favor-
able and possible outcomes exist. Even if A and B need different numbers to
win, they have the same probability of 50% to win the game. For Player C , only
two numbers lead to victory, which is why their probability of winning is 33%,
lower than that of the other two players.
P A =0.8
P B =0.7
P A ∩ B =0.5
115
Rules for Calculating Probabilities
Different rules apply to probabilities, including the following (Handl & Kuhlenkasper, 2018,
p. 183):
These regularities result in the following calculation rules, which relate to the operations
between events described above. While we have only examined so far, for example, when
a player does not win or when Player A and Player B win together, we now want to calcu-
late the associated probabilities.
Referring to the complementary event described above, we now want to determine the
−
corresponding probability for such an event. The probability of a complementary event A
and, thus, the counter probability to an event A is calculated by
−
P A =1−P A .
If we know the probability for Event A, we must only subtract it from 1 (100%) to get the
probability for the complementary event. If, for example, we know that the probability of
rain is 30% tomorrow, it is also given that it will not rain with a probability of 70%. Based
on the Venn diagram, one can imagine it as follows:
116
Figure 28: Venn Diagram for Counter Probability
The entire box contains a probability of 1 or 100%. The circle contains the probability
P A determined for an event A. Everything around it describes the probability for the
−
complementary event P A = 1 − P A .
Since both A and B have a 50% probability of winning, the probability of losing
is also 50% in each case. For C , the probability of losing is higher at 67%, since
the probability of winning is only 33%.
117
RUNNING EXAMPLE: EXAM RESULTS
For passing the exams, we know that the probabilities are P A = 0.8 for statis-
tics and P B = 0.7 for math. This means that we have the following informa-
tion:
−
• probability of not passing the statistics exam: P A = 1 − 0.8 = 0.2
−
• probability of not passing the math exam: P B = 1 − 0.7 = 0.3
Now, we will consider the probability of a union set. We will assume that the two events A
and B are not disjoint and, thus, have an intersection. The probability that at least one of
the two events and, therefore, the union set A ∪ B occurs is as follows:
P A∪B =P A +P B −P A∩B
Accordingly, the two probabilities of the individual events A and B are first added. Next,
the probability for the intersection is subtracted once.
Figure 29: Venn Diagram for the Probability of the Union Set
We now want to find out the probability that at least one of the two players will
win. Remember, this can mean that only A wins (on 1 or 2), only B wins (on 3 or
5), or both win (on 4). If we calculate the probability using the above formula,
the following happens:
3 3 1 5
P A∪B = 6
+ 6
− 6
= 6
118
3
We start with the probability that Player A will win. This is , which we deter-
6
mined based on the favorable results 1, 2, and 4. The probability of victory for
3
Player B is then added to it. This is also , since the numbers 3, 4, and 5 lead to
6
victory for this player. As we see, the number “4” has been considered twice so
far. Meaning, it is contained in the probability P A as well as P B . Since this is
always the case for two non-disjoint events, the probability for the intersection
(in this case, the probability for 4) must always be subtracted once. In the end,
5
there is the probability of = 0.83 or 83% that at least one of the two players
6
will win. It is not that surprising because with the numbers 1, 2, 3, 4, and 5
(see the Venn diagram), five out of the six numbers can lead to this situation.
3 2 1 4
P B∪C = 6
+ 6
− 6
= 6
At this point, with the number 5, the commonality of the two events exists,
1
which is why the probability for the intersection, which is in the amount of ,
6
must also be subtracted once.
3 2 5
P A∪C = 6
+ 6
−0= 6
Because A and C cannot win at the same time, “at least one wins” means that
either A or C wins. For this reason, nothing more is subtracted from the sum of
the two individual probabilities.
In general, the probability for the union of any two disjoint events A and B is always as
follows:
P A∪B =P A +P B
119
P A ∪ B = 0.8 + 0.7 − 0.5 = 1 .
So, we are pleased to see that at least one of the two subjects is passed with a
probability of 100%.
The probability for the differenceA\B (A shall occur but not B) of two events A and B is
calculated by the difference of the probability for the eventA minus the probability of the
intersection A ∩ B:
P A\B = P A − P A ∩ B .
Because we want to only find out the probability of A, we must subtract its possible com-
monality with another event, since this would otherwise occur.
Figure 30: Venn Diagram for the Probability of the Difference A\B
For this reason, the probability for Player A to win alone is as follows:
3 1 2
P A\B = 6
− 6
= 6
3 3
• P A\C = − 0 = , since A = 1, 2, 4 andC = 5, 6 , so A always
6 6
wins alone.
3 1 2
• P B\C = − = , since B = 3, 4, 5 and C = 5, 6 .
6 6 6
120
3 1 2
• P B\A = − = , since A = 1, 2, 4 and B = 3, 4, 5 .
6 6 6
2 2
• P C\A = − 0 = , since A = 1, 2, 4 and C = 5, 6 , so C always
6 6
wins alone.
2 1 1
• P C\B = − = , since B = 3, 4, 5 and C = 5, 6 .
6 6 6
P A∩B =P A ·P B
In this case, one simply multiplies the two individual probabilities together to obtain the
probability of the joint occurrence (Bortz & Schuster, 2010, p. 54). In the die example with
A = 1 and B = 4 , this means that
1 1 1
P A∩B = 6
· 6
= 36
.
In this example, we can deduce that the two events must be independent of each other,
since the die always has the same initial state before each roll. In our exam case, it is not
obvious that passing the statistics exam is independent of passing the math exam or not.
This information must be known so that we can calculate the probability of the average
set as just described.
121
5.3 Random Variables and Their
Distributions
In the context of descriptive univariate statistics, we dealt with variables or characteristics
for which we had concrete data. We, therefore, evaluated individual variables based on a
sample by determining, for example, how often the individual values of the variable would
occur in the sample. In the case of gender, for instance, we can determine based on a sam-
ple that how many people belong to the female, male, or diverse group. However, we are
often not in the fortunate situation of having such a sample available. Nevertheless, we
would like to be able to make statements about certain variables. Since we do not always
have concrete data, we must make use of a probability calculation. With its help, we can
determine with which probability the individual expressions of the variables can occur.
Variables that have their outcomes (values) depending on chance are called random vari-
Random variable ables (Nachtigall & Wirtz, 2013, p. 56). As in descriptive statistics, a distinction is made
A random variable between discrete and continuous random variables.
depends on chance for its
outputs.
Discrete Random Variables
They are characterized by the fact that they have a countable set of different values. The
starting point of a discrete random variable is always a random experiment or a random
process, the result set of which must be recorded in the first step.
Ω = HH, HT , T H, T T
The coin can land on heads both the first and the second time (HH ). Likewise, it
can land on tails the two times (T T ). In addition, it can land on heads the first
time and tails the second time (HT ) and vice versa (T H ).
By considering a random variable X, we are now interested in a very specific variable that
ensures that the results of the above result set are converted into real numbers (Handl &
Kuhlenkasper, 2018, p. 224):
X: Ω ℝ
Depending on the definition of the random variable, the four different results of the result
set above are changed to numbers. These concrete numbers are marked with the small
letter x. The totality of all expressions of a random variable X is referred to as the carrier
T X of X.
122
Carrier
The carrier contains all
possible expressions of a
RUNNING EXAMPLE: VALUES OF THE RANDOM VARIABLE WHEN A random variable.
COIN IS TOSSED TWICE
Focusing on our random variable X, we are now interested in the “number of
heads H when a coin is tossed twice.” If, for example, heads are tossed twice in
succession, the random variable takes the value “2.” If the coin lands on heads
first and then tails or vice versa, the random variable has the value “1.” If the coin
lands on tails twice in succession, this means a value of “0” for the random varia-
ble:
HH 2
HT 1
TH 1
TT 0
If we were in a game of chance, it would certainly be of great interest to know the proba-
bility of the individual outcomes of the random variables. If we know all possible results of
the random process and their probabilities, we can also determine the probabilities for
the characteristics of the random variables. This is done by the probability function Probability function
fX x (Handl & Kuhlenkasper, 2018, p. 225): The probability function
describes the probability
for each individual
x fX x = P X = x for all x ∈ ℝ expression of a random
variable.
According to the function, we go through each individual expression x 0,1, 2 and deter-
mine the probability for it. The function is assigned the letter f because it represents the
counterpart to the relative frequencies in descriptive statistics. The large X is at the bot-
tom of the index to indicate that it is a random variable, and the small x in the parenthe-
ses says that we determine the probability for each expression. The following two condi-
tions must apply to the probability function: (1) There must be no negative probabilities,
and (2) the sum of all probabilities must add up to 1.
123
P HH =0.25
P HT =0.25
P T H =0.25
P T T =0.25
From this, the probabilities for the characteristics of the random variables can
be derived:P X = 0 = P T T = 0.25
P X=1 =P T H + P HT = 0.25 + 0.25 = 0.5
P X=2 =P HH = 0.25
The probability that the random variable takes the value 0 (no heads) is equiva-
lent to the fact that two consecutive numbers must fall. This is one of the four
outcomes of the result set, which is why we note the probability of 0.25. The
probability that getting heads once is higher because this can occur by two pos-
sible outcomes of the random process (T H and HT ). Consequently, the proba-
bility of 0.25 must be considered twice, that is why we arrive at the probability of
0.5. For the value 2 of the random variable, the probability is the same as for the
value 0. These probabilities are now combined into a joint probability function:
0.25 for x = 0
0.5 for x = 1
fX x = P X=x =
0.25 for x = 2
0 else
Each line describes one characteristic of the random variable and its probability.
It always starts with the smallest value. The last line “0 else” is to signal that
there are no further values. This remark is always considered in the last line of
the probability function.
Alternatively, the probability function can also be represented in the manner of a fre-
quency table (with which we are already familiar).
x fX x
o 0.25
1 0.5
2 0.25
∑ 1
124
What in descriptive statistics is the expression aj of a variable is now the expression x of a
random variable. Likewise, the relative frequency f j is now replaced by the probability
fX x .
Graphically, the probability function can be represented with the help of a bar chart. This
is done in a similar way to the representation of relative frequencies in descriptive univari-
ate analysis. For our example, the bar chart looks as follows:
With the help of the probability function, all probabilities can now be calculated.
P X > 1 = fX 2 = 0.25
125
• What is the probability that heads happens at least once but not more than
twice?
Such a random variable can be described by various measures, three of which we will
Expected value explain. The first measure is the expected value, written as E X or the Greek letter μ. It
The expected value is the counterpart to the mean value of descriptive statistics and can also be calculated in
describes the expected
expression of a random this way (Nachtigall & Wirtz, 2013, p. 60):
variable.
μ=E X = ∑ x · fX x
x x ∈ TX
So, if we have the probability function, we multiply the expression in each row by the
probability and finally add up all the products. Please note that we speak here of an
expected value and not of a mean value since due to the uncertainty of the occurrence of
certain events, only results can be “expected.” Mean values are based on concrete obser-
vations.
0.25 for x = 0
0.5 for x = 1
fX x = P X=x =
0.25 for x = 2
0 else
It states that, as expected, heads will fall once if you flip a coin twice in a row.
Whether the random variable is strongly dispersed or not is in turn described by the var-
Variance ianceand then by the standard deviation. While in the context of descriptive statistics we
The variance describes speak of a sample variance (there is a concrete sample), here only the term “variance” is
the dispersion of a ran-
dom variable. used. The abbreviation of the variance is either V ar X or the Greek letter σ2. The var-
iance is calculated using the following formula (Handl & Kuhlenkasper, 2018, p. 244):
126
2
σ2 = V ar X = E X2 − E X Standard deviation
The standard deviation
stands for the expected
The calculation should look familiar to us. We need two expectation values. For the back deviation from the expec-
part of the formula, the simple expectation value E X is relevant, which finally must be ted value.
squared. For the front part, the expected value of the squared expressions E X2 is used.
Here, only the individual expressions are squared in the calculation. If we get a result for
the variance in the form of a number, we will face the problem that it cannot be interpre-
ted by squaring. If we can take the root, we will get the standard deviation σ:
σ = σ2
0.25 for x = 0
0.5 for x = 1
fX x = P X=x =
0.25 for x = 2
0 else
Now, we determine the two expected values, the variance, and the standard
deviation:
σ2=1.5 − 12 = 0.5
σ= 0.5 = 0.707 .
We can state that, as expected, heads is shown1 ± 0.707 times when a coin is
tossed twice.
Continuous random variables must be handled in a completely different way than discrete
ones. We are dealing here with random variables that can take on many different forms.
Since calculations by integral calculus are necessary at this point, we will only explain the
absolute basic features of a continuous random variable.
127
Take a moment to recall what we know about descriptive statistics. For continuous char-
acteristics, we graphically depict the empirical densities (relative frequencies divided by
the class width) in a histogram. The area within a class reflects the relative frequency in
that class, and the total area reflects the total relative frequency of 1. Also in this context,
the probabilities are not directly represented graphically. Rather, the corresponding densi-
ties are. This is because a continuous random variable can take on so many different
expressions that a consideration of individual expressions is pointless. The starting point
Density function here is, therefore, a density function fX x (Handl & Kuhlenkasper, 2018, p. 231).
The density function con-
tains the characteristics
and the densities of a
continuous random varia-
ble.
EXAMPLE: WAITING TIME
As an example, we consider a random variable X that measures the waiting time
in line at the bakery. In the best case, one does not have to wait at all. Thus, the
waiting time is 0 minutes. Since time is limited for most people, the maximum
waiting time that can be tolerated would be 10 minutes. Because any waiting
time (for instance, 3.52 minutes) can occur between 0 and 10 minutes, this ran-
dom variable is called a continuous one. The density function of this random
variable is now given by the following:
fX x =
0.1 for 0 ≤ x ≤ 10
0 else
It states in the first line that the random variable assumes a density of 0.1
between 0 and 10 minutes, inclusive. Waiting times smaller than 0 minutes or
larger than 10 minutes are not possible. These are referenced in the second line:
“else 0.” Graphically, the density function takes the following form.
128
Figure 32: Density Function for a Continuous Random Variable
We can observe the waiting times on the x-axis and the densities on the y-axis.
In the interval from 0 to 10 minutes, a density of 0.1 is constantly entered by a
horizontal line.
This distribution is also called a uniform distribution or rectangular distribution since it Uniform distribution
has one and the same density for the entire valid range of values and consequently A uniform distribution
contains the same density
assumes the shape of a rectangle. For ranges smaller than 0 or larger than 10, we find indi- over the entire range of
cated horizontal lines at the height 0. values.
It is important to mention that these densities do not offer any interpretation of the con-
tent. The decisive values are located in the area below this density because the probabili-
ties are to be found there. The total area must always result in a probability of 1 or 100%
since it is obvious that the waiting time with a probability of 100% lies somewhere from 0
to 10 minutes. But how do we arrive at this result mathematically? We see that we have a
rectangle at the top. The area of a rectangle is always calculated by height · width. Since
we have here a height of 0.1 and a width of 10, we get the necessary area:
So, the probability that the waiting time is between 0 and 10 minutes, inclusive, is 1 or
100%.
According to this principle, all other areas and, thus, probabilities can be calculated.
129
RUNNING EXAMPLE: WAITING TIME
For example, what is the probability that the maximum waiting time is three
minutes? The surface we are looking for is now the one outlined in red.
If we calculate the area according to the same principle, we still need a height of
0.1 but only a width of 3:
130
Figure 34: Density Function with a Waiting Time of Four to Six Minutes
Finally, we want to know the probability of having to wait at least 6 minutes. So,
the total area between 6and 10 is needed.
131
Figure 35: Density Function with a Waiting Time of At Least Six
Minutes
Please note that the probabilities at a certain point are subordinate for continuous ran-
dom variables because they are almost 0. Therefore, only probabilities in certain intervals
of the random variable are pertinent. Furthermore, the density function discussed here in
the form of a uniform distribution only represents one of many examples of a density func-
tion. There are many other forms of density functions, but all of them without exception
require the integral calculus for the area calculation. The calculation of the expected value
and variance also requires the integral calculus for continuous random variables, which is
why we did not cover this in this unit.
SUMMARY
Many processes in everyday life depend on chance. For unknown out-
comes of a process, probabilities are determined to make them more
tangible.
132
cussed certain variables dependent on chance with the discrete and
continuous random variables. We then calculated probabilities for them
and, in the case of discrete random variables, determined the corre-
sponding distribution parameters such as expected value, variance, and
standard deviation.
133
UNIT 6
SPECIAL PROBABILITY DISTRIBUTIONS
STUDY GOALS
Introduction
For certain types of frequently used random variables, there are ready-made distribution
models that simplify working with them considerably. The advantage of these models is
that, for example, probability functions, expected values, and variances can be deter-
mined much more quickly. While there are numerous such models, we will focus on
explaining two discrete and two continuous models in this unit.
A Bernoulli process is about counting something. Two types of random variables can be of
interest (Handl & Kuhlenkasper, 2018, p. 276):
136
1. The number of successes in n Bernoulli trials
2. The number of failures until the first success
The first variant describes the situation in which there are, for instance, 10 multiple-choice
questions and we are interested in knowing how many of them we will guess correctly.
The second variant of a random variable is about how many multiple-choice questions we
answer incorrectly until we guess one correctly the first time. Basically, we count the failed
attempts before the first success.
The first type of random variable is described by the binomial distribution, and the second
one by the geometric distribution. We will explain these two distributions in the following
sections. The advantage of both is that there is a predefined probability function that can
be used to calculate every probability. Also, the expected value and the variance of the
random variables can be determined with quite simple formulas without much effort.
Binomial Distribution
With the binomial distribution, we are dealing with a random variable that counts the suc-
cesses in n Bernoulli trials. Thus, for instance, we count how many of the 10 multiple-
choice questions we will guess correctly. To work with a binomially distributed random
variable, two parameters must always be determined: (1) The number of Bernoulli trials Binomially distributed
n, and (2) the probability of success p. Individual probabilities as well as expected value random variable
A binomially distributed
and variance can only be calculated if these two values are known. random variable counts
the number of successes.
Let’s start with the probabilities. We want to be able to calculate any probability. So, if we
have recognized that the random variable X under consideration is a binomially distrib-
uted one, then the big advantage is that there is a function with which we can determine
all probabilities. This probability function takes the following form (Handl & Kuhlenkasper,
2018, p. 276):
n n−x
P X=x = x · px · 1 − p for x ∈ T X
Let’s explain this function now. We want to determine the probability for x successes. To
do this, we first start by determining with nx all the possibilities of achieving x successes
in n Bernoulli trials. How we calculate this binomial coefficient will be explained using Binomial coefficient
The binomial coefficient
the example. Consider px. Here, the probability for the success p is then multiplied x times
counts the possibilities of
by itself px , since we assume exactly x successes. Therefore, if we have x successes, then achieving successes in tri-
we must have n − x failures for n Bernoulli trials. The probability of failure 1 − p is, thus, als.
n−x
multiplied by itself a total of n − x times: 1 − p . As shown in the above equation,
all three terms discussed thus far are multiplied together due to the independence of each
Bernoulli trial. Finally, x ∈ T X stands for the fact that any value of the carrier of X can be
substituted for x. But what is in the carrier of X? If we consider n Bernoulli trials, there can
be at most n successes. In the worst case, there is no success. Consequently, all values
between 0 and n, inclusive, can be inserted into the probability function forx above.
137
RUNNING EXAMPLE: MULTIPLE-CHOICE EXAM
We consider the above example once more. We are taking an exam with 10 mul-
tiple-choice questions. We have not studied and want to answer each question
by guessing. Each question has four possible answers, of which only one is cor-
rect. All questions can be answered independently. With the random variable X,
we are interested in the number of correctly guessed answers. This is all the
information we have.
Knowing this, we can now set up the probability function, with the help of which
we can calculate any probability:
n n−x
P X=x = x · px · 1 − p
10 − x
= 10
x · 0.25x · 1 − 0.25
= 10
x · 0.25x · 0.7510 − x
We should replace the in the x above function at the three relevant places by “2”
in this problem definition:
P X=2 = 10
2 · 0.252 · 0.7510 − 2
138
multiplied by itself twice, since we want two correct answers. Finally, with two
successes and 10 questions, there must be eight failures. So, the probability of
failure 0.75 is multiplied eight times by itself. If we now multiply the three terms
together, we get a probability of 28.2% for guessing two out of 10 multiple-
choice questions correctly.
We proceed according to the same principle; we only replace the x at this point
by “7”:
P X=7 = 10
7 · 0.257 · 0.7510 − 7
Here, it is important to realize that “fewer than two” means that either no tasks
(0) or one task (1) is solved correctly. We, therefore, must add up the probabili-
ties for no correct answers and one correct answer: 1
P X < 2 =P X = 0 + P X = 1
= 10
0 · 0.250 · 0.7510 − 0 + 10
1 · 0.251 · 0.7510 − 1
= 10
0 · 0.250 · 0.7510 + 10
1 · 0.251 · 0.759
Finally, we would like to be able to determine the expected value and variance of a bino-
mially distributed random variable. Thus, if we are interested in determining the expected
number of successes, we calculate the expected value:
E X =n·p
139
We first determine the dispersion around the expected value by the variance:
V ar X = n · p · 1 − p
This gives us the interpretable standard deviation σ by taking the square root (Handl &
Kuhlenkasper, 2018, pp. 276–277).
E X = n · p = 10 · 0.25 = 2.5 .
For 10 multiple-choice questions, we can expect 2.5 correct answers. The disper-
sion around this expected value is first calculated with the variance:
σ = 1.875 = 1.369
Geometric Distribution
While the binomial distribution counts how many successes occur within a certain number
Geometrically of random processes, a geometrically distributed random variable X counts the failures
distributed random until the first success can be recorded. For example, we check the questions one by one
variable
A geometrically distrib- and count the incorrectly answered ones until the first time a question is guessed cor-
uted random variable rectly. Again, the event A is the one that describes the success. In this case, it is not the
counts the failures until questions we count but the one we are waiting for (i.e., a correctly guessed question). The
the first success.
probability for the event A is again P A = p, which is why the probability for the coun-
− −
ter-event A (an incorrectly answered multiple choice question) is again P A = 1 − p.
The only parameter that is important for the geometric distribution is the probability of
success p. The number of Bernoulli trials is not known here. Rather, that is what we want
to find out. Thus, the probability function of the geometrically distributed random varia-
ble takes a simple form (Handl & Kuhlenkasper, 2018, p. 278):
x
P X =x = p· 1−p for x ∈ T X
140
The carrier T X can take any value from 0 to infinity because, in the best case, we get a
success directly at the first run (so that there are 0 failures) or it can take infinitely long
until we get a success the first time. The probability function above is like this for the fol-
lowing reasons: If we want to determine the probability of x failures, we must multiply the
x
probability of failure 1 − p x times by itself. For this reason, we write 1 − p . We only
need to consider the probability of success p since we are only waiting for a success.
Because all the operations are independent, the probabilities are related by multiplying to
x
p· 1−p .
x
P X = x = 0.25 · 1 − 0.25 = 0.25 · 0.75x
With this, countless probabilities can now be determined. Let us look at two
example questions:
Question 1: What is the probability that the third question is the first cor-
rectly solved one?
This is equivalent to answering the first two questions wrongly and having to
put in a “2” for x:
Thus, the probability is 14.06% that the first two multiple-choice questions are
solved incorrectly (0.752) and the third one correctly (0.25).
141
This means that either the first one will be guessed correctly (so there will be 0
misses) or the second one will be guessed correctly (so, there will be 1 miss):
P X ≤ 2 =P X = 0 + P X = 1
=0.25 · 0.750 + 0.25 · 0.751
=0.25 + 0.1875
=0.4375
Finally, the expected value and the variance of a geometrically distributed random varia-
ble can be easily calculated using two given formulas. For the expected value, we use the
following:
1−p
E X = p
This tells us how many failures to expect before the first success. For the variance, we use
this:
1−p
V ar X =
p2
This gives a result that is, once again, not interpretable. However, if we take the square
root of the variance, we obtain the standard deviation σ and, thus, the average deviation
from the expected value (Handl & Kuhlenkasper, 2018, p. 279).
1−p 1 − 0.25
E X = p
= 0.25
=3
We can, therefore, assume that three questions are answered incorrectly and
the fourth question is the first correctly answered one. If we also want to deter-
mine the dispersion around this expected value, then we first calculate the var-
iance of 9 as follows:
1−p 1 − 0.25
V ar X = = = 12
p2 0.252
142
This cannot be interpreted, as mentioned several times already, but it becomes
an interpretable result by taking the square root of 9:
σ = 12 = 3 . 46
Finally, we can state that, as expected, we will answer 3 ± 3.46 questions incor-
rectly until we guess the first one correctly.
Normal Distribution
Many variables of everyday life are normally distributed, such as IQ scores or exam scores.
It is, therefore, important to know what characterizes a normally distributed random
variable. Normally distributed random variables are characterized by the fact that a cer- Normally distributed
tain value range of a variable occurs with a high probability while other value ranges occur random variable
A normally distributed
with a rather low probability, i.e., the variable is distributed symmetrically around an random variable is dis-
expected value. tributed symmetrically
around an expected
value.
Consider, for example, IQ scores. It is very likely that a person has an IQ between 90 and
110, inclusive. An IQ of less than 90 or more than 110 is very unlikely. The same is often
true about exams. The scores obtained in exams are often in a central range with a high
probability. Let’s say, in an exam one can achieve 100 points. However, so often the evalu-
ation of the scores shows that the probability of achieving between 60 and 70 points,
inclusive, is very high. The probability of getting less than 60 and more than 70 points is
again very low. If such distributions of random variables are present, they can often be
described by a normal distribution. Basically, we must always know or be given whether a
variable is normally distributed or not. In addition, we must know two parameters:
1. The expected value of the normally distributed random variable, which is denoted by
the Greek letter μ
2. The dispersion of the variable under consideration
143
This can be either in the form of the variance σ2 or the standard deviation σ. The fact that
a random variable X is normally distributed with expected value μ and variance σ2 is also
written as X N μ, σ2 . For the further arising calculations, it is important that we work
with the standard deviation σ. If only the variance is given, it is advisable to take the
square root at the beginning and then calculate σ.
On the x-axis, we find the possible travel times to work. We see here a division
from 30 to 50 minutes, since obviously all other times (less than 30 or more than
50 minutes) are improbable. On the y-axis, we find the densities. The actual
probabilities are located in the area below the bell. Where we find the expected
value at 40 minutes on the x-axis, the density takes the highest point. It then
drops evenly to both sides. There is, therefore, a perfectly symmetrical distribu-
tion around the expected value. The variance in the amount of “4” cannot be
read directly from the graph. It is only noticeable in the fact that the density
function falls from the highest point very steeply or rather slowly or flatly. If you
have a small variance and, thus, a small dispersion, the density function falls off
very steeply. This is equivalent to the fact that travel times around 40 minutes
are very likely.
Figure 36: Representation of the Density Function for the Travel Time
144
If the variance and, thus, the dispersion were very large, then the density func-
tion would decay much more slowly and give room for much shorter or longer
travel times.
This is the initial situation. We know that the travel time is normally distributed.
We also know that the expected value is 40 minutes, the variance is 4, and,
therefore, the standard deviation is 4 = 2. The goal is now to determine proba-
bilities for certain ranges of random variables. For example, we might be interes-
ted in the probability of taking a maximum of 36 minutes to get to work, or we
might want to determine the probability of taking from 38 to 39 minutes. We
must be aware that this means that we must determine areas below the density
function. The total area below the density function is 1 or 100% because the
probability of using any travel time to work is 100%. Also, this should be clear for
understanding that the probability of needing a maximum of 40 minutes is 50%.
Likewise, the probability of needing 40 minutes or more is 50%. Since 40 is
exactly in the middle of a symmetric distribution, there is 50% probability mass
both below and above the expected value of 40. Thus, the expected value of a
normally distributed random variable corresponds exactly to the median.
If we now want to determine such probabilities, we must always follow a very specific
path: It would be quite complicated to calculate areas and, hence, probabilities for a den-
sity function like the one above. For this reason, we use the standard normal distribu- Standard normal distri-
tion. This is a special form of the normal distribution which always has an expected value bution
The standard normal dis-
of 0 μ = 0 and a variance and thus standard deviation of 1 σ2 = σ = 1 . A standard nor- tribution is a special case
mally distributed random variable is always denoted by Z (Bortz & Schuster, 2010, p. 71). of the normal distribu-
tion.
The great advantage of these standard normally distributed random variables is that there
is a table with all cumulative probabilities for them, from which we can also read out the
relevant probability for a normally distributed random variable. But what do we have to
do for this? If we are looking for the probability that a normally distributed random varia-
ble X takes on at most the value x, then a standardization must be made so that the table
can be used afterward. So, we look for
P X ≤ x = FX x .
We can also write F X x since we are looking for a cumulative probability up to the value
x. Now the value x must be standardized so that a value of the standard normal distribu-
tion z result:
x−μ
P X ≤ x = FX x = Φ σ
=Φ z
Standardization is done by subtracting the expected value μ from the relevant value x and
dividing this difference by the standard deviation σ. The Greek letter Phi Φ is used as a
sign for the distribution function of the standard normal distribution. What is finally deter-
145
mined in the parenthesis by standardizing is the letter z. The value we get for z must now
be read in the table of the standard normal distribution. This table looks like the following
figure.
z-
value* 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.00 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.10 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.20 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.30 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.40 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.50 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.60 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.70 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.80 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.90 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.00 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.10 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.20 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.30 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.40 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.50 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.60 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.70 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.80 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.90 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.00 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.10 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.20 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.30 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.40 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.50 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.60 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.70 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.80 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.90 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.00 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
Suppose we obtain
Φ 2.53
after standardization. We now find these three relevant digits in the leftmost column and
in the top row. Then, 2 before the decimal point and the 5 as the first number after the
decimal point can be found in the line for“2.50.” The 3 as the second number after the dec-
imal point can be read in the uppermost line at the position “0.03” If we now go from the
146
line for “2.50” to the column for “0.03” we find the value 0.9943. This is the probability we
are looking for. If, for example, we get a 1 for z with Φ 1 , then it is important to realize
that this is nothing but a 1.00.
Accordingly, we go to the row for “1.00” and the column for “0.00” and read the corre-
sponding probability 0.8413. With this knowledge, we can now go through all possible
questions that are relevant regarding calculating probabilities for a normally distributed
random variable.
The following figure shows which area below the density function is to be deter-
mined. It is by no means an obligation to make such a sketch to solve the ques-
tion. We do this merely for illustration. The red marked area up to digit 42 is now
the cumulative probability which must be calculated. We can estimate that the
probability must be greater than 50% because 50% of the probability mass is up
to digit 40.
147
P X ≤ 42 =F X 42
42 − 40
=Φ 2
=Φ 1
=0.8413
Question 2: What is the probability that it takes less than 42 minutes for an
employee to get to work?
Even though the number 42 is not actually taken into account here (“less than”),
this is neglected for continuous random variables. Because less than 42 minutes
means that one may need 41 minutes and 59 seconds and still various millisec-
onds to get to work. Since the difference is minimal, it is neglected, and the
same procedure is used as for Question 1:
P X < 42 =F X 42
42 − 40
=Φ 2
=Φ 1
=0.8413
The probability of taking less than 42 minutes to get to work is also 84.13%.
In some cases, it happens that a negative z-value is calculated while standardizing. We will
not find this in the table of the standard normal distribution. There are only positive val-
ues. However, since the standard normal distribution is a random variable symmetrically
distributed around 0, we can convert the negative z-value into a positive one as follows:
Φ −z = 1 − Φ z
So, if we get a negative z-value, we read the probability in the table at the corresponding
positive place and subtract this probability from 1.
148
RUNNING EXAMPLE: TRAVEL TIME
Question 3: What is the probability that it takes an employee less than 36
minutes to get to work?
The following figure shows the relevant area that is to be calculated. It is located
to the left of the expected value and is a relatively small area, which is why the
probability must also be correspondingly low, significantly less than 50%.
Figure 38: Density Function for a Travel Time of Less Than 36 Minutes
P X < 36 = F X 36
36 − 40
=Φ 2
=Φ −2
=1 − Φ 2
=1 − 0.9772
=0.0228
Up to the third line, the calculation is the same as in the previous examples.
However, if we get the number −2 as a standardized value, the innovation
occurs. We convert Φ −2 into 1 − Φ 2 to be able to determine a correspond-
ing probability. We now must read the probability in the table of the standard
normal distribution at the position +2 . 00 (0.9772) and finally subtract this from
1. We, thus, obtain a very low probability of 2.28% that less than 36 minutes is
needed to get to work.
149
If we want to determine the probability that a normally distributed random variable X
takes on at least or a greater value than x, then we calculate this via
P X ≥ x = 1 − P X < x = 1 − FX x .
The relevant area to be calculated is, therefore, the one to the right of the 42nd
minute. Here, we can assume that the probability is less than 50% because only
the area starting at 40 minutes corresponds exactly to the probability of 50%.
With the knowledge from Question 1, we do not need to bother ourselves with a
calculation. Because we already know that the probability for a maximum of 42
minutes is 84.13%, we know the probability for the white area below the bell.
The probability for the red area must therefore be 100% − 84.13% = 15.87%.
Nevertheless, we want to prove this result by the calculation described above:
150
P X > 42 =1 − P X ≤ 42
=1 − F X 42
42 − 40
=1 − Φ 2
=1 − Φ 1
=1 − 0.8413
=0.1587
So, according to this, we can also conclude that the probability for a travel time
of at least 42 minutes is 15.87%.
Finally, we want to be able to determine the probability that a normally distributed ran-
dom variable X takes on a value between two values x1 and x2:
P x1 ≤ X ≤ x2 = F X x2 − F X x1
Accordingly, one always first calculates the total probability up to the value x2 (larger
value) and then subtracts the probability up to the value x1 (smaller value). Because after-
ward the probability part which lies between these two values remains.
We are now looking for an area that does not start at the beginning of the distri-
bution or ends at the end of the distribution. Everything that lies between 39
and 42 minutes, inclusive, is now our searched area/probability.
151
Figure 40: Density Function for a Travel Time From 39 to 42 Minutes
P 39 ≤ X ≤ 42 =F X 42 − F X 39
42 − 40 39 − 40
=Φ 2
−Φ 2
=Φ 1 − Φ −0.5
=Φ 1 − 1 − Φ 0.5
=0.8413 − 1 − 0.6915
=0.5328
The first line describes that the cumulative probability up to 39 must be subtrac-
ted from the cumulative probability up to 42. The second line carries out the
respective standardization. From the third to the fourth line, it must be consid-
ered that with −0 . 5 additionally a negative z-value is calculated. This must be
converted in Line 4 by 1 − Φ 0.5 into a positive one. It is important to put a
bracket () around 1 − Φ 0.5 so that it is completely subtracted. Finally, we get a
probability of 53.28% that it takes an employee between 29 and 42 minutes,
inclusive, to get to work.
152
tends toward 0. As the distance between the two limits decreases, no probability remains
for a specific location. The reason is that a continuous random variable can take on so
many different characteristics that each individual characteristic is almost improbable.
Up to now, we have calculated probabilities without exception. A travel time was speci-
fied, and we determined, for example, the probability that a maximum of this travel time
would be required. In the same way, travel times can be calculated for given probabilities.
For instance, if we want to determine a travel time that will not be exceeded with a proba-
bility of p percent, we determine the quantile
xp = μ + zp · σ
where zp is the corresponding quantile of the standard normal distribution (Handl & Kuh-
lenkasper, 2018, p. 296). We recall that p always describes exactly the percentage that lies
below the searched value. If, for example, we are looking for the quantile that is exceeded
with a probability of 30%, then it is equivalent to saying that a probability of 70% is below
the value we are looking foru. Thus, x0.7 must be determined.
This is equivalent to saying that we are looking for a travel time that has a prob-
ability of 95% below it. If we want to mark the probability of 95% below the den-
sity function of the normal distribution, the end of the marked area just
describes the quantile we are looking for. The following figure shows the
searched quantile x0.95.
Figure 41: Density Function for a Travel Time That is Not Exceeded on
95% of Days
153
We calculate the quantile we are looking for as follows:
x0.95=μ + z0.95 · σ
=40 + z0.95 · 2
=40 + 1.6449 · 2
=43.289
Thus, from the second to the third line, z0.95 is replaced by 1.6449. Overall, the
result means that with a probability of 95%, the calculated travel time of 43.289
minutes will not be exceeded or fallen short of.
If you are looking for a quantile that is to the left of the expected value (e.g., x0.2 or x0.4),
then the following will help you to read the quantile of the standard normal distribution:
zp = −z1 − p
This is because quantiles below x0.5 are usually not tabulated. Also, at the top of the table,
we find the quantiles that are to the right of the expected value.
The search is, therefore, for the travel time x0.1, which is undercut with a proba-
bility of 10%. The following figure shows that the searched quantile must be
below 40.
154
Figure 42: Density Function for the Travel Time That is Exceeded on
90% of the Days
Thus, it makes sense that, when determining the quantile, something is subtrac-
ted from the 40 and not added to it. While the approach with
x0.1 = μ + z0.1 · σ
is initially identical, we notice that we do not find the quantile z0.1 in the table of
quantiles of the standard normal distribution. Now, however, the symmetry
around 0 benefits us again:
zp=−z1 − p
z0.1=−z1 − 0.1 = − z0.9
We can state that z0.9 is as far away from the center (0) of the standard normal
distribution because z0.1. z0.9 is usually positive. To be exact, z0.9 = 1.2816.
Thus, the two quantiles are equidistant from 0, z0.1 = − 1.2816. This means
that the sought-after quantile can be calculated via the following:
x0.1=μ + z0.1 · σ
=40 − z0.9 · σ
=40 − 1.2816 · 2
=37 . 44
The travel time, which is exceeded with a probability of 90% or fallen short of
with 10%, is 37 . 44 minutes.
155
Finally, following the previous determination of quantiles, we want to determine central
Central fluctuation fluctuation intervals. These are the intervals that are formed centrally around the expec-
intervals ted value and contain a certain probability mass. Both limits of the interval must be
These central fluctuation
intervals contain a certain equally far away from the expected value. We will look at this using an example.
probability mass between
two limits.
This can be best illustrated using the density function, as can be seen in the fol-
lowing figure. If 90% is within the boundaries of the central interval, then 10%
remains. These are divided equally between the left and right halves of the dis-
tribution outside the interval. This means that the interval must start at the
quantile x0.05 (since 5% are below this limit) and consequently end at x0.95
(since 5% are above this limit). These two quantiles must be determined now, as
we have already learned.
Figure 43: Density Function for the Central Variation Interval with 90%
of All Travel Times
It is always a good idea to start with the upper limit of the interval, here with
x0.95:
x0.05=μ + z0.05 · σ
=40 + z0.95 · 2
=40 + 1 . 6449 · 2
=43.29
156
Because if we determine the associated quantile of the standard normal distri-
bution for the upper interval boundary first, then we only have to give this a
negative sign for the lower boundary:
x0.05=μ + z0.05 · σ
=40 − z0.95 · 2
=40 − 1.6449 · 2
=36.71
t-Distribution
With the t-distribution, we consider another continuous distribution that is very similar to t-distribution
the normal distribution or standard normal distribution. It becomes important in particu- The t-distribution
describes a distribution
lar if one deals with statistical test procedures. The t-distribution distributes itself just like for small samples that is
the standard normal distribution around the value 0 (μ = 0) with a variance or standard close to the standard nor-
mal distribution.
deviation of 1 (σ2 = σ = 1). The difference is that the t-distribution is especially suitable
for the case when the samples are particularly small. This is because – unlike the standard
normal distribution – it can take into account how heavily populated a sample is. It is true
that with increasing sample size, the t-distribution becomes very similar to the standard
normal distribution until at some point it becomes congruent. This can be seen in the fol-
lowing figure.
157
Figure 44: t-Distributions and Standard Normal Distribution
The first figure shows a t-distribution with only four observations. Especially in compari-
son to the second figure in which we have 40 observations, the first one runs somewhat
flatter and consequently with a larger dispersion. This is generally the case with small
samples. Let us imagine a small sample in which there is an outlier. We have already
learned that such outliers can increase a dispersion a lot. Then, the larger the sample size
becomes (see the second figure), the less noticeable outliers become. We see at n = 40
that the distribution is almost identical to that of the standard normal distribution. So, we
can state that as the sample size increases, the t-distribution approaches the standard
normal distribution.
158
It might be surprising why we are suddenly talking about a sample here when we are dis-
cussing random variables without any sample at all. This becomes important when we
deal with inferential statistics and want to find out whether results obtained based on
samples can be transferred to the public. Here, we should just clarify that it is very similar
to the standard normal distribution and is particularly suitable for small sample sizes.
SUMMARY
For certain types of random variables, ready-made models exist with
which probabilities, expected values, and variances can be calculated
more easily and quickly than the conventional treatment of random var-
iables. Besides, there are some discrete distribution models, of which
only the binomial distribution and the geometric distribution have been
discussed here. Both are based on the Bernoulli process, and their ran-
dom variables have the task of counting something. While a binomially
distributed random variable counts the successes within a certain num-
ber of Bernoulli trials, the geometrically distributed random variable
counts the failures up to the first success.
Note that of the numerous continuous distribution models, only the nor-
mal distribution and the t-distribution have been discussed here. From
the starting point of a density function, it becomes clear that the sought
probabilities are always in the area below the density function. While
with discrete distributions, probabilities for certain expressions of the
random variable are of interest, with continuous random variables, only
ranges of expressions and their probabilities play a role, particularly the
probability of a certain expression of a continuous random variable
approaching zero.
159
UNIT 7
STATISTICAL ESTIMATION METHODS
STUDY GOALS
Introduction
Imagine that we want to conduct an employee survey on satisfaction with the company
that all of us work for. We already know that all employees of the company form the basic
population. However, if we do not reach all employees during our new survey, we will only
have a sample of that population. Therefore, we will have no choice but to estimate state-
ments from all employees based on the sample. In this unit, we will use point and interval
estimates to show how to estimate results for the population based on a sample.
Let’s assume that we are now interested in the average length of service for all
employees. The dispersion is also important to us.
Point estimation In order to create a point estimation, we should first determine a concrete value for the
This refers to the use of a mean and the dispersion, which would then be valid for the population. Computationally,
concrete value to esti-
mate the true parameter there will be no new challenges for us. We only have to be able to calculate a mean value,
of the population based a sample variance, and a standard deviation.
on a sample.
If we now want to estimate the average service length of all employees in the company
(i.e., the population), the mean value comes into play as a relevant measure. It is impor-
tant to note that the mean value is assigned a different letter depending on the initial sit-
uation. If we are at the population level, the mean, also referred to as the “expected
162
value,” is abbreviated with the Greek letter μ. At the sample level, we use the familiar
abbreviation −x . Now, if we only have a sample, we must estimate the mean value of the
population. We always add a above the relevant estimate. Therefore, μ refers to the esti-
mated mean of the population.
At this point, we have only one sample. Consequently, we estimate the mean of the popu-
lation on the basis of this sample by calculating the mean value of the sample:
n
μ=−
1
x = n
∑ xi
i=1
This raises an important question: Is the mean value of the sample − x really well suited to
represent the mean value of the population μ? To do so, two quality criteria should be
met. On the one hand, the estimator should be unbiased (Handl & Kuhlenkasper, 2018, Unbiased
p. 345). This means that the mean value of the sample actually corresponds to the expec- An estimator is unbiased
if the calculated parame-
ted value in the population. In principle, the sample mean is considered to be an “expecta- ter of the sample matches
tion-true estimator.” Hence, this quality criterion is fulfilled. As well, the estimator should that of the population.
be consistent. A consistent estimator becomes more accurate as we increase the sample Consistent
size and approaches the expected value of the population. This criterion is also met by our An estimator is consistent
if it matches the true
use of the sample mean as an estimator for the expected value of the population (Handl & value of the population
Kuhlenkasper, 2018, p. 350). better and better as the
sample size increases.
μ=−
12 + 8 + 16 + 10 + 6 + 10 + 14 + 12
x = 8
= 11
Now, we want to obtain an estimate of the dispersion in the form of the variance and
standard deviation of the population. Here, a distinction must also be made between the
level of the population and that of the sample. If we are at the population level, the var-
iance and standard deviation are represented using the Greek letter σ by σ2 and σ, respec-
tively. Here, it is important that we are talking about a variance and not a sample variance.
In the context of a sample, we use the representation that we are familiar with – namely,
s2 and s – for the sample variance and standard deviation, respectively. The correspond-
163
ing estimates of the variance and standard deviation of the population are denoted by σ2
and σ , respectively. If we have only one sample, we estimate these two measures as fol-
lows:
n 2
σ2=s2 = n−1
· x2 − x
σ=s = s2
−
x=
12 + 8 + 16 + 10 + 6 + 10 + 14 + 12
= 11
8
− 122 + 82 + 162 + 102 + 62 + 102 + 142 + 122
x2= 8
= 130
8
σ2=s2 = 8−1
· 130 − 112 = 10.29
The major disadvantage of point estimators is that you commit yourself to a specific value
for the mean, variance, or standard deviation. Moreover, there is a chance that the true
value of the population may deviate slightly from our value. For this reason, it is common
to fall back on intervals with limits that have a certain probability of containing the true
value of the population.
164
Intervals consist of a lower limit and an upper limit. This gives us a more reliable estimate specifies a range as an
estimator for a parameter
of the expected value, since the interval between the two limits allows for many possible of the population.
values. The “confidence” in “confidence interval” refers to the fact that we can trust that
the interval covers or contains the true value in the population according to a certain
probability. This probability is also called the confidence probability or confidence level. Confidence probability
The error probability is the opposite of the confidence probability, and it is written as α. It The confidence probabil-
ity, which is also known
captures the probability that the true value does not lie within the established confidence as the confidence level,
interval. Therefore, 1 − α represents the probability that the true value will be in the inter- indicates the probability
val (i.e., the confidence probability). that the true parameter
lies in the calculated
interval.
In the context of this lesson, we will only deal with the establishment of confidence inter-
vals for the expected values of the population. We will leave out intervals for variances or
standard deviations.
So, how do you set up such a confidence interval? We will go through the relevant steps:
1. Determine the confidence probability 1 − α: The most common values for the confi-
dence probability are 0.9 90% , 0.95 95% and, more rarely, 0.99 99% . Confidence
probabilities that are too large, such as 99%, should be treated with caution, since
although they include the true expected value with almost 100% probability, they are
not very informative due to their size.
2. Collect a sample and calculate its mean: At this stage, we need a sample to help cre-
ate our confidence interval. Because we do not want to deal with that, we will take a
given sample. Next, we calculate the mean value of this sample.
3. Mark the area around the mean value: The confidence interval should be set up
around the calculated mean value. Since the distributions for mean values usually
approach a normal distribution, this can also be assumed here.
With these steps in mind, let’s take a look at the following figure.
165
Figure 45: Confidence Interval for an Expected Value
If we have estimated the mean as μ = − x based on the sample of the population, then we
can observe that we have now constructed a symmetric interval around this mean. The
two limits are marked by two vertical lines that indicate where the true value lies with
according to the confidence probability of 1 − α. Note that only this probability forms
these boundaries. Finally, if there is a 1 − α probability that the true value lies within the
interval, there is also the remaining error probability α that it actually lies outside the
α
interval. Since the interval is symmetric around the mean, α is divided in half 2 = 0.5α
between the area below the lower limit of the interval and the area above its upper limit.
166
Our goal is to calculate the limits of our confidence interval. To do so, we must distinguish
between two cases:
1. Sometimes, we know information about the population. For example, we may know
the variance or standard deviation of the relevant variables at the population level.
2. We may have no known information about the population.
Depending on whether the measures of dispersion in the population are known or not, we
will set up the confidence interval according to a certain pattern (Bortz & Schuster, 2010,
p. 119). Both variants are outlined in the following sections.
Let us assume that we know the variance σ2 or the standard deviation σ of the population.
We can now calculate the confidence interval as follows (Handl & Kuhlenkasper, 2018,
p. 369):
−
x − z1 − 0.5 · α ·
σ
; −
x + z1 − 0.5 · α ·
σ
n n
The calculation to the left of the semicolon gives the lower limit of the interval. Accord-
ingly, the right side of the semicolon gives the upper limit. To calculate either limit, you
start by using the mean value of the sample − x as the estimator of the expected value for
the population. For the lower limit, you now subtract something − , and for the upper
limit, you now added something + . Note that what you have either subtracted or
added is identical on both sides. This must be the case, given that both limits should lie
equally far away from the mean value.
Because we know the variance or standard deviation of the population in this case, we can
next use the relevant quantile of the standard normal distribution. If the variance or stand-
ard deviation is unknown, then we must use a different distribution, which will be dis-
cussed later. Now, let’s have a look at the density function of the standard normal distribu-
tion.
167
Figure 46: Density Function of the Standard Normal Distribution
We have already seen from the previous figure how the probabilities are distributed over
the individual ranges. Thus, we also know that we have probability masses up to the
α
upper limit of the interval 1 − 2
= 1 − 0.5α. This is because there is a mass that we
α
delineated according to the probability 2
= 0.5α that lies to the right of the upper limit.
Consequently, we must determine the quantile z1 − α = z1 − 0.5α, which we can use for
2
both the upper limit (with a positive sign) and the lower limit (with a negative sign) due to
the symmetry of the standard normal distribution around 0.
Finally, this quantile is multiplied by the quotient of the standard deviation of the popula-
σ
tion σ and the square root of the sample size n. This quotient of is also called the
n
Standard error standard error. In terms of content, it tells us how much the estimated mean value − x
The standard error indi- deviates from the true value μ.
cates the deviation of the
estimated mean from the
true value.
168
12; 8; 16; 10; 6; 10; 14; 12
Now, we would like to determine the confidence interval within which the true
expected value of average service length lies with a probability of 95%.
−
x =
12 + 8 + 16 + 10 + 6 + 10 + 14 + 12
= 11
8
• Standard deviation σ. We obtain this simply by taking the square root of the
variance:
σ= 9=3
• Sample size n. Finally, the sample size is given by the number of interviewed
workers in the sample, with n = 8.
Now, we have all the information needed to set up our confidence interval:
169
−
x − z1 − 0.5 · α ·
σ
; −
x + z1 − 0.5 · α ·
σ
n n
3 3
= 11 − 1.96 · ; 11 + 1.96 ·
8 8
= 8.92; 13.08
With a probability of 95%, we can, therefore, conclude that the true average
service length is between 8.92 and 13.08 years.
The construction of a confidence interval is a little different if neither the variance nor the
standard deviation of the population is known to us. The basic principle is remains the
same. There are only two changes:
This results in the following general formula for setting up the confidence interval when
the variance in the population is unknown (Handl & Kuhlenkasper, 2018, p. 371):
−
x − tn − 1; 1 − 0.5 · α ·
s
; −
x + tn − 1; 1 − 0.5 · α ·
s
n n
Recall that −
x , s, and n should be known to us. However, we should take a closer look at
the quantile tn − 1.1 − 0.5 · α. As mentioned elsewhere, the t-distribution is a distribution
very similar to the standard normal distribution. It is also distributed around the value 0
but has a corresponding variance or standard deviation depending on the sample size. For
this reason, when determining the t-value, the sample size is also taken into account in
addition to the 1 − 0.5 · α probability that we know. This is done according to the number
of degrees of freedom: n − 1 (Nachtigall & Wirtz 2013, p. 119). We will not go into detail on
the meaning of the degrees of freedom here. However, it is important to know that they
represent the sample size.
170
Now, we would like to determine the confidence interval in which the true
expected value of the average service length lies with a probability of 95%.
Area*
df 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 0.975 0.990 0.995 0.9995
l 0.158 0.325 0.510 0.727 1.000 1.376 1.963 3.078 6.314 12.706 31.821 63.657 636.619
2 0.142 0.289 0.445 0.617 0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 31.598
3 0.137 0.277 0.424 0.584 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 12.941
4 0.134 0.271 0.414 0.569 0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 8.610
5 0.132 0.267 0.408 0.559 0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 6.859
6 0.131 0.265 0.404 0.553 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 5.959
7 0.130 0.263 0.402 0.549 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 5.405
8 0.130 0.262 0.399 0.546 0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 5.041
9 0.129 0.261 0.398 0.543 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 4.781
10 0.129 0.260 0.397 0.542 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.587
11 0.129 0.260 0.396 0.540 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 4.437
12 0.128 0.259 0.395 0.539 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 4.318
13 0.128 0.259 0.394 0.538 0.694 0.870 1.079 1.350 1.771 2.160 2.650 3.012 4.221
14 0.128 0.258 0.393 0.537 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 4.140
15 0.128 0.258 0.393 0.536 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 4.073
16 0.128 0.258 0.392 0.535 0.690 0.865 1.071 1.337 1.746 2.120 2.583 2.921 4.015
17 0.128 0.257 0.392 0.534 0.689 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.965
18 0.127 0.257 0.392 0.534 0.688 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.922
19 0.127 0.257 0.391 0.533 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.883
20 0.127 0.257 0.391 0.533 0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.850
21 0.127 0.257 0.391 0.532 0.686 0.859 1.063 1.323 1.721 2.080 2.518 2.831 3.819
22 0.127 0.256 0.390 0.532 0.686 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.792
23 0.127 0.256 0.390 0.532 0.685 0.858 1.060 1.319 1.714 2.069 2.500 2.807 3.767
24 0.127 0.256 0.390 0.531 0.685 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.745
25 0.127 0.256 0.390 0.531 0.684 0.856 1.058 1.316 1.708 2.060 2.485 2.787 3.725
26 0.127 0.256 0.390 0.531 0.684 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.707
27 0.127 0.256 0.389 0.531 0.684 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.690
28 0.127 0.256 0.389 0.530 0.683 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.674
29 0.127 0.256 0.389 0.530 0.683 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.659
30 0.127 0.256 0.389 0.530 0.683 0.854 1.055 1.310 1.697 2.042 2.457 2.750 3.646
40 0.126 0.255 0.388 0.529 0.681 0.851 1.050 1.303 1.684 2.021 2.423 2.704 3.551
60 0.126 0.254 0.387 0.527 0.679 0.848 1.046 1.296 1.671 2.000 2.390 2.660 3.460
120 0.126 0.254 0.386 0.526 0.677 0.845 1.041 1.289 1.658 1.980 2.358 2.617 3.373
z 0.126 0.253 0.385 0.524 0.674 0.842 1.036 1.282 1.645 1.960 2.326 2.576 3.291
171
Source: Bortz & Schuster (2010, p. 590).
• Mean value of the sample x −. We have already calculated this for the point
estimator as well as the previous confidence interval:
−
x =
12 + 8 + 16 + 10 + 6 + 10 + 14 + 12
= 11
8
−
x=
12 + 8 + 16 + 10 + 6 + 10 + 14 + 12
= 11
8
• Sample size n. Finally, the sample size is given by the number of interviewed
workers in the sample, so n = 8.
With this, we have all the information to be able to set up our confidence inter-
val:
−
x − tn − 1; 1 − 0.5 · α ·
s
; −
x + tn − 1; 1 − 0.5 · α ·
s
n n
3.21 3.21
= 11 − 2.365 · ; 11 + 2.365 ·
8 8
= 8.32; 13.68
172
We can, therefore, conclude with a probability of 95% that the true average
length of service with the company is between 8.32 and13.68 years, inclusive.
Suppose that we want to compare our two confidence intervals. First, we note the follow-
ing with regard to the calculation:
• known variance or standard deviation in the population: use the quantile of the stand-
ard normal distribution (z)
• unknown variance or standard deviation in the population: use the quantile of the t-dis-
tribution and calculate sample variance and standard deviation based on the sample
Note that we have come to different results based on the same sample:
Accordingly, we can state that the interval is wider when we have an unknown variance
than when we have a known variance. This is very typical for two reasons. First, the quan-
tiles of the t-distribution are larger than those of the standard normal distribution, espe-
cially for small samples. Thus, the distance between the two interval boundaries and the
mean becomes larger. Second, the dispersion in the form of s is also often larger when we
have an unknown variance in the population than when we have a known variance. Both
factors ensure that more is subtracted from the mean on one side of the interval and more
is added on the other side. The interval boundaries are, therefore, always further apart
when we’re dealing with unknown variance rather than known variance.
The width of the confidence interval can be influenced in a positive sense. Basically, we
would like to have an interval that is as narrow as possible and based on as high a proba-
bility as possible, since this would mean that we have a very high probability of hitting the
true value within these very narrow limits.
How can this be achieved? One way is to increase the sample size n. The more observa-
tions our sample contains, the more precisely we can calculate the mean and estimate the
interval around it. In other words, the interval becomes narrower. Another way is to
reduce the confidence probability 1 − α. If, for example, we chose a confidence probabil-
ity of 90% instead of 95%, the boundaries of the confidence interval would automatically
move closer together. However, remember that this also reduces the probability that the
true value of the population is masked by the interval. This is because if the confidence
probability decreases, the error probability must increase.
173
SUMMARY
In most cases of statistical investigations, we only have samples to work
with. Nevertheless, we still have ways of making statements about pop-
ulations. One way to make such a statement is to estimate the results for
the population based on a sample of it. This can be done using point
estimators, which involves making a concrete estimate for the expected
value, the variance, or the standard deviation.
However, if you do not want to rely on a value for your estimation, then
you can choose a confidence interval instead. A confidence interval
specifies the probability with which the true but unknown parameter is
covered by the interval. Confidence intervals for expected values must
be differentiated into those with known variance in the population and
those with unknown variance in the population. If the variance is known,
a quantile of the standard normal distribution is used to establish the
confidence interval. If it is unknown, the corresponding quantile of the t-
distribution is used instead. It should be noted that a large sample size,
in particular, makes a confidence interval narrower and, thus, more pre-
cise.
174
UNIT 8
HYPOTHESIS TESTING
STUDY GOALS
Introduction
Suppose that we only have samples to work with but still want to use our results to make
generally valid statements. Recall that we can make use of tools such as point estimators
and confidence intervals for this purpose. However, another option is for us to use hypoth-
esis testing to see if our sample results can be generalized. This will be discussed in the
following sections.
8.1 Methods
All over the world, there are many cases of wage gaps between men and women in the
same professions, meaning that these men typically earn more money than their counter-
parts. This might be one of the reasons why – and is certainly the case in many profes-
sions – there are different satisfaction levels among some workers. Suppose that we want
to examine this issue in the nursing profession. So, we might come up with the following
research questions:
To further explore our research questions, it would certainly be desirable to ask all
nurses – in short, the basic population – about their salaries. However, we are, once again,
faced with the common problem that we have to work with a random sample. Often, such
a sample is also relatively small. This raises the important question of whether the results
Inferential statistics obtained from such a sample are generally valid. Inferential statistics is concerned with
You use inferential statis- exactly this question. Namely, is it possible to generalize the sample results to the popula-
tics to test sample results
for generality. tion, or do sample results happen to represent a particularity of the sample instead? For
example, suppose that we obtain a sample of 50 nurses such that the female nurses earn
on average $500 more per month than the male nurses. Crucially, this salary disparity may
only be observed in this particular sample and not in the general population. Following
Schäfer (2011, pp. 9–14), we must content with the following questions.
Question 1: How can it be feasible to draw conclusions from a sample and apply them
to a population?
For this purpose, the sample must be “representative” of the population. This means that
the ratios in the population, such as the gender distribution, must be reflected as well as
possible in the sample. For instance, if we know that 70% of the nurses in the population
are female, then this should also be the case in the sample. We achieve this by drawing
random samples, and the resulting sample should be as large as possible.
176
Critically, there is one thing we must be aware of: A sample is always only a section of the
population. In other words, there is a certain probability that the sample does not exactly
reflect the population. If this occurs, the sample results do not actually apply to the gen-
eral population. This leads us to our final two questions.
Question 2: What is the quality of these conclusions? How well can the results of the
sample be generalized to the population?
There are two ways to answer these questions. Our first option is to conduct our analysis
on several samples rather than just one. Therefore, we would need several samples availa-
ble from which a “sampling distribution” can be derived (Sedlmeier & Renkewitz, 2018,
pp. 330–334). This option is rarely applied in practice due to the time required as well as
other financial and scientific reasons. Returning to our example, we would need to con-
duct several studies on “salaries in the nursing sector.” Then, we could use the individual
results to derive a global result. However, for the aforementioned reasons, this is hardly
practicable.
So, let’s move on to the second option: If we have the results of only one sample available,
then we can indicate the probability of our results being wrong, which is common prac-
tice. If, for example, we conclude that men in nursing professions earn on average $200
more per month than women, we should also determine a probability that reflects how
sure we are of this result.
This common practice is put into practice with the help of statistical significance tests or
hypothesis tests. We perform such tests to find out whether sample results can be trans-
ferred to the general population in addition to how well they can be transferred.
If we want to answer a question such as “Do male and female nurses earn different
amounts of money?” or “Do male nurses earn more money than female nurses?” and only
have a random sample available to us, then we should – as we have just learned – use a
statistical test procedure. At the beginning of such a test, the question is first transformed
into two completely contrary hypotheses (Bortz & Schuster, 2010, pp. 97–99). The first
hypothesis is called the null hypothesis, and it is represented by H0. Applied to our exam- Null hypothesis
ple, if the null hypothesis is true, then whether a nurse is male or female does not cause a The null hypothesis is
conservative and propo-
difference in their income. Suppose that we wanted to test for a correlation between gen- ses that there is no effect.
der and smoking behavior whether age has an influence on income. Under the null In other words, it always
describes that state in
hypothesis, we always assume that no such effect exists.
which there is no effect.
The opposite situation is described by the alternative hypothesis, which is represented Alternative hypothesis
by H1. In the context of this hypothesis, it is always assumed that such an effect exists. The alternative hypothe-
sis assumes that there is
Accordingly, we would assume that whether a nurse is male or female has an effect on an effect.
their income. Alternatively, we would assume that there is a correlation between gender
and smoking behavior and an influence of age on income.
177
Let’s return to our very first research question: “Do male and female nurses earn different
amounts of money?” With what we just learned in mind, we can create the following pair
of hypotheses:
• H0: Whether a nurse is male or female does not cause differences in income.
• H1: Whether a nurse is male or female causes differences in income.
Note that this is only one example of how to formulate these hypotheses. Under the null
hypothesis, we could also assume that male and female nurses earn the same amount of
money. In such a case, the alternative hypothesis would be that they earn different
amounts.
In most research projects, the alternative hypothesis represents what you want to
research or prove. For example, we would probably not answer the research question rela-
ted to whether male and female nurses earn different amounts of money if we did not also
assume this to be the case. The next step would be to conduct such a study for as large a
sample as possible in order to try and uncover the fact and, subsequently, also be able to
counteract the fiction.
However, there are also research questions that aim to prove the null hypothesis. For
example, researchers in the tobacco industry might want to prove the null hypothesis that
smokers and non-smokers have the same health status versus the alternative hypothesis
that smokers may have worse health than non-smokers.
Since there can be many different questions, there can be just as many different types of
hypothesis pairs as a result.
Test for location param- There are some questions that require a test for location parameters. If, for example, we
eters assume that nurses accumulate an average of 10 hours of overtime per week, we would
A test for location param-
eters checks, for example, then use a selected sample to test for the average value as a location parameter of 10
the presence of a mean hours of overtime. The corresponding hypothesis pair would be as follows:
value in the population.
Correlation hypothesis
Correlation hypothesis In some situations, we might need to use a correlation hypothesis. For example, suppose
A correlation hypothesis that we want to examine whether there is a relationship between the number of hours
assumes a relationship
exists between two or worked in the nursing profession and cigarette consumption. Our question might be as
more variables. follows: “Does the number of hours worked in the nursing profession have an effect on
cigarette consumption?” Our hypothesis pair can differ depending on our research ques-
tion. One formulation is as follows:
178
• H0: There is no correlation between hours worked and cigarette consumption.
• H1: There is a correlation between hours worked and cigarette consumption.
• H0: The number of hours worked in the nursing profession does not affect cigarette
consumption.
• H1: The number of hours worked in the nursing profession affects cigarette consump-
tion.
The word “influence” also falls under the umbrella of terms used in correlation hypothe-
ses. Thus, if we asked ourselves, “Does the number of hours worked have an influence on
cigarette consumption?”, we, just like before, are faced with a correlation hypothesis and
should create the corresponding hypothesis pair.
Difference hypothesis
Another type of hypothesis is a difference hypothesis. A difference hypothesis is useful if Difference hypothesis
you want to investigate the possible differences between two or more groups with respect A difference hypothesis
assumes a difference
to a variable. If we consider our original research question, “Do male and female nurses exists between two or
earn different amounts of money?”, this leads us to the following pair of difference hypoth- more groups with respect
eses: to a variable.
• H0: Male and female nurses earn the same amount on average.
• H1: Male and female nurses earn different amounts on average.
Change hypothesis
There are situations in which a difference hypothesis can also be a change hypothesis. Change hypothesis
Such hypotheses are often formulated in medical or psychological contexts. Suppose that A change hypothesis is a
difference hypothesis that
you form a group of people who all have a certain condition and intend to expose them to additionally tests for a
a new form of therapy. In such a case, their health status is usually examined at two differ- change.
ent points in time: (1) before the therapy and (2) after the therapy. Usually, the research
question is then whether their health status has changed as a result of the therapy and
results in the following pair of hypotheses:
• H0: The health status after the therapy is identical to that before the therapy.
• H1: The health status is different after the therapy than before the therapy.
This pair of hypotheses can be used to examine the difference in health status from one
point in time to the next. Specifically, it tests whether this changes for the same group of
individuals. Accordingly, change hypotheses are strongly characterized by the fact that a
single group of objects is observed at several points in time or in different situations.
179
Directed Versus Undirected Hypotheses
The types of hypotheses just described are all formulated in such a way that the alterna-
tive hypothesis does not assume that a specific effect exists in a specific direction. In such
a case, we are talking about “undirected” or “two-sided” hypotheses. If we return to our
example about nurses working overtime, the number of overtime hours can be below or
above 10. Similarly, there can be a positive or negative relationship between hours worked
and cigarette consumption. The same is true for the remaining pairs of hypotheses that we
Undirected hypotheses discussed. Undirected hypotheses are formulated when previous research does not yet
Undirected hypotheses allow for testing in a specific direction.
do not assume a specific
effect direction.
If we have an a priori assumption about the direction of an effect, we can focus on “direc-
ted” or “one-sided” hypotheses (Bortz & Schuster, 2010, p. 98). So, let’s go through the
above examples and assume that the effect in each situation has a very specific direction.
Directed hypotheses This allows us to formulate directed hypotheses:
A directed hypothesis
assumes an effect has a
specific direction. 1. Do nurses work “more” than 10 hours of overtime per week on average?
• H0: On average, nurses work “no more” than 10 hours of overtime per week.
• H1: The nurses work on average “more” than 10 hours of overtime per week.
4. Do the hours worked have a “negative” effect on cigarette consumption?
• H0: Hours worked “do not have a negative” effect on cigarette consumption.
• H1: Hours worked has a “negative” effect on cigarette consumption.
7. Do male nurses earn “more” than female nurses on average?
• H0: Male nurses earn on average “at most” as much as female nurses.
• H1: Male nurses earn “more” than female nurses on average.
10. Is the state of health “better” after the therapy than before the therapy?
• H0: The state of health after therapy is “at most” as good as before therapy.
• H1: The state of health is “better” after therapy than before therapy.
With such directed hypotheses, it is important that in the alternative hypothesis, the con-
crete assumption is always formulated in a very specific direction. Accordingly, there will
always be formulations that use terms such as “more than,” “less than,” “positive,” and
“negative” in the alternative hypothesis. In addition to the opposite direction, the null
hypothesis always accounts for equality.
Deciding on a Hypothesis
Once you have formulated a pair of hypotheses as described, the next step is to decide on
one of these two hypotheses. In principle, the decision is always made with respect to the
null hypothesis. One possible result is that the null hypothesis is rejected. This means that
you assume the results obtained by the sample fit the assumption formulated in the alter-
180
native hypothesis. Accordingly, there seems to be an effect. If the null hypothesis is not
rejected, it can be assumed that the results obtained by the sample do not support the
existence of an effect.
But how do we decide on one of the two hypotheses? Such test procedures can be descri-
bed as “very conservative” (Schäfer, 2011, p. 57). They first assume that (1) the null
hypothesis is the correct hypothesis and (2) the sample data must first convince us other-
wise in order to decide against using the null hypothesis. Accordingly, you must examine
whether the data contained in the sample still justify the null hypothesis or whether it is
better to decide against it. If we determine that the nurses in a sample worked an average
of 11 hours of overtime per week, the question arises as to whether this still fits the null
hypothesis, with an expected average of 10 hours of overtime or whether the deviation of
a single hour of overtime is already large enough to decide against the null hypothesis. A
result is “significant” when we manage to reject the null hypothesis and, thus, support the
existence of an effect. The following section explains how this decision occurs during a
statistical test procedure.
No matter which test procedure is carried out and which pair of hypotheses must be tes-
ted, the associated test always consists of five steps. For a better understanding, let’s look
at a concrete example: We would like to find out whether nurses in a specific city work an
average of 10 hours of overtime per week. For this purpose, we asked eight nurses about
how much the overtime they worked in a given week. Our next steps are as follows:
As we have learned in the previous sections, a hypothesis pair is always formulated from
the initial question. In the present example, the null hypothesis that an average of 10
hours of overtime are worked per week will be tested against the alternative hypothesis
that an average of more or less than 10 hours of overtime are worked per week.
When deciding on one of the two hypotheses, we can always make a mistake. As already
discussed, we only have one sample to work with. In this step, the “probability of error” is
determined. This is also referred to as the significance level, which is represented by α Significance level
(Bortz & Schuster, 2010, pp. 100–101). The significance level describes the probability with The significance level is
also referred to as the
which we allow ourselves to make a mistake if we decide against the null hypothesis and, probability of error.
thus, in favor of the alternative hypothesis. The most common significance level is 5%,
which means that there is a 5% probability of committing an error if we decide against the
null hypothesis. At the same time, it means that there is a 95% probability that we have
made the correct decision. A stricter significance level is 1%. Thus, we can be 99% sure
that we have made the right decision to reject the null hypothesis. In comparison, a
weaker significance level is 10%. Such a level is usually chosen when the null hypothesis is
the research hypothesis. In general, careful consideration is required when determining
181
the significance level. It is crucial to determine this level in this second step and never
change it in order to achieve your desired research results. In our example, we will use the
classic significance level of 5%.
Every statistical test procedure – whether it tests for a location parameter, difference, cor-
Test statistic relation, or change – involves the calculation of a test statistic. In this step, all relevant
The test statistic is nee- information from the sample under consideration is combined into one value (Bortz &
ded to reach a decision
regarding the two hypoth- Schuster, 2010, p. 101). This value forms one of two bases for making a decision regarding
eses. Its form depends on the two established hypotheses. Since we are testing for a mean value within the frame-
the test procedure. work of a single example, we should apply the z-test or t-test for a sample here. These
tests will be described in more detail later on. In this step, we should calculate the z-statis-
tic or t-statistic based on the available data provided by the eight nurses. Let us assume
that these nurses actually worked an average of 11.5 hours of overtime, resulting in a t
value of 2.2.
Critical value The second basis for our decision-making process is either the critical value or the p-
A critical value is taken value. Accordingly, there are two possibilities:
from a specific distribu-
tion.
p-value 1. Using the critical value. Most statistical test procedures are based on certain distri-
The p-value is output by butions. For example, when testing for a location parameter such as the mean, we
statistical programs.
assume that the corresponding variable is normally distributed. Therefore, by calcu-
lating the test variable, we transform the relevant sample data into a standard normal
distribution, which is distributed around the value 0. Here, this value represents the
case where the hypothetical mean value corresponds to the sample mean value.
Therefore, the further the test variable moves away from 0, the less plausible the null
hypothesis seems. The critical value is a cut-off value that marks when the null
hypothesis no longer seems plausible to us. Let’s assume that the critical value in this
case is 1.96. Later on, we’ll clarify how this is determined.
2. Using the p-value. The p-value describes the exceedance probability, meaning that it
indicates the probability that the found sample result – or even a more extreme one –
is valid under the null hypothesis (Benesch, 2013, p. 165). As mentioned previously,
we have assumed that the actual average overtime is 11.5 hours per week. Assuming
that the null hypothesis is initially the correct hypothesis, the p-value indicates the
probability of obtaining such an overtime value or an even larger one. For our exam-
ple, let’s use a p-value of 0.03.
The last step of a statistical test is always to make a decision regarding the null hypothesis.
We can do this using the critical value as well as the p-value:
182
1. Using the critical value. As a rule, it is necessary to check whether the test variable
exceeds the critical value, and note that there are exceptions depending on the test
procedure and use of directional or non-directional testing. In order to be able to
reject the null hypothesis, the test variable must be greater than the critical value in
terms of its quantity (Bortz & Schuster, 2010, pp. 101–102):
Let’s apply our test statistic and critical value, resulting in 2.2>1.96. This means that
we can reject the null hypothesis.
2. Using the p-value. The second way to make a decision is to combine the significance
level and the p-value. The aim here is to achieve a lower p-value than the significance
level in order to be able to reject the null hypothesis (Schäfer, 2011, p. 59):
Let’s return our example and apply our p-value and α: 0.03 < 0.05. Just like before, we are
able to reject the null hypothesis. Accordingly, we can state that the average of 11.5 hours
obtained from the sample is far “away” enough from the hypothetical average of 10 hours
that we can reject the null hypothesis. Thus, we conclude that nurses work significantly
more overtime hours on average than our hypothesis of 10 hours on average. In this unit,
we will always make such a decision based on the test variable as well as the critical value.
There are three main factors that can influence a test decision.
183
Figure 47: Rejection Area During Undirected Testing
We see here – as it is the basis for many tests – the standard normal distribution (Bortz &
Schuster, 2010, pp. 70–74). Under the null hypothesis, we assume that, following our
example, the hypothetical mean of 10 does not differ from the sample mean, which affects
how we interpret the center of the distribution. If a deviation from 0 is possible both in an
“upward” (i.e., nurses work more than 10 hours of overtime on average) and “downward”
(i.e., nurses work less than 10 hours of overtime on average) manner due to the undirected
hypotheses, then the rejection range in the form of the significance level is divided equally
between the two sides.
If we use a significance level of 5%, this is split between both sides, with a value of 2.5%
each. In order to reject the null hypothesis, we would have to observe that either the right
or left limit – indicated here with a vertical line – is “exceeded” or “undershot,” respec-
tively. However, if we are testing in a particular direction and want to test, for example,
whether nurses work “more” than 10 hours of overtime on average, then we confine the
rejection area to only one side of the distribution.
184
Figure 48: Rejection Area During Directed Testing
In this case, the rejection area is on the right side of the distribution, since we assume
there are “more than” 10 hours of overtime. If we now compare the two figures, we can
see which of the two tests makes it easier to reject the null hypothesis. Namely, the path
beyond the critical limit is simply shorter for the directed test than for the two-sided test.
This also makes sense from a content perspective: If we already have a reasonable suspi-
cion in a certain direction, we might have an easier time confirming this direction as well.
The directed test, thus, allows for a faster rejection of the null hypothesis (Schäfer, 2011,
pp. 61–63).
The sample size is another aspect that can either support or hinder the rejection of the
null hypothesis. This is because rejecting the null hypothesis becomes easier as the sam-
ple size increases. Let’s consider our example again and assume that there is an average of
11.5 hours of overtime among the nurses in our sample per week. If we obtained from
only, say, 20 individuals, we have to be cautious with our test, since this sample is a very
small one. This means that potential outliers can have a large effect and, consequently,
distort the result. However, we have obtained our average of 11.5 overtime hours from a
sample of 1,000 nurses, we can be more confident in our test’s calculated mean, as outli-
ers can hardly have an influence on it (Sedlmeier & Renkewitz, 2018, pp. 382–383).
Finally, the choice of the significance level α plays a very crucial role: The higher α value,
the easier it is to reject the null hypothesis. Let’s look at the previous figure on directional
testing. Imagine that the significance level is not 5% but 10%. This would increase the area
of the rejection region, bringing it closer to the center of 0. Therefore, the path to exceed
the critical limit becomes shorter as well.
185
Types of Errors in Hypothesis Testing
As we have already discussed, we can never be 100% sure that we have made the right
decision with regard to one of the two hypotheses. If we decide to accept the null hypoth-
esis and it turns out that the null hypothesis is also valid in the population, then we have
made the correct decision. Likewise, if we reject the null hypothesis in order to support
the alternative hypothesis and it turns out that this also applies in the population, then we
have done everything correctly. We find these two cases in the diagonal cells (from the
bottom to the top) in the following figure.
H0 H1
On the diagonal in the other direction, we find cells outlining the two situations in which
an error occurs. If we decide on the basis of a sample to reject the null hypothesis but the
null hypothesis is actually true in the population, we have just committed what is known
Type I error as a Type I error or an α error (Schäfer, 2011, p. 65). Suppose that we use a sample to
This describes the errone- conclude that major restructuring measures within a company increase the efficiency of
ous rejection of the null
hypothesis. its employees even though this is not the case in reality at all. This is an example of a Type
I error, and it can have serious consequences at one point or another. For instance, if these
restructuring measures necessitate large investments, then the enterprise might make
expenditures that do not actually support its success.
If we make a decision in favor of the null hypothesis based on a sample but the alternative
hypothesis is actually the correct one in the population, then we have committed a Type II
Type II error error, also known as a β error (Schäfer, 2011, pp. 65–66). Let’s imagine that we are testing
This describes the errone- an anti-cancer drug as part of a large-scale study. In this context, we test the null hypothe-
ous retention of the null
hypothesis. sis that the anti-cancer drug has no effect against the alternative hypothesis that it does
have an effect. Now, based on the data from some participants, we conclude that the drug
does not work. Thus, we do not reject the null hypothesis. However, suppose that in the
generality of all cancer patients, the drug shows an effect. Hence, the null hypothesis is
true, and we have committed a Type II error. Namely, we did not attribute an effect to the
drug although it actually has one.
We can see from the two examples that the two errors are of different importance depend-
ing on our question (Sedlmeier & Renkewitz, 2018, p. 384). We must decide case by case
which error weighs more heavily. If it is important for us to keep the Type I error as low as
possible, we should thoughtfully choose α in the described test procedure and set it to, for
186
example, 1%. Even if we cannot influence the β-error directly, we can control it via α. This
is because the two errors are closely – and oppositely – related. Hence, if we want a low β-
error, we should choose an α that is as large as possible.
We have already learned in the context of confidence intervals that the variance or stand-
ard deviation of the population plays a decisive factor in the way that confidence intervals
are set up. We continue this idea here:
• If we know the variance or standard deviation of the population, we can use the z-test
as well as a quantile of the standard normal distribution for the critical value.
• If we do not know the variance or standard deviation of the population, we must first
determine this ourselves based on a sample and use the t-test with the corresponding t-
distribution for the critical value.
For each individual test, assumptions specifically tailored to the test must always apply.
We first start with the test for an expected value, the z-test. Therefore, we need to make z-test
several assumptions: The z-test is used to test
for an expected value
when the variance in the
1. The variable under investigation must be cardinally scaled. This makes sense because population is known.
if we want to test for a mean or expected value, then we have to deal with a variable
that consists of a number.
2. The variable under investigation should be normally distributed in the population.
Importantly, this assumption ensures that extreme outliers cannot distort the results.
In the following example, we assume that this assumption is fulfilled. While there are
special test procedures that you can use to test for the presence of the normal distri-
bution, we do not cover these procedures in this section.
3. The data chosen for the test originate from a simple random sample. This assumption
must be fulfilled for any hypothesis test. It is critical that all individuals in the popula-
tion had an equal chance of being included in the sample. Accordingly, the sample
should not contain specifically selected objects.
187
Process of the z-Test
8; 10; 7; 5; 10
We also know that the variance of overtime hours per week in the population is
4 and overtime is basically normally distributed. Now, the question arises
whether our boss is right about the number of hours. We can formulate our
research question in three different ways:
In any case, since we have a known variance of 4 for the population, we can con-
clude that we must use a z-test.
188
Two-sided z-test
Suppose that we want to address the following question: “Is the average number of over-
time hours worked by employees per week actually 7, or is it either above or below 7?”
The previous section described the test procedure in general. Now, we will go through the
five steps of the present two-sided test.
The first step is always to set up both the null and alternative hypotheses. This is usually
done in mathematical notation. As we have already learned elsewhere, the Greek letter μ
is used for the expected value. The abbreviation μ0 is, in turn, used for the hypothetical
mean, and it is replaced by a concrete number for each test. In general, for a two-sided
test, we can assume that the expected value is equal to the hypothetical mean μ0 under
the null hypothesis and not equal to μ0 under the alternative hypothesis (i.e., it is either
larger or smaller):
H0 : μ = μ0 versus H1 : μ ≠ μ0
We can also express this differently: The null hypothesis assumes that we are in a popula-
tion in which the expected value is equal to the hypothetical mean value, whereas the
alternative hypothesis assumes a generality in which the expected value is greater or
smaller than the hypothetical mean value.
H0 : μ = 7 versus H1 : μ ≠ 7
The task now is to find out which hypothesis – and, thus, which population – our
five examined employees has the best fit with.
189
RUNNING EXAMPLE: OVERTIME HOURS
In our example, we select a commonly used significance level of α = 0.05. In
terms of content, this means that if the null hypothesis is rejected, we have a 5%
probability of being wrong.
In the case of a z-test, the test variable is z, and it is calculated as follows (Bortz & Schus-
ter, 2010, p. 103):
−
x − μ0
z= n· σ
n stands for the sample size, −x is the mean of the sample, which we have to calculate our-
selves, μ0 is the hypothetical mean, which is given, and σ (sigma) is the standard deviation
known from the population.
8; 10; 7; 5; 10
So, we can state that n = 5. In addition, we know from the case description and
hypotheses that μ0 = 7. Also, we are given the variance of the population as
σ2 = 4, which is why the standard deviation can be calculated simply by
σ = 4 = 2. Only the mean value must be calculated by
−
x =
8 + 10 + 7 + 5 + 10
= 8.
5
−
x − μ0 8−7
z= n· σ
= 5· 2
= 1.118.
The fact that the result is positive is due to the fact that the mean value of the
sample, which is 8, is greater than the hypothetical mean value, which is 7. If it
were the other way around, we would have a negative result.
190
Step 4: Defining the critical value
In order to be able to decide whether the test statistic that we just calculated still supports
the null hypothesis or not, we need a reference value. In the case of the z-test, we always
use a quantile of the standard normal distribution. Since testing is done in both directions,
we will end up with both a positive critical value and a negative critical value. We deter-
mine a critical value as follows:
±z1 − 0.5 · α
With the help of α, we can determine the cumulative probability and then find the rele-
vant value in a table of the quantiles of the standard normal distribution.
z0 . 975 = 1.96
Cumulative probabil-
ity p 0.9 0.95 0.975 0.99 0.995
To the right of the central 0 is +1.96, and to the left is −1.96. The null hypothesis
is valid up to these critical limits.
Now, if the test statistic that we calculated in the third step exceeds the critical limit, the
null hypothesis is rejected. The two-sided rule for rejecting the null hypothesis can be
written by placing magnitude bars around the test variable and using the positive critical
value:
191
The positive value of the calculated test variable – even if it is actually negative – must,
therefore, be greater than the positive critical value from the table in order for us to be
able to reject the null hypothesis.
This means that we cannot reject the null hypothesis. Accordingly, the average
number of overtime hours worked per week does not significantly deviate from
7.
Now, let’s address the following question: Is the average number of overtime hours
worked by employees per week more than 7? We, thus, test specifically with the possibility
that there could be more than 7 hours of overtime in mind. This is a directional test and
also, more specifically, a right-sided test. Compared to the previous two-sided test, not
much changes in terms of the steps that we take. We only have to formulate the hypothe-
ses differently and use a different critical value. The rest is analogous to the previous test
procedure. Let’s go through the steps once.
Since we are now testing in a specific direction, this must be accounted for in the hypothe-
ses. If we are performing a right-sided test, we must always assume that under the alterna-
tive hypothesis, the mean value is greater than > the hypothetical mean value μ0 .
Consequently, under the null hypothesis, we assume that the mean value is less than or
equal to ≤ the hypothetical mean value μ0 :
H0 : μ ≤ μ0 versus H1 : μ > μ0
H0 : μ ≤ 7 versus H1 : μ > 7
192
Step 2: Determining the significance level
Although we are now using a one-sided test, nothing changes in this step.
−
x − μ0
z= n· σ
= 1.118
When setting the critical value, there is now a new feature to keep in mind. Since we are
now only testing in one direction, the rejection region is just on one side of the distribu-
tion – in this case, the right side. Consequently, the rejection range no longer needs to be
halved. Hence, the quantile of the standard normal distribution that we are looking for is
determined as follows:
z1 − α
Once we have arrived at z0.95, we, again, consult a table of the quantiles of the
standard normal distribution. If we look for p = 0.95, the corresponding quan-
tile is 1.6449.
193
We see that this critical limit is lower than the one that we used for two-sided testing. Con-
sequently, it is easier to reject the null hypothesis. Whether this is successful is addressed
in the next step.
The decision regarding the null hypothesis is almost the same as before. However, we do
not need to put the test variable z in magnitude bars. As expected, the test variable must
always be positive for a right-sided test:
reject H0 if z > z1 − α
Again, the test variable must be greater than the critical value in order to reject the null
hypothesis.
This means that we cannot reject the null hypothesis in this case either. Accord-
ingly, we cannot assume that there are significantly more than 7 overtime hours
on average for these employees per week.
We also want to test in the other direction: Is the average number of overtime hours
worked by employees less than 7 overtime hours per week? To investigate this, we will test
on the left-hand side, since we hypothesize there are less than 7 hours of overtime on
average per week. Again, we will only need to differ from the general procedure when
making the hypotheses, selecting the critical value, and making our decision.
When setting up the alternative hypothesis, we must now assume that there is a smaller
mean value < than the hypothetical one μ0 . Consequently, under the null hypothe-
sis, the expected mean value is greater than or equal to ≥ the hypothetical μ0 mean
value:
H0 : μ ≥ μ0 versus H1 : μ < μ0
194
RUNNING EXAMPLE: OVERTIME HOURS
In the example, the hypothetical value is 7, which is why the hypotheses are as
follows:
H0 : μ ≥ 7 versus H1 : μ < 7
Again, there are no changes between the step here and those for two-sided and right-
sided testing.
−
x − μ0
z= n· σ
= 1.118.
When setting the critical value, we need to consider the following: According to the alter-
native hypothesis, we expect to obtain a smaller mean than the hypothesized one. Accord-
ingly, if we look at the test variable above, we can expect that −
x − μ0 will result in some-
thing negative and, thus, the test variable will be negative overall. For this reason, the
critical value in left-handed testing is always negative. While it is identical in number to
that for right-hand testing, it is given a negative sign:
−z1 − α
195
RUNNING EXAMPLE: OVERTIME HOURS
With α = 0.05, our critical value results in
Once we have arrived at z0.95, we, again, consult a table of the quantiles of the
standard normal distribution. If we search for p = 0.95, we discover that the cor-
responding quantile is 1.6449. Finally, we assign a negative sign to this value.
When making the decision, we now have two options. We can stay in the negative range
and see if the test variable is smaller than the negative critical value to reject the null
hypothesis:
reject H0 if z < − z1 − α
The alternative here is to take advantage of the symmetry of the standard normal distribu-
tion around 0. We can, therefore, place both the test variable and the critical value in mag-
nitude bars and make the decision in the positive range:
reject H0 if z > − z1 − α
The test statistic is, therefore, not smaller than the negative critical value, which
is why the null hypothesis cannot be rejected. Recall that the second possible
decision rule uses magnitude lines, which means that we come to the following
result:
196
The test variable, which was already positive anyway, is not larger than the posi-
tive critical value. The null hypothesis can, therefore, not be rejected, and we
conclude that there are not significantly less than 7 overtime hours worked on
average per week.
• the independent calculation of the sample variance and standard deviation based on
the sample,
• the calculation of the test statistic, called the t-statistic, and
• the use of a critical value from the t-distribution.
The rest of the procedure remains completely identical to that of the z-test. Also, all the
assumptions mentioned in the z-test still apply here. We want to use the same example
again, but, this time, suppose that we do not know the variance in the population.
8; 10; 7; 5; 10
The boss, just like before, thinks that there is an average of exactly 7 overtime
hours worked per week.
We will now go through the three possible questions and, thus, get to know the two-sided
as well as the left- and right-sided test procedures. If something should already be known
to us, we will only briefly cover it in this section.
Two-Tailed t-Test
We now want to address the following question: Is the average number of overtime hours
worked by employees per week actually 7, or is it above or below 7?
197
Step 1: Setting up the hypotheses
Since we are now dealing with a t-test, we will also use a t-statistic as a test variable.
Except for the denominator used in its calculation, the t-statistic is exactly the same as the
z-statistic (Bortz & Schuster, 2010, p. 118):
−
x − μ0
t= n· s
To obtain the test statistic, we have to calculate the standard deviation s based on the
sample.
• n=5
• μ0 = 7
8 + 10 + 7 + 5 + 10
• −
x = =8
5
198
−
x =
8 + 10 + 7 + 5 + 10
=8
5
− 82 + 102 + 72 + 52 + 102
x2 = 5
= 67.6
5
s2 = 5−1
· 67.6−82 = 4.5
s = 4.5 = 2.12
With this, we have all the information needed to calculate the test statistic:
−
x − μ0 8−7
t= n· s
= 5· 2.12
= 1.05
This step is analogous to the one for the z-test, except that we now use the t-distribution.
From what we’ve learned about confidence intervals, we know that the t-distribution
depends on the degrees of freedom (that is, df = n − 1) in addition to the significance
level.
With n = 5 and α = 0.05, we arrive at the following critical values for the t-dis-
tribution:
199
Table 37: Quantiles of the t-Distribution (2)
Area*
df 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 0.975 0.990 0.995 0.9995
1 0.158 0.325 0.510 0.727 1.000 1.376 1.963 3.078 6.314 12.706 31.821 63.657 636.619
2 0.142 0.289 0.445 0.617 0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 31.598
3 0.137 0.277 0.424 0.584 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 12.941
4 0.134 0.271 0.414 0.569 0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 8.610
5 0.132 0.267 0.408 0.559 0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 6.859
6 0.131 0.265 0.404 0.553 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 5.959
7 0.130 0.263 0.402 0.549 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 5.405
8 0.130 0.262 0.399 0.546 0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 5.041
9 0.129 0.261 0.398 0.543 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 4.781
10 0.129 0.260 0.397 0.542 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.587
11 0.129 0.260 0.396 0.540 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 4.437
12 0.128 0.259 0.395 0.539 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 4.318
13 0.128 0.259 0.394 0.538 0.694 0.870 1.079 1.350 1.771 2.160 2.650 3.012 4.221
14 0.128 0.258 0.393 0.537 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 4.140
15 0.128 0.258 0.393 0.536 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 4.073
16 0.128 0.258 0.392 0.535 0.690 0.865 1.071 1.337 1.746 2.120 2.583 2.921 4.015
17 0.128 0.257 0.392 0.534 0.689 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.965
18 0.127 0.257 0.392 0.534 0.688 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.922
19 0.127 0.257 0.391 0.533 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.883
20 0.127 0.257 0.391 0.533 0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.850
21 0.127 0.257 0.391 0.532 0.686 0.859 1.063 1.323 1.721 2.080 2.518 2.831 3.819
22 0.127 0.256 0.390 0.532 0.686 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.792
23 0.127 0.256 0.390 0.532 0.685 0.858 1.060 1.319 1.714 2.069 2.500 2.807 3.767
24 0.127 0.256 0.390 0.531 0.685 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.745
25 0.127 0.256 0.390 0.531 0.684 0.856 1.058 1.316 1.708 2.060 2.485 2.787 3.725
26 0.127 0.256 0.390 0.531 0.684 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.707
27 0.127 0.256 0.389 0.531 0.684 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.690
28 0.127 0.256 0.389 0.530 0.683 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.674
29 0.127 0.256 0.389 0.530 0.683 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.659
30 0.127 0.256 0.389 0.530 0.683 0.854 1.055 1.310 1.697 2.042 2.457 2.750 3.646
40 0.126 0.255 0.388 0.529 0.681 0.851 1.050 1.303 1.684 2.021 2.423 2.704 3.551
60 0.126 0.254 0.387 0.527 0.679 0.848 1.046 1.296 1.671 2.000 2.390 2.660 3.460
120 0.126 0.254 0.386 0.526 0.677 0.845 1.041 1.289 1.658 1.980 2.358 2.617 3.373
z 0.126 0.253 0.385 0.524 0.674 0.842 1.036 1.282 1.645 1.960 2.326 2.576 3.291
Again, if the test variable exceeds the critical limit, the null hypothesis is rejected. We can
write the rule for rejecting the null hypothesis for both sides by placing magnitude bars
around the test variable:
200
reject H0 if t > tn − 1;1 − 0.5 · α
The positive value of the calculated test statistic (even if it is actually negative) must there-
fore be greater than the positive critical value from the table in order to be able to reject
the null hypothesis.
The test value is, therefore, not greater than the critical value, which is why the
null hypothesis cannot be rejected. Even after this test, it appears that the aver-
age number of overtime hours per week does not deviate significantly from 7.
With the knowledge we have gained so far, let’s just go through the directed tests very
briefly. If we want to investigate “more than 7 hours overtime on average per week,” we
perform a right-sided test. If we also don’t know the variance in the population, we must
conduct a t-test. So, we go through the following steps:
1. The hypotheses are formulated as for the right-sided z-test (H0 : μ ≤ 7 vs. H1 : μ > 7).
2. The significance level will be given with a value (α = 0.05).
3. The test statistic is calculated just like it is for a two-sided t-test (t = 1.05).
4. Following the z-test, the critical limit is defined here with the corresponding degrees
of freedom as tn − 1;1 − α. Note that the rejection region α is now only on the right side
of the distribution: t4;0.95 = 2.132
5. The null hypothesis is rejected exactly when t > tn − 1;1 − α. We then observe the fol-
lowing: t = 1.05 ≯ 2.132 = t4;0.95. So, the null hypothesis cannot be rejected.
If we test for “less than 7 hours of overtime on average per week,” again we end up with a
left-sided t-test, since the variance in the population is unknown. Here, too, we would like
to briefly summarize the essential steps:
1. The hypotheses are formulated as for the left-sided z-test (H0 : μ ≥ 7 vs. H1 : μ < 7).
2. The significance level will be given with a value (α = 0.05).
3. The test statistic is calculated just like it is for a two-sided t-test (t = 1.05).
201
4. Following the z-test, the critical limit is defined here with the corresponding degrees
of freedom as −tn − 1;1 − α. Note that the rejection region α is now only on the left
side of the distribution: −t4;0.95 = − 2.132
5. The null hypothesis is rejected exactly when either t < − tn − 1;1 − α or
t > − tn − 1;1 − α holds. In this case, we note that t = 1.05 ≮ − 2.132 = −t4;0.95
and t = 1.05 ≯ − 2.132 = −t4;0,95 . This means that the null hypothesis can-
not be rejected.
We have now become acquainted with two different test procedures. Both test for an
expected value. We should be able to distinguish between them and decide whether to
use a two-sided, left-sided, or right-sided test.
SUMMARY
Suppose that we only have a sample but we would like to make gener-
ally valid statements. A hypothesis test or significance test can be used
to test the suitability of the sample results to the population. The start-
ing point of each test is a hypothesis pair, with a null and an alternative
hypothesis. These hypotheses describe two completely contrasting pop-
ulations. The purpose of the test is to find out to which population the
given sample best fits in.
202
BACKMATTER
LIST OF REFERENCES
Bamberg, G., Baur, F., & Krapp, M. (2022). Statistik [Statistics] (19th ed.). De Gruyter Olden-
bourg Verlag.
Bortz, J., & Schuster, C. (2010). Statistik für Human- und Sozialwissenschaftler [Statistics for
human and social scientists] (7th ed.). Springer Verlag.
Fahrmeir, L., Heumann, C., Künstler, R., Pigeot, I., & Tutz, G. (2016). Statistik: Der Weg zur
Datenanalyse [Statistics: The path to data analysis] (8th ed.). Springer Spektrum.
Handl, A., & Kuhlenkapser, T. (2018). Einführung in die Statistik – Theorie und Praxis mit R
[Introduction to statistics – Theory and practice with R]. Springer Spektrum.
Schäfer, T. (2011). Statistik II: Inferenzstatistik [Statistics II: Inferential statistics]. VS Verlag.
Sedlmeier, P., & Renkewitz, F. (2018). Forschungsmethoden und Statistik für Psychologen
und Sozialwissenschaftler [Research methods and statistics for psychologists and
social scientists] (3rd ed.). Pearson Verlag.
204
LIST OF TABLES AND
FIGURES
Figure 1: Variable Classification by Scales of Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . 15
205
Table 14: Frequency Table for Age (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Table 20: Contingency Table With Absolute Frequencies for the Variables of Gender and
Smoking Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Table 21: Contingency Table With Only Marginal Frequencies for the Variables of Gender
and Smoking Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Table 23: Initial Data on the Relationship Between Satisfaction With Care Robots and Satis-
faction With Health Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Table 24: Ranks of the Relationship Between Satisfaction With Care Robots and Satisfac-
tion With Health Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Table 25: Auxiliary Table for the Calculation of the Rank Correlation Coefficient (1) . . . . . 71
Table 26: Auxiliary Table for the Calculation of the Rank Correlation Coefficient (2) . . . . . 71
Table 27: Initial Data for the Relationship Between the Age of the Mothers and Fathers of
Young Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 9: Scatter Plot for the Correlation Between the Age of the Mothers and Fathers of
Young Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 11: Scatter Plot for the Correlation Between the Age of the Mothers and Fathers of
Young Patients Divided Into Quadrants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Table 28: Auxiliary Table for the Calculation of the Correlation Coefficient . . . . . . . . . . . . . 78
206
Figure 13: Scatter Plot With a Correlation Coefficient of -1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Figure 16: Scatter Plot for the Initial Data of the Simple Linear Regression . . . . . . . . . . . . . 90
Figure 18: Scatter Plot for the Initial Data Including Regression Line of the Simple Linear
Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 21: Venn Diagram for the Union Set of A and B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Figure 22: Venn Diagram for the Union Set of Players A and B . . . . . . . . . . . . . . . . . . . . . . 109
Figure 23: Union Set for Passing at Least One Examination . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Figure 24: Venn Diagram for the Intersection Set ofA and B . . . . . . . . . . . . . . . . . . . . . . . . . 111
Figure 29: Venn Diagram for the Probability of the Union Set . . . . . . . . . . . . . . . . . . . . . . . . 118
Figure 30: Venn Diagram for the Probability of the Difference A\B . . . . . . . . . . . . . . . . . . 120
Figure 33: Density Function with a Waiting Time of Maximum Three Minutes . . . . . . . . . 130
207
Figure 34: Density Function with a Waiting Time of Four to Six Minutes . . . . . . . . . . . . . . . 131
Figure 35: Density Function with a Waiting Time of At Least Six Minutes . . . . . . . . . . . . . . 132
Figure 36: Representation of the Density Function for the Travel Time . . . . . . . . . . . . . . . 144
Figure 37: Density Function for a Travel Time of at Most 42 Minutes . . . . . . . . . . . . . . . . . . 147
Figure 38: Density Function for a Travel Time of Less Than 36 Minutes . . . . . . . . . . . . . . . 149
Figure 39: Density Function for a Travel Time of at Least 42 Minutes . . . . . . . . . . . . . . . . . 150
Figure 40: Density Function for a Travel Time From 39 to 42 Minutes . . . . . . . . . . . . . . . . . 152
Figure 41: Density Function for a Travel Time That is Not Exceeded on 95% of Days . . . . 153
Figure 42: Density Function for the Travel Time That is Exceeded on 90% of the Days . . 155
Figure 43: Density Function for the Central Variation Interval with 90% of All Travel Times
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
208
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt
Mailing Address
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
[email protected]
www.iu.org