Lecture04 Prel
Lecture04 Prel
Shifting a variable
Let x1 , x2 , · · · , xN be the observations of a variable x in a population U , let a be a constant
and let
y i = xi + a for all i = 1, 2, · · · , N.
In other words, y is simply the same as x but adding some constant. Let us consider, for
instance, the age (in years) of ten individuals as of December 31, 2018 (= x), and the age (in
years) of the same individuals as of December 31, 2023 (= y) given in Table 9 and illustrated
by dotplots in Figure 14. Thus yi = xi + 5.
x 26 29 31 32 34 37 38 39 40 46
y 31 34 36 37 39 42 43 44 45 51
Table 9: Age of ten individuals as of December 31, 2018 (x) and December 31, 2023 (y)
By looking at Figure 14 we can see the effect that adding a constant to a variable has:
it shifts the observations a units. It also gives an intuition about what will happen with the
different parameters that we have studied so far. Intuitively, all location measures are equally
shifted, whereas variability and shape measure will remain unaffected. It turns out that this
intuition is correct. Table 10 shows the different parameters that we have studied for both x
and y in Table 9.
Type Parameter x y
First quartile 31.25 36.25
Mean 35.2 40.2
Location Median 35.5 40.5
Third quartile 38.75 43.75
Range 20 20
Variability IQR 7.5 7.5
Variance 35.29 35.29
Standard deviation 5.94 5.94
Shape Skewness 0.16 0.16
Table 10: Descriptive statistics of ten individuals in Table 9
22
Scaling a variable
Let us now consider a second situation. Let x1 , x2 , · · · , xN be the observations of a variable
x in a population U , let b be a constant and let
yi = b xi for all i = 1, 2, · · · , N.
In other words, y is simply the same as x but multiplied by a constant. Let us consider, for
instance, the price of ten cell phones in a particular store in Swedish Krona SEK (= x) and in
Czech Koruna CZK (= y) given in Table 11 and illustrated by dotplots in Figure 15. Taking
into account that (today) one Czech Koruna is equivalent to 2.17 Swedish Kronor, we have
yi = 2.17 xi .
x 2000 7000 8500 9800 11500 14500 16000 16500 17500 20500
y 4340 15190 18445 21266 24955 31465 34720 35805 37975 44485
Table 11: Price of ten cell phones in SEK (x) and CZK (y)
Figure 15: Dotplots of the price of ten cell phones Table 11.
By looking at Figure 15 we can see the effect that multiplying a variable by a constant has:
it scales the observations by a factor of b. It also gives an intuition about what will happen
with the different parameters that we have studied so far. Intuitively, all location measures are
equally scaled, variability measures are also scaled by the same factor b, with one exception:
the variance. The variance is scaled by a factor of b2 . The skewness remain unaffected.Table
12 shows the different parameters that we have studied for both x and y in Table 11.
Type Parameter x y
First quartile 8825 19150
Mean 12380 26860
Location Median 13000 28210
Third quartile 16375 35530
Range 18500 40150
Variability IQR 7550 16380
Variance 31 770 000 149 600 000
Standard deviation 5636 12230
Shape Skewness -0.3094 -0.3094
Table 12: Descriptive statistics of the price of ten cell phones in Table 11
23
Let us now consider a third situation in which we combine the two situations above. Let
x1 , x2 , · · · , xN be the observations of a variable x in a population U , let a and b be two
constants and let
yi = b(x + a) for all i = 1, 2, · · · , N.
In other words, y the same as x but adding a constant and then multiplied by another constant.
Let us consider, for instance, the temperatures in a weather station in Sweden measured at
twelve different time points over a year in Fahrenheit (= x) and Celsius (y) given in Table 13
and illustrated by dotplots in Figure 16. Remember that yi = 95 (xi − 32).
x -0.4 19.4 26.6 32.0 44.6 64.4 73.4 71.6 66.2 51.8 30.2 23.0
y -18 -7 -3 0 7 18 23 22 19 11 -1 -5
Table 13: Temperature in a weather station at twelve time points in Fahrenheit (= x) and
Celsius (y)
Figure 16: Dotplots of the temperature at twelve time points in Table 13.
By looking at Figure 16 we can see the effect that adding a constant a and multiplying
by another one b has on a variable x: first, the observations are shifted a units and then
they are scaled by a factor of b. It also gives an intuition about what will happen with the
different parameters that we have studied so far. Intuitively, all location measures are equally
shifted and scaled, variability measures will be scaled by the factor b, with one exception: the
variance. The variance is scaled by a factor of b2 . The skewness remains unaffected. Table 14
shows the different parameters that we have studied for both x and y in Table 13.
Type Parameter x y
First quartile 25.70 -3.50
Mean 41.90 5.50
Location Median 38.30 3.50
Third quartile 64.85 18.25
Range 73.80 41.00
Variability IQR 39.15 21.75
Variance 563.5 173.9
Standard deviation 23.74 13.19
Shape Skewness -0.0984 -0.0984
Table 14: Descriptive statistics of the temperature at twelve time points in Table Table 13
24
We have
i 1 2 3 4 5 6 7 8 9 10
xi 8 15 5 36 40 30 9 21 32 27
zi -1.14 -0.58 -1.38 1.09 1.41 0.61 -1.06 -0.10 0.77 0.38
Table 15: Points of ten students in an exam in Statistics and their z-scores
Another use of z-scores is for comparing observations from different variables, possibly
from different populations. For instance, let us say that the next year, the course of statistics
was taken by eight students. One of the students got 41 points in the exam. In absolute terms,
evidently, this value is higher than 40, which was the maximum score during the previous year,
but it may be that the exam was easier, right? By standardizing both populations we can
establish “how good was each student with respect to their own populations”.
25
Example 31 (Continuation of Example 30). Let U2 be the population of eight students who
took the master course in statistics the next year. The first row of Table 16 shows their points
in the exam. The average number of points during this year was x̄U2 = 35 and the standard
deviation was Sx,U2 = 7.01. Thus, we obtain the z-scores shown in the second row of Table
16.
i 1 2 3 4 5 6 7 8
xi 48 29 26 32 41 37 35 32
zi 1.85 -0.86 -1.28 -0.43 0.86 0.29 0.00 -0.43
Table 16: Points of eight students in an exam in Statistics and their z-scores
The mean during the second year was much larger than during the first year. A result
of 40 points in the exam during the first year is 1.41 standard deviations over the mean, but
a result of 41 during the second year is only 0.86 standard deviations over the mean. Thus,
compared to their own populations, the student with 40 points performed better than the one
with 41 points.
Although z-scores allow for comparing observations between different variables and differ-
ent populations, we should be cautious with the interpretation. Let us take a look back at
Examples 30 and 31. The mean during the first year was 22.3 points, whereas the exam during
the second year was 35 points. We do not know if this is due to the exam being easier or to the
students being better prepared for the exam. All we can say with the z-scores is that “with
respect to their own population” a result of 40 during the first year was more remarkable than
a result of 41 during the second one.
Empirical rule
In large populations, a variable x with mean x̄U and standard deviation Sx,U which is
symmetric, unimodal and bell-shaped will satisfy:
Later in the course we will learn where does this empirical rule come from.
Example 32. Let us consider the population of N = 97 startups in Section 2.1.4. The
mean and standard deviation of the number of employees, x, are x̄U = 4.97 and Sx,U = 2.28,
respectively. Figure 17 shows the dotplot of x. We see that the variable is unimodal but it is
not exactly symmetric. Nevertheless, let us see how well does the empirical rule work in this
case.
26
• approximately 68% of the observations lie in the interval [4.97 − 2.28 , 4.97 + 2.28] =
[2.69 , 7.24]. One can verify that, in fact, 71.1% of the observations lie in this interval;
• approximately 95% of the observations lie in the interval [4.97 − 2 · 2.28 , 4.97 + ·2.28] =
[0.42 , 9.52]. One can verify that 96.9% of the observations lie in this interval;
• almost all of the observations lie in the interval [4.97 − 3 · 2.28 , 4.97 + 3 · 2.28] =
[−1.86 , 11.79]. One can verify that all of the observations lie in this interval.
Note that in this case even when the shape of the variable does not exactly satisfy the condi-
tions for the empirical rule, it still works pretty well.
Bar charts
In a bar chart the categories of the variable of interest are placed along one of the axes (typ-
ically the horizontal axis), with the another axis representing the frequency of each category.
Bar charts are useful for draw attention to the frequency of the categories.
• You can plot either the absolute or the relative frequency. The resulting plot is identical
except for the scale.
• The bars should not touch each other and it is important that all bars have the same
width;
• If the variable is ordinal, the categories must be sorted either in ascending or descending
order.
• If the variable is not ordinal, it is common practice to sort the categories from the most
frequent to the least frequent or vice versa.
• It is important to always label the axes, otherwise a reader may not know what is being
shown in the chart.
27
Absolute Relative
Value frequency frequency
F 51 0.425
E 12 0.100
D 23 0.192
C 18 0.150
B 11 0.092
A 5 0.042
Total 120 1
Table 17: Frequency distribution table of the grades of 120 students in an exam in Statistics
Example 33. Let U be the population of N = 120 students taking a course in statistics. Let
xi be the grade in the exam (A, B, C, D, E, F) for the ith student (i = 1, 2, · · · , N ). Table 17
shows the frequency distribution table of x.
Figure 18 shows a bar plot of the grades in the exam of the N = 120 students.
The bars should start at zero, otherwise the chart will be misleading as the differences
between categories will look bigger than they actually are, thus we would be violating the
area principle. If the intention is, precisely, to draw attention to these differences, the scale
may be changed, but it is important to make this clear to the reader. For example, Figure 19
is a bar plot of the number of employees of a company by sex. By looking at the plot we get
the impression that there are around three times more men than women in the company, but
after a closer look we see that a misleading scale has been chosen. In fact there are 102 men
and 98 women, so the relative difference is not as large as the plot may incorrectly suggest.
Figure 20 shows a bar plot of the same data with the vertical axis starting at zero.
Pie charts
If we want to draw attention to the proportion of elements in each category, then we will
probably use a pie chart to depict the division of a whole into its constituent parts. The circle
(or “pie”) represents the total, and the segments (or “pieces of the pie”) cut from its center
depict shares of that total. The pie chart is constructed so that the area of each segment is
proportional to the corresponding frequency.
Example 34. Figure 21 shows a pie chart of the grades of the 120 students in Example 39 in
a final exam.
28
Figure 19: Bar chart of the the number of men and women in a company.
Figure 20: Bar chart of the the number of men and women in a company.
Figure 21: Pie chart of the grades of 120 students in an exam and an assignment.
It should be noted that the order of the categories is lost in a pie chart, therefore it may
not be the best choice for ordinal variables. In this sense, instead of using a pie chart for the
grades of the students as in the example above, a bar chart may be more adequate.
Dot plots
By now, we should be familiar with dotplots. We have it extensively to illustrate the
parameters that were introduced in Subsection 2.1.
29
In a dotplot the observations are represented as dots (or other symbols, like line segments)
over a number line. If one value occurs multiple times, we just stack them over each other.
Dotplots are simple to create and (hopefully) easy to interpret. However, it is often said
that they are useful for illustrating small to moderate populations. For instance, R docu-
mentation for the function stripchart (which allows for creating dotplots) says “These plots
are a good alternative to boxplots when sample sizes are small” and Wikipedia’s page says
that dotplots “are suitable for small to moderate sized data sets[...] When dealing with larger
datasets[...]dotplots may become too cluttered”.
Example 35. The following are the number of points obtained by the 120 students in Example
33:
10 87 40 20 47 40 40 94 48 15
15 66 66 15 5 18 37 29 92 64
93 70 78 45 59 41 68 42 68 93
28 85 18 63 15 15 86 71 40 32
75 64 37 53 25 76 11 35 63 50
52 63 73 79 13 16 83 74 15 60
81 78 20 80 80 66 82 5 20 79
75 10 68 61 63 63 61 15 50 88
76 33 50 57 70 61 9 0 84 77
15 60 27 94 34 20 75 50 76 34
5 58 42 73 20 36 40 83 58 55
28 55 30 60 73 42 65 69 61 61
Figure 22 represents a dotplot of the number of points obtained by the 120 students in the
final exam.
Histograms
A histogram is a graph that consists of vertical bars constructed on a horizontal line that
is marked off with intervals for the variable being displayed. The intervals correspond to the
classes in a frequency distribution table. It is important that all intervals have the same width,
otherwise the result may be misleading as we would be violating the area principle. If it is
not possible to create intervals with the same width (for instance, if the classes are given), it
should be taken into account that the area of each bar must be proportional to its frequency.
Bars representing two categories that are adjacent should touch each other. As always, it
is important to use labels for the axis.
In order to determine the number and the width of the categories we simply repeat the
recommendations given in Section 2.1.4 when we introduced frequency distribution tables. √
Regarding the number of categories, a rule of thumb (which I often use) is to set K ≈ N
classes. Regarding the width of the classes, it can be defined as rangex,U /K, where rangex,U
is given by (14). However, good sense and some flexibility is needed for obtaining a “nice”
30
presentation. Finally, it is very important to make sure that the categories are inclusive and
nonoverlapping, so that every observation belongs to one and only one category.
Example 36. Figure 23 represents a histogram of the number of points obtained by the 120
students in the final exam.
Box-and-Whisker plot
A box-and-whisker plot is a graph that describes the shape of a variable in terms of five
parameters: the minimum value x(1) , the first quartile (25th percentile) x̆25,U , the median x̆U ,
the third quartile (75th percentile) x̆75,U , and the maximum value x(N ) :
• A box of arbitrary width is drawn from the first to the third quartile. A line is drawn
through the box at the median x̆U .
– one whisker is a line from the first quartile x̆25,U to either the minimum x(1) or
x̆25,U − 1.5 IQRx,U (whichever is larger);
– the other whisker is a line from x̆75,U to either the maximum x(N ) or x̆75,U +
1.5 IQRx,U (whichever is smaller).
• If there are outliers (according to the definition in Subsubsection 2.1.3), they are pre-
sented as individual points.
Example 37. Figure 24 represents a box-and-whisker plot of the number of points obtained
by the 120 students in the final exam.
Figure 24: Box-and-whisker plot of the number of points of 120 students in an exam.
31
3 Describing two categorical variables
In Section 2 we introduced several measures for describing one variable. In this and the
following sections we will introduce some methods for describing two variables simultaneously.
In this section, in particular, we consider the case of two categorical variables.
Contingency tables
The simplest, but usually quite telling, tool for describing two categorical variables is a
contingency table. Contingency tables are also known as cross tables.
Definition 38. Let xi and yi be the values of two categorical variables associated to the ith
individual in the population U (i = 1, 2, · · · , N ), where x takes Kx different categories and y
takes Ky different categories. A contingency table is a matrix-like table that shows, in the cell
(kx , ky ), the frequency of elements taking the kx th category of x and the ky th category of y
simultaneously (for kx = 1, 2, · · · , Kx and ky = 1, 2, · · · , Ky ).
In simple words, a contingency table is a table that shows in each cell the frequency of one
category of one variable x and one category of the second variable y. This is one of the many
situations in which things are simpler than they sound: aqui
Example 39. Let U be the population of N = 120 students taking a course in statistics. Let
xi be the grade in the first assignment (Pass or Fail) and yi be the grade in the exam (A, B,
C, D, E, F) for the ith student (i = 1, 2, · · · , N ). Table 18 shows the values of x and y in the
population of students.
x y x y x y x y x y x y x y x y
Fail F Pass F Pass F Pass F Pass E Pass D Pass C Pass B
Fail F Pass F Pass F Pass F Pass E Pass D Pass C Pass B
Fail F Pass F Pass F Pass F Pass E Pass D Pass C Pass B
Fail F Pass F Pass F Pass F Pass D Pass D Pass C Pass B
Fail F Pass F Pass F Pass F Pass D Pass D Pass C Pass B
Fail F Pass F Pass F Pass F Pass D Pass D Pass C Pass B
Fail F Pass F Pass F Fail E Pass D Pass D Pass C Pass B
Fail F Pass F Pass F Pass E Pass D Pass D Pass C Pass B
Fail F Pass F Pass F Pass E Pass D Pass D Pass C Pass B
Pass F Pass F Pass F Pass E Pass D Pass D Pass C Pass B
Pass F Pass F Pass F Pass E Pass D Pass D Pass C Fail A
Pass F Pass F Pass F Pass E Pass D Fail C Pass C Pass A
Pass F Pass F Pass F Pass E Pass D Pass C Pass C Pass A
Pass F Pass F Pass F Pass E Pass D Pass C Pass C Pass A
Pass F Pass F Pass F Pass E Pass D Pass C Fail B Pass A
Table 18: Results of N = 120 students in an assignment and an exam in statistics
32
Note that we added the totals to the rows and to the columns. These totals are known as
the marginals. They show the univariate distributions of each variable, i.e. the values of each
variable disregarding the other.
Example 40. Now, we divide each entry of the contingency table in Example 39 by N = 120
to obtain the joint relative distribution:
y
A B C D E F Total
x Pass 0.0333 0.0833 0.1417 0.1917 0.0917 0.35 0.8917
Fail 0.0083 0.0083 0.0083 0 0.0083 0.075 0.1083
Total 0.0417 0.0917 0.15 0.1917 0.1 0.425 1.0000
One way for represent graphically the information provided by a contingency table or the
joint relative distribution is through mosaic plots. In a mosaic plot each cell of the table is
represented by a rectangle, whose area is proportional to the value of the cell. Figure 25 shows
a mosaic plot of the grade in the assignment and the exam for the population of 120 students.
Note, for instance, that as the number of students who passed the assignment and got A in
the exam is four times bigger than the number of students who failed the assignment and got
A in the exam, therefore, in the mosaic plot, the former is represented by a rectangle that is
four times bigger than the latter.
Figure 25: Mosaic plot of the grades in a home assignment and an exam of 120 students.
Conditional distributions
Dividing each cell by the row-totals gives the conditional distribution of y given x.
Example 41. In order to obtain the conditional distribution of the grade in the exam y
conditioned on the grade on the assignment x, we divide each entry by the corresponding
row-total:
y
A B C D E F Total
x Pass 0.0374 0.0935 0.1589 0.215 0.1028 0.3925 1.000
Fail 0.0769 0.0769 0.0769 0 0.0769 0.6923 1.000
Total 0.0417 0.0917 0.15 0.1917 0.1 0.425 1.000
33
This table allows us to see some facts that are not so evident from the previous tables, for
instance, among students who fail the assignment, almost 70% fail the exam too; while among
students who pass the assignment, only around 40% fail the exam.
In this case, we are obtaining the distribution of y for each value of x. For instance, in
the first row we are considering only the students who passed the assignment. Among them
3.74% got A in the exam, 9.17% got B and so on. In the second row we are considering only
the students who failed the assignment. Among them 7.69% got B in the exam, 7.69 got B
and so on.
Conditional distributions can be represented graphically through stacked bar charts as
follows. As we are conditioning on x we create Kx bars of length one. Then, each bar is
subdivided according to the conditional frequencies of the categories of y. Figure 26 shows a
stacked bar chart for the distribution of the grades in the exam conditioned on the grade in
the assignment.
Figure 26: Stacked bar chart for the distribution of grades in the exam conditioned on the
grade in the assignment
Dividing each cell by the column-totals gives the conditional distribution of x given y.
Example 42. In order to obtain the conditional distribution of the grade in the assignment
x conditioned on the grade in the exam y, we divide each entry by the corresponding column-
total:
y
A B C D E F Total
x Pass 0.8 0.9091 0.9444 1 0.9167 0.8235 0.8917
Fail 0.2 0.0909 0.0556 0 0.0833 0.1765 0.1083
Total 1.00 1.00 1.00 1.00 1.00 1.00 1.00
This table says, for instance, that considering only the students who got A in the exam, 80%
of them passed the assignment too whereas 20% failed it; considering only the students who
got B in the exam, 91% of them passed the assignment too whereas 9% failed it; etc.
Figure 27 shows a stacked bar chart for the distribution of the grades in the assignment
conditioned on the grade in the exam.
34
Figure 27: Stacked bar chart for the distribution of grades in the assignment conditioned on
the grade in the exam
35