STAE Lecture Notes - LU2
STAE Lecture Notes - LU2
LEARNING OBJECTIVES
• Construct a frequency distribution
• Compute and interpret various components of a frequency distribution
• Construct and interpret a pie chart, bar graph, stem-and-leaf plot, histogram, frequency polygon and
ogive
1
2.1.2. Grouped frequency table
An ungrouped frequency table of a variable consisting of a large number of possible and different values will
be impractical in summarising the information. Grouped frequency tables are only constructed for numerical
data. For continuous data and discrete data consisting of many possible values, the values of the variable are
organised into classes or intervals. The grouped frequency table, or grouped frequency distribution, then lists
the intervals and records the number of observations in each interval. The purpose of grouping data is to
highlight the main features of the data and present the information more effectively. Grouping must be done
in such a way that important information is not lost.
The following points should be considered when constructing a grouped frequency table:
• The number of class intervals
o Too few intervals would group together too many values and lead to information loss
o Too many intervals would not give much more information that the original raw data values
• The size (width) of the class intervals
o Depends on the number of class intervals
o All classes must be of equal width
o Width = Upper limit − Lower limit
o The smallest value must be recorded in the first interval and the largest value in the last interval
• All class intervals must be non-overlapping
o The following notation can be used to denote the class intervals for age groups:
▪ [20, 30) and [30, 40)
▪ 20 x 30 and 30 x 40
There are many algorithms that can be used to find the optimal solution for grouping data together. That is
beyond the scope of this course. Table 5 shows the grouped frequency table for the age variable in Table 1.
Class intervals are structured as decades (note the inclusion and exclusion of the class limits/boundaries).
Age group Frequency
[10, 20) 3
[20, 30) 10
[30, 40) 7
Total 20
Table 5: Age categories
2
2.1.3. Components of a frequency table
A complete frequency table consists of many different components.
Values/Class intervals
List of all possible values of the variable (categorical/discrete) or class intervals (discrete/continuous).
Frequency
The count of all possible outcomes (individual values or class intervals) of the variable. Frequency is denoted
by f.
Sample size
Total number of observations in the dataset. For frequency tables f =n.
3
Class width
The class width of class intervals in a grouped frequency table is the difference between the upper and lower
limits of the interval, i.e., Upper limit – Lower limit, irrespective of whether the limits are inclusive or
exclusive.
The complete frequency table for the grouped frequency age variable given in Table 5 is as follows:
Age group Frequency RF RF% CF CRF CRF% MP
[10, 20) 3 0.15 15 3 0.15 15 15
[20, 30) 10 0.50 50 13 0.65 65 25
[30, 40) 7 0.35 35 20 1 100 35
Total 20 1 100
Exercise 2.1
1) Construct a complete frequency table for the gender variable in Table 1, listed in order as follows:
Female Female Female Female Female Female Female Female Female Male
Male Male Male Male Male Male Male Male Male Male
4
2) Construct a complete frequency table for the daily coffee consumption variable in Table 1, listed in order
as follows:
1 1 1 1 1 2 2 2 2 2 2 3 3 3 4 4 5 5 7 8
3) Construct a complete frequency table for the coffee affinity score variable in Table 1, where the first class
is (0, 1], listed in order as follows:
0.1 0.2 0.4 0.4 0.6 0.8 1.0 1.4 1.8 1.9 1.9 2.3 2.4 3.1 3.1 3.4 3.6 4.4 4.6 4.9
Coffee affinity
f RF RF% CF CRF CRF% MP
score
(0, 1]
(1, 2]
(2, 3]
(3, 4]
(4, 5]
Total
5
2.2. Contingency Tables
A contingency table, also known as a cross-tabulation, is a special type of frequency table in a matrix format
that is used to examine the relationship between two or more variables (categorical or numerical discrete)
simultaneously. If two variables are summarised in a contingency table, the values of one variable are listed
in the rows of the table and the values of the second variable are listed in the columns of the table. The
intersection of a row and a column is referred to as a cell of the table. The frequencies in the cells of the table
are referred to as observed (or cell or joint) counts (or frequencies). All the cell frequencies add up to n. Cell
frequencies can be expressed as percentages of the sample (table %), percentages within a single column
(column %) or percentages within a single row (row %), depending on the objective of the analysis.
Table 6 shows the contingency table of gender by coffee preference from the data in Table 1. Since gender
has two possible outcomes and coffee preference has two possible outcomes, this yields a 2×2 contingency
table. The additional row and column in the table do not denote possible outcomes of the variable, but show
the total counts for each separate variable, called the marginal counts/totals/frequencies per variable.
Exercise 2.2
1) Complete Table 6
2) What percentage of consumers drink filter coffee? ___________
This is an example of __________________ %
3) Among those who only drink filter coffee, what percentage is male? ___________
This is an example of __________________ %
4) What percentage of females only drink instant coffee? ___________
This is an example of __________________ %
5) What percentage of consumers are females who drink instant coffee? ___________
This is an example of __________________ %
6
2.3. Shape Of A Distribution
Frequency distributions and graphs summarise data so that important features in the data and the distribution
of the variable across the scale are presented in an effective way. The shape of a distribution is described in
terms of its symmetry and modality and determines the appropriate analyses that can be performed on the data.
2.3.1. Symmetry
A distribution is symmetric if the left side of the distribution mirrors the right side of the distribution. In a
symmetric distribution the mean, median and mode coincide (measures are discussed in detail in Section 3.1).
Mean
Median
Mode
A distribution is asymmetric or skewed if the left side of the distribution is different from the right side of the
distribution and is either left-skewed or right-skewed. A left-skewed distribution, also termed negatively
skewed, has a longer tail to the left of the distribution. A right-skewed distribution, also termed positively
skewed, has a longer tail to the right of the distribution.
The following two graphs show the shape of the left-skewed and right-skewed distributions and the
relationship between the mean, median and mode for both.
7
2.3.2. Modality
The modality of a distribution refers to the number of significant peaks in the shape of the distribution, i.e.,
high frequencies associated with a value or a class interval. A unimodal distribution has one peak, a bimodal
distribution has two peaks, and a multimodal distribution has more than two peaks.
Frequency
Filter 7
Instant 13
Total 20
8
2.4.2. Bar graph
A bar graph uses vertical or horizontal bars to show a comparison between possible outcomes of a categorical
or a discrete numerical variable. The sizes of the bars represent the frequencies, RF or RF%. There are gaps
between the bars as the outcomes of the variable are separated and not measured on a continuum. The
following figure shows the bar graph (with frequencies) of the choice of brand rating, using the ungrouped
frequency table data in Table 3 as input.
Frequency
Not important 5
2 4
3 5
4 4
Very important 2
Total 20
Stem Leaf
0 5
1 1 3 3
2 0 2
However, if the original values were 0.5, 1.1, 1.3, 1.3, 2.0 and 2.2, the resulting stem-and-leaf plot will be the
same as above, but the unit of measure will change, i.e., leaf unit = 0.1.
9
The following graph shows the stem-and-leaf plot of age, using the raw data from Table 1 as input. The data
range from 19 to 40. The first digit is used as the stem and the second digit as the leaf. For this data the leaf
unit = 1. The data are listed in order as follows:
19 19 19 21 24 24 25 26 26 28 29 29 30 32 34 35 35 36 37 40
Stem Leaf
1 9 9 9
2 1 4 4 5 6 6 8 9 9
3 0 2 4 5 5 6 7
4 0
2.4.4. Histogram
The histogram is used to graph a grouped frequency distribution. The interval widths are represented on the
horizontal axis and the frequencies, RF or RF% are represented on the vertical axis. Vertical bars are
constructed above the class limits. Since the class limits are on a continuum there are no gaps between the
bars. The following graph shows the histogram of age (with frequencies) using the grouped frequency table
data in Table 5 as input.
10
The following graph shows the frequency polygon of age, using the grouped frequency table data in Table 5,
together with the MP, as input.
The frequency polygon is an alternative to the histogram as both graphs convey the same information, as can
be seen when the two graphs are shown together in a single graph as follows:
2.4.6. Ogive
The ogive is a graphical representation of a CF, CRF or CRF%. Cumulative frequencies are represented on
the vertical axis and points are plotted above the upper class limits indicated on the horizontal axis. The first
point coincides with the lower class limit of the first class on the x-axis, with a CF of zero. The following
graph shows the ogive of age, using the cumulative frequencies in Table 5, together with the CF, as input.
11
Exercise 2.3
Use the frequency tables constructed in Exercise 2.1 to draw the following graphs:
1) Pie chart for gender (use RF%)
Gender RF%
Male 55
Female 45
Total 20
12
4) Frequency polygon for coffee affinity (use RF%)
Coffee
affinity RF% MP
score
(−1, 0] 0 −0.5
(0, 1] 35 0.5
(1, 2] 20 1.5
(2, 3] 10 2.5
(3, 4] 20 3.5
(4, 5] 15 4.5
(5, 6] 0 5.5
Total 100
6) Compare (3), (4) and (5), and comment on the symmetry and modality
13
Stem Leaf (leaf unit = 0.1)
0 124468
1 04899
2 34
3 1146
4 469
Coffee
affinity CF x-axis
score
(0, 1] 7
(1, 2] 11
(2, 3] 13
(3, 4] 17
(4, 5] 20
Total
14