Chapter 1 - Descriptive Statistics - Frequency Distributions
Chapter 1 - Descriptive Statistics - Frequency Distributions
1. Descriptive Statistics
FREQUENCY DISTRIBUTIONS
Agenda
1.1 – Introduction
2/55
DATA ANALYSIS AND PROBABILITY
1.1 Introduction
3/55
DATA ANALYSIS AND PROBABILITY
Introduction
4/55
DATA ANALYSIS AND PROBABILITY
Statistical process
Collect data
Infer and
decide
based on
data
5/55
DATA ANALYSIS AND PROBABILITY
sample
Why?
6/55
DATA ANALYSIS AND PROBABILITY
7/55
DATA ANALYSIS AND PROBABILITY
8/55
DATA ANALYSIS AND PROBABILITY
Statistical variables
9/55
DATA ANALYSIS AND PROBABILITY
10/55
DATA ANALYSIS AND PROBABILITY
Numerical
Continuous
11/55
DATA ANALYSIS AND PROBABILITY
Categorical
Exercise Variables
Discrete
Numerical
Classify the following variables according to their type:
Continuous
12/55
DATA ANALYSIS AND PROBABILITY
The second method to classify statistical variables Nominal data is considered to be the weakest type of
distinguishes between qualitative and quantitative data. Numerical identification is used strictly for
variables. convenience and does not imply ranking of responses.
For example: gender, eye colour and occupation.
In qualitative variables there is no meaning for the Ordinal data indicate the rank ordering of items.
difference between values. This type of variable Numerical identification indicates order but the
includes nominal and ordinal measurement levels. difference between values has no meaning. For
example: social class (low, medium, high) and product
quality ranking (poor, average, good).
13/55
DATA ANALYSIS AND PROBABILITY
Quantitative
Ratio
14/55
DATA ANALYSIS AND PROBABILITY Nominal
Qualitative
Ordinal
Exercise Variables
Interval
15/55
DATA ANALYSIS AND PROBABILITY
2019:
Je↵ Bezos 1 114.0 56 Divorced Amazon
William Gates 2 106.0 64 Married Microsoft
17/55
DATA ANALYSIS AND PROBABILITY
Frequency distributions
CATEGORICAL DATA
A frequency distribution is a table used to organize Example: Favorite airline of a group of 552 individuals
data.
Airline Absolute frequency Relative frequency
The first column includes all the possible values of the TAP 178 0.322
variable.
Iberia 45 0.082
The absolute frequency is the number of elements in KLM 20 0.036
each category. Easyjet 200 0.362
The relative frequency is the percentage of elements Air France 38 0.069
in each category. It is obtained by dividing the absolute Vueling 71 0.129
frequency by 𝑛. Total 552 1
18/55
DATA ANALYSIS AND PROBABILITY
charts: 200
13%
TAP
7% 32%
150 Ibéria
Bar charts – when one wishes to draw attention to the KLM
0
the proportion of frequencies in each category TAP Ibéria KLM Easyjet Air France Vueling
19/55
DATA ANALYSIS AND PROBABILITY
20/55
DATA ANALYSIS AND PROBABILITY
Misleading graphics
https://twitter.com/partidochega/status/1617176950002401281?s=48&t=OqUQ_cFYUTeO5bhMDkFbLA
21/55
DATA ANALYSIS AND PROBABILITY
Misleading graphics
3,5
2,5
1,5
0,5
0
Média EU Portugal
22/55
DATA ANALYSIS AND PROBABILITY
Frequency distributions
NUMERICAL DATA – DISCRETE VARIABLES
While dealing with numerical data, it is also possible to Example: Number of cell phones purchased in the last
compute the cumulative frequencies. two years.
N. cell phones Abs.
Abs. freq.
freq Rel. freq. Abs. cum. freq. Rel. cum. freq.
23/55
DATA ANALYSIS AND PROBABILITY
Some notation
𝑿𝒊 Absolute frequency - 𝒏𝒊 Relative frequency – 𝒇𝒊 Abs. cumulative frequency - 𝑵𝒊 Rel. cumulative frequency - 𝑭𝒊
𝑋# 𝑛# 𝑓# 𝑁# 𝐹#
… … … … …
𝑋$ 𝑛$ 𝑓$ 𝑁$ 𝐹$
… … … … …
𝑋% 𝑛% 𝑓% 𝑁% 𝐹%
Total 𝒏 1
24/55
DATA ANALYSIS AND PROBABILITY
70
60
50
40
30
20
10
0
0 1 2 3 4
25/55
DATA ANALYSIS AND PROBABILITY
26/55
DATA ANALYSIS AND PROBABILITY
250
200 𝑿𝒊 𝒏𝒊 𝒇𝒊 𝑵𝒊 𝑭𝒊
0 18 0.090 18 0.090
150
1 39 0.195 57 0.285
2 52 0.260 109 0.545
100
3 67 0.335 176 0.880
4 24 0.120 200 1
50
Total 200 1
0
-3 -1 1 3 5 7
27/55
DATA ANALYSIS AND PROBABILITY
43.33%
30.83%
13.89%
6.67%
2.78% 2.50%
1 2 3 4 5 6
28/55
DATA ANALYSIS AND PROBABILITY
43.33%
30.83%
13.89%
6.67%
2.78% 2.50%
1 2 3 4 5 6
29/55
DATA ANALYSIS AND PROBABILITY
43.33%
30.83%
13.89%
6.67%
2.78% 2.50%
1 2 3 4 5 6
30/55
DATA ANALYSIS AND PROBABILITY
Frequency distributions
NUMERICAL DATA – CONTINUOUS VARIABLES
While dealing with continuous variables1 it is very By grouping data into classes, we lose its individuality,
frequent to group the data into classes. but we expect gains in terms of interpretation.
1It is also common to group data if we are dealing with a discrete variable with many different values.
31/55
DATA ANALYSIS AND PROBABILITY
271 236 294 252 254 263 266 222 262 278 288
262 237 247 282 224 263 267 254 271 278 263
262 288 247 252 264 263 247 225 281 279 238
How should we group this data set?
252 242 248 263 255 294 268 255 272 271 291
263 242 288 252 226 263 269 227 273 281 267
263 244 249 252 256 263 252 261 245 252 294
288 245 251 269 256 264 252 232 275 284 252
263 274 252 252 256 254 269 234 285 275 263
263 246 294 252 231 265 269 235 275 288 294
263 247 252 269 261 266 269 236 276 248 299
32/55
DATA ANALYSIS AND PROBABILITY
One way of answering the previous question is using Example (Newbold, p. 42): Time to complete a task.
the Sturges’ rule.
This is a practical rule to find an appropriate number of 271 236 294 252 254 263 266 222 262 278 288
classes, 𝑘, and the width of each class, ℎ: 262 237 247 282 224 263 267 254 271 278 263
262 288 247 252 264 263 247 225 281 279 238
252 242 248 263 255 294 268 255 272 271 291
𝑘 = 𝐼 log ! 𝑛 +1
263 242 288 252 226 263 269 227 273 281 267
263 244 249 252 256 263 252 261 245 252 294
∆ 288 245 251 269 256 264 252 232 275 284 252
ℎ=𝐼 + 1,
# 263 274 252 252 256 254 269 234 285 275 263
263 246 294 252 231 265 269 235 275 288 294
where: 𝐼(𝑥) stands for the integer part of 𝑥, and Δ 263 247 252 269 261 266 269 236 276 248 299
stands for the difference between the maximum and
minimum observations on the data set. 𝑘 = 𝐼 log ! 110 + 1 = 𝐼 6.78 + 1 = 7 7 classes
299 − 222
ℎ=𝐼 + 1 = 12 width 12
7
33/55
DATA ANALYSIS AND PROBABILITY
Example (Newbold, p. 42): Time to complete a task. After grouping the values, the individuality of each
observation is lost.
34/55
DATA ANALYSIS AND PROBABILITY
0,35
If all the classes have the same width, this requirement 0,1
35/55
DATA ANALYSIS AND PROBABILITY
• … 0,2
0,136
0,1
0,1
The areas of the rectangles are proportional to the 0,082
0,05 0,064
frequencies: 0,055
0
!.!#$ !.&#' !.% %.(
, …
[222, 234[ [234, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306]
= , =
% %( % %(
36/55
DATA ANALYSIS AND PROBABILITY
If, for some reason, the first two classes were Using the new frequencies to build a “histogram” we
aggregated, we would obtain a new frequency get:
distribution:
0,35
Classes fi 0,3
Classes fi
[222,234[ 0,064
[222,246[ 0,164 0,25
[234,246[ 0,100
[246,258[ 0,273
[246,258[ 0,273 0,2
[258,270[ 0,291
[258,270[ 0,291
[270,282[ 0,136 0,15
[270,282[ 0,136
[282,294[ 0,082
[282,294[ 0,082 0,1
[294,306] 0,055
[294,306] 0,055
Total 1 0,05
Total 1
0
[222, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306]
37/55
DATA ANALYSIS AND PROBABILITY
frequencies: 0,055
0.164 3.936 0
[222, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306]
≠
1 13.98
38/55
DATA ANALYSIS AND PROBABILITY
0,025
the chart.
0,015
$! &!
or 0
%! %! [222, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306]
39/55
DATA ANALYSIS AND PROBABILITY
Property:
Proof:
= ( 𝑛)
)*&
=𝑛
40/55
DATA ANALYSIS AND PROBABILITY
Property: Proof:
= ( 𝑓)
)*&
=1
41/55
DATA ANALYSIS AND PROBABILITY
0.15
0.1
0.05
0
[210, 222[ [222, 234[ [234, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306] ]306, 318[
42/55
DATA ANALYSIS AND PROBABILITY
0.3
0.25
0.2
0.15
0.1
0.05
0
[210, 222[ [222, 234[ [234, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306] ]306, 318[
43/55
DATA ANALYSIS AND PROBABILITY
Classes Ni 80
[222,234[ 7
60
[234,246[ 18
[246,258[ 48 40
[258,270[ 80 20
[270,282[ 95
0
[282,294[ 104 222 234 246 258 270 282 294 306
[294,306] 110
44/55
DATA ANALYSIS AND PROBABILITY
45/55
DATA ANALYSIS AND PROBABILITY
We asked to 300 retail employees (who work in a) Find the missing information. Justify.
downtown Lisbon) on the effective number of hours b) Plot the histogram of area 1.
worked per week. The table below presents the results.
Classes ci ni fi Ni Fi
[a, 20[ 15 4 i 4 0.013
[20, 35[ 27.5 f 0.050 19 o
[35, 40[ e 101 0.337 120 p
[40, b[ 45 g j m 0.800
[50, 60[ 55 50 k n q
[c, d] 65 h l 300 1
Total 300 1
46/55
DATA ANALYSIS AND PROBABILITY
We asked to 300 retail employees (who work in c) Clarify the following questions, made by an
Downtown, Lisbon) on the effective number of hours amateur:
worked per week. The table below presents the results. i. From the above table can I check the values for
all the 300 employees?
Classes ci ni fi Ni Fi
[a, 20[ 15 4 i 4 0.013 ii. Where are placed the employees that answered
[20, 35[ 27.5 f 0.050 19 o 40 hours?
[35, 40[ e 101 0.337 120 p iii. Where are placed the employees that answered
[40, b[ 45 g j m 0.800 d hours and one second?
[50, 60[ 55 50 k n q
[c, d] 65 h l 300 1 iv. How do the observations behave within each
Total 300 1 class?
47/55
DATA ANALYSIS AND PROBABILITY
48/55
DATA ANALYSIS AND PROBABILITY
Two-dimensional data
49/55
DATA ANALYSIS AND PROBABILITY
While dealing with numerical raw data, the observations Example: Height (in m) and weight (in kg) of a group of
are usually listed side by side. 10 individuals.
80
Weight (kg)
75
70
65
60
55
50
1,5 1,55 1,6 1,65 1,7 1,75 1,8 1,85 1,9 1,95
Height (m)
50/55
DATA ANALYSIS AND PROBABILITY
While dealing tabulated data, either categorical or Example (Newbold, p. 30): The following cross-table
numerical, two-dimensional data is usually organized in contains data from a survey on health and nutrition of
a cross-table (or contingency table) which lists the the U.S. population, conducted in 2005. The table
number (or percentage) of observations for each contains information on the gender and activity level
possible combination of the values of the two variables. of a group of 4460 individuals.
The values inside a cross-table are the joint absolute
frequencies (𝒏𝒊𝒋 ) or the joint relative frequencies (𝒇𝒊𝒋 )
as they refer to counts or percentages, respectively.
Male Female
Sedentary 957 1226
Active 340 417
Very active 842 678
Joint frequency distribution of gender and activity level.
51/55
DATA ANALYSIS AND PROBABILITY
By adding the frequencies in each row and column we The graphical representation of cross-tables is usually
get the marginal frequencies for each variable. made by a component bar chart or by a cluster bar
chart.
1200
2000 678
417 800
0
340 417
Total 2139 2321 4460 Sedentary Active Very active Sedentary Active Very active
52/55
DATA ANALYSIS AND PROBABILITY
We can also analyse the conditional frequencies. They For example, using the previous data set, we can
are obtained by dividing the joint frequency by the answer questions like:
marginal frequency.
• What is the percentage of men who are sedentary?
53/55
DATA ANALYSIS AND PROBABILITY
https://www.gapminder.org/fw/world-health-chart/
56/55