0% found this document useful (0 votes)
17 views56 pages

Chapter 1 - Descriptive Statistics - Frequency Distributions

The document covers data analysis and probability, focusing on descriptive statistics, statistical variables, and frequency distributions. It explains the importance of summarizing data through tables and graphs, and classifies statistical variables into categorical and numerical types. Additionally, it discusses methods for sampling, measurement levels, and presents exercises for practical application.

Uploaded by

leonor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views56 pages

Chapter 1 - Descriptive Statistics - Frequency Distributions

The document covers data analysis and probability, focusing on descriptive statistics, statistical variables, and frequency distributions. It explains the importance of summarizing data through tables and graphs, and classifies statistical variables into categorical and numerical types. Additionally, it discusses methods for sampling, measurement levels, and presents exercises for practical application.

Uploaded by

leonor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

DATA ANALYSIS AND PROBABILITY

1. Descriptive Statistics
FREQUENCY DISTRIBUTIONS

MARIA JOÃO BRAGA | PATRÍCIA RAMOS


DATA ANALYSIS AND PROBABILITY

Agenda

1.1 – Introduction

1.2 – Statistical variables

1.3 – Frequency distributions

1.4 – Two-dimensional data

2/55
DATA ANALYSIS AND PROBABILITY

1.1 Introduction

3/55
DATA ANALYSIS AND PROBABILITY

Introduction

Descriptive statistics is a branch of statistics that


applies several techniques to describe and summarize a Graphical and
Descriptive numerical procedures
set of data. This task, crucial for great volumes of data, statistics to summarize and
materializes in building tables and charts, and in the process data
Statistics
computation of measures or indicators that represent
the information contained in the data. Statistical Use of data for
forecasts and estimates
Inference for decision making
Tables and graphs help us gain a better understanding
of data and provide visual support for improved
decision making.

4/55
DATA ANALYSIS AND PROBABILITY

Statistical process

Collect data

Define the Summarize


problem data

Infer and
decide
based on
data

5/55
DATA ANALYSIS AND PROBABILITY

Example (Newbold, p. 22): Before bringing a new


product to market, a manufacturer wants to arrive at population
some assessment of the level of demand and may
undertake a market research survey.
The manufacturer is interested in all potential buyers
(population) however, the survey is more likely to be
applied to a subset (sample).

sample
Why?

6/55
DATA ANALYSIS AND PROBABILITY

How should we select a sample?

The most common procedure for selecting a sample is


random sampling.
This procedure selects a set of 𝑛 objects from a
population in such a way that:
• each member of the population has the same
probability of being selected;
• the selection of one member does not influence the
selection of any other member.

7/55
DATA ANALYSIS AND PROBABILITY

1.2 Statistical variables

8/55
DATA ANALYSIS AND PROBABILITY

Statistical variables

Statistical variables are characteristics of interest to be


statistically studied, which are associated with the
population or with the sample. They are called like this
because they exhibit element to element variation in
the population or sample in study.
There are two ways of classifying statistical variables:
• by the type and amount of information they
contain; or
• by levels of measurement.

9/55
DATA ANALYSIS AND PROBABILITY

Classification of variables by type and amount of information

This method classifies variables into categorical or


numerical. Categorical
Categorical variables produce responses that belong to
groups or categories.
Variables
For example: gender, car brand, social class and marital
Discrete
status.
Numerical
Numerical variables are expressed by numbers and
include both discrete and continuous variables. Continuous

10/55
DATA ANALYSIS AND PROBABILITY

Discrete variables may take on values inside a finite or


countable infinite set.
For example: number of siblings, number of students
enrolled in a course and shoe size.
Categorical

Continuous variables may take on any value inside a


real interval. Variables
For example: height, weight, salary and age.
Discrete

Numerical

Continuous

11/55
DATA ANALYSIS AND PROBABILITY

Categorical

Exercise Variables

Discrete

Numerical
Classify the following variables according to their type:
Continuous

• Blood type. • Number of shares one has from a company.


• Grade in DAP's exam. • Inflation rate.
• Education degree. • Ink amount in litres, needed to paint a building.
• Nationality. • Course final grade, rounded to units.
• Medal awarded to an athlete at the Olympics. • Daily temperature, in Celsius degrees, in Nepal.
• Score given to a new beverage on a scale 1-10. • Duration of a trip, in hours and minutes.
• Occupation. • Number of people in a household.
• Production cost of a tablet.

12/55
DATA ANALYSIS AND PROBABILITY

Classification of variables by levels of measurement

The second method to classify statistical variables Nominal data is considered to be the weakest type of
distinguishes between qualitative and quantitative data. Numerical identification is used strictly for
variables. convenience and does not imply ranking of responses.
For example: gender, eye colour and occupation.
In qualitative variables there is no meaning for the Ordinal data indicate the rank ordering of items.
difference between values. This type of variable Numerical identification indicates order but the
includes nominal and ordinal measurement levels. difference between values has no meaning. For
example: social class (low, medium, high) and product
quality ranking (poor, average, good).

13/55
DATA ANALYSIS AND PROBABILITY

As for quantitative variables, they include interval and


Nominal
ratio measurement levels.
Qualitative

An interval scale indicates rank and distance from an


arbitrary zero. For example: temperature and IQ. Ordinal

Finally, a ratio scale indicates both rank and distance Variables


from a natural zero, with ratios between two measures
having meaning. For example: weight, salary and age.
Interval

Quantitative

Ratio

14/55
DATA ANALYSIS AND PROBABILITY Nominal
Qualitative

Ordinal

Exercise Variables

Interval

Classify the following variables according to their Quantitative


measurement levels: Ratio

• Blood type. • Number of shares one has from a company.


• Grade in DAP's exam. • Inflation rate.
• Education degree. • Ink amount in litres, needed to paint a building.
• Nationality. • Course final grade, rounded to units.
• Medal awarded to an athlete at the Olympics. • Daily temperature, in Celsius degrees, in Nepal.
• Score given to a new beverage on a scale 1-10. • Duration of a trip, in hours and minutes.
• Occupation. • Number of people in a household.
• Production cost of a tablet.

15/55
DATA ANALYSIS AND PROBABILITY

Exercise (chapter 1, proposed exercise 2)


Introduction
Statistical variables
Frequency Distributions
Two-dimensional data

Chapter 1 - Proposed exercise 2: Consider the following information about


the 10 richest American citizens, published on Forbes magazine in 2019:
Consider the following information about the 10 richest
American citizens, published in Forbes magazine in Rank Net worth Age Marital status Company

2019:
Je↵ Bezos 1 114.0 56 Divorced Amazon
William Gates 2 106.0 64 Married Microsoft

a) In this dataset, how many observations do we have?


Warren Bu↵ett 3 80.8 89 Married Berkshire Hathaway
Mark Zuckerberg 4 69.6 35 Married Facebook

b) And how many statistical variables?


Larry Ellison 5 65.0 75 Divorced Oracle
Larry Page 6 55.5 46 Married Google

c) How do you classify each variable?


Sergey Brin 7 53.5 46 Married Google
Michael Bloomberg 8 53.4 77 Divorced Bloomberg
Steve Balmer 9 51.7 63 Married Microsoft
d) For each variable, what was the measurement scale Jim Walton 10 51.6 71 Married Walmart
used?

(a) In this dataset, how many observations do we have?


(b) And how many statistical variables?
(c) How do you classify each variable? 16/55
(d) For each variable, what was the measurement scale used?
DATA ANALYSIS AND PROBABILITY

1.3 Frequency distributions


CATEGORICAL DATA. NUMERICAL DATA.

17/55
DATA ANALYSIS AND PROBABILITY

Frequency distributions
CATEGORICAL DATA

A frequency distribution is a table used to organize Example: Favorite airline of a group of 552 individuals
data.
Airline Absolute frequency Relative frequency
The first column includes all the possible values of the TAP 178 0.322
variable.
Iberia 45 0.082
The absolute frequency is the number of elements in KLM 20 0.036
each category. Easyjet 200 0.362
The relative frequency is the percentage of elements Air France 38 0.069
in each category. It is obtained by dividing the absolute Vueling 71 0.129
frequency by 𝑛. Total 552 1

18/55
DATA ANALYSIS AND PROBABILITY

Graphical representation of categorical data

Categorical data is usually represented by two types of 250

charts: 200
13%

TAP
7% 32%
150 Ibéria
Bar charts – when one wishes to draw attention to the KLM

absolute frequency in each category 100 Easyjet


Air France
Vueling
50 36% 8%

Pie charts – when the goal is to give more emphasis to


4%

0
the proportion of frequencies in each category TAP Ibéria KLM Easyjet Air France Vueling

19/55
DATA ANALYSIS AND PROBABILITY

“Although the absolute


numbers make it look like the
most critical situation is in
Lisbon, with 3150 confirmed
cases, this data might not be
the best indicator to mirror
the real impact of the disease.”
https://www.publico.pt/interactivo/como-
esta-evoluir-pandemia-covid19-onde-vivo#/

The choice of the chart is crucial!

20/55
DATA ANALYSIS AND PROBABILITY

Misleading graphics

https://twitter.com/partidochega/status/1617176950002401281?s=48&t=OqUQ_cFYUTeO5bhMDkFbLA

21/55
DATA ANALYSIS AND PROBABILITY

Misleading graphics

3,5

2,5

1,5

0,5

0
Média EU Portugal

22/55
DATA ANALYSIS AND PROBABILITY

Frequency distributions
NUMERICAL DATA – DISCRETE VARIABLES

While dealing with numerical data, it is also possible to Example: Number of cell phones purchased in the last
compute the cumulative frequencies. two years.
N. cell phones Abs.
Abs. freq.
freq Rel. freq. Abs. cum. freq. Rel. cum. freq.

The absolute cumulative frequency, is the number of 0 18 0.090 18 0.090


observations smaller than or equal to a given value of 1 39 0.195 57 0.285
the variable. 2 52 0.260 109 0.545
The relative cumulative frequency, is the percentage of 3 67 0.335 176 0.880
observations smaller than or equal to a given value of 4 24 0.120 200 1
the variable. Total 200 1

23/55
DATA ANALYSIS AND PROBABILITY

Some notation
𝑿𝒊 Absolute frequency - 𝒏𝒊 Relative frequency – 𝒇𝒊 Abs. cumulative frequency - 𝑵𝒊 Rel. cumulative frequency - 𝑭𝒊

𝑋" 𝑛" 𝑓" 𝑁" 𝐹"

𝑋# 𝑛# 𝑓# 𝑁# 𝐹#

… … … … …

𝑋$ 𝑛$ 𝑓$ 𝑁$ 𝐹$

… … … … …

𝑋% 𝑛% 𝑓% 𝑁% 𝐹%

Total 𝒏 1

𝑘 - number of categories (or groups) of 𝑋

24/55
DATA ANALYSIS AND PROBABILITY

Graphical representation of simple frequencies


80

70

60

50

40

30

20

10

0
0 1 2 3 4

Bar chart of the number of cell phones purchased.

25/55
DATA ANALYSIS AND PROBABILITY

Graphical representation of cumulative frequencies

How many individuals purchased 2 or less cell phones?


𝑿𝒊 𝒏𝒊 𝒇𝒊 𝑵𝒊 𝑭𝒊
How many purchased 2.5 or less cell phones? 0 18 0.090 18 0.090
1 39 0.195 57 0.285
How many purchased 2.8 or less cell phones? 2 52 0.260 109 0.545
3 67 0.335 176 0.880
How many purchased 3 or less cell phones?
4 24 0.120 200 1
Total 200 1
How many purchased 3.1 or less cell phones?

How many purchased 4.7 or less cell phones?

26/55
DATA ANALYSIS AND PROBABILITY

Graphical representation of cumulative frequencies

250

200 𝑿𝒊 𝒏𝒊 𝒇𝒊 𝑵𝒊 𝑭𝒊
0 18 0.090 18 0.090
150
1 39 0.195 57 0.285
2 52 0.260 109 0.545
100
3 67 0.335 176 0.880
4 24 0.120 200 1
50
Total 200 1

0
-3 -1 1 3 5 7

Absolute cumulative frequency distribution

27/55
DATA ANALYSIS AND PROBABILITY

Exercise (chapter 1, proposed exercise 6)


It was asked to 360 individuals about the number of a) Considering this data, state the maximum number of
banks they have an account in. Based on this banks that an individual has an account in.
information, the following bar chart was built:

Number of banks an individual has an account in

43.33%

30.83%

13.89%

6.67%
2.78% 2.50%

1 2 3 4 5 6

28/55
DATA ANALYSIS AND PROBABILITY

Exercise (chapter 1, proposed exercise 6)


It was asked to 360 individuals about the number of b) Create a table with both the absolute and relative
banks they have an account in. Based on this frequencies. Add to your table the cumulative values.
information, the following bar chart was built:

Number of banks an individual has an account in

43.33%

30.83%

13.89%

6.67%
2.78% 2.50%

1 2 3 4 5 6

29/55
DATA ANALYSIS AND PROBABILITY

Exercise (chapter 1, proposed exercise 6)


It was asked to 360 individuals about the number of c) Graphically represent the relative cumulative
banks they have an account in. Based on this frequency.
information, the following bar chart was built:

Number of banks an individual has an account in

43.33%

30.83%

13.89%

6.67%
2.78% 2.50%

1 2 3 4 5 6

30/55
DATA ANALYSIS AND PROBABILITY

Frequency distributions
NUMERICAL DATA – CONTINUOUS VARIABLES

While dealing with continuous variables1 it is very By grouping data into classes, we lose its individuality,
frequent to group the data into classes. but we expect gains in terms of interpretation.

This is done because the possible values of a


continuous variable are infinite and classifying them
individually would be useless and little informative.

1It is also common to group data if we are dealing with a discrete variable with many different values.

31/55
DATA ANALYSIS AND PROBABILITY

Example (Newbold, p. 42): The following data set refers


to the time (in seconds) a group of 110 employees took
to complete a task.

271 236 294 252 254 263 266 222 262 278 288
262 237 247 282 224 263 267 254 271 278 263
262 288 247 252 264 263 247 225 281 279 238
How should we group this data set?
252 242 248 263 255 294 268 255 272 271 291
263 242 288 252 226 263 269 227 273 281 267
263 244 249 252 256 263 252 261 245 252 294
288 245 251 269 256 264 252 232 275 284 252
263 274 252 252 256 254 269 234 285 275 263
263 246 294 252 231 265 269 235 275 288 294
263 247 252 269 261 266 269 236 276 248 299

32/55
DATA ANALYSIS AND PROBABILITY

One way of answering the previous question is using Example (Newbold, p. 42): Time to complete a task.
the Sturges’ rule.
This is a practical rule to find an appropriate number of 271 236 294 252 254 263 266 222 262 278 288
classes, 𝑘, and the width of each class, ℎ: 262 237 247 282 224 263 267 254 271 278 263
262 288 247 252 264 263 247 225 281 279 238
252 242 248 263 255 294 268 255 272 271 291
𝑘 = 𝐼 log ! 𝑛 +1
263 242 288 252 226 263 269 227 273 281 267
263 244 249 252 256 263 252 261 245 252 294
∆ 288 245 251 269 256 264 252 232 275 284 252
ℎ=𝐼 + 1,
# 263 274 252 252 256 254 269 234 285 275 263
263 246 294 252 231 265 269 235 275 288 294
where: 𝐼(𝑥) stands for the integer part of 𝑥, and Δ 263 247 252 269 261 266 269 236 276 248 299
stands for the difference between the maximum and
minimum observations on the data set. 𝑘 = 𝐼 log ! 110 + 1 = 𝐼 6.78 + 1 = 7 7 classes

299 − 222
ℎ=𝐼 + 1 = 12 width 12
7

33/55
DATA ANALYSIS AND PROBABILITY

Example (Newbold, p. 42): Time to complete a task. After grouping the values, the individuality of each
observation is lost.

We will therefore assume that the all the observations


Classes ni fi Ni Fi are uniformly distributed within each class.
[222,234[ 7 0,064 7 0,064
[234,246[ 11 0,100 18 0,164 For example, what is the percentage of employees that
[246,258[ 30 0,273 48 0,436 completed the task in less than 250 seconds?
[258,270[ 32 0,291 80 0,727
[270,282[ 15 0,136 95 0,864
[282,294[ 9 0,082 104 0,945 250 − 246
[294,306] 6 0,055 110 1,000 0.164 + 0.273 × = 0.255 = 25.5%
258 − 246
Total 110 1

34/55
DATA ANALYSIS AND PROBABILITY

Graphical representation of simple frequencies

0,35

The most common chart used to plot numerical 0,3

continuous data is called histogram. 0,25

Unlike bar charts, the width of the classes of the 0,2

histogram is not irrelevant: the areas of the rectangles


should be proportional to the frequencies.
0,15

If all the classes have the same width, this requirement 0,1

is automatically satisfied; if not, it is necessary to make 0,05

an adjustment to the heights of the rectangles. 0


[222, 234[ [234, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306]

Histogram of relative frequencies

35/55
DATA ANALYSIS AND PROBABILITY

If all the classes have the same width, we have that:

• Area of the 1st rectangle: 12×0.064 = 0.768 0,35

• Area of the 2nd rectangle: 12×0.100 = 1.2 0,3

• Area of the 3rd rectangle: 12×0.273 = 3.276 0,25 0,273


0,291

• … 0,2

• Total area: 12×(0.064 + 0.1 + … ) = 12 0,15

0,136
0,1
0,1
The areas of the rectangles are proportional to the 0,082
0,05 0,064
frequencies: 0,055

0
!.!#$ !.&#' !.% %.(
, …
[222, 234[ [234, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306]

= , =
% %( % %(

36/55
DATA ANALYSIS AND PROBABILITY

If, for some reason, the first two classes were Using the new frequencies to build a “histogram” we
aggregated, we would obtain a new frequency get:
distribution:
0,35

Classes fi 0,3

Classes fi
[222,234[ 0,064
[222,246[ 0,164 0,25

[234,246[ 0,100
[246,258[ 0,273
[246,258[ 0,273 0,2

[258,270[ 0,291
[258,270[ 0,291
[270,282[ 0,136 0,15

[270,282[ 0,136
[282,294[ 0,082
[282,294[ 0,082 0,1
[294,306] 0,055
[294,306] 0,055
Total 1 0,05
Total 1
0
[222, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306]

37/55
DATA ANALYSIS AND PROBABILITY

Let us see why this is not a histogram:


0,35

• Area of the 1st rectangle: 24×0.164 = 3.936


0,3

• Area of the 2nd rectangle: 12×0.273 = 3.276 0,291


0,273
0,25

• Area of the 3rd rectangle: 12×0.291 = 3.492


• … 0,2

• Total area: 24×0.164 + 12×(0.273 + ⋯ ) = 13.98 0,15


0,164
0,136
0,1

The areas of the rectangles are not proportional to the 0,05


0,082

frequencies: 0,055

0.164 3.936 0
[222, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306]

1 13.98

38/55
DATA ANALYSIS AND PROBABILITY

The previous chart is not a histogram because it violates


its assumptions 0,03

0,025

Classes with twice the width should have a height equal


to half of its frequency to preserve the proportions in 0,02

the chart.
0,015

In general, if we are dealing with classes of different 0,01

widths, histograms should be built using frequency


densities (absolute or relative): 0,005

$! &!
or 0
%! %! [222, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306]

39/55
DATA ANALYSIS AND PROBABILITY

Property:
Proof:

A histogram plotted with the absolute frequency


densities has an area equal to 𝑛. 𝑛& 𝑛' 𝑛(
𝐴 !"#$% = ℎ& × + ℎ' × + … + ℎ( ×
ℎ& ℎ' ℎ(
(
𝑛)
= ( ℎ) ×
ℎ)
)*&
(

= ( 𝑛)
)*&
=𝑛

40/55
DATA ANALYSIS AND PROBABILITY

Property: Proof:

A histogram plotted with the relative frequency


densities has an area equal to 1. 𝑓& 𝑓' 𝑓(
𝐴 !"#$% = ℎ& × + ℎ' × + … + ℎ( ×
ℎ& ℎ' ℎ(
(
𝑓)
= ( ℎ) ×
ℎ)
)*&
(

= ( 𝑓)
)*&
=1

41/55
DATA ANALYSIS AND PROBABILITY

Graphical representation of simple frequencies

The frequency polygon is a polygonal line obtained by 0.35

joining the centres of the tops of the histogram bars 0.3

and also the centre of the “imaginary” classes created 0.25

at the start and at the end of the histogram. 0.2

0.15

0.1

0.05

0
[210, 222[ [222, 234[ [234, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306] ]306, 318[

42/55
DATA ANALYSIS AND PROBABILITY

Graphical representation of simple frequencies

The frequency polygon is particularly useful if we wish 0.4

to compare two distributions using the same chart. 0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
[210, 222[ [222, 234[ [234, 246[ [246, 258[ [258, 270[ [270, 282[ [282, 294[ [294, 306] ]306, 318[

43/55
DATA ANALYSIS AND PROBABILITY

Graphical representation of cumulative frequencies

The cumulative frequencies of continuous variables are 120

graphically represented by a chart called ogive. 100

Classes Ni 80

[222,234[ 7
60
[234,246[ 18
[246,258[ 48 40

[258,270[ 80 20
[270,282[ 95
0
[282,294[ 104 222 234 246 258 270 282 294 306
[294,306] 110

44/55
DATA ANALYSIS AND PROBABILITY

Summary - Graphical representation of statistical variables

Simple Bar chart


Categorical
frequencies Pie chart

Simple Bar chart


frequencies Pie chart
Discrete
Cumulative
Cumulative
frequency
frequencies
distribution
Numerical
Simple Histogram
frequencies Frequency polygon
Continuous
Cumulative
Ogive
frequencies

45/55
DATA ANALYSIS AND PROBABILITY

Exercise (chapter 1, proposed exercise 11)

We asked to 300 retail employees (who work in a) Find the missing information. Justify.
downtown Lisbon) on the effective number of hours b) Plot the histogram of area 1.
worked per week. The table below presents the results.

Classes ci ni fi Ni Fi
[a, 20[ 15 4 i 4 0.013
[20, 35[ 27.5 f 0.050 19 o
[35, 40[ e 101 0.337 120 p
[40, b[ 45 g j m 0.800
[50, 60[ 55 50 k n q
[c, d] 65 h l 300 1
Total 300 1

46/55
DATA ANALYSIS AND PROBABILITY

Exercise (chapter 1, proposed exercise 11)

We asked to 300 retail employees (who work in c) Clarify the following questions, made by an
Downtown, Lisbon) on the effective number of hours amateur:
worked per week. The table below presents the results. i. From the above table can I check the values for
all the 300 employees?
Classes ci ni fi Ni Fi
[a, 20[ 15 4 i 4 0.013 ii. Where are placed the employees that answered
[20, 35[ 27.5 f 0.050 19 o 40 hours?
[35, 40[ e 101 0.337 120 p iii. Where are placed the employees that answered
[40, b[ 45 g j m 0.800 d hours and one second?
[50, 60[ 55 50 k n q
[c, d] 65 h l 300 1 iv. How do the observations behave within each
Total 300 1 class?

47/55
DATA ANALYSIS AND PROBABILITY

1.4 Two-dimensional data

48/55
DATA ANALYSIS AND PROBABILITY

Two-dimensional data

There are situations in which we need to describe


relationships between two variables.
In those cases, each element of the data set is evaluated
regarding both, thus, it is represented by a pair of
observations, one for each variable.
For example, an individual can be asked about his age
and height. In this case, the individual is represented by
a pair of numbers: one corresponding to his age, and
the other to his height.

49/55
DATA ANALYSIS AND PROBABILITY

While dealing with numerical raw data, the observations Example: Height (in m) and weight (in kg) of a group of
are usually listed side by side. 10 individuals.

As for its graphical representation, it is usual to use a


scatter plot. This graphic uses the pairs of observations 95

as coordinates and each observation is represented by 90

a point 𝑥' , 𝑦' , 𝑖 = 1, … , 𝑛. 85

80

Weight (kg)
75

70

65

60

55

50
1,5 1,55 1,6 1,65 1,7 1,75 1,8 1,85 1,9 1,95
Height (m)

50/55
DATA ANALYSIS AND PROBABILITY

While dealing tabulated data, either categorical or Example (Newbold, p. 30): The following cross-table
numerical, two-dimensional data is usually organized in contains data from a survey on health and nutrition of
a cross-table (or contingency table) which lists the the U.S. population, conducted in 2005. The table
number (or percentage) of observations for each contains information on the gender and activity level
possible combination of the values of the two variables. of a group of 4460 individuals.
The values inside a cross-table are the joint absolute
frequencies (𝒏𝒊𝒋 ) or the joint relative frequencies (𝒇𝒊𝒋 )
as they refer to counts or percentages, respectively.
Male Female
Sedentary 957 1226
Active 340 417
Very active 842 678
Joint frequency distribution of gender and activity level.

51/55
DATA ANALYSIS AND PROBABILITY

By adding the frequencies in each row and column we The graphical representation of cross-tables is usually
get the marginal frequencies for each variable. made by a component bar chart or by a cluster bar
chart.

Male Female Total 2500 1400

1200
2000 678

Sedentary 957 1226 2183 1500


842 1000

417 800

Active 340 417 757


340
1000 600 1226
957
400 842
1226 678
500 957

Very active 842 678 1520 0


200

0
340 417

Male Female Male Female

Total 2139 2321 4460 Sedentary Active Very active Sedentary Active Very active

Component bar chart Cluster bar chart

52/55
DATA ANALYSIS AND PROBABILITY

We can also analyse the conditional frequencies. They For example, using the previous data set, we can
are obtained by dividing the joint frequency by the answer questions like:
marginal frequency.
• What is the percentage of men who are sedentary?

Male Female Total 957


= 0.477
2139
Sedentary 957 1226 2183
Active 340 417 757
• What is the percentage women inside the active
Very active 842 678 1520 group?
Total 2139 2321 4460
417
= 0.551
757

53/55
DATA ANALYSIS AND PROBABILITY

Exercise (chapter 1, proposed exercise 17)


Introduction
Statistical variables
Frequency Distributions
Two-dimensional data
A real estate agency wants to analyze the number of Classify each of the following sentences into True or False,
ter 1 years of experience
- Proposed exerciseof17:their agents
A real (X) agency
estate and thewants
number quantifying
to analyze the your justification.
ers ofofyears
houses they sold
of experience of last
theirmonth (Y).) The
agents (X information
and the number of houses
regarding the 100 agents working in the agency is a) 10% of the agents work in the agency for more than 6
old last month (Y ). The information regarding the 100 agents working in
summarized in the next table. years.
gency is summarized in the next table.
Sales (Y ) b) 20% of the agents work in the agency for more than 4
Experience (X )
[0, 2] ]2, 4] ]4, 6] ]6, 8] years and sold more than 6 houses in the last month.
[0, 2] 4 6 8 7
]2, 4] 2 6 10 17
]4, 6] 3 6 9 12
]6, 8] 1 2 3 4

fy each of the following sentences into True or False, quantifying your


cation.
10% of the agents work in the agency for more than 6 years.
20% of the agents work in the agency for more than 4 years and sold 54/55
DATA ANALYSIS AND PROBABILITY

Exercise (chapter 1, proposed exercise 17)


Introduction
Statistical variables
Frequency Distributions
Two-dimensional data
A real estate agency wants to analyze the number of c) Among the agents that sold more than 6 houses in the
ter 1 years of experience
- Proposed exerciseof17: their agents
A real (X) agency
estate and thewants
number
to analyze last
the month, 5% of them work in the agency for more
ers ofofyears
houses they sold
of experience of last
theirmonth (Y).) The
agents (X information
and the number of housesthan 6 years.
regarding the 100 agents working in the agency is
old last month (Y ). The information regarding the 100 agents working in
summarized in the next table. d) There is no relationship between the experience and the
gency is summarized in the next table.
performance of the agents.
Sales (Y )
Experience (X )
[0, 2] ]2, 4] ]4, 6] ]6, 8]
[0, 2] 4 6 8 7
]2, 4] 2 6 10 17
]4, 6] 3 6 9 12
]6, 8] 1 2 3 4

fy each of the following sentences into True or False, quantifying your


cation.
10% of the agents work in the agency for more than 6 years.
20% of the agents work in the agency for more than 4 years and sold 55/55
DATA ANALYSIS AND PROBABILITY

https://www.gapminder.org/fw/world-health-chart/

56/55

You might also like