EA311 Lecture Note One
EA311 Lecture Note One
2
3
1 EXPLORATORY DATA ANALYSIS
Learning Objectives
General Concepts of Exploratory (Preliminary) Statistics
Data Variable Types
Statistical Characteristics and Graphical Methods of Presenting
Qualitative Variables
Statistical Characteristics and Graphical Methods of Presenting
Quantitative Variables
Explanation
Original goal of statistics was to collect data about population based on population samples.
By population we mean a group of all existing components available for observation during
statistical research. For example:
If a statistical research is performed about physical hight of 15-year old girls, the
population will be all girls currently aged 15.
Considering the fact that the number of population members is usually high, the research will
be based on the so-called sample examination where only part of the population is used. The
examined part of the population is called a sample. What's really important is to make a
definite selection that is as representative of the whole group as possible.
There are several ways to achieve it. To avoid of omitting some elements of the population
the so-called random sample is used in which each element of population has the same
chance of being selected.
It goes without saying that sample examination can never be as accurate as examining the
whole population. Why do we do prefer it then?
4
2. To avoid damaging samples in destructive testing (some tests like examining
cholesterol in blood etc., lead to the permanent damage of examined elements).
3. Because the whole population is not available.
Now that you know that statistics can describe the whole population based on information
gathered from a population sample we will move on to Exploratory Data Analysis (EDA).
Data we observe will be called the variables and their values variable variants. EDA is
often the first step in revealing information hidden in a large amount of variables and their
variants.
Because the way of processing variables depends most on their type, we will now explore
how variables are devided into different categories. The variables division is shown in the
following diagram.
Variable
Qualitative Quantitative
(categorial, lexical...) (numerical...)
Nominal Ordinal
Finite Denumerable
dividing based on
number of variant
variant
Alternative Plural
Qualitative variable – its variants are expressed verbally and they split into two
general subgroups according to what relation is between their values:
5
Ordinal variable – forms a transition between qualitative and quantitative
variables: individual variant can be sorted and it is possible to compare one
another (for example: cloth sizes S, M, L, and XL)
Alternative variable – has only two possible options (e.g. sex – male or
female, etc.)
Plural variable – has more than two possible options (e.g. education, name,
eye color, etc.)
- Discrete finite variable – it has finite number of variants (e.g. math grades
- 1,2,3,4,5)
Continuous variable - it has any value from or from some subset (e.g.
distance between cities, etc.)
Additional clues
Imagine that you have a large statistical group and you face a question of how to best
describe it. Number representations of values are used to “replace” the group elements and
they become the basic attributes of the group. This is what we call statistical characteristics.
In the next chapters we are going to learn how to set up statistical characteristics for
various types of variables and how to represent larger statistical groups.
Nominal variable has different but equivalent variants in one group. The number of these
variants is usually low and that's why the first statistical characteristics we use to describe it
will be its frequency.
6
In case that a qualitative variable has k different variants (we describe their frequency
as n1, n2… nk) - in a statistical group (of n values) it must be true that:
k
n1 n2 ... nk ni n
i 1
If you want to express the proportion of the variant frequency on the total number of
occurrences, we use relative frequency to describe the variable.
Relative frequency p i
- is defined as:
ni
pi
n
alternatively:
ni
pi 100 %
n
(We use the second formula to express the relative frequency in percentage points).
For relative frequency it must be true that:
k
p1 p2 pk pi 1
i 1
When qualitative variables are processed, it is good to arrange frequency and relative
frequency in the so-called frequency table:
FREQUENCY TABLE
Values xi Absolute frequency Relative frequency
ni pi
x1 n1 p1
x2 n2 p2
xk nk pk
k k
ni n pi 1
Total i 1 i 1
Mode
7
- is defined as a variant that occurs most frequently
The mode represents a typical element of the group. Mode cannot be determined if
there are more values with maximum frequency in the statistical group.
The statistics often uses graphs for better analysis of variables. There are two types of graphs
for analyzing nominal variable:
Histogram is a standard graph where variants of the variable are represented on one axis and
variable frequencies on the other exis. Individual values of the frequency are then displayed as
bars (boxes, vectors, squared logs, cones, etc.)
Examples:
Classification Classification
20
20
18
18
16
16
14
14
12
12
10 10
8 8
6 6
4 4
2 2
0 0
1 2 3 4 1 2 3 4
Classification Classification
20 20
18 18
16 16
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
1 2 3 4 1 2 3 4
Classification
20
18
16
14
12
10
8
6
4
2
0
1 2 3 4
8
Pie chart represents relative frequencies of individual variants of a variable. Frequencies are
presented as proportions in a sector of a circle. When we change the angle of the circle, we
can get elliptical, three-dimensional effect.
Classification Classification
3 6
3 6
1 1
2 9 2
19
9 3 3
19 4 4
Classification Classification
3
6
3
6
1 1
2 2
3 3
9 9
19 4 4
19
REMEMBER! Describing the pie chart is necessary. Marking individual sectors by relative
frequencies only without adding their absolute values is not sufficient.
Example: An opinion poll has been carried out about launching high school fees. Its results
are shown on the following chart:
YES
50% 50%
NO
Aren’t the results interesting? No matter how true they may be, it is recommended that the
chart be modified as follows:
9
YES
1 1
NO
What is the difference? From the second chart it is obvious that only two people were asked -
the first one said YES and the second one said NO. What can be learned from that? Make
charts in such a way that their interpretation is absolutely clear. If you are presented with a pie
chart without absolute frequencies marked on it, you can ask yourselves whether it is because
of the author’s ignorance or it is a deliberate bias.
An observational study has been undertaken on the use of an intersection. The collected data
are in the table below. The data is made up of colours of cars that pass through the
intersection. Analyze the data and interpret the results in a graphical form.
Solution:
From the table it is obvious that the collected colours are qualitative (lexical) variables, and
because there is no order or comparison between them, we can say they are nominal variables.
For better description we create a frequency table and we determine the mode. We are
going to present the colours of the passing vehicles by a histogram and a pie chart.
FREQUENCY TABLE
Colors of Absolute frequency Relative frequency
passing cars ni pi
red 5 5/12 = 0.42
blue 3 3/12 = 0.25
white 1 1/12 = 0.08
green 3 3/12 = 0.25
Total 12 1.00
10
We observed 12 cars total.
3
4
red
5
3 blue
white
1
2 green
1
3
0
red blue white green
Now we are going to have a look at describing ordinal variables. The ordinal variable (just
like the nominal variable) has various verbal variants in the group but these variants can be
sorted i.e. we can tell which one is "smaller" and which one is "bigger"
For describing ordinal variables we use the same statistical characteristics and graphs as
for nominal variables (frequency, relative frequency, mode viewed by histogram or pie chart)
plus two others characteristics (cumulative frequency and cumulative relative frequency) thus
including information about how they are sorted.
E.g. we have a variable called "grade from Statistics" that has the following variants:
"1", "2", "3" or "4" (where 1 is the best and 4 the worst grade). Then, for example, the
cumulative frequency for variant "3", will be equal number of students who get grade
"3" or better.
If variants are sorted by their "size" (“ x1 x2 xk ”) then the following must be true:
i
mi nj
j 1
mk n
11
The second special characteristic for ordinal variable is cumulative relative frequency.
- a part of the group are the values with the i-th and lower variants. They are
expressed by the following formula:
i
Fi pj
j 1
mi
Fi
n
Just as in the case of nominal variables we can present statistical characteristics using
frequency table for ordinal variables. In comparison to the frequency table of nominal
variables it also contains values of cumulative and cumulative relative frequencies.
FREQUENCY TABLE
Values Absolute Cumulative Relative Relative cumulative
xi frequency frequency frequency frequency
ni mi pi Fi
x1 n1 m1 n1 p1 F1 p1
x2 n2 p2 F2 p1 p2 F1 p2
xk nk mk nk 1
nk n pk Fk Fk 1
pk 1
Total k
ni n ----- k
-----
pi 1
i 1
i 1
We briefly mentioned histogram and the pie chart as good ways of presenting the ordinal
variable. But these graphs don't reflect variants’ sorting. To achieve that, we need to use
polygon (also known as Ogive) and Pareto graph.
Frequency Polygon
- is a line chart. The frequency is placed along the vertical axis and the individual variants of
the variable are placed along the horizontal axis (sorted in ascending order from the “lowest"
to the “highest"). The values are attached to the lines.
12
Frequency polygon for the evaluation grades
20
18
16
14
12
10
8
6
4
2
0
1 2 3 4
variant
40
35
30
25
20
15
10
0
1 2 3 4
variant
Pareto Graph
- is a bar chart for qualitative variable with the bars arranged by frequency
- variants are on horizontal axis and are sorted from the “highest” importance to the “lowest”
13
Notice the decline of cumulative frequency. It drops as the frequency of variables decreases.
Following data represent t-shirts sizes that a cloths retailer offers on sale:
Solution:
a) The variable is qualitative (lexical) and t-shirt sizes can be sorted, therefore it is an
ordinal variable. For its description you use frequency table for the ordinal variable and you
determine the mode.
FREQUENCY TABLE
Colors of Absolute frequency Relative frequency
passing cars ni pi
red 5
blue 3
white 1
green 3
Total 12 1.00
For graphical representation use histogram, pie graph and cumulative frequency polygon (you
don't create Pareto graph because it is mostly used for technical data).
14
Graphical output:
Histogram P ie Chart
Sold t-shirt Sold t-shirt
5 XL S
32% 14%
4
M
3 L 27%
27%
2
0
S M L XL
variant
25
20
15
10
0
S M L XL
variant
b) You get the answer from the value of the relative cumulative frequency for variant L. You
see that 68% of people bought t-shirts of L size and smaller.
and
15
1.2.1 Measures of Location and Variability
The most common measure of position is the variable mean. The mean represents average or
typical value of the sample population. The most famous mean of quantitative variable is:
Arithmetical mean x
n
xi
x i !
n
1. xi x 0
i !
- sum of all diversions of variable values from their arithmetical mean is equal to
zero which means that arithmetical mean compensates mistakes caused by random
errors.
n n
xi a xi
2. a : x i ! i !
a x
n n
- if the same number is added to all the values of the variable, the arithmetical
mean increases by the same number
n n
xi bxi
3. b : x i ! i !
bx
n n
- if all the variable values are multiplied by the same number the arithmetical
Mean increases accordingly
Arithmetical mean is not always the best way to cal culate the mean of the sample population.
For example, if we work with a variable representing relative changes (cost indexes, etc.) we
use the so-called geometrical mean. To calculate mean when the variable has a form of a unit,
harmonical mean is often used.
16
Considering that the mean uses the whole variable values data set, it carries maximum
information about the sample population. On the other hand, it's very sensitive to the so-called
outlying observations (outliers). Outliers are values that are substantially different from the
rest of the values in a group and they can distort the mean to such a degree that it no longer
represents the sample population. We are going to have a closer look at the Outliers later.
Measures of location that are less dependent on the outlying observations are:
Mode x̂
In the case of mode we will differentiate between discrete and continuous quantitative
variable. For discrete variable we define mode as the most frequent value of the
variable (similarly as with the qualitative variable).
But in the case of continuous variable we think of the mode as the value around
which most variable values are concentrated.
For assessment of this value we use shorth. Shorth is the shortest interval with at
least 50% of variable values. In case of a sample as large as n 2 k k (with even
number of values) k values lie within shorth - which is n/2 (50%) variable values. In
the case of a sample as large as n 2k 1 k (with odd number of values) k 1
values lies within short - which is about 1/2 plus 50% variable values (n/2+1/2).
From what has been said so far it is clear that the shorth length (top boundary -
bottom boundary) is unique but its location is not.
22 82 27 43 19 47 41 34 34 42 35
Solution:
a) Mean:
17
In this case we use arithmetical mean:
n
xi
i ! 22 82 27 43 19 47 41 34 34 42 35
x 38.7 years
n 11
b) Shorth:
Our sample population has 11 values. 11 is an odd number. 50% of 11 is 5.5 and the
nearest higher natural number is 6 - otherwise: n/2+1/2 = 11/2+1/2 = 12/2 = 6. That
means that 6 values will lie in the Shorth.
The shortest of these intervals will be the shorth (size of the interval = x i 5 x i )
From the table you can see that the shortest interval has the value of 9. There is only one
interval that corresponds to this size and that is: 34;43 .
Shorth = 34;43 and that means that half of the musicians are between 34 and 43 years of
age.
c) Mode:
34 43
x̂ 38.5
2
18
Mode = 38.5 years which means that the typical age of the musicians who performed at
the concert was 38.5 years.
Among other characteristics describing quantitative variables are quantiles. Those are used
for more detailed illustration of the distribution of the variable values within the scope of the
population.
Quantiles
Quantiles describe location of individual values (within the variable scope) and are
resistant to outlying observations similarly like the mode. Generally the quantile is
defined as a value that divides the sample into two parts. The first one contains values
that are smaller than given quantile and the second one with values larger or equal
than the given quantile. The data must be sorted ascendingly from the lowest to the
highest value.
Quantile of variable x that separates 100% smaller values from the rest of the samples
(i.e. from 100(1-p)% values) will be called 100p % quantile and marked xp.
In real life you most often come across the following quantiles:
Quartiles
In case of the four-part division the values of the variate corresponding to 25%, 50%,
and 75% of the total distribution are called quartiles.
Lower quartile x 0,25 = 25% quantile - divides a sample of data in a way that 25%
of the values are smaller than the quartile, i.e. 75% are bigger (or equal)
Median x0,5 = 50% quantile - divides a sample of data in a way that 50% of the
values are smaller than the median and 50% of values are bigger (or equal)
Upper quartile x 0,75 = 75% quantile - divides a sample of data in a way that 75%
of values are smaller than the quartile, i.e. 25% are bigger (or equal)
Example:
Data 6 47 49 15 43 41 7 39 43 41 36
Data in ascending order 6 7 15 36 39 41 41 43 43 47 49
Median 41
Upper quartile 43
Lower quartile 15
The difference between the 1st and 3rd quartile is called the Inter-Quartile Range
(IQR).
19
IQR x0.75 x0.25
Example:
Data 23456667789
Upper quartile 7
Lower quartile 4
IQR 7-4=3
For example, the 80th percentile is the number that has 80% of values below it and
20% above it. Rather than counting 80% from the bottom, count 20% from the top.
2. The individual values are sequenced so that the smallest value is at the first place
and the highest value is at n-th place (n is the total number of values)
REMEMBER!!!
In case of a data set with an even number of values the median is not uniquely
defined. Any number between two middle values (including these values) can be
accepted as the median. Most often it is the middle value.
We are now going to discuss the relation between quantiles and the cumulative
relative frequency. The value p denotes cumulative relative frequency of quantile x p
20
i.e. relative frequency of those variable values that are smaller than quantile x p.
Quantile and cumulative relative frequency are inverse concepts.
We put the sample population in ascending order (x 1<x2< … <xn) and we denote
p(xi) as relative frequency of the value x i. For empirical distribution function F(x)
it must then be true that:
0 for x x1
j
F x p xi for x j x x j 1, 1 j n 1
i 1
1 for xn x
p xi lim F x F xi
x xi
F(x)
1
p(xn)
p(x2)
0 x1 x2 x3 ........ xn-1 xn
x
MAD
21
3. For each value determine absolute value of its deviation from the median
5. Determine the median of the absolute deviations from the median i.e. MAD
Solution:
a) You need to determine Lower Quartile x 0,25; Median x 0,5 and Upper Quartile x 0,75.
First, you order the data by size and assign a sequence number to each value.
Now you can divide the data set into quartiles and mark their variable values accordingly:
i.e. 75% musicians are under 43 (25% of them are 43 years old or older).
22
IQR = x0.75 – x0.25 = 43 – 27 = 16
c) MAD
If you want to determine this characteristic you must follow its definition (the median of
absolute deviations from the median).
x0.5 = 35
Original Ordered
Absolute values of Ordered absolute values
data xi data yi
deviations of the ordered Mi
data from their median
yi x0.5
22 19 16 19 35 0
82 22 13 22 35 1
27 27 8 27 35 1
43 34 1 34 35 6
19 34 1 34 35 7
47 35 0 35 35 8
41 41 6 41 35 8
34 42 7 42 35 12
34 43 8 43 35 13
42 47 12 47 35 16
35 82 47 82 35 47
MAD = M0.5
p = 0.5; n = 11 zp = 11 x 0.5 + 0.5 = 6 x0.5 = 8
(MAD is a median absolute deviation from the median i.e. 6 th value of ordered absolute
deviations from the median)
MAD = 8.
d) The last task was to draw the Empirical Distribution Function. Here is its definition:
0 for x x1
j
F x p xi for x j x x j 1,1 j n 1
i 1
1 for xn x
- Arrange the variable values as well as their frequencies and relative frequencies in
ascending order and write them down in the table. Then derive the empirical distribution
function from them:
23
Original Ordered
Absolute Relative Empirical
data xi data ai
frequencies of the frequencies of the distribution
ordered values ni ordered values pi function F(ai)
22 19 1 1/11 0
82 22 1 1/11 1/11
27 27 1 1/11 2/11
43 34 2 2/11 3/11
19 35 1 1/11 5/11
47 41 1 1/11 6/11
41 42 1 1/11 7/11
34 43 1 1/11 8/11
34 47 1 1/11 9/11
42 82 1 1/11 10/11
35
As by its definition - the empirical distribution function F(x) - equals 0 for each x <19; F(x)
equals 1/11 for all 22 x>19; F(x) equals 1/11 + 1/11 for all 27 x> 22; and so it goes on.
Means, mode and median (i.e. measures of location) represent imaginary centre of the
variable. However, we are also interested in the distribution of the individual values of the
variable around the centre (i.e. measures of variability).
The following three statistical characteristics allow description of the sample population
variability. Shorth and Inter-Quartile Range are classified as measures of variability.
Sample Variance s 2
1,2
1,0
0,8
0,6
0,4
0,2
0,0
-20 0 20 40 60 80 100 120
24
The sample variance is given by:
n
2
xi x
s2 i 1
n 1
- Sample Variance is the sum of all squared deviations from their mean divided by
one less than the sample size
In other words: if all variable values are the same, the sampling has zero
diffuseness
n n
2 2
xi x yi y
a : s2 i 1
yi a xi i 1
s2
n 1 n 1
In other words: if you add the same constant number to all variable values, the
sample variance doesn’t change
n n
2 2
xi x yi y
b : s2 i 1
yi bxi i 1
b2 s 2
n 1 n 1
Standard Deviation s
25
Another disadvantage of using the sample variation and the standard deviation is that
variability of the variable can’t be compared in different units. Which variable has bigger
variability - height or weight of an adult? To answer that, Coefficient of Variation has to be
used.
Coefficient of Variation V x
s
Vx
x
Solution:
26
Based on the calculated means the new technology could be recommended because the
temperature it can withstand is 6 oC higher.
Sample Variance:
Standard Deviation:
New technology:
Sample variance:
Standard deviation:
Sample variance (standard deviation) for the new technology is significantly larger. What is
the possible reason? Look at the graphical Critical temperature
representation of the collected data. Critical
temperatures are much more spread out which 600
27
Now we are going to return to exploratory analysis as such. We mentioned outliers. So far we
know that outliers are variable values that are substentially different from the rest of the
values and this impacts on mean. How can these values be identified?
In the statistical practice we are going to come across a few methods that are
capable of identifying outliers. We'll mention three and go through them one by
one.
1. The outlier can be every value x i that by far exceeds 1,5 IQR lower (or upper)
quantile.
2. The outlier can be every value x i of which the absolute value of the z-score is
greater then 3.
xi x
z scorei
s
z score.i 3 xi is anoutlier
3. The outlier can be every value x i of which the absolute value of the median-
score is greater then 3.
xi x0.5
median score.i
1.483.MAD
Any of the three rules can be used to identify outliers in real-life problems. The Z-
axis is "less strict" than the median-axis to outliers. It's because establishing the z-
axis is based on mean and standard deviation and they are strongly influenced by
outlying values. Meanwhile, establishing the median-axis is based on median and
MAD and they are immune to outliers.
When you identify a value as an outlier you need to decide its type unless it is
caused by:
If you know the outlier cause and make sure that it will not occur again, it can be
cleared from the process. In other cases you must consider carefully if by getting
rid of an outlier you won’t lose important information about events with low
frequencies.
28
The others characteristics describing qualitative variable are skewness and kurtosis. Their
formulas are rather complex therefore specialized software is used for the calculation.
Skewness
Skewness interpretation:
60 60
60
50 50
50
40 40 40
30 30 30
20 20 20
10
10 10
0
0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
1 2 3 4 5 6 7
=0 >0 <0
Kurtosis
Kurtosis interpretation:
29
0 ... Kurtosis corresponds to normal distribution
0 ... "peaked" distribution of the variable
0 ... "flat" distribution of the variable
70 100 30
60 25
80
50
20
40 60
15
30 40
10
20
20
10 5
0
0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
=0 >0 <0
We have now defined all numerical characteristics of the quantitative variable Next we are
going to take a look at how they can be interpreted graphically.
Box plot 60
Outlier
A box plot is a way of 50
summarizing a data set on an
interval scale. It is often used in Max1
exploratory data analysis. It is a 40
Notice: A box plot construction begins by marking outliers and then other characteristics
(min1, max1, quartiles and shorth).
30
Stem and Leaf Plot
As we saw, simplicity is an advantage of the box plot. However, information about specific
values of the variable is missing. The missing numeric values would have to be specifically
marked down onto the graph. The Stem and leaf plot will make up for that limitation.
We have a variable representing average month salary of bank employees in the Czech
Republic.
For example: the first row in the graph represents two values - (6.7 and 6.8)*10 3 CZK i.e.
6,700 CZK and 6,800 CZK, the sixth row represents two values too - (12.4 and 12.4)*103
CZK, i.e. two employees have the average month pay of 12,400 CZK, etc.
31
Finally, you need to keep in mind that there are different ways of constructing a stem and leaf
plot and you need to be aware of one particular problem. Nowhere is it said which place
values of the variable are important and which ones are not. This is left to the observer.
However there is a tip to follow. A long stem with short leaves and a short stem with long
leaves indicate incorrect choice of scale. Look at the picture.
0 66788999999 11
1 000022558 9
*104
Quiz
3. Which statistical characteristics can be contained in frequency table (for what type of
variable)?
32
Practical Exercises
Exercise 1: The following data represents car manufacturers’ countries of origin. Analyze the data
(frequency, relative frequency, cumulative frequenc y and cumulative relative frequency, mode) and
interpret it in the graphical form (histogram, pie chart).
Exercise 2: The following data represents customers’ waiting ti me (min) when dealing with the
customer service. Draw box plot and stem and leaf plot.
120 80 100 90
150 5 140 130
100 70 110 100
Exercise 3: A traffic survey was carried out to establish a veh icle count at an intersection. A student
data collector recorded the numbers of cars waiting in queue each time the green light jumped on.
These are his/her outcomes:
3 1 5 3 2 3 5 7 1 2 8 8 1 6 1 8 5 5 8 5 4 7 2 5 6 3 4 2 8 4 4 5 5 4 3 3 4 9 6 2 1
5 2 3 5 3 5 7 2 5 8 2 4 2 4 3 5 64 6 9 3 2 1 2 6 3 5 3 5 3 7 6 3 75 6
Draw box plot, empirical distribution function and calculate the mean, standard deviation, shorth,
mode and inter-quartile range.
33