0% found this document useful (0 votes)
9 views33 pages

EA311 Lecture Note One

The document provides an overview of exploratory data analysis (EDA) in statistics, focusing on variable types, statistical characteristics, and graphical methods for presenting qualitative and quantitative data. It explains the importance of sampling in statistics and categorizes variables into qualitative (nominal and ordinal) and quantitative (discrete and continuous) types. Additionally, it discusses statistical characteristics such as frequency, relative frequency, mode, and various graphical representations like histograms and pie charts for better data analysis.

Uploaded by

tertese7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views33 pages

EA311 Lecture Note One

The document provides an overview of exploratory data analysis (EDA) in statistics, focusing on variable types, statistical characteristics, and graphical methods for presenting qualitative and quantitative data. It explains the importance of sampling in statistics and categorizes variables into qualitative (nominal and ordinal) and quantitative (discrete and continuous) types. Additionally, it discusses statistical characteristics such as frequency, relative frequency, mode, and various graphical representations like histograms and pie charts for better data analysis.

Uploaded by

tertese7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

STATISTICS FOR ENGINEERS

STATISTICS FOR ENGINEERS

2
3
1 EXPLORATORY DATA ANALYSIS

Learning Objectives
General Concepts of Exploratory (Preliminary) Statistics
Data Variable Types
Statistical Characteristics and Graphical Methods of Presenting
Qualitative Variables
Statistical Characteristics and Graphical Methods of Presenting
Quantitative Variables

Explanation

Original goal of statistics was to collect data about population based on population samples.
By population we mean a group of all existing components available for observation during
statistical research. For example:

If a statistical research is performed about physical hight of 15-year old girls, the
population will be all girls currently aged 15.
Considering the fact that the number of population members is usually high, the research will
be based on the so-called sample examination where only part of the population is used. The
examined part of the population is called a sample. What's really important is to make a
definite selection that is as representative of the whole group as possible.
There are several ways to achieve it. To avoid of omitting some elements of the population
the so-called random sample is used in which each element of population has the same
chance of being selected.
It goes without saying that sample examination can never be as accurate as examining the
whole population. Why do we do prefer it then?

1. To save time and minimize costs (especially for large populations).

4
2. To avoid damaging samples in destructive testing (some tests like examining
cholesterol in blood etc., lead to the permanent damage of examined elements).
3. Because the whole population is not available.

Now that you know that statistics can describe the whole population based on information
gathered from a population sample we will move on to Exploratory Data Analysis (EDA).
Data we observe will be called the variables and their values variable variants. EDA is
often the first step in revealing information hidden in a large amount of variables and their
variants.

Because the way of processing variables depends most on their type, we will now explore
how variables are devided into different categories. The variables division is shown in the
following diagram.

Variable

Qualitative Quantitative
(categorial, lexical...) (numerical...)

general dividing Discrete Continuous

Nominal Ordinal

Finite Denumerable

dividing based on
number of variant
variant

Alternative Plural

Qualitative variable – its variants are expressed verbally and they split into two
general subgroups according to what relation is between their values:

Nominal variable – has equivalent variants: it is impossible to either compare


them or sort them (for example: sex, nationality, etc.)

5
Ordinal variable – forms a transition between qualitative and quantitative
variables: individual variant can be sorted and it is possible to compare one
another (for example: cloth sizes S, M, L, and XL)

The second way of dividing them is based on number of variants:

Alternative variable – has only two possible options (e.g. sex – male or
female, etc.)

Plural variable – has more than two possible options (e.g. education, name,
eye color, etc.)

Quantitative variable – is expressed numerically and it's divided into:

Discrete variable – it has finite or denumerable number of variants

- Discrete finite variable – it has finite number of variants (e.g. math grades
- 1,2,3,4,5)

- Discrete denumerable variable – it has denumerable number of variants


(e.g. age (year), height (cm), weight (kg), etc.)

Continuous variable - it has any value from or from some subset (e.g.
distance between cities, etc.)

Additional clues
Imagine that you have a large statistical group and you face a question of how to best
describe it. Number representations of values are used to “replace” the group elements and
they become the basic attributes of the group. This is what we call statistical characteristics.
In the next chapters we are going to learn how to set up statistical characteristics for
various types of variables and how to represent larger statistical groups.

1.1 Statistical Characteristics of Qualitative Variables


We know that a qualitative variable has two basic t ypes - nominal and ordinal.

1.1.1 Nominal Variables

Nominal variable has different but equivalent variants in one group. The number of these
variants is usually low and that's why the first statistical characteristics we use to describe it
will be its frequency.

Frequency n i (absolute frequency)

- is defined as the number of a variant occurrences of the qualitative variable

6
In case that a qualitative variable has k different variants (we describe their frequency
as n1, n2… nk) - in a statistical group (of n values) it must be true that:

k
n1 n2 ... nk ni n
i 1

If you want to express the proportion of the variant frequency on the total number of
occurrences, we use relative frequency to describe the variable.

Relative frequency p i

- is defined as:

ni
pi
n
alternatively:

ni
pi 100 %
n

(We use the second formula to express the relative frequency in percentage points).
For relative frequency it must be true that:

k
p1 p2 pk pi 1
i 1

When qualitative variables are processed, it is good to arrange frequency and relative
frequency in the so-called frequency table:

FREQUENCY TABLE
Values xi Absolute frequency Relative frequency
ni pi
x1 n1 p1
x2 n2 p2

xk nk pk
k k
ni n pi 1
Total i 1 i 1

The last characteristic of nominal variable is the mode.

Mode

7
- is defined as a variant that occurs most frequently

The mode represents a typical element of the group. Mode cannot be determined if
there are more values with maximum frequency in the statistical group.

1.1.2 Graphical Methods of Presenting Qualitative Variabl es

The statistics often uses graphs for better analysis of variables. There are two types of graphs
for analyzing nominal variable:

Histogram (bar chart)


Pie chart

Histogram is a standard graph where variants of the variable are represented on one axis and
variable frequencies on the other exis. Individual values of the frequency are then displayed as
bars (boxes, vectors, squared logs, cones, etc.)

Examples:

Classification Classification

20
20
18
18
16
16
14
14
12
12
10 10
8 8
6 6
4 4

2 2

0 0
1 2 3 4 1 2 3 4

Classification Classification

20 20
18 18
16 16
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
1 2 3 4 1 2 3 4

Classification

20
18
16
14
12
10
8
6
4
2
0
1 2 3 4

8
Pie chart represents relative frequencies of individual variants of a variable. Frequencies are
presented as proportions in a sector of a circle. When we change the angle of the circle, we
can get elliptical, three-dimensional effect.

Classification Classification

3 6
3 6
1 1

2 9 2
19
9 3 3

19 4 4

Classification Classification

3
6
3
6
1 1

2 2

3 3
9 9
19 4 4
19

REMEMBER! Describing the pie chart is necessary. Marking individual sectors by relative
frequencies only without adding their absolute values is not sufficient.

Example: An opinion poll has been carried out about launching high school fees. Its results
are shown on the following chart:

YES
50% 50%
NO

Aren’t the results interesting? No matter how true they may be, it is recommended that the
chart be modified as follows:

9
YES
1 1
NO

What is the difference? From the second chart it is obvious that only two people were asked -
the first one said YES and the second one said NO. What can be learned from that? Make
charts in such a way that their interpretation is absolutely clear. If you are presented with a pie
chart without absolute frequencies marked on it, you can ask yourselves whether it is because
of the author’s ignorance or it is a deliberate bias.

Example and Solution

An observational study has been undertaken on the use of an intersection. The collected data
are in the table below. The data is made up of colours of cars that pass through the
intersection. Analyze the data and interpret the results in a graphical form.

red blue red Green


blue red red White
green green blue Red

Solution:
From the table it is obvious that the collected colours are qualitative (lexical) variables, and
because there is no order or comparison between them, we can say they are nominal variables.
For better description we create a frequency table and we determine the mode. We are
going to present the colours of the passing vehicles by a histogram and a pie chart.

FREQUENCY TABLE
Colors of Absolute frequency Relative frequency
passing cars ni pi
red 5 5/12 = 0.42
blue 3 3/12 = 0.25
white 1 1/12 = 0.08
green 3 3/12 = 0.25

Total 12 1.00

10
We observed 12 cars total.

Mode = red (i.e. in our sample most cars were red)

Colours of passing cars Colours of passing cars

3
4
red
5
3 blue

white
1
2 green

1
3

0
red blue white green

1.1.3 Ordinal Variable

Now we are going to have a look at describing ordinal variables. The ordinal variable (just
like the nominal variable) has various verbal variants in the group but these variants can be
sorted i.e. we can tell which one is "smaller" and which one is "bigger"
For describing ordinal variables we use the same statistical characteristics and graphs as
for nominal variables (frequency, relative frequency, mode viewed by histogram or pie chart)
plus two others characteristics (cumulative frequency and cumulative relative frequency) thus
including information about how they are sorted.

Cumulative frequency of the i-th variant m i

- is a number of values of a variable showing the frequency of variants less or equal


the i-th variant

E.g. we have a variable called "grade from Statistics" that has the following variants:
"1", "2", "3" or "4" (where 1 is the best and 4 the worst grade). Then, for example, the
cumulative frequency for variant "3", will be equal number of students who get grade
"3" or better.

If variants are sorted by their "size" (“ x1 x2 xk ”) then the following must be true:
i
mi nj
j 1

So it is self-evident that cumulative frequency k-th ("the highest") variant is equal to


the variable n.

mk n

11
The second special characteristic for ordinal variable is cumulative relative frequency.

Cumulative relative frequency of i-th variant F i

- a part of the group are the values with the i-th and lower variants. They are
expressed by the following formula:

i
Fi pj
j 1

This is nothing else then relative expression of the cumulative frequency:

mi
Fi
n
Just as in the case of nominal variables we can present statistical characteristics using
frequency table for ordinal variables. In comparison to the frequency table of nominal
variables it also contains values of cumulative and cumulative relative frequencies.

FREQUENCY TABLE
Values Absolute Cumulative Relative Relative cumulative
xi frequency frequency frequency frequency

ni mi pi Fi

x1 n1 m1 n1 p1 F1 p1
x2 n2 p2 F2 p1 p2 F1 p2

xk nk mk nk 1
nk n pk Fk Fk 1
pk 1
Total k
ni n ----- k
-----
pi 1
i 1
i 1

1.1.4 Graphical Presentation of Ordinal Variables

We briefly mentioned histogram and the pie chart as good ways of presenting the ordinal
variable. But these graphs don't reflect variants’ sorting. To achieve that, we need to use
polygon (also known as Ogive) and Pareto graph.

Frequency Polygon
- is a line chart. The frequency is placed along the vertical axis and the individual variants of
the variable are placed along the horizontal axis (sorted in ascending order from the “lowest"
to the “highest"). The values are attached to the lines.

12
Frequency polygon for the evaluation grades

20
18
16
14
12
10
8
6
4
2
0
1 2 3 4

variant

Ogive (Cumulative Frequency Polygon)


- is a frequency polygon of the cumulative frequency or the relative cumulative frequency.
The vertical axis is the cumulative frequency or relative cumulative frequency. The horizontal
axis represents variants. The graph always starts at zero, at the lowest variant, and ends up at
the total frequency (for a cumulative frequency) or 1.00 (for a relative cumulative frequency).

Ogive for the evaluation grades

40

35

30

25

20

15

10

0
1 2 3 4

variant

Pareto Graph
- is a bar chart for qualitative variable with the bars arranged by frequency
- variants are on horizontal axis and are sorted from the “highest” importance to the “lowest”

13
Notice the decline of cumulative frequency. It drops as the frequency of variables decreases.

Example and Solution

Following data represent t-shirts sizes that a cloths retailer offers on sale:

S, M, L, S, M, L, XL, XL, M, XL, XL, L, M, S, M, L, L, XL, XL, XL, L, M

a) Analyze the data and interpret results in a graphical form.


b) Determine what percentage of people bought t-shirts of L size maximum.

Solution:
a) The variable is qualitative (lexical) and t-shirt sizes can be sorted, therefore it is an
ordinal variable. For its description you use frequency table for the ordinal variable and you
determine the mode.

FREQUENCY TABLE
Colors of Absolute frequency Relative frequency
passing cars ni pi
red 5
blue 3
white 1
green 3

Total 12 1.00

Mode = XL (the most people bought t-shirts with XL value)

For graphical representation use histogram, pie graph and cumulative frequency polygon (you
don't create Pareto graph because it is mostly used for technical data).

14
Graphical output:

Histogram P ie Chart
Sold t-shirt Sold t-shirt

5 XL S
32% 14%
4
M
3 L 27%
27%
2

0
S M L XL

variant

Cumulative Frequency Polygon


Sold t-shirt

25

20

15

10

0
S M L XL

variant

Total sales were 22 t-shirts.

b) You get the answer from the value of the relative cumulative frequency for variant L. You
see that 68% of people bought t-shirts of L size and smaller.

1.2 Statistical Characteristics of Quantitative Variables


To describe quantitative variable, most of the statistical characteristics for ordinal variable
description can be used (frequency, relative frequency, cumulative frequency and cumulative
relative frequency). Apart from those, there are two additional ones:

Measures of location – those indicate a typical distribution of the variable values

and

Measures of variability – those indicate a variability (variance) of the values around


their typical position

15
1.2.1 Measures of Location and Variability

The most common measure of position is the variable mean. The mean represents average or
typical value of the sample population. The most famous mean of quantitative variable is:

Arithmetical mean x

It is defined by the following formula:

n
xi
x i !

where: xi ... are values of the variable


n ... size of the sample population (number of the values of the variable)

Properties of the arithmetical mean:

n
1. xi x 0
i !

- sum of all diversions of variable values from their arithmetical mean is equal to
zero which means that arithmetical mean compensates mistakes caused by random
errors.

n n
xi a xi
2. a : x i ! i !
a x
n n

- if the same number is added to all the values of the variable, the arithmetical
mean increases by the same number

n n
xi bxi
3. b : x i ! i !
bx
n n

- if all the variable values are multiplied by the same number the arithmetical
Mean increases accordingly

Arithmetical mean is not always the best way to cal culate the mean of the sample population.
For example, if we work with a variable representing relative changes (cost indexes, etc.) we
use the so-called geometrical mean. To calculate mean when the variable has a form of a unit,
harmonical mean is often used.

16
Considering that the mean uses the whole variable values data set, it carries maximum
information about the sample population. On the other hand, it's very sensitive to the so-called
outlying observations (outliers). Outliers are values that are substantially different from the
rest of the values in a group and they can distort the mean to such a degree that it no longer
represents the sample population. We are going to have a closer look at the Outliers later.

Measures of location that are less dependent on the outlying observations are:

Mode x̂

In the case of mode we will differentiate between discrete and continuous quantitative
variable. For discrete variable we define mode as the most frequent value of the
variable (similarly as with the qualitative variable).

But in the case of continuous variable we think of the mode as the value around
which most variable values are concentrated.
For assessment of this value we use shorth. Shorth is the shortest interval with at
least 50% of variable values. In case of a sample as large as n 2 k k (with even
number of values) k values lie within shorth - which is n/2 (50%) variable values. In
the case of a sample as large as n 2k 1 k (with odd number of values) k 1
values lies within short - which is about 1/2 plus 50% variable values (n/2+1/2).

Then, the mode x̂ can be defined as the centre of the shorth.

From what has been said so far it is clear that the shorth length (top boundary -
bottom boundary) is unique but its location is not.

If the mode can be determined unambiguously we talk about unimode variable.


When a variable has two modes we call it bimode. When there are two or more modes
in a sample, it usually indicates a heterogenity of variable values. This heterogenity
can be removed by dividing the sample into more subsamples (for example bimode
mark for person's height can be divided by sex into two unimode marks - women's
height and men's height).

Example and Solution


The following data shows ages of musicians who performed at a concert. Age is a continuous
variable. Calculate Mean, Shorth and Mode for the variable.

22 82 27 43 19 47 41 34 34 42 35

Solution:

a) Mean:

17
In this case we use arithmetical mean:

n
xi
i ! 22 82 27 43 19 47 41 34 34 42 35
x 38.7 years
n 11

The musicians’ average age is 38.7 years.

b) Shorth:

Our sample population has 11 values. 11 is an odd number. 50% of 11 is 5.5 and the
nearest higher natural number is 6 - otherwise: n/2+1/2 = 11/2+1/2 = 12/2 = 6. That
means that 6 values will lie in the Shorth.

And what are the next steps?

You need to sort the variable


You determine the size of all the intervals (having 6 elements) where x i x i 1 xi 5

The shortest of these intervals will be the shorth (size of the interval = x i 5 x i )

Original data Sorting data Size of intervals (having 6


elements)
22 19 16 (= 35 – 19)
82 22 19 (= 41 – 22)
27 27 15 (= 42 – 27)
43 34 9 (= 43 – 34)
19 34 13 (= 47 – 34)
47 35 47 (= 82 – 35)
41 41
34 42
34 43
42 47
35 82

From the table you can see that the shortest interval has the value of 9. There is only one
interval that corresponds to this size and that is: 34;43 .

Shorth = 34;43 and that means that half of the musicians are between 34 and 43 years of
age.

c) Mode:

Mode is defined as the center of shorth:

34 43
x̂ 38.5
2

18
Mode = 38.5 years which means that the typical age of the musicians who performed at
the concert was 38.5 years.

Among other characteristics describing quantitative variables are quantiles. Those are used
for more detailed illustration of the distribution of the variable values within the scope of the
population.

Quantiles
Quantiles describe location of individual values (within the variable scope) and are
resistant to outlying observations similarly like the mode. Generally the quantile is
defined as a value that divides the sample into two parts. The first one contains values
that are smaller than given quantile and the second one with values larger or equal
than the given quantile. The data must be sorted ascendingly from the lowest to the
highest value.

Quantile of variable x that separates 100% smaller values from the rest of the samples
(i.e. from 100(1-p)% values) will be called 100p % quantile and marked xp.

In real life you most often come across the following quantiles:

Quartiles
In case of the four-part division the values of the variate corresponding to 25%, 50%,
and 75% of the total distribution are called quartiles.

Lower quartile x 0,25 = 25% quantile - divides a sample of data in a way that 25%
of the values are smaller than the quartile, i.e. 75% are bigger (or equal)

Median x0,5 = 50% quantile - divides a sample of data in a way that 50% of the
values are smaller than the median and 50% of values are bigger (or equal)

Upper quartile x 0,75 = 75% quantile - divides a sample of data in a way that 75%
of values are smaller than the quartile, i.e. 25% are bigger (or equal)

Example:

Data 6 47 49 15 43 41 7 39 43 41 36
Data in ascending order 6 7 15 36 39 41 41 43 43 47 49
Median 41
Upper quartile 43
Lower quartile 15

The difference between the 1st and 3rd quartile is called the Inter-Quartile Range
(IQR).

19
IQR x0.75 x0.25

Example:
Data 23456667789
Upper quartile 7
Lower quartile 4
IQR 7-4=3

Deciles – x0.1; x0.2; ... ; x0.9

The deciles divide the data into 10 equal regions.

Percentiles – x0.01; x0.02; …; x0.99

The percentiles divide the data into 100 equal regions.

For example, the 80th percentile is the number that has 80% of values below it and
20% above it. Rather than counting 80% from the bottom, count 20% from the top.

Note: The 50 th percentile is the median.

Minimum x min and Maximum x max

x min x0 , i.e. 0% of values are less than minimum


x max x1 , i.e. 100% of values are less than maximum

There is the following process to determine quantiles:

1. The sample population needs to be ordered by size

2. The individual values are sequenced so that the smallest value is at the first place
and the highest value is at n-th place (n is the total number of values)

3. 100p% quantile is equal to a variable value with the sequence z p where:


zp n p 0.5
zp has to be rounded to integer!!!

REMEMBER!!!
In case of a data set with an even number of values the median is not uniquely
defined. Any number between two middle values (including these values) can be
accepted as the median. Most often it is the middle value.

We are now going to discuss the relation between quantiles and the cumulative
relative frequency. The value p denotes cumulative relative frequency of quantile x p

20
i.e. relative frequency of those variable values that are smaller than quantile x p.
Quantile and cumulative relative frequency are inverse concepts.

Graphical or tabular representation of the ordered variable and appropriate cumulative


frequencies is known as distribution function of the cumulative frequency or
empirical distribution function .

Empirical Distribution Function F(x) for the Quanti tative Variable

We put the sample population in ascending order (x 1<x2< … <xn) and we denote
p(xi) as relative frequency of the value x i. For empirical distribution function F(x)
it must then be true that:

0 for x x1
j
F x p xi for x j x x j 1, 1 j n 1
i 1
1 for xn x

The empirical distribution function is a monotonous, increasing function and it


runs from the left.

p xi lim F x F xi
x xi

F(x)

1
p(xn)

p(x2)

0 x1 x2 x3 ........ xn-1 xn
x

MAD

MAD is a short for Median Absolute Deviation from the median.

MAD is determined as follows:

1. Order the sample population by size

2. Determine the median of the sample population

21
3. For each value determine absolute value of its deviation from the median

4. Put absolute deviations from the median in ascending order by size

5. Determine the median of the absolute deviations from the median i.e. MAD

Example and Solution


There is the following data set: 22, 82, 27, 43, 19, 47, 41, 34, 34, 42, 35 (the data from the
previous example).
Determine:
a) All quartiles
b) Inter-Quartile Range
c) MAD
d) Draw the Empirical Distribution Function

Solution:

a) You need to determine Lower Quartile x 0,25; Median x 0,5 and Upper Quartile x 0,75.
First, you order the data by size and assign a sequence number to each value.

Original data Ordered data Sequence


22 19 1
82 22 2
27 27 3
43 34 4
19 34 5
47 35 6
41 41 7
34 42 8
34 43 9
42 47 10
35 82 11

Now you can divide the data set into quartiles and mark their variable values accordingly:

Lower Quartile x 0,25: p = 0.25; n = 11 zp = 11 x 0.25 + 0.5 = 3.25 3 x0.25 = 27


i.e. 25% of musicians are under 27 (75% of them are 27 years old or older).

Median x 0,5: p = 0.5; n = 11 zp = 11 x 0.5 + 0.5 = 6 x0.5 = 35


i.e. a half of the musician are under 35 (50% of them are 35 years old or older).

Upper Quartile x 0,75: p = 0.75; n = 11 zp = 11 x 0.75 + 0.5 = 8.75 9 x0.75 = 43

i.e. 75% musicians are under 43 (25% of them are 43 years old or older).

b) Inter-Quartile Range IQR:

22
IQR = x0.75 – x0.25 = 43 – 27 = 16

c) MAD

If you want to determine this characteristic you must follow its definition (the median of
absolute deviations from the median).

x0.5 = 35

Original Ordered
Absolute values of Ordered absolute values
data xi data yi
deviations of the ordered Mi
data from their median
yi x0.5

22 19 16 19 35 0
82 22 13 22 35 1
27 27 8 27 35 1
43 34 1 34 35 6
19 34 1 34 35 7
47 35 0 35 35 8
41 41 6 41 35 8
34 42 7 42 35 12
34 43 8 43 35 13
42 47 12 47 35 16
35 82 47 82 35 47

MAD = M0.5
p = 0.5; n = 11 zp = 11 x 0.5 + 0.5 = 6 x0.5 = 8

(MAD is a median absolute deviation from the median i.e. 6 th value of ordered absolute
deviations from the median)

MAD = 8.

d) The last task was to draw the Empirical Distribution Function. Here is its definition:

0 for x x1
j
F x p xi for x j x x j 1,1 j n 1
i 1
1 for xn x

- Arrange the variable values as well as their frequencies and relative frequencies in
ascending order and write them down in the table. Then derive the empirical distribution
function from them:

23
Original Ordered
Absolute Relative Empirical
data xi data ai
frequencies of the frequencies of the distribution
ordered values ni ordered values pi function F(ai)
22 19 1 1/11 0
82 22 1 1/11 1/11
27 27 1 1/11 2/11
43 34 2 2/11 3/11
19 35 1 1/11 5/11
47 41 1 1/11 6/11
41 42 1 1/11 7/11
34 43 1 1/11 8/11
34 47 1 1/11 9/11
42 82 1 1/11 10/11
35

As by its definition - the empirical distribution function F(x) - equals 0 for each x <19; F(x)
equals 1/11 for all 22 x>19; F(x) equals 1/11 + 1/11 for all 27 x> 22; and so it goes on.

X ;19 19; 22 22; 27 27; 34 34; 35


F(x) 0 1/11 2/11 3/11 5/11

X 35; 41 41; 42 42; 43 43; 47 47; 82 82;


F(x) 6/11 7/11 8/11 9/11 10/11 11/11

Means, mode and median (i.e. measures of location) represent imaginary centre of the
variable. However, we are also interested in the distribution of the individual values of the
variable around the centre (i.e. measures of variability).
The following three statistical characteristics allow description of the sample population
variability. Shorth and Inter-Quartile Range are classified as measures of variability.

Sample Variance s 2

Empirical distribution function

1,2
1,0

0,8
0,6

0,4
0,2

0,0
-20 0 20 40 60 80 100 120

- is the most common measure of variability

24
The sample variance is given by:

n
2
xi x
s2 i 1

n 1

- Sample Variance is the sum of all squared deviations from their mean divided by
one less than the sample size

General properties of the sample variance are for example:

The sample variance of a constant number is zero

In other words: if all variable values are the same, the sampling has zero
diffuseness

n n
2 2
xi x yi y
a : s2 i 1
yi a xi i 1
s2
n 1 n 1

In other words: if you add the same constant number to all variable values, the
sample variance doesn’t change

n n
2 2
xi x yi y
b : s2 i 1
yi bxi i 1
b2 s 2
n 1 n 1

In other words: if you multiply all variable values by an arbitrary constant


number (b) the sample variance increases by square of this constant number
(b2)

Disadvantage of using the sample variance as a measure of variablility is that it employs


squared values of the variable. For example: if the variable represents cash denominated in
EUR, then the sample variation of this variable will be in EUR 2. That is why we use another
measure of variability called standard deviation.

Standard Deviation s

- is calculated by the square root of the variance

25
Another disadvantage of using the sample variation and the standard deviation is that
variability of the variable can’t be compared in different units. Which variable has bigger
variability - height or weight of an adult? To answer that, Coefficient of Variation has to be
used.

Coefficient of Variation V x

- it represents relative measure of variability of the variable x and it is often


expressed as a percentage
- it is the ratio of the sample standard deviation to the sample mean:

s
Vx
x

Example and Solution


A table glass manufacturer has developed less expensive technology for improving the fire-
resistant glass. 10 glass table sheets were selected for testing. Half of them were treated by the
new technology while the other half was used for comparison.
Both lots were tested by fire until they cracked. These are the results:

Critical temperature (glass cracked) [oC]


Old technology xi New technology yi
475 485
436 390
495 520
483 460
426 488

Compare both technologies by means of basic characteristics of the exploratory analysis


(mean, variation, etc.).

Solution:

- First you compare both technologies by the mean:

Mean for the old technology:

Mean for the new technology:

26
Based on the calculated means the new technology could be recommended because the
temperature it can withstand is 6 oC higher.

- now you determine the measures of variability

The old technology:

Sample Variance:

Standard Deviation:

New technology:

Sample variance:

Standard deviation:

Sample variance (standard deviation) for the new technology is significantly larger. What is
the possible reason? Look at the graphical Critical temperature
representation of the collected data. Critical
temperatures are much more spread out which 600

means this technology is not fully under control


and its use can't guarantee higher production
quality. In this case the critical temperature can
either be much higher or much lower. For that
reason it is recommended that it should be
subjected to additional research. These conclusions
are based only on exploratory analysis. Statistics 300
Old New
provides us with more exact methods for analysing Technologie

similar problems (hypothesis testing).

27
Now we are going to return to exploratory analysis as such. We mentioned outliers. So far we
know that outliers are variable values that are substentially different from the rest of the
values and this impacts on mean. How can these values be identified?

Identification of the Outliers

In the statistical practice we are going to come across a few methods that are
capable of identifying outliers. We'll mention three and go through them one by
one.

1. The outlier can be every value x i that by far exceeds 1,5 IQR lower (or upper)
quantile.

xi x 0.25 1.5 IQR xi x 0.75 1.5 IQR x i is an outlier

2. The outlier can be every value x i of which the absolute value of the z-score is
greater then 3.
xi x
z scorei
s

z score.i 3 xi is anoutlier

3. The outlier can be every value x i of which the absolute value of the median-
score is greater then 3.

xi x0.5
median score.i
1.483.MAD

median score.i 3 xi is an outlier

Any of the three rules can be used to identify outliers in real-life problems. The Z-
axis is "less strict" than the median-axis to outliers. It's because establishing the z-
axis is based on mean and standard deviation and they are strongly influenced by
outlying values. Meanwhile, establishing the median-axis is based on median and
MAD and they are immune to outliers.

When you identify a value as an outlier you need to decide its type unless it is
caused by:

mistakes, typing errors, human error, technology whims, etc.


faults, results of wrong measurements, etc.

If you know the outlier cause and make sure that it will not occur again, it can be
cleared from the process. In other cases you must consider carefully if by getting
rid of an outlier you won’t lose important information about events with low
frequencies.

28
The others characteristics describing qualitative variable are skewness and kurtosis. Their
formulas are rather complex therefore specialized software is used for the calculation.

Skewness

- Skewness is defined as asymmetry in the distribution of the variable values.


Values on one side of the distribution tend to be further away from the "middle"
than values on the other side.

- The following formula is used:

Skewness interpretation:

0 ... variable values are distributed symmetrically around the


mean
0 ... values smaller than the mean are predominant
0 ... values larger than the mean are predominant

60 60
60

50 50
50

40 40 40

30 30 30

20 20 20

10
10 10

0
0 0

1 2 3 4 5 6 7 1 2 3 4 5 6 7
1 2 3 4 5 6 7

=0 >0 <0

Kurtosis

- Kurtosis represents concentration of variable values around their mean.

The following formula is used to get its value:


n
4
xi x 2
nn 1 n 1
i 1
3
n 1 n 2 n 3 s4 n 2 n 3

Kurtosis interpretation:

29
0 ... Kurtosis corresponds to normal distribution
0 ... "peaked" distribution of the variable
0 ... "flat" distribution of the variable

70 100 30

60 25
80
50
20
40 60
15
30 40
10
20
20
10 5

0
0 0

1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

=0 >0 <0

We have now defined all numerical characteristics of the quantitative variable Next we are
going to take a look at how they can be interpreted graphically.

1.2.2 Graphical Methods of Presenting Quantitative Variab les

Box plot 60

Outlier
A box plot is a way of 50
summarizing a data set on an
interval scale. It is often used in Max1
exploratory data analysis. It is a 40

graph that shows the shape of the


distribution, its centre point, and 30
variability. The resulting picture Upper Quartile
consists of the most extreme values
in the data set (maximum and 20

minimum), the lower and upper shorth Median

quartiles, and the median. 10 Lower Quartile


A box plot is especially helpful
for indicating whether a distribution Min1
is skewed and whether there are any 0

unusual observations (outliers) in


the data set.

Notice: A box plot construction begins by marking outliers and then other characteristics
(min1, max1, quartiles and shorth).

30
Stem and Leaf Plot

As we saw, simplicity is an advantage of the box plot. However, information about specific
values of the variable is missing. The missing numeric values would have to be specifically
marked down onto the graph. The Stem and leaf plot will make up for that limitation.

We have a variable representing average month salary of bank employees in the Czech
Republic.

Average month pay [CZK]


10,654 9,765 8,675 12,435 9,675 10,343 18,786 15,42 0 8,675 7,132
6,732 6,878 15,657 9,754 9,543 9,435 10,647 12,453 9,987 10,342

Average month pay [CZK] – data in ascending order


6,732 6,878 7,132 8,675 8,675 9,435 9,543 9,675 9,7 54 9,765
9,987 10,342 10,343 10,647 10,654 12,435 12,453 15 ,420 15,657 18,786

How do we bring the data onto the


graph? The place values that are 6 78 2
regarded as “unimportant” are ignored 7 1 1
8 66 2
and data on the higher places are put in 9 456779 6
order. Stem
10 3366 4
12 44 2
We are especially interested in the 15 46 2
values from the third (hundreds’) place. 18 7 1

The values on the fourth (thousands’)


*103
place are written down in ascending
Leaves Frequencies
order thus creating a stem. Under the
graph we append a stem width that will Stem width
also act as a coefficient used to multiply
values in the graph.
The second column of the graph - known as leaves – are the numbers representing
"important" place values. They are written down in corresponding rows.
The third column is absolute frequency for particular rows.

For example: the first row in the graph represents two values - (6.7 and 6.8)*10 3 CZK i.e.
6,700 CZK and 6,800 CZK, the sixth row represents two values too - (12.4 and 12.4)*103
CZK, i.e. two employees have the average month pay of 12,400 CZK, etc.

There are various modifications of 6 78 2


7 1 3
this graph. For example the third 8 66 5
column could store cumulative 9 456779 (6)
10 3366 9
frequencies and in the median row the Stem
12 44 5
absolute frequency is shown in 15 46 3
18 7 1
parentheses. From this row the absolute
frequencies either cumulate from the *103
smallest values or diminish from the Leaves Cumulative
highest values – as seen on the picture. frequencies
Stem width

31
Finally, you need to keep in mind that there are different ways of constructing a stem and leaf
plot and you need to be aware of one particular problem. Nowhere is it said which place
values of the variable are important and which ones are not. This is left to the observer.
However there is a tip to follow. A long stem with short leaves and a short stem with long
leaves indicate incorrect choice of scale. Look at the picture.

0 66788999999 11
1 000022558 9

*104

Quiz

1. What is exploratory statistics concerned with?

2. Characterize the basic types of variables.

3. Which statistical characteristics can be contained in frequency table (for what type of
variable)?

4. What are the outliers and how do you define them?

5. Which characteristics are sensitive to outliers?


a) Median
b) Arithmetical Mean
c) Upper Quartile

6. How do you depict the qualitative (quantitative) variables?

7. The following box plot represents students’ earnings during holiday.

Mark statements that do not correspond to the displayed reality:


a) A student earned 19 thousand CZK maximum.
b) Inter-quartile range is approximately 10 thousands CZK.
c) Half of the students earned less than 11 thousands CZK.
d) Shorth is rougly an interval of (5;15) thousand CZK

32
Practical Exercises

Exercise 1: The following data represents car manufacturers’ countries of origin. Analyze the data
(frequency, relative frequency, cumulative frequenc y and cumulative relative frequency, mode) and
interpret it in the graphical form (histogram, pie chart).

USA USA Germany Czech Rep.


Germany Germany Germany Czech Rep.
Czech Rep. Czech Rep. USA Germany

Exercise 2: The following data represents customers’ waiting ti me (min) when dealing with the
customer service. Draw box plot and stem and leaf plot.

120 80 100 90
150 5 140 130
100 70 110 100

Exercise 3: A traffic survey was carried out to establish a veh icle count at an intersection. A student
data collector recorded the numbers of cars waiting in queue each time the green light jumped on.
These are his/her outcomes:

3 1 5 3 2 3 5 7 1 2 8 8 1 6 1 8 5 5 8 5 4 7 2 5 6 3 4 2 8 4 4 5 5 4 3 3 4 9 6 2 1
5 2 3 5 3 5 7 2 5 8 2 4 2 4 3 5 64 6 9 3 2 1 2 6 3 5 3 5 3 7 6 3 75 6

Draw box plot, empirical distribution function and calculate the mean, standard deviation, shorth,
mode and inter-quartile range.

33

You might also like