Data Presentation and Summary Statistics
Data Presentation and Summary Statistics
Answers: 0, 2, 1, 7, 9, 3, 0, 1, 1, 1, 0, 4, 1, 5, 1, 9, 3, 2, 0, 3, 1, 6, 1,
0, 2, 5, 7, 1, 3, 4, 1, 0, 2, 7, 1, 3, 0, 1, 3, 1
The answers are given as a list, but this is not the most helpful way to present
the information.
The “No. of Sopranos Episodes” varies from person to person and is known as
the variable. It is convenient to denote a variable by a letter such as X. Each
different value of the variable occurs with a particular frequency, i.e. the
number of people in each category. The frequency is always denoted by the
letter f (small f).
X f
0 7
1 13
2 4
3 6
4 or more 10
Page 1 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Discrete Data
Data is said to be discrete when it can only take a certain set of distinct values.
12 14 9 13 11 15 10 12 14 13
12 11 13 12 15 9 14 12 13 10
When the figures are presented in this way, it is difficult to make much sense
of them. A frequency table makes clearer how the values in the set of data are
spread out.
Frequency Tables
In the example above, the variable is the number of defective items during a
day. Let X stand for the variable, i.e. X = the number of defective items on a
day.
The number of times each value of the variable occurs is the frequency of the
value, denoted by f.
Page 2 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
A Histogram displays the data from a frequency table graphically. Some people
prefer tables and others pictures!
Put the values of the variable on the horizontal axis (or x-axis) and measure the
frequency on the vertical axis. Draw a rectangle above each value with the
height of the rectangle indicating how often the corresponding value occurs.
6
Frequency (f) iin
5
4
Days
3
2
1
0
9 10 11 12 13 14 15
Defective Items in a Day(X)
From a Histogram, it is easy to read off which value of the variable occurs most
often. It is simply the value corresponding to the tallest rectangle. This value is
called the mode of the data set. In the above example the mode is 12
defective items – this occurred on more days than any other value.
Page 3 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Data is continuous if it can take any value in some range. If you have a lot of
discrete data of values which are very close to each other then that is usually
seen as continuous data too.
Note: We have already seen one grouped frequency table. In our very first
example (on Sopranos Episodes) we grouped the values 4, 5, 6, and so on into a
single entry in the frequency table: “4 or more”
10.2 35.7 12.1 19.2 17.5 44.21 15.5 22.3 25.6 28.6
21.2 10.2 33.76 34.12 41.65 29.43 21.0 34.47 38.72 32.12
23.49 16.61 17.12 49.16 55.12 39.17 35.63 41.25 37.73 29.15
43.78 25.55 37.76 18.81 49.12 38.71 33.34 56.72 30.0 47.76
42.23 62.21 68.37 11.2 27.89 39.87 52.3 41.23 19.54 28.87
2. Choose a class size so that the number of classes is between 5 and 15. The
size itself of each class should be convenient.
Page 4 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
In our example the lowest salary is 10.2 and the highest is 68.37. Classes of
width 10 or 5 would be appropriate. We will take classes of width 10. Our first
class will be salaries from 10 to 20, the second from 20 to 30 and so on (in
€000’s).
The extreme values in a class (such as 10 and 20 in the first class) are known as
class boundaries.
(a) Include the lower class boundary in each class and exclude the upper
OR
(b) Exclude the lower class boundary from each class and include the upper.
If we do not make this choice, then we will not know whether the number 20
belongs to the first class or the second class. We will have similar problems for
30, 40, 50, and so on.
The Tally column is there to speed up your counting. As you read through the data put a mark
in the relevant Tally box. Put them in groups of 5 like this as you go
1111
so that each fifth mark is a horizontal line. Why Tally first? – it means you only read ONCE
through the data and ensures you neither miss a value or count it twice.
Now add up your Tally marks to get your frequency column.
Note: A more mathematical notation for “From 10 to under 20” that you might
see is “10 X <20”. We will stick to the english version!
Page 5 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Drawing a Histogram for grouped data is much the same as it was for simple
discrete data. Again the variable is plotted on the x-axis with each frequency
represented by a rectangle of the correct height. However, this time we have a
rectangle for each class in the frequency table as opposed to a rectangle for
each individual value of the variable.
Company Salaries
16
14
Frequency (f)
12
10
8
6
4
2
0
from 10 to from 20 to from 30 to from 40 to from 50 to from 60 to
under 20 under 30 under 40 under 50 under 60 under 70
Page 6 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Practice Example: A group of 25 people are asked for their weight to the
nearest lb. Here are the answers.
145, 143, 161, 156, 159, 159, 154, 153, 167, 155, 151, 146, 148, 160, 134, 143,
155, 157, 142, 171, 146, 163, 161, 153, 172
Histogram:
Weights
12
10
Frequency (f)
8
6
4
2
0
From 130 to From 140 to From 150 to From 160 to From 170 to
under 140 under 150 under 160 under 170 under 180
Weights in lbs (X)
Page 7 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Examples:
If we are dealing with rolls of a die, the population is the results of every
roll of the die ever done.
If we are dealing with the heights of men in Ireland, the population is the
heights of every man in the country.
If we are % carbon content in a piece of steel from a production run, the
population might be all of the pieces produced on one run. It may also be
all pieces which might be produced by that production process into the
future.
We select a sample from the population and work with that instead. In
manufacturing, samples from a production run are usually tested for quality
control purposes. This sampling and testing may be destructive and/or time
consuming, so you obviously don’t want to test everything!
Examples of Samples:
Roll a die 250 times and record the results.
Select 1000 men at random and measure their heights.
Select 10 steel pieces at random from a day’s production run and test their
% carbon content.
The sample is the first 10 steel pieces in the day’s production run.
Page 8 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
A summary statistic is a statistic that sums up the data in a sample – it tells you
something about the data as a whole.
Measures of Location
An average is a point within a group of data which is central to the group, and
around which the other values are distributed. It is therefore a measure of
central tendency – a measure which starts to summarise the data by fixing one
point as the centre. The position of the central item fixes the location of the
distribution and averages are therefore sometimes called measures of location.
There are three measures of location that we will discuss: the mode, the
median, and the mean.
Median – “middle” value when all the numbers are arranged in order. (It is
greater than half of the values in the data set and less than half the values
in the data set.)
Mean – “average” value in the familiar sense. The mean is the arithmetic
average: add up the values and divide by however many of them there are.
Page 9 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
The mode, the median, and the mean all give an indication of where the data
is situated. The idea in each case is to pick one number that is representative
of the data set as a whole.
Example: For the following set of numbers, calculate the mode, the median,
and the mean.
To calculate the mean we add up the numbers and divide by how many
numbers we have:
80
Mean = = 7.27 (to two decimal places)
11
Page 10 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Example
Find the mean for the salary data given before:
The salaries (in €000’s) of 50 employees in a large corporation are given below.
10.2 35.7 12.1 19.2 17.5 44.21 15.5 22.3 25.6 28.6
21.2 10.2 33.76 34.12 41.65 29.43 21.0 34.47 38.72 32.12
23.49 16.61 17.12 49.16 55.12 39.17 35.63 41.25 37.73 29.15
43.78 25.55 37.76 18.81 49.12 38.71 33.34 56.72 30.0 47.76
42.23 62.21 68.37 11.2 27.89 39.87 52.3 41.23 19.54 28.87
Page 11 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Example: Below is the histogram for the Defective Items data. It is easy to pick
out the mode of the data set.
6
Frequency (f)
4
2
0
9 10 11 12 13 14 15
Num ber of Defective Item s (X)
Mode =
Page 12 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Example: Recall that a group of 25 people are asked for their weight to the
nearest lb. Here are the answers.
145, 143, 161, 156, 159, 159, 154, 153, 167, 155, 151, 146, 148, 160, 134, 143,
155, 157, 142, 171, 146, 163, 161, 153, 172
Median ≈
Example
Find the median for the salary data given before:
The salaries (in €000’s) of 50 employees in a large corporation are given below.
10.2 35.7 12.1 19.2 17.5 44.21 15.5 22.3 25.6 28.6
21.2 10.2 33.76 34.12 41.65 29.43 21.0 34.47 38.72 32.12
23.49 16.61 17.12 49.16 55.12 39.17 35.63 41.25 37.73 29.15
43.78 25.55 37.76 18.81 49.12 38.71 33.34 56.72 30.0 47.76
42.23 62.21 68.37 11.2 27.89 39.87 52.3 41.23 19.54 28.87
Page 13 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Page 14 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
It should be noted that for a reasonably large set of real data, the three
measures of location will be more or less the same.
Mean
Advantages Disadvantages
Takes all the numbers into account Affected by large values
When to use the mean - if all of the values in a data set are roughly equal,
the mean is the best number to use as a summary statistic.
Mode
Advantages Disadvantages
Simple to calculate May be unrepresentative of the whole
data set
Unaffected by large values Might not be unique. For example,
consider the list of numbers 1, 2, 2, 2,
4, 4, 7, 7, 7, 9. Both 2 and 7 are modes
for that set!
Useful for non-numerical data as
well
When to use the mode - when nearly all the values in the data set are the
same or for a small data set.
Median
Advantages Disadvantages
Unaffected by large values May not be representative of the whole
data set
Easy to calculate
When to use the median - when the data set has a few very large or very
small numbers.
Page 15 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Negative Skew?
Why is it called negative skew?
Because the long "tail" is on
the negative side of the peak.
People sometimes say it is
"skewed to the left"
(the long tail is on the left hand side).
Not Skewed
A Normal Distribution is not skewed.
It is perfectly symmetrical.
And the Mean is exactly at the peak
Positive Skew
And positive skew is when
the long tail is on the
positive side of the peak,
and some people say it is
"skewed to the right".
Calculating Skewness
"Skewness" (the amount of skew) can be calculated, for example you could use
the SKEW() function in Excel. Some other resources:
http://www.statisticshowto.com/skewed-distribution/
Positive Skew
Page 16 of 25
And positive skew is when the long tail is on
the positive side of the peak, and some people
Preparatory Mathematics: Collecting and Presenting Data (2015)
Measures of Dispersion
While the mode, median, and mean help us to summarise a set of data, they
tell us nothing about how spread out the values in the set are.
For both, the mean = 24, but the second is obviously more spread out.
We will look at two ways of describing the amount of spread in a set of data:
1. Range
2. Standard Deviation
Range
The range is just the difference between the lowest and highest numbers in the
data set.
Range = 27-22 = 5
Advantages
1. Easy to calculate
Disadvantages
1. Only use two numbers from the set
2. One outlier can skew the range as a useful measure if all the other
values are close together.
Page 17 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Standard Deviation
The standard deviation is the most important and widely used measure of
dispersion. It measures the average deviation of the numbers in a data set from
the mean of the set.
Mean = x =
2. Next subtract the mean away from each number in the original set.
177 − = 185 − =
180 − = 187 − =
181 − =
s2 =
5. Finally, to compensate for having squared all of the original deviations, take
the square root of the answer from part 4. This is the standard deviation of
the set of numbers.
Standard deviation s = =
Note: For a small sample such as this we should really use the “sample
standard deviation” formula, which in step 4 above would divide by n – 1 = 4
rather than n = 5. or larger samples, there is no great difference between using
n and n – 1, so we will use n and not worry about it.
Page 18 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Example:
Solution:
Steps:
1. First we calculate the mean
2. Next subtract the mean away from each number in the original set.
5. Finally to get the standard deviation of the set of numbers, take the square
root of the answer from part 4. This is
s
e x x j (again, use n – 1 if n is small)
n
d x x i
2
x =
x = and s
e x x j =
n
Page 19 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
s
e x x j (again, use n – 1 if n is small)
n
Note: This is the standard deviation formula when frequencies are not known.
Example: The number of days with an average temperature below 6oC is shown
below. Calculate the mean and standard deviation for this data.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
18 12 11 6 9 4 2 0 2 6 12 17
Solution:
We have 12 numbers here: 18,12,11,6 ,9, 4, 2, 0, 2, 6, 12, 17 so here n = 12
2
Standard Deviation s
e x x j
n
x x xx dx xi 2
18
12
11
6
9
4
2
0
2
6
12
17
d x x i
2
x =
x = and s
e x x j =
n
This is method is fine for small data sets and in the lab with the use of an
excel spreadsheet we can use it for large data sets also.
Page 20 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
http://www.google.co.uk/imgres?imgurl=http://nursingplanet.com/biostatistics/normal_c
urve.jpg&imgrefurl=http://nursingplanet.com/biostatistics/normal_distribution_and_prob
ability.html&h=331&w=784&sz=20&tbnid=c7hWm9OjFIPjAM:&tbnh=52&tbnw=124
&prev=/search%3Fq%3Dstandard%2BDeviation%2B-
6%2Bsigma%2Bpictures%26tbm%3Disch%26tbo%3Du&zoom=1&q=standard+Deviati
on+-6+sigma+pictures&docid=0WVRBIa8D5MITM&hl=en&sa=X&ei=cSCYT4XyKI-
DhQehzdn1BQ&ved=0CDAQ9QEwAg&dur=8406
Page 21 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
1. Find the mean, median and mode of the following sets of data:
(b) 127 138 123 121 129 124 122 128 132 124.
2. Find the range and standard deviation for each of the data sets in question
1.
Number of Technicians 0 1 2 3 4 5
Number of Days 3 10 8 7 6 6
Page 22 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Back
Couint count x meanx (x-meanx) (x-meanx)^2 Ps1(a)
5
1 11 10.5455 -5.54545455 30.75206612 Mean =
7
2 10 10.5455 -3.54545455 12.57024793 10.54545455
7 Standard
3 9 10.5455 -3.54545455 12.57024793 Deviation =
8
4 8 10.5455 -2.54545455 6.479338843 3.939627033
9
5 7 10.5455 -1.54545455 2.388429752
10
6 6 10.5455 -0.54545455 0.297520661 n=
11
7 5 10.5455 0.454545455 0.20661157 11
12
8 4 10.5455 1.454545455 2.115702479
12
9 3 10.5455 1.454545455 2.115702479 Mode = 7 and 12
16
10 2 10.5455 5.454545455 29.75206612 Median =10
19
11 1 10.5455 8.454545455 71.47933884 Range =19-5 =
14
116 170.7272727
is the sum
of the x Is Sum of (x-meanx)^2
Page 23 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Back
Couint count x meanx (x-meanx) (x-meanx)^2 PS 1(b)
1 10 121 126.8 -5.8 33.64 Mean =
2 9 122 126.8 -4.8 23.04 126.8
123 Standard
3 8 126.8 -3.8 14.44 Deviation =
4 7 124 126.8 -2.8 7.84 4.955804677
5 6 124 126.8 -2.8 7.84
6 5 127 126.8 0.2 0.04 n=
7 4 128 126.8 1.2 1.44 10
8 3 129 126.8 2.2 4.84
9 2 132 126.8 5.2 27.04 Mode = 124
138 Median
=(124+127)/2 =
10 1 126.8 11.2 125.44 125.5
Range =138-121 =
17
1268 245.6
is the sum
of the x Is Sum of (x-meanx)^2
Page 24 of 25
Preparatory Mathematics: Collecting and Presenting Data (2015)
Page 25 of 25