St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences
St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences
WEEK 3: LECTURE
OUTLINE
• Summarize data, using measures of central tendency, such as
mean, median, mode.
• Describe data, using measures of variation, such as range,
variance and standard deviation.
• Identify the position of a data value in a data set, using
various measures of position, such as percentiles, deciles and
quartiles.
• Use the techniques of Exploratory data analysis, including
boxplots and the five-number summary to discover various
aspects of data
2
Data Description
3
INTRODUCTION
• After collection, organization and presentation
of data we now examine the statistical methods
that can be used to describe data.
• The methods include
• measures of central tendency;
• measures of variation;
• measures of position.
4
MEASURES OF CENTRAL TENDENCY
5
1. THE MEAN
• The mean (arithmetic mean) is found by
adding all the data values and dividing by the
total number of data values.
Example 1
The mean of 3, 2, 6, 5 and 4 is found by adding
3+2+6+5+4=20 and dividing by 5; hence the
mean of the data is 20/5=4.
n n n
Population
X
fX
fX m
N N
N
Rounding rule for the mean: the mean should be rounded
to one more decimal place than in the raw data.
7
EXAMPLE 2
Rating (X ) Frequency ( f ) fX
1 2 2
2 1 2
3 2 6
4 2 8
5 2 10
6 5 30
7 3 21
8 2 16
9 2 18
10 3 30
Total n 24 143
8
EXAMPLE 3
Find the mean of the frequency distribution given below. The
data is for the distribution of the weights of 50 randomly
selected ST130 students
Class Limits Frequency
30-39 5
40-49 10
50-59 18
60-69 12
70-79 5
50
9
SOLUTION
Class Limits
f Xm f .X m
30-39 5 34.5 172.5
40-49 10 44.5 445
50-59 18 54.5 981
60-69 12 64.5 774
70-79 5 74.5 372.5
n 50 f .X m 2745
X
f X m
2745
54.9 kg
n 50 10
2. THE MEDIAN
11
EXAMPLE 4
The data represent the number of lectures missed per year for
a sample of students selected from the university.
10 13 26 35 15 28 19 24 36 40 46
Find the median.
Solution
Step 1 10 13 15 19 24 26 28 35 36 40 46
Step 2 the middle value is the sixth value, which is 26. Thus
the median is 26 lectures.
12
EXAMPLE 5
The data represent the number of lectures missed per
year for a sample of students selected from the
university.
10 13 26 35 15 28 19 24 36 40
Find the median.
Step 1 10 13 15 19 24 26 28 35 36 40
Step 2 MEDIAN
24 26
MD 25
2 13
MEDIAN FOR GROUPED DATA
To find the median for the grouped data we can use the
ogive.
14
17.5
51
15
So the median is 51.
3. THE MODE
The value that occurs most often in a data set is called
the mode. A data set can have more than one mode or
no mode at all.
16
MODE FOR GROUPED DATA
The mode for grouped data is known as the modal class.
The modal class is the class with the largest frequency.
20
Note:
In many cases, the different measures of central
tendency may have significantly different
values. One has to be very cautious in using
these measures.
21
EXAMPLE 7
The annual salaries of a company is listed below. Which measure is
more reliable, mean, median, or mode.
Staff salaries
outlier
Owner 50,000
Manager 20,000
Salesperson 12,000
Technician 9,000
Technician 9,000
Solution
The mean is 20,000, the median is 12,000, and the mode is 9,000.
Median is more reliable than the mean because outliers affect the
mean easily.
22
DISTRIBUTIONS
23
Measures of Variation
24
MEASURES OF VARIATION
The measures of variation (or ‘dispersion’) are
the numerical measures to determine the spread
of the data values from the central tendencies.
25
Example 8
BRAND A BRAND B
I wish to test two brands of 10 35
outdoor paint to see how long
each will last before fading. 60 45
The results (in months) are 50 30
shown. Find the mean and
median of each group. 30 35
(Assume population) 40 40
20 25
26
SOLUTION
The mean of both brands
BRAND A BRAND B
of paints is 35 months
10 35
The median of both brands
60 45
of paints is 35 months
50 30
30 35
Since the mean and median are
40 40 same, one cannot conclude
which brand of paint lasts longer.
20 25
1. Range
2. Variance
3. Standard deviation
28
1. RANGE
The range is the highest value BRAND A BRAND B
minus the lowest value in the data 10 35
set. It is denoted by the symbol R.
60 45
Example 9
50 30
Find the range of the two brands
30 35
of paint.
40 40
Solution
20 25
Brand A: R 60 10 50
Brand B: R 45 25 20 29
It can be concluded that the brand B paint is less
variable or more consistent and hence a better choice.
But Range is not a good measure of variability because
outliers affect the range easily.
2. VARIANCE
The variance is the next measure of variation.
31
FORMULAS FOR CALCULATING
VARIANCE
X X f Xm X
2 2
S 2
2
S
n 1 n 1
Sample
f Xm
2
X
2
2
2
N N
Population
32
EXAMPLE 10
10 35
60 45
50 30
30 35
40 40
20 25
33
SOLUTION Brand A (X) X 2
10 625
X 210
35 60 625
N 6
50 225
30 25
X 1750
2
X 210
X
2
1750
2
291.67
N 6
Therefore standard deviation 291.67 17.08
34
Brand B X 2
35 0
X 210
35 45 100
N 6 25
30
35 0
40 25
25 100
X
2
250
X 250
2
2
41.67 X 210
N 6
41.67 6.45
36
HOMEWORK
The following is the distribution of the number of fish caught by 50
fishermen in a village. Find the variance and standard deviation
using the short-cut formula.
Score Frequency
11-15 12
16-20 14
21-25 13
26-30 11
n=50
37
Measures of Position
38
MEASURES OF POSITION
• The measures of position are the numerical
measures to determine the relative position of a
data value in a data set.
• The most commonly used measures of position
are:
• Percentiles
• Deciles
• Quartiles
39
1. PERCENTILES
Percentiles divide data set into 100 equal parts.
Each set of observations has 99 percentiles and are
denoted by P P … P .
1 2 99
Q1 Q2 Q3
44
SOLUTION
Arrange the data from lowest to highest
56, 62, 65, 70, 73, 77, 79, 82, 85, 87, 92, 99
a) 80th percentile P80
np 12(80)
c 9.6. Since c is not a whole number
100 100
so the 10th score is 80th percentile. P80 =87.
46
EXAMPLE
The following are the test scores of 12 students
in a statistics class
70, 77, 65, 56, 99, 62, 79, 73, 85, 87, 92, 82
Find the percentile rank for the score 92.
Solution:
Arrange the data from lowest to highest
56, 62, 65, 70, 73, 77, 79, 82, 85, 87, 92, 99
10 0.5
Percentile Rank of 92: 100 87.5.
12
47
PERCENTILE GRAPHS
49
SOLUTION
Class Frequency Cumulative Cumulative
boundaries frequency percent
7.5-12.5 3 3 3/30x100=10
12.5-17.5 5 8 8/30x100=22.67
17.5-22.5 15 23 23/30x100=76.67
22.5-27.5 5 28 28/30x100=93.33
27.5-32.5 2 30 30/30x100=100
50
From the percentile graph we estimate that:
a) 50th percentile=20.
b) The percentile rank of 30 is 97.
51
OUTLIERS
An outlier is an extremely high or an extremely low data
value when compared with the rest of the data values.
A data value less than q1 – 1.5(IQR) or greater than Q1 +
1.5(IQR) can be considered an outlier.
55
IN EDA, DATA CAN BE ORGANIZED USING A
STEM AND LEAF PLOT.
20 25 30 35 40 45 50
57
STEPS FOR CONSTRUCTING A BOXPLOT
Step 1 Arrange data in order. 1. Minimum value
Step 2 Find the 5-number summary 2. First quartile (Q1 )
Step 3 Draw a horizontal axis with a 3. Median (Q2 )
scale that includes the maximum and 4. Third quartile (Q3 )
minimum data values. 5. Maximum value
Step 4 Draw a box with vertical sides through Q1 and Q3
and draw a vertical line though the median.
Step 5 Draw a line from the minimum data value to the
left side of the box and a line from the maximum data
value to the right side of the box.
58
EXAMPLE
Step 3 Draw a scale for the data on the x axis and make
the boxplot.
59
8 11.5 16
3 20
0 4 8 12 16 20 22
60
INFORMATION OBTAINED FROM A
BOXPLOT
• If the median is near the centre of the box or the lines are
about the same length, the distribution is approximately
symmetric.
• If the median is to the left of the centre of the box or the
right line is larger than the left line, the distribution is
positively skewed.
• If the median falls to the right of the centre of the box or the
left line is larger than the right line, the distribution is
negatively skewed.
61
THE END
62