0% found this document useful (0 votes)
40 views62 pages

St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences

This document provides an outline and introduction for a lecture on basic statistics. It discusses measures of central tendency including the mean, median, and mode. It describes how to calculate and understand each of these measures. The document also introduces measures of variation such as range, variance, and standard deviation to describe the spread of data. Examples are provided to demonstrate calculating and interpreting measures of central tendency and variation.

Uploaded by

Shiv Neel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views62 pages

St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences

This document provides an outline and introduction for a lecture on basic statistics. It discusses measures of central tendency including the mean, median, and mode. It describes how to calculate and understand each of these measures. The document also introduces measures of variation such as range, variance, and standard deviation to describe the spread of data. Examples are provided to demonstrate calculating and interpreting measures of central tendency and variation.

Uploaded by

Shiv Neel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 62

SCHOOL OF COMPUTING INFORMATION

AND MATHEMATICAL SCIENCES

ST130: BASIC STATISTICS

WEEK 3: LECTURE
OUTLINE
• Summarize data, using measures of central tendency, such as
mean, median, mode.
• Describe data, using measures of variation, such as range,
variance and standard deviation. 
• Identify the position of a data value in a data set, using
various measures of position, such as percentiles, deciles and
quartiles.
• Use the techniques of Exploratory data analysis, including
boxplots and the five-number summary to discover various
aspects of data

2
Data Description

3
INTRODUCTION
• After collection, organization and presentation
of data we now examine the statistical methods
that can be used to describe data.
• The methods include
• measures of central tendency;
• measures of variation;
• measures of position.
4
MEASURES OF CENTRAL TENDENCY

The measures of central tendency are the


numerical measures that locate the centre of the
data set.

Types of measures of central tendency


1.Mean
2.Median
3.Mode

5
1. THE MEAN
• The mean (arithmetic mean) is found by
adding all the data values and dividing by the
total number of data values.
Example 1
The mean of 3, 2, 6, 5 and 4 is found by adding
3+2+6+5+4=20 and dividing by 5; hence the
mean of the data is 20/5=4.

• The symbol X represents the sample mean,


the symbol  represents the population mean.6
FORMULAS FOR CALCULATING
 
MEAN
Raw data Ungrouped Grouped
frequency frequency
distribution distribution
Sample
X 
 X
X 
 fX
X 
 fX m

n n n

Population

X 
 fX 
 fX m

N N
N
Rounding rule for the mean: the mean should be rounded
to one more decimal place than in the raw data.
7
EXAMPLE 2

Rating (X ) Frequency ( f ) fX
1 2 2 
2 1 2
3 2  6
4 2  8
5 2 10 
6 5  30
7 3  21
8 2  16
9 2  18
10 3  30
Total n  24  143

8
EXAMPLE 3
Find the mean of the frequency distribution given below. The
data is for the distribution of the weights of 50 randomly
selected ST130 students
Class Limits Frequency
30-39 5
40-49 10
50-59 18
60-69 12
70-79 5
50

9
SOLUTION
Class Limits
f Xm f .X m
30-39 5 34.5 172.5
40-49 10 44.5 445
50-59 18 54.5 981
60-69 12 64.5 774
70-79 5 74.5 372.5

n  50  f .X m  2745

Use the formula to get the mean.

X
 f X m

2745
 54.9 kg
n 50 10
2. THE MEDIAN

The median is the midpoint of the data array. The


symbol for the median is MD.

Steps to find median of a data array:


Step 1 arrange the data in order.
Step 2 select the middle value.

11
EXAMPLE 4
The data represent the number of lectures missed per year for
a sample of students selected from the university.
10 13 26 35 15 28 19 24 36 40 46
Find the median.
Solution
Step 1 10 13 15 19 24 26 28 35 36 40 46
Step 2 the middle value is the sixth value, which is 26. Thus
the median is 26 lectures.

12
EXAMPLE 5
The data represent the number of lectures missed per
year for a sample of students selected from the
university.
10 13 26 35 15 28 19 24 36 40
Find the median.
Step 1 10 13 15 19 24 26 28 35 36 40
Step 2 MEDIAN

24  26
MD   25
2 13
MEDIAN FOR GROUPED DATA
To find the median for the grouped data we can use the
ogive.

Example Class Limits Freq


Find the median of the 30-39 5
given data. 40-49 8
50-59 16
60-69 6
35

14
17.5

51

15
So the median is 51.
3. THE MODE
The value that occurs most often in a data set is called
the mode. A data set can have more than one mode or
no mode at all.

Steps to find mode of a data array:

Step 1 arrange the data in order. However, this is not an


absolute necessity.
Step 2 select the most common value(s).

16
MODE FOR GROUPED DATA
The mode for grouped data is known as the modal class.
The modal class is the class with the largest frequency.

Example 6 Class Limits Freq


This data is for the 30-39 5
distribution of the weights 40-49 10
of 50 randomly selected 50-59 18
ST130 students. 60-69 12
70-79 5
50

So the modal class is 50-59.


17
PROPERTIES OF THE MEAN
Uses all data values.
Varies less than the median or mode
Used in computing other statistics, such as the
variance
Unique, usually not one of the data values
Cannot be used with open-ended classes
Affected by extremely high or low values, called
outliers
18
PROPERTIES OF THE MEDIAN

Gives the midpoint


Used when it is necessary to find out whether
the data values fall into the upper half or lower
half of the distribution.
Can be used for an open-ended distribution.
Affected less than the mean by extremely high
or extremely low values.
19
PROPERTIES OF THE MODE
Used when the most typical case is desired
Easiest average to compute
Can be used with nominal data
Not always unique or may not exist

20
Note:
In many cases, the different measures of central
tendency may have significantly different
values. One has to be very cautious in using
these measures.

21
EXAMPLE 7
The annual salaries of a company is listed below. Which measure is
more reliable, mean, median, or mode.
Staff salaries
outlier
Owner 50,000
Manager 20,000
Salesperson 12,000
Technician 9,000
Technician 9,000
Solution
The mean is 20,000, the median is 12,000, and the mode is 9,000.
Median is more reliable than the mean because outliers affect the
mean easily.

22
DISTRIBUTIONS

23
Measures of Variation

24
MEASURES OF VARIATION
The measures of variation (or ‘dispersion’) are
the numerical measures to determine the spread
of the data values from the central tendencies.

Many times the measures of central tendency


alone cannot describe the data. Consider the
following example.

25
Example 8
BRAND A BRAND B
I wish to test two brands of 10 35
outdoor paint to see how long
each will last before fading. 60 45
The results (in months) are 50 30
shown. Find the mean and
median of each group. 30 35
(Assume population) 40 40
20 25

26
SOLUTION
The mean of both brands
BRAND A BRAND B
of paints is 35 months
10 35
The median of both brands
60 45
of paints is 35 months
50 30
30 35
Since the mean and median are
40 40 same, one cannot conclude
which brand of paint lasts longer.
20 25

Thus the mean and median is of no use to


chose between the two brands of paint. 27
TYPES OF MEASURES OF VARIATION

1. Range
2. Variance
3. Standard deviation

28
1. RANGE
The range is the highest value BRAND A BRAND B
minus the lowest value in the data 10 35
set. It is denoted by the symbol R.
60 45
Example 9
50 30
Find the range of the two brands
30 35
of paint.
40 40
Solution
20 25

Brand A: R  60  10  50
Brand B: R  45  25  20 29
It can be concluded that the brand B paint is less
variable or more consistent and hence a better choice.
But Range is not a good measure of variability because
outliers affect the range easily.

2. VARIANCE
The variance is the next measure of variation.

The symbol S 2, represents the sample variance, the


symbol  2 , represents the population variance.
30
3. STANDARD DEVIATION
The standard deviation is the square root of the
variance.

The symbol S , represents the sample standard


deviation, the symbol  , represents the population
standard deviation.

31
FORMULAS FOR CALCULATING
VARIANCE

Raw data Grouped:


Class intervals

  X X  f  Xm  X 
2 2
S 2
 2
S 
n 1 n 1
Sample

 f  Xm  
2
 X  
2
2
2   
N N
Population

32
EXAMPLE 10

Find the variance and standard deviation


of the two brands of paint.
BRAND A BRAND B

10 35
60 45
50 30
30 35
40 40
20 25
33
SOLUTION Brand A (X)  X  2
10 625
X 210
   35 60 625
N 6
50 225
30 25

Use the formula to get the 40 25


variance. 20 225

  X     1750
2
 X  210

  X  
2
1750
 2
   291.67
N 6
Therefore standard deviation   291.67  17.08
34
Brand B  X  2
35 0
X 210
   35 45 100
N 6 25
30
35 0
40 25
25 100
  X  
2
250
  X     250
2
 2
   41.67  X  210
N 6

  41.67  6.45

Brand B is less variable, more reliable and a


better choice! 35
CONCLUSION
• Variance is used to determine the spread or
variability of the data set.
• Variance is useful in comparing two (or more)
data sets to determine which is more consistent
that is less variable (one with less variance).

36
HOMEWORK
The following is the distribution of the number of fish caught by 50
fishermen in a village. Find the variance and standard deviation
using the short-cut formula.

Score Frequency

11-15 12
16-20 14
21-25 13
26-30 11

n=50

37
Measures of Position

38
MEASURES OF POSITION
• The measures of position are the numerical
measures to determine the relative position of a
data value in a data set.
• The most commonly used measures of position
are:
• Percentiles
• Deciles
• Quartiles

39
1. PERCENTILES
Percentiles divide data set into 100 equal parts.
Each set of observations has 99 percentiles and are
denoted by P  P … P .
1 2 99

• P20 - is called the 20th percentile, which indicates that


20% of the scores are below P20.
• P50 - is called the 50th percentile, which indicates that
50% of the scores are below P50. Note: P50 = Median. 40
2. DECILES
Deciles divide the data set into 10 equal parts.
Each set of observations has 9 deciles and they are
denoted by D1  D2 … D9 .

• D1 - is called the 1st decile, which indicates


that 10% of the scores are below D1.
• D4 - is called the 4th decile, which indicates that
41
40% of the scores are below D4.
3. QUARTILES
Quartiles divide the ordered data set into 4 equal parts.
Each set of observations has 3 quartiles and they are
denoted by Q1  Q2  Q3 .

Q1 Q2 Q3

• Q1 - is called the 1st quartile (or lower quartile), which


indicates that 25% of the scores are below Q1.
• Q3 - is called the 3rd quartile (or upper quartile), which
indicates that 75% of the scores are below q3. 42
CALCULATING PERCENTILES
Steps to calculate percentiles for a raw data:
Step 1: arrange the data from lowest to highest.
Step 2: substitute into the formula
np
c
100
where, n = number of data values and p = percentile .
Step 3: if c is not a whole number, then round it
Up to the next whole number and if c is a whole
Number, use the average of c th and (c+1) st
Term.
To calculate quartiles and deciles, convert them to
percentiles and use the same steps.
43
EXAMPLE
The following are the test scores of 12 students
in a statistics class
70, 77, 65, 56, 99, 62, 79, 73, 85, 87, 92, 82

Find the following:


a) 80th percentile.
b) upper quartile.

44
SOLUTION
Arrange the data from lowest to highest
56, 62, 65, 70, 73, 77, 79, 82, 85, 87, 92, 99
a) 80th percentile  P80
np 12(80)
c   9.6. Since c is not a whole number
100 100
so the 10th score is 80th percentile.  P80 =87.

b)Upper quartile  Q3  P75


np 12(75)
c   9. Since c is a whole number so the average of
100 100
85+87
9th and 10th score is upper quartile.  Q3 = =86.
2
45
CALCULATING PERCENTILE RANK

• The percentile rank of a score indicates what


percent of data lies below the score.
• The percentile rank of a score x for a raw data
is computed by:

 number of values below X   0.5 100%


total number of values

46
EXAMPLE
The following are the test scores of 12 students
in a statistics class
70, 77, 65, 56, 99, 62, 79, 73, 85, 87, 92, 82
Find the percentile rank for the score 92.

Solution:
Arrange the data from lowest to highest
56, 62, 65, 70, 73, 77, 79, 82, 85, 87, 92, 99
10  0.5
Percentile Rank of 92: 100  87.5.
12
47
PERCENTILE GRAPHS

• Percentile graph is an ogive which uses


cumulative percentage instead of
cumulative frequency or cumulative
relative frequency in the y-axis.

• Percentile graphs can be used to


estimate percentiles, deciles, quartiles
and percentile ranks for grouped data.
48
EXAMPLE

Given the grouped frequency distribution.


Class limits Frequency
8-12 3
13-17 5
18-22 15
23-27 5
28-32 2

Construct a percentile graph and use it to estimate


a) 50th percentile.
b) The percentile rank of 30.

49
SOLUTION
Class Frequency Cumulative Cumulative
boundaries frequency percent
7.5-12.5 3 3 3/30x100=10
12.5-17.5 5 8 8/30x100=22.67
17.5-22.5 15 23 23/30x100=76.67
22.5-27.5 5 28 28/30x100=93.33
27.5-32.5 2 30 30/30x100=100

50
From the percentile graph we estimate that:
a) 50th percentile=20.
b) The percentile rank of 30 is 97.

51
OUTLIERS
An outlier is an extremely high or an extremely low data
value when compared with the rest of the data values.
A data value less than q1 – 1.5(IQR) or greater than Q1 +
1.5(IQR) can be considered an outlier.

How does an outlier occur?


•Measurement or observational error.
•Recording error, i.e. typed incorrectly.
•Naturally occurred by chance. 52
PROCEDURE FOR IDENTIFYING
OUTLIERS
STEP 1 Arrange the data in order and find Q1 and Q3 .
STEP 2 Find the interquartile range ( IQR  Q3  Q1 ).

STEP 3 Multiply the IQR by 1.5.

STEP 4 Subtract the value obtained in step 3 from Q1


and add the value to Q3 .
STEP 5 Check the data set for any data value which is
smaller than Q1  1.5  IQR and greater than
Q3  1.5  IQR.
53
EXAMPLE
CHECK THE FOLLOWING DATA SET FOR
OUTLIERS.
16, 18, 12, 11, 8, 13, 4, 3, 9, 20.
STEP 1: 3, 4, 8, 9, 11, 12, 13, 16, 18, 20
 Q1  8 ; Q3  16
STEP 2: IQR  Q3  Q1  16  8  8
STEP 3: 8 1.5  12
STEP 4: Subtract the value obtained in step 3 from Q1 , and add to Q3 .
8 -12  -4 and 16  12  28.
STEP 5: Check the data set for all values falling out of the
interval -4 to 28. None from our data, so no outliers.
54
EDA AND TRADITIONAL STATISTICS
The purpose of traditional statistics, is to confirm
various conjectures about the nature of the data. It
starts from a hypothesis, performs an experiment,
and then tests the hypothesis.
The purpose of exploratory data analysis (eda) is
to examine data to find out what information can be
discovered about the data. It starts instead from the
data and asks what patterns, relationships, or trends
they might hold.

55
IN EDA, DATA CAN BE ORGANIZED USING A
STEM AND LEAF PLOT.

THE MEASURE OF CENTRAL TENDENCY USED


IN EDA IS THE MEDIAN.

THE MEASURE OF VARIATION USED IN EDA IS


THE INTERQUARTILE RANGE.

IN EDA THE DATA ARE REPRESENTED


GRAPHICALLY USING A BOXPLOT.
56
BOXPLOTS
A boxplot is a graph of a data set obtained by
using the box and whiskers to represent the five-
number summary of the data.
The five number summary includes minimum
value, lower quartile, median, upper quartile and
maximum value.
Example: 30 33 40
23 45

20 25 30 35 40 45 50
57
STEPS FOR CONSTRUCTING A BOXPLOT
Step 1 Arrange data in order. 1. Minimum value
Step 2 Find the 5-number summary 2. First quartile (Q1 )
Step 3 Draw a horizontal axis with a 3. Median (Q2 )
scale that includes the maximum and 4. Third quartile (Q3 )
minimum data values. 5. Maximum value
Step 4 Draw a box with vertical sides through Q1 and Q3
and draw a vertical line though the median.
Step 5 Draw a line from the minimum data value to the
left side of the box and a line from the maximum data
value to the right side of the box.
58
EXAMPLE

Construct the boxplot for the following data set:


16, 18, 12, 11, 8, 13, 4, 3, 9, 20.

Step 1 Order 3, 4, 8, 9, 11, 12, 13, 16, 18, 20

Step 2 Minimum Value: 3 First quartile Q1  8

MD  Q2  11.5 Third quartile Q3  16 Maximum Value: 20

Step 3 Draw a scale for the data on the x axis and make
the boxplot.
59
8 11.5 16
3 20

0 4 8 12 16 20 22

60
INFORMATION OBTAINED FROM A
BOXPLOT
• If the median is near the centre of the box or the lines are
about the same length, the distribution is approximately
symmetric.
• If the median is to the left of the centre of the box or the
right line is larger than the left line, the distribution is
positively skewed.
• If the median falls to the right of the centre of the box or the
left line is larger than the right line, the distribution is
negatively skewed.
61
THE END

62

You might also like