0% found this document useful (0 votes)
2 views

Lecture 3

The document discusses measures of central tendency, which summarize data into a single value, including mode, median, and mean. It explains how to calculate these measures from raw data, frequency tables, or histograms, and outlines their strengths and weaknesses. Additionally, it covers measures of dispersion, such as range, variance, and standard deviation, emphasizing their importance in understanding data variability.

Uploaded by

Peter Parker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 3

The document discusses measures of central tendency, which summarize data into a single value, including mode, median, and mean. It explains how to calculate these measures from raw data, frequency tables, or histograms, and outlines their strengths and weaknesses. Additionally, it covers measures of dispersion, such as range, variance, and standard deviation, emphasizing their importance in understanding data variability.

Uploaded by

Peter Parker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

MEASURES

OF
CENTRAL
TENDENCY
DESCRIPTIVE
MEASURES
Measures that summarise the data into a
single number are called descriptive
measures.

Descriptive measures can be calculated from a


sample or a population.

A measure calculated from a sample is called a


statistic and that from a population is called a
parameter.
MEASURES OF
CENTRAL TENDENCY
A measure of central tendency is a
descriptive statistic that indicates;
• The average or typical observed value of
a variable in a data set.
It gives an indication of where most of the
data lies
• or where most of the data is clustered.
MEASURES OF
CENTRAL TENDENCY
The three most commonly used measures
of central tendency include the:
• Mode
• Median and
• Mean
MEASURES OF
CENTRAL TENDENCY
The measures of central tendency, can
be calculated from three different
“starting points”

• a list of observed values

• a frequency table (or bar graph) or

• a histogram
The Mode
MODE
The mode (or modal value) of a variable in a
data set is the value of the variable that is
observed most frequently in that data
• or, given a continuous frequency curve, is at the point
of greatest density.

Note: the mode is the value that is observed


most frequently, not the frequency itself.

The mode is defined for every type of


measurement scales [i.e., nominal, ordinal,
interval, or ratio].
• However, the mode is used as a measure of central
tendency primarily for nominal scale (variables) only.
MODE
The mode may be ill-defined if we have either:
• a small number of cases; or
• a precisely measured continuous
variable and a finite number of cases;

because in either event it is likely that no value


will be observed more than once in the data

A set of values can have more than one mode


MODE
The mode can be unstable in some
instance

• small changes in the data can result in large


and erratic changes in the modal value;

• Especially changes in changes coding of


the variable or in class intervals, can
change the modal value.
CALCULATION OF THE
MODE
Example 1.
Given the data on number of problem sets
turned in by students in class.
4 5 5 3 4 5 5
3 1 2 3 5 5
We can construct a frequency table….
NUMBER OF PROBLEM SETS
TURNED IN
Value Abs. freq. Rel. freq. Cum. Freq. De-cum. Freq.
0 0 0% 0% 100%
1 1 8% 8% 100%
2 1 8% 16% 92%
3 3 23% 39% 84%
4 2 15% 54% 61%
5 6 46% 100% 46%
Total 13 100%

• Value with the greatest absolute or relative


frequency is the modal value.

• In this case 5 if the modal value.


MODE
Notice that the modal number of problem sets
turned in is 5,
• although most students turned in fewer than 5,
So if we recoded the variable to create just two
dichotomous categories:
(a) turned in all 5
(b) did not turn in all 5
• the latter category (i.e.(b) becomes the modal
category.
FREQUENCY
DISTRIBUTION TABLE
MODE FROM THE FREQUENCY TABLE
MODE ON A FREQUENCY CURVE
Given a continuous frequency curve:
• the mode is the value of the variable under the
highest point of the frequency curve
• (the point with the greatest density of observed
values).
STRENGTHS OF THE
MODE
It is easy to understand and calculate
It is not affected by extreme large or small
values (outliers)
It is useful for qualitative data
Can be located graphically
WEAKNESSES OF THE MODE
Its computation is not based on all values as is the
case for the mean

It will not be well defined if the data consists of small


number of values (it is possible that there can be more
than one modal value)

It is not capable of further mathematical treatment

Sometimes the data may not have a mode at all


The Median
MEDIAN
This is a value that divides the data set of finite
values into two equal parts such that
• the number of values equal to or greater than
the median is equal to the number less than or
equal to the median
CALCULATION OF THE MEDIAN
Given a list of observed values (raw data):

rank order the cases in terms of their observed


values (e.g., from lowest to highest)

identify the value of the case right at the


middle of this rank-ordered list, and
• the value of this case is the median value; or

construct a frequency table and find where the


cumulative frequency crosses the 50% mark
MEDIAN
Unordered data (raw data)
4 5 5 3 4 5 5 3 1 2 3 5 5
Rank ordered data
1 2 3 3 3 4 4 5 5 5 5 5 5

Median value (i.e. in the middle of the rank)


NUMBER OF PROBLEM SETS
TURNED IN.
Value Abs. freq. Rel. freq. Cum. Freq. De-cum. Freq.
0 0 0% 0% 100%
1 1 8% 8% 100%
2 1 8% 16% 92%
3 3 23% 39% 84%
4 2 15% 54% 61%
5 6 46% 100% 46%
Total 13 100%

• Value at which the cumulative frequency


crosses the 50% threshold.

• In this case 4 if the median value.


MEDIAN
When the number of values is even, there is no single
middle value.
• There will be two middle values
In this case the median is the average of the two
middle values.
1 2 3 3 3 4 4 5 5 5 5 5
Median is the average of the two middle values
• In this case median = 4

If we add another value to the data set


1 2 3 3 3 4 4 5 5 5 5 5 5 5
The median becomes the average of 4 and 5 = 4.5
FREQUENCY
DISTRIBUTION TABLE
(MEDIAN)
MEDIAN FROM THE
FREQUENCY TABLE
MEDIAN ON A FREQUENCY CURVE
On a frequency distribution curve, the median
cuts the area under the curve into two equal
parts.
STRENGTHS OF THE
MEDIAN
It is unique
• There is only one median for a given set of
data.
It is easy to calculate
It is not drastically affected by extreme
values (outliers) as is the case for the
mean.
WEAKNESSES OF THE
MEDIAN
Computation of the median only relies on
the central values and ignores all the other
data.

It is also less amenable to statistical tests


(i.e. compared to the mean)
The mean
MEAN
The mean (or mean value) of a variable in a set
of data is the result of adding up all the
observed values of the variable and dividing by
the number of cases
• (i.e. the “average” ).
CALCULATION OF THE
MEAN
Suppose we have a variable X and a set of cases
numbered 1,2,...,n. Let the observed value of the
variable in each case be designated x1, x2, etc.
Thus:

Mean = Sum of values


Number of observations
Notation : Let x1 , x2 , ... xn are n observations of a variable
x. Then the mean of this variable,
n

x  x2  ...  xn x i
x  1  i 1
n n
CALCULATION OF THE
MEAN
Given the data:
1 2 3 3 3 4 4 5 5 5 5 5 5

Mean = Sum of values


Number of observations

Mean = 1+2+3+3+3+4+4+5+5+5+5 = 50
13 13
Mean = 3.85
FREQUENCY DISTRIBUTION
TABLE
MEAN FROM THE
FREQUENCY TABLE
MEAN ON A FREQUENCY CURVE
The mean is the “center of gravity” of the
distribution.
• Determine (by “eyeball” approximation) the
value of the variable such that the density
“balances” at that point; this value is the mean.
STRENGTHS OF THE
MEAN
The mean is unique for a given set of data
• There is only one mean.
It is easily understood and easy to
calculate.
It takes into consideration all the values in
the set of data.
WEAKNESS OF THE MEAN
The mean is affected by extreme values
• Because each value in the set of data is included in the
computation.
e.g. family income.
20, 30, 40, and 990
Mean = (20+30+40+990)/4 = 270.
Median = (30+40)/2 = 35.
Here 3 observations out of 4 lie between 20-40.
So, the mean 270 really fails to give a realistic
picture of the major part of the data.
It is influenced by extreme value 990
CHOOSING A MEASURE OF
CENTRAL TENDENCY

Depends on the nature of the distribution.

For continuous variables in a unimodal and


symmetric distribution the mean, median and
mode are identical.

With a skewed distribution the median may be


more useful.

For statistical analyses the mean is the preferred


measure.
MEASURES
OF
DISPERSIO
N
MEASURES OF
DISPERSION
Synonyms: measures of variation, spread and scatter

Dispersion refers to the variability exhibited by a set of


observations

A measure of dispersion describes the amount of variability


present in a set of data

There are different measures of variation that are used in


statistics. Those included in this module are:
• The range, variance, standard deviation and coefficient of
variation.
RANGE
1. The difference between the largest and
smallest observation.
Example: Numbers below are test scores for a
class.
44 56 58 62 64 64 70
72
Range = 72 – 44 = 28
28 (44,72)
Communicates very little information
ADVANTAGES AND
DISADVANTAGES OF THE
RANGE
Advantage
it is easy to compute

Disadvantage
It communicates very little information about
the data set
• It only takes into account the largest and
smallest value
• This makes it a poor measure of dispersion
VARIANCE
The variance is a measure of variability which
takes into account the differences between
each observation and the sample mean

It measures the scatter of the values in a set of


data about the mean

The dispersion of the value when they are


close to the mean is less and vice versa
• Hence the logic to measure the variation of
values from the mean
CALCULATION OF THE VARIANCE
Sample variance = The sum of the squared
deviations, divided by (n – 1).

Mathematical notation: s² = Σ(x – x¯)²


n -1
The quantity s² is called the sample estimate of
the variance

Population variance:
Mathematical notation: σ² = Σ(x – μ)²
N
POPULATION
VARIANCE
Average of squared deviations of values from the mean
Calculating the variance (population)

Population Variance:

Where; 𝜇 = Population mean


𝑁 =Population size
𝑋𝑖 = ith value of the variable
SAMPLE VARIANCE

Where; = sample mean


𝑋𝑖 =ith value of observation X
n=sample size
ADVANTAGES AND
DISADVANTAGES OF THE
VARIANCE
Advantage
It takes into consideration all the values in
the set of data.
Disadvantage
The units of measure are squared which
may be difficult to communicate
• e.g. variance of weight will be in kg
squared.
STANDARD DEVIATION
The way around the difficulty of s² is to use the square root of the
variance as a measure of variability.
The quantity denoted by s, is called the sample standard deviation

Thus, if s² = Σ(x – x¯)²


n–1
Then

The population standard deviation will therefore be denoted as: σ =


√σ²
Where
σ² = Σ(x – μ)²
N
STANDARD DEVIATION
OF THE POPULATION
Get the square root of the population
variance to obtain the standard deviation
for the population.
STANDARD DEVIATION
OF THE SAMPLE
The sample standard deviation is obtained by
squaring root of the sample variance.
Therefore, the sample standard deviation is
given by;
EXAMPLE 1
The data given is of plasma volume

x xx x  x
2

 x)
 (x
2

Variance  S 
2

(n  1)

Mean=3.0025
Variance = 0.097
Standard dev. = 0.31
VARIANCE AND S.D.
FROM THE FREQUENCY
TABLE
VARIANCE AND S.D.
FOR GROUPED DATA
RECAP OF FORMULAS
 x)
 (x
2

Variance  S 
2

(n  1)
FEATURES OF THE
STANDARD DEVIATION
• It is usually positive and NEVER negative
• It is 0 only when all data values are the same
number
• The larger value for SD the greater amount the
data varies
• It can increase dramatically with the inclusion
of outliers
• The units (minutes, feet, etc...) are the same as
the units of original values
COEFFICIENT OF VARIATION (CV)
Sometimes we may wish to compare standard
deviations in two groups.
• i.e. we may want to compare the variability in two
groups.
• The two groups may be from two different data
sets

Or may have observations measured in different


units of measure.
• e.g. weight measures in pounds and kg

The groups may also have different means


• e.g. mean weight in children and mean weight in
adults.
COEFFICIENT OF VARIATION (CV)
The coefficient of variation gives a relative
measure of variation rather than the absolute
variation.
• Hence sometimes referred to as the relative
coefficient of variation.

It expresses the standard deviation as a


percentage of the sample mean
COEFFICIENT OF VARIATION (CV)

CV. = standard deviation (s) x 100%


mean (x¯)

The cv is independent of the units of


observations (i.e. it is a unit less dimension)
• Because the standard deviation and the mean
are expressed in the same units, the two units
cancel out.
EXAMPLE 2
The data given is from two samples of males aged 11 years and 25
years.

Sample 1 Sample 2
Age 25 years 11 years
Mean weight 145 pounds 80 pounds
Standard 10 pounds 10 pounds
deviation
We wish to know which of the weights is more variable.
EXAMPLE 2
If we calculate the CV for the 25 year olds;

C.V. = (10/145) x 100 = 6.9%

CV for the 11 year olds

C.V. = (10/80) x 100 = 12.5%

We can see that variation is higher in the 11


year olds than the 25 year olds.
COEFFICIENT OF
VARIATION (CV) FOR
GROUPED DATA
VARIANCE AND S.D.
FOR GROUPED DATA
COEFFICIENT OF
VARIATION (CV) FOR
GROUPED DATA
Measures of Position
PERCENTILES
• Percentiles are values that divide the
ranked data set into 100 (‘per cent’) equal
parts.
The Pth percentile of a data set is a value
such that;

• at least p percent of the observations take


on this value or less
• at least (100-p) percent of the
observations take on this value or more.
PERCENTILES

pth percentile
1 1 3 4 5 5 7 8 9

p% (100-p)% greater
PERCENTILES
• If approximately n percent of the items in a
distribution are less than the number x;
• then x is the nth percentile of the
distribution, denoted Pn.

• Percentile rank indicates the percentage of


data values that fall below the specified rank.

• Symbolized by P1, P2 ,…..


PERCENTILES
To find the percentile rank (ranked data) given
data value x
= number of data values below the given data
points+0.5 x 100
Total number of values

Note: techniques differ, but all get to the


similar values
EXAMPLE: PERCENTILES

The following are test scores (out of 100) for a


particular Epidemiology class.
44 56 58 62 64 64 70 72
72 72 74 74 75 78 78 79
80 82 82 84 86 87 88 90
92 95 96 96 98 100
Mulenga scored 62 in the test. What was his
percentile rank?
PERCENTILES
= number of data values below the given data
points+0.5 x 100
Total number of values
= (3+0.5/30) x 100
= 11.6666
~12
The score 62 is the 12th percentile and is expressed as
P12

Bwalya has a score of 95 in the test, what was his


percentile rank?
PERCENTILES
EXAMPLE: PERCENTILES
PERCENTILES OF
GROUPED DATA
PERCENTILES OF
GROUPED DATA
PERCENTILES OF
GROUPED DATA
DECILES
Deciles are values that divide a data set into ten
(approximately) equal parts.
Denoted by D1, D2,…, D9)

10% 10% 10% 10% 10% 10% 10% 10% 10% 10%

D1 D2 D3 D4 D5 D6 D7 D8 D9
DECILES AND
QUARTILES
Deciles and quartiles are determined in the
same manner as percentiles, since they
may be expressed as percentiles.
EXAMPLE: DECILES

The following are test scores (out of 100) for a


particular math class.
44 56 58 62 64 64 70 72
72 72 74 74 75 78 78 79
80 82 82 84 86 87 88 90
92 95 96 96 98 100
Find the sixth decile.
EXAMPLE: DECILES
Solution
• The sixth decile is the 60th percentile.
• 60 percent of 30 is (0.6)(30) = 18
• we take the average of the 18th and 19th item = 82 as
the sixth decile.
• D6 = 82
QUARTILES
These are values which divide a series of observations, arranged in
ascending order into 4 equal parts. (Thus the 2nd Quartile is the
Median).
The data is ranked and then split into 4 segments with an equal
number of values per segment

The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger

Q2 is the same as the median (50% are smaller, 50% are larger)
Lastly, Only 25% of the observations are greater than the third
quartile.
QUARTILES
For any set of data (ranked in order from
least to greatest):
• The second quartile, Q2, is the median.
• The first quartile, Q1, is the median of all
items below Q2.
• The third quartile, Q3, is the median of
all items above Q2.
QUARTILES
• Quartiles are the three values (Q1, Q2, Q3) that
divide the data set into four (approximately) equal
parts.
Q1, Q2, Q3
divides ranked scores into four equal parts
25% 25% 25% 25%

(minimum)
Q1 Q2 Q3 (maximum)

(median)
INTER QUARTILE
RANGE
The interquartile range shows the spread of
the middle 50% of the data.
Interquartile Range (or IQR): Q3 - Q1
INTERQUARTILE
RANGE
The interquartile range (IQR) is a measure of
variability, based on dividing a data set into
quartiles.

Quartiles divide a rank-ordered data set into four


equal parts. The values that divide each part are
called the first, second, and third quartiles; and
they are denoted by Q1, Q2, and Q3, respectively
EXAMPLE: QUARTILES

The following are test scores (out of 100) for a


particular math class.
44 56 58 62 64 64 70 72
72 72 74 74 75 78 78 79
80 82 82 84 86 87 88 90
92 95 96 96 98 100
• Find the three quartiles.
• And find the interquartile range
EXAMPLE: QUARTILES
Solution
The two middle numbers are 78 and 79 so
Q2 = (78 + 79)/2 = 78.5.
There are 15 numbers above and 15 numbers below
Q2, the middle number for the lower group is
Q1 = 72, and for the upper group is
Q3 = 88.
IQR = Q3 – Q1 = 88 – 72 = 16
QUARTILES OF
GROUPED DATA
QUARTILES OF
GROUPED DATA
QUARTILES OF
GROUPED DATA
QUARTILES OF
GROUPED DATA
INTERQUARTILE RANGE
OF GROUPED DATA
BOX AND WHISKER
PLOT
QUARTILES Deciles
Q1 = P25 D1 = P10
D2 = P20
Q2 = P50
D3 = P30
Q3 = P75 •


D9 = P90
END OF
LECTURE

You might also like