Satatistics
Satatistics
Satatistics
1. INTRODUCTION
The word statistics has two different meanings (sense) which are discussed below:
(1) Plural Sense (2) Singular Sense
1. Plural Sense: In plural sense, the word statistics refer to numerical facts and figures
collected in a systematic manner with a definite purpose in any field of study. In this sense,
statistics are also aggregates of facts which are expressed in numerical form. For example,
Statistics on industrial production, statistics or population growth of a country in different
years etc.
The word statistics is used as the plural of the word “Statistic” which refers to a numerical
quantity like mean, median, variance etc…, calculated from sample value.
For Example: If we select 15 student from a class of 80 students, measure their heights and find
the average height. This average would be a statistic
2. Singular Sense: In singular sense, it refers to the science comprising methods which are used
in collection, analysis, interpretation and presentation of numerical data. These methods are
used to draw conclusion about the population parameter.
For Example: If we want to have a study about the distribution of weights of students in a
certain college. First of all, we will collect the information on the weights which may be obtained
from the records of the college or we may collect from the students directly. The large number
of weight figures will confuse the mind. In this situation we may arrange the weights in groups
such as: “50 Kg to 60 Kg” “60 Kg to 70 Kg” and so on and find the number of students fall in each
group. This step is called a presentation of data. We may still go further and compute the
averages and some other measures which may give us complete description of the original data.
Classifications:
Depending on how data can be used statistics is sometimes divided in to two main areas or
branches.
a. Descriptive Statistics: is concerned with summary calculations, graphs, charts and tables. In
descriptive statistics, it deals with collection of data, its presentation in various forms, such
as tables, graphs and diagrams and findings averages and other measures which would
describe the data. For Example: Industrial statistics, population statistics, trade statistics
etc… Such as businessman make to use descriptive statistics in presenting their annual
reports, final accounts, bank statements.
b. Inferential Statistics: is a method used to generalize from a sample to a population. For
example, the average income of all families (the population) in Ethiopia can be estimated
from figures obtained from a few hundred (the sample) families. In inferential statistics, it
deals with techniques used for analysis of data, making the estimates and drawing
conclusions from limited information taken on sample basis and testing the reliability of the
estimates.
For Example: Suppose we want to have an idea about the percentage of illiterates in our
country. We take a sample from the population and find the proportion of illiterates in the
sample. This sample proportion with the help of probability enables us to make some inferences
about the population proportion. This study belongs to inferential statistics.
• It is important because statistical data usually arises from sample.
1
• Statistical techniques based on probability theory are required.
Stages in Statistical Investigation
There are five stages or steps in any statistical investigation.
1. Collection of data: the process of measuring, gathering, assembling the raw data up on which the
statistical investigation is to be based.
Data can be collected in a variety of ways; one of the most common methods is through
the use of survey. Survey can also be done in different methods, three of the most
common methods are:
• Telephone survey
• Mailed questionnaire
• Personal interview.
Exercise: discuss the advantage and disadvantage of the above three methods with respect
to each other.
2. Organization of data: Summarization of data in some meaningful way, e.g table form
3. Presentation of the data: The process of re-organization, classification, compilation, and
summarization of data to present it in a meaningful form.
4. Analysis of data: The process of extracting relevant information from the summarized data,
mainly through the use of elementary mathematical operation.
5. Inference of data: The interpretation and further observation of the various statistical measures
through the analysis of the data by implementing those methods by which conclusions are
formed and inferences made.
• Statistical techniques based on probability theory are required.
Definitions of some terms
a. Statistical Population: It is the collection of all possible observations of a specified
characteristic of interest (possessing certain common property) and being under study.
b. Sample: It is a subset of the population, selected using some sampling technique in such a way
that they represent the population.
c. Sampling: The process or method of sample selection from the population.
d. Sample size: The number of elements or observation to be included in the sample.
e. Census: Complete enumeration or observation of the elements of the population. Or it is the
collection of data from every element in a population
f. Parameter: Characteristic or measure obtained from a population.
g. Statistic: Characteristic or measure obtained from a sample.
h. Variable: It is an item of interest that can take on many different numerical values.
Types of Variables or Data:
1. Qualitative Variables are nonnumeric variables and can't be measured. Examples include
gender, religious affiliation, and state of birth.
2. Quantitative Variables are numerical variables and can be measured. Examples include balance
in checking account, number of children in family. Note that quantitative variables are either discrete
(which can assume only certain values, and there are usually "gaps" between the values, such as the
number of bedrooms in your house) or continuous (which can assume any value within a specific
range, such as the air pressure in a tire.)
2
Applications, Uses and Limitations of statistics
Applications of statistics:
• In almost all fields of human endeavor.
• Almost all human beings in their daily life are subjected to obtaining numerical facts e.g. abut
price.
• Applicable in some process e.g. invention of certain drugs, extent of environmental pollution.
• In industries especially in quality control area.
Uses of statistics:
The main function of statistics is to enlarge our knowledge of complex phenomena. The
following are some uses of statistics:
1. It presents facts in a definite and precise form.
2. Data reduction.
3. Measuring the magnitude of variations in data.
4. Furnishes a technique of comparison
5. Estimating unknown population characteristics.
6. Testing and formulating of hypothesis.
7. Studying the relationship between two or more variable.
8. Forecasting future events.
Limitations of statistics
As a science statistics has its own limitations. The following are some of the limitations:
• Deals with only quantitative information.
• Deals with only aggregate of facts and not with individual data items.
• Statistical data are only approximately and not mathematical correct.
• Statistics can be easily misused and therefore should be used be experts.
Scales of measurement
Proper knowledge about the nature and type of data to be dealt with is essential in order to
specify and apply the proper statistical method for their analysis and inferences. Measurement
scale refers to the property of value assigned to the data based on the properties of order,
distance and fixed zero.
SCALE TYPES
Measurement is the assignment of numbers to objects or events in a systematic fashion. Four
levels of measurement scales are commonly distinguished: nominal, ordinal, interval, and ratio
and each possessed different properties of measurement systems.
Nominal Scales
Nominal scales are measurement systems that possess none of the three properties stated above.
• Level of measurement which classifies data into mutually exclusive, all inclusive
categories in which no order or ranking can be imposed on the data.
• No arithmetic and relational operation can be applied.
Examples:
o Blood type (A, B, AB, O)
o Political party preference (Republican, Democrat, or Other,)
o Sex (Male or Female.)
o Marital status(married, single, widow, divorce)
o Country code
o Regional differentiation of Ethiopia.
3
Ordinal Scales
Ordinal Scales are measurement systems that possess the property of order, but not the property
of distance. The property of fixed zero is not important if the property of distance is not satisfied.
• Level of measurement which classifies data into categories that can be ranked.
Differences between the ranks do not exist.
• Arithmetic operations are not applicable but relational operations are applicable.
• Ordering is the sole property of ordinal scale.
Examples:
o Letter grades (A, B, C, D, F).
o Rating scales (Excellent, Very good, Good, Fair, poor).
o Military status.
Interval Scales
Interval scales are measurement systems that possess the properties of Order and distance, but
not the property of fixed zero.
• Level of measurement which classifies data that can be ranked and differences are
meaningful. However, there is no meaningful zero, so ratios are meaningless.
• All arithmetic operations except division are applicable.
• Relational operations are also possible.
Examples: IQ
o Temperature in oF.
Ratio Scales
Ratio scales are measurement systems that possess all three properties: order, distance, and fixed
zero. The added power of a fixed zero allows ratios of numbers to be meaningfully interpreted; i.e.
the ratio of Bekele's height to Martha's height is 1.32, whereas this is not possible with interval
scales.
• Level of measurement which classifies data that can be ranked, differences are meaningful,
and there is a true zero. True ratios exist between the different units of measure.
• All arithmetic and relational operations are applicable.
Examples:
o Weight
o Height
o Number of students o
Age
The following present a list of different attributes and rules for assigning numbers to objects. Try to
classify the different measurement systems into one of the four types of scales.
CHAPTER TWO
METHODS OF DATA COLLECTION AND PRESENTATION
Introduction to Methods of Data Collection
There are two sources of data:
1. Primary Data
• Data measured or collect by the investigator or the user directly from the source.
• Two activities involved: planning and measuring.
4
a) Planning:
Identify source and elements of the data.
Decide whether to consider sample or census.
If sampling is preferred, decide on sample size, selection method,… etc
Decide measurement procedure.
Set up the necessary organizational structure.
b) Measuring: there are different options.
Focus Group
Telephone Interview
Mail Questionnaires
Door-to-Door Survey
Mall Intercept
New Product Registration
Personal Interview and
Experiments are some of the sources for collecting the primary
data.
2. Secondary Data
• Data gathered or compiled from published and unpublished sources or files.
• When our source is secondary data check that:
The type and objective of the situations.
The purpose for which the data are collected and compatible with
the present problem.
The nature and classification of data is appropriate to our problem.
6
7
Step 2: Make a table as shown
Step 3: Tally the data.
Step 4: Compute the frequency.
Mark Tally Frequency
60 // 2
62 / 1
63 / 1
65 / 1
70 //// 4
74 / 1
75 // 2
76 / 1
80 /// 3
85 /// 3
90 / 1
Each individual value is presented separately, that is why it is named ungrouped frequency distribution.
3) Grouped frequency Distribution:
-When the range of the data is large, the data must be grouped in to classes that are more than one unit in
width.
Definitions:
• Grouped Frequency Distribution: a frequency distribution when several numbers are grouped in one
class.
• Class limits: Separates one class in a grouped frequency distribution from another. The limits could
actually appear in the data and have gaps between the upper limits of one class and lower limit of the
next.
• Units of measurement (U): the distance between two possible consecutive measures. It is usually
taken as 1, 0.1, 0.01, 0.001, -----.
• Class boundaries: Separates one class in a grouped frequency distribution from another. The
boundaries have one more decimal places than the row data and therefore do not appear in the data.
There is no gap between the upper boundary of one class and lower boundary of the next class. The
lower class boundary is found by subtracting U/2 from the corresponding lower class limit and the
upper class boundary is found by adding U/2 to the corresponding upper class limit.
• Class width: the difference between the upper and lower class boundaries of any class. It is also the
difference between the lower limits of any two consecutive classes or the difference between any two
consecutive class marks.
• Class mark (Mid points): it is the average of the lower and upper class limits or the average of upper
and lower class boundary.
• Cumulative frequency: is the number of observations less than/more than or equal to a specific value.
8
10. Find the cumulative frequencies. Depending on what you're trying to accomplish, it may not be
necessary to find the cumulative frequencies.
11. If necessary, find the relative frequencies and/or relative cumulative frequencies Example*:
The following are weights in pounds of 20 children at a day-care center:
Construct a frequency distribution for this data.
11 29 6 33 14 31 22 27 19 20
9
18 17 22 38 23 21 26 34 39 27
Solutions:
Step 1: Find the highest and the lowest value H=39, L=6 Step 2: Find
the range; R=Max-Min=39-6=33
Step 3: Select the number of classes desired using Sturges formula; k = 1 + 3.32
log n =1+3.32log (20) =5.32=6(rounding up)
Step 4: Find the class width; w=R/k=33/6=5.5=6 (rounding up)
Step 5: Select the starting point, let it be the minimum observation.
6, 12, 18, 24, 30, 36 are the lower class limits.
Step 6: Find the upper class limit; e.g. the first upper class=12-U=12-1=11
11, 17, 23, 29, 35, 41 are the upper class limits.
So combining step 5 and step 6, one can construct the following classes. Class limits
6 – 11
12 – 17
18 – 23
24 – 29
30 – 35
36 – 41
Step 7: Find the class boundaries;
E.g. for class 1 Lower class boundary=6-U/2=5.5 Upper class
boundary =11+U/2=11.5
• Then continue adding w on both boundaries to obtain the rest boundaries. By doing so one can
obtain the following classes.
Class boundary 5.5 – 11.5
11.5 – 17.5 17.5 – 23.5
23.5 – 29.5 29.5 – 35.5
35.5 – 41.5
Step 8: tally the data.
Step 9: Write the numeric values for the tallies in the frequency column. Step 10: Find
cumulative frequency.
Step 11: Find relative frequency or/and relative cumulative frequency.
Example: The following table gives the details of the number of deaths due to a variety of causes
among residents for a given town. Represent these figures by a suitable diagram.
11
Pictogram
-In these diagram, we represent data by means of some picture symbols. We decide about a
suitable picture to represent a definite number of units in which the variable is measured.
Example: draw a pictogram to represent the following population of a town.
Year 1989 1990 1991 1992
Population 2000 3000 5000 7000
Bar Charts:
- A set of bars (thick lines or narrow rectangles) representing some magnitude over time
space.
- They are useful for comparing aggregate over time space.
- Bars can be drawn either vertically or horizontally.
- There are different types of bar charts. The most common being :
• Simple bar chart
• Deviation or two way bar chart
• Broken bar chart
• Component or sub divided bar chart.
• Multiple bar charts.
Simple Bar Chart
-Are used to display data on one variable.
-They are thick lines (narrow rectangles) having the same breadth. The magnitude of a
quantity is represented by the height /length of the bar.
Example: The following data represent sale by product, 1957- 1959 of a given company for
three products A, B, C.
Product Sales($) Sales($) Sales($)
In 1957 In 1958 In 1959
A 12 14 18
B 24 21 18
C 24 35 54
12
Solutions:
Sales by product in 1957
30
25
Sales in $
20
15
10
5
0
A B C
product
100
80
Salesin$
Product C
60
Product B
40
Product
20
0
1957 1958 1959
Year of production
13
Multiple Bar charts
- These are used to display data on more than one variable.
- They are used for comparing different variables at the
same time.
Example: Draw a component bar chart to represent the sales
by product from 1957 to 1959. Solutions:
40
30 Product A
20 Product B
10 Product C
0
1957 1958 1959
Year of production
14
Graphical Presentation of data
- The histogram, frequency polygon and cumulative frequency graph or ogive are most
commonly applied graphical representation for continuous data.
Procedures for constructing statistical graphs:
• Draw and label the X and Y axes.
• Choose a suitable scale for the frequencies or cumulative frequencies and label it on the
Y axes.
• Represent the class boundaries for the histogram or ogive or the mid points for the
frequency polygon on the X axes.
• Plot the points.
• Draw the bars or lines to connect the points.
Histogram
A graph which displays the data by using vertical bars of various height to represent
frequencies. Class boundaries are placed along the horizontal axes. Class marks and class
limits are sometimes used as quantity on the X axes.
Example: Construct a histogram to represent the previous data (example *).
Frequency Polygon:
- A line graph. The frequency is placed along the vertical axis and classes mid points are
placed along the horizontal axis. It is customer to the next higher and lower class interval
with corresponding frequency of zero, this is to make it a complete polygon.
Example: Draw a frequency polygon for the above data (example *).
Ogive (cumulative frequency curve)
- A graph showing the cumulative frequency (less than or more than type) plotted against
upper or lower class boundaries respectively. That is class boundaries are
plotted along the horizontal axis and the corresponding cumulative frequencies are
plotted along the vertical axis. The points are joined by a free hand curve.
Example: Draw an ogive curve(less than type) for the above data. (Example *)
CHAPTER THREE
3. MEASURES OF CENTERAL TENDENCY
Introduction
When we want to make comparison between groups of numbers it is good to have a single
value that is considered to be a good representative of each group. This single value is called
the average of the group. Averages are also called measures of central tendency.
An average which is representative is called typical average and an average which is not
representative and has only a theoretical value is called a descriptive average. A typical
average should posses the following:
• It should be rigidly defined.
• It should be based on all observation under investigation.
• It should be as little as affected by extreme observations.
• It should be capable of further algebraic treatment.
• It should be as little as affected by fluctuations of sampling.
15
• It should be ease to calculate and simple to understand.
Objectives:
To comprehend the data easily.
To facilitate comparison.
To make further statistical analysis.
The Summation Notation:
• Let X1, X2 ,X3 …XN be a number of measurements where N is the total number of
observation and Xi is ith observation.
• Very often in statistics an algebraic expression of the form X1+X2+X3+...+XN is used in a
formula to compute a statistic. It is tedious to write an expression like this very often,
so mathematicians have developed a shorthand notation to represent a sum of scores,
called the summation notation.
N
∑Xi = x1 + x2 +K+ xN
i =1
The expression is read, "the sum of X sub i from i equals 1 to N." It means "add up all the
numbers."
Example: Suppose the following were scores made on the first homework assignment for
five students in the class: 5, 7, 7, 6, and 8. In this example set of five numbers, where N=5, the
summation could be written:
5
∑Xi = x1 + x2 + x3 + x4 + x5 = 5 + 7 + 7 + 6 + 8 = 33
i =1
16
The "i=1" in the bottom of the summation notation tells where to begin the sequence of
summation. If the expression were written with "i=3", the summation would start with the
third number in the set. For example:
N
∑Xi = x3 + x4 + K + xN
i=3
In the example set of numbers, this would give the following result:
N
∑Xi = x3 + x4 + x5 = 7 + 6 + 8 = 21
i =3
The "N" in the upper part of the summation notation tells where to end the sequence of
summation. If there were only three scores then the summation and example would be:
3
∑Xi = x1 + x2 + x3 = 5 + 7 + 7 = 21
i =1
Sometimes if the summation notation is used in an expression and the expression must be
written a number of times, as in a proof, then a shorthand notation for the shorthand
notation is employed. When the summation sign "" is used without additional notation, then
"i=1" and "N" are assumed.
For example:
N
∑Xi = x1 + x2 +K+ xN
i =1
PROPERTIES OF SUMMATION
N
2. ∑ kX i
X
=k ∑ i where k is any constant
i=1 i =1
N N
4. ∑(Xi + Yi ) = ∑ X i + ∑ Yi
i =1 i =1 i =1
5 5 5 5
Solutions:
17
5
a) ∑Xi = 5 + 7 + 7 + 6 + 8 = 33
i =1
5
b) ∑Y i = 6 + 7 + 8 + 7 + 8 = 36
i=1
5
c) ∑10 = 5 *10 = 50
i=1
5
d) ∑( X i + Yi ) = (5 + 6) + (7 + 7) + (7 + 8) + (6 + 7) + (8 + 8) = 69 = 33 + 36
i=1
5
e) ∑( X i − Yi ) = (5 − 6) + (7 − 7) + (7 − 8) + (6 − 7) + (8 − 8) = −3 = 33 − 36
i=1
5
f) ∑ X iYi = 5 * 6 + 7 * 7 + 7 * 8 + 6 * 7 + 8 * 8 = 241
i=1
5
g) ∑Xi2 2
=5 +7
2
+7
2
+6
2 2
+ 8 = 223
i=1
55
h) (∑ X i )(∑Yi ) = 33 * 36 = 1188
i =1 i=1
Types of measures of central tendency
There are several different measures of central tendency; each has its advantage and
disadvantage.
• The Mean (Arithmetic, Geometric and Harmonic)
• The Mode
• The Median
• Quantiles (Quartiles, Deciles and Percentiles)
The choice of these averages depends up on which best fit the property under discussion.
The Arithmetic Mean
• Is defined as the sum of the magnitude of the items divided by the number of items.
• The mean of X1, X2 ,X3 …Xn is denoted by A.M ,m or X and is given by:
• If X1 occurs f1 times
• If X2occurs f2 times
• .
• .
• If Xn occurs fn times
Example: Obtain the mean for the following sample of birth weight (kg) of live born infants
at a private hospital. 2, 7, 8, 2, 7, 3, 7
Solution:
18
4
f X
Xi fi Xifi ∑ i i
2 2 4 X = i =1
4
= 36 = 5.15
f
∑
3 1 3 7
i
7 3 21 i =1
8 1 8
Total 7 36
Arithmetic Mean for Grouped Data
If data are given in the shape of a continuous frequency distribution, then the mean is
obtained as follows:
K
fX
∑ i i
X= i=1
,Where Xi =the class mark of the ith class and fi = the frequency of the ith class
K
f
∑ i
i=1
Example: calculate the mean for the following age distribution.
Class Frequency
6- 10 35
11- 15 23
16- 20 15
21- 25 12
26- 30 9
31- 35 6
Solutions:
• First find the class marks
• Find the product of frequency and class marks
• Find mean using the formula.
6
Class fi Xi Xifi f X
∑ i i 1575
6- 10 35 8 280 X = i =1 = = 15 .75
11- 15 23 13 299 6
f 100
16- 20 15 18 270
∑ i
i =1
21- 25 12 23 276
26- 30 9 28 252
31- 35 6 33 198
Total 100 1575
Exercises:
1. Marks of 75 students are summarized in the following frequency distribution:
19
Marks No. of students
40-44 7
45-49 10
50-54 22
55-59 f4
60-64 f5
65-69 6
70-74 3
If 20% of the students have marks between 55 and 59
i. Find the missing frequencies f4 and f5.
ii. Find the mean.
• If the values in a series or mid values of a class are large enough, coding of values is a good
device to simplify the calculations.
• For raw data suppose we have used the following coding system.
In both cases the true mean is the assumed mean plus the average of the deviations from
the assumed mean.
Suppose the data is given in the shape of continuous frequency distribution with a
constant class size of w then the following coding is appropriate.
20
A is an assumed mean usually the mean of the class marks (i =1, 2… k).
Example:
1. Suppose the deviations of the observations from an assumed mean of 7 are: 1, -1, -2, -
2, 0, -3, -2, 2, 0, -3.
a) Find the true mean
b) Find the original observation.
Solutions:
∑( X i − X ) = 0.
i =1
2. The sum of the squared deviations of a set of items from their mean is the minimum.
N N
i.e. ∑( Xi − X ) < ∑( X i − A) , A ≠ X
2
2
i =1 i=1
If a wrong figure has been used when calculating the mean the correct mean can be obtained
without repeating the whole process using:
22
Weighted Mean
When a proper importance is desired to be given to different data a weighted mean is
appropriate.
Weights are assigned to each item in proportion to its relative importance.
Let X1, X2, …Xn be the value of items of a series and W1, W2, …Wn their corresponding
weights , then the weighted mean denoted X w is defined as:
n
XW
∑ i i
i =1
X w = n
W
∑ i
i −1
X = i =1 = 60 * 1 + 75 * 2 + 63 * 1 + 59 * 3 + 55 * 3 = 615 = 61 .5
w 5
∑
W
i
1+2+1+3+3 10
i −1
23
Taking the logarithms of both sides
1
n n
log( G .M ) = log( X 1 * X 2 * ... * X n ) = log( X 1 * X 2 * ... * X n )
1 1
⇒ log( G .M ) = log( X 1 * X 2 * .... * X n ) = (log X 1 + log X 2 + ... + log X n )
n n
n
⇒ log( G .M ) = ∑ log X i
1
n i =1
⇒The logarithm of the G.M of a set of observation is the arithmetic mean of their
logarithm.
n
⇒ G .M = Anti log(
1
∑ log Xi)
n i =1
∑ 1
X
i =1 i
H .M =
n ,n=∑fi
k
i =1
∑ fi
i =1 Xi
If observations X1, X2, …Xn have weights W1, W2, …Wn respectively, then their harmonic mean
is given by
n
W
∑ i
i =1 , This is called Weighted Harmonic Mean.
H .M = n
W X
∑ i i
i =1
Remark: The Harmonic Mean is useful and appropriate in finding average speeds and average
rates.
Example: A cyclist pedals from his house to his college at speed of 10 km/hr and back from the
college to his house at 15 km/hr. Find the average speed.
Solution: Here the distance is constant
The simple H.M is appropriate for this problem.
X1= 10km/hr X2=15km/hr
2
H .M = 1 + 1 = 12 km / hr
10 15
The Mode
- Mode is a value which occurs most frequently in a set of values
- The mode may not exist and even if it does exist, it may not be unique.
- In case of discrete distribution the value having the maximum frequency is the model value.
24
Examples:
1. Find the mode of 5, 3, 5, 8, 9 Mode =5
2. Find the mode of 8, 9, 9, 7, 8, 2, and 5. It is a bimodal Data: 8 and 9
3. Find the mode of 4, 12, 3, 6, and 7. No mode for this data.
ˆ
- The mode of a set of numbers X1, X2, …Xn is usually denoted by X .
Mode for Grouped data
If data are given in the shape of continuous frequency distribution, the mode is defined as:
Lmo = 45
w = 10
Example: Find the median of the following numbers, which consists of white blood counts
taken on admission of all patients entering a small hospital.
a) 6, 5, 2, 8, 9, 4.
b) 2, 1, 8, 3, 5, 8. Solutions:
a) First order the data: 2, 4, 5, 6, 8, 9 Here n=6
26
Remark: The median class is the class with the smallest cumulative frequency (less than
n
type) greater than or equal to .
2
Example: Find the median of the following distribution.
Solutions:
• First find the less than cumulative frequency.
• Identify the median class.
• Find median using formula.
n = 75 = 37.5
2 2
39 is the first cumulative frequency to be greater than or equal to 37.5
⇒ 50 − 54 is the median class.
27
Advantages and Disadvantages of Median
Advantages:
• Median is a positional average and hence not influenced by extreme observations.
• Can be calculated in the case of open end intervals.
• Median can be located even if the data are incomplete.
Disadvantages:
• It is not a good representative of data if the number of items is small.
• It is not amenable to further algebraic treatment.
• It is susceptible to sampling fluctuations.
Quantiles: When a distribution is arranged in order of magnitude of items, the median is
the value of the middle term. Their measures that depend up on their positions in
distribution quartiles, deciles, and percentiles are collectively called quantiles.
Quartiles: Quartiles are measures that divide the frequency distribution in to four equal parts.
- The value of the variables corresponding to these divisions are denoted Q1, Q2, and Q3
often called the first, the second and the third quartile respectively.
- Q1 is a value which has 25% items which are less than or equal to it. Similarly Q2 has
50%items with value less than or equal to it and Q3 has 75% items whose values are less
than or equal to it.
Quartile for Individual Observations (Ungrouped Data):
28
Quartile for a Frequency Distribution (Discrete Data):
Example: The wheat production (in Kg) of 20 acres is given as: 1120, 1240, 1320, 1040, 1080, 1200,
1440, 1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470, 1750, and 1885. Find the
quartile deviation and coefficient of quartile deviation. Solution: After arranging the observations in
ascending order, we get
1040, 1080, 1120, 1200, 1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750,
1755, 1785, 1880, 1885, 1960.
Remark: The quartile class (class containing Qi) is the class with the smallest cumulative
iN
frequency (less than type) greater than or equal to .
4
Deciles: Deciles are measures that divide the frequency distribution in to ten equal parts. -
The values of the variables corresponding to these divisions are denoted D1, D2,.. D9
29
often called the first, the second,…, the ninth decile respectively.
Remark: The decile class (class containing Di) is the class with the smallest cumulative
iN
frequency (less than type) greater than or equal to .
10
Percentiles:
- Percentiles are measures that divide the frequency distribution in to hundred equal parts.
- The values of the variables corresponding to these divisions are denoted P1, P2, .. P99
often called the first, the second,…, the ninety-ninth percentile respectively.
iN
- To find Pi (i=1, 2,..99) we count of the classes beginning from the lowest class.
100
30
- For grouped data: we have the following formula
Remark: The percentile class (class containing Pi) is the class with the smallest
cumulative
iN
frequency (less than type) greater than or equal to .
100
Example: Considering the following
distribution Calculate:
a) All quartiles.
b) The 7th decile.
c) The 90th percentile.
Values Frequency
140- 150 17
150- 160 29
160- 170 42
170- 180 72
180- 190 84
190- 200 107
200- 210 49
210- 220 34
220- 230 31
230- 240 16
240- 250 12
Solutions:
• First find the less than cumulative frequency.
• Use the formula to calculate the required quantile.
Values Frequency Cum.Freq(less than type)
140- 150 17 17
150- 160 29 46
160- 170 42 88
170- 180 72 160
180- 190 84 244
190- 200 107 351
200- 210 49 400
210- 220 34 434
220- 230 31 465
230- 240 16 481
240- 250 12 493
a) Quartiles:
31
32
CHAPTER FOUR
4. Measures of Dispersion (Variation)
Introduction and objectives of measuring Variation
-The scatter or spread of items of a distribution is known as dispersion or variation. In other
words the degree to which numerical data tend to spread about an average value is called
dispersion or variation of the data.
Measures of dispersions are statistical measures which provide ways of
measuring the extent in which data are dispersed or spread out.
Objectives of measuring Variation:
• To judge the reliability of measures of central tendency
• To control variability itself.
• To compare two or more groups of numbers in terms of their variability.
• To make further statistical analysis.
33
Types of Measures of Dispersion
Various measures of dispersions are in use. The most commonly used measures of
dispersions are:
1) Range and relative range
2) Quartile deviation and coefficient of Quartile deviation
3) Mean deviation and coefficient of Mean deviation
4) Standard deviation and coefficient of variation.
The Range (R)
The range is the largest score minus the smallest score. It is a quick and dirty measure of
variability, although when a test is given back to students they very often wish to know the
range of scores. Because the range is greatly affected by extreme scores, it may give a
distorted picture of the scores. The following two distributions have the same range, 13, yet
appear to differ greatly in the amount of variability.
Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45
Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45
For this reason, among others, the range is not the most important measure of variability.
R=L−S , L = l arg est observation
S = smallest observation
Range for grouped data:
If data are given in the shape of continuous frequency distribution, the range is computed as:
Relative Range (RR): it is also sometimes called coefficient of range and given by:
L−S R
RR = =
L+S L+S
Example:
1. Find the relative range of the above two distribution. (exercise!)
2. If the range and relative range of a series are 4 and 0.25 respectively. Then what is the value
34
of:
a) Smallest observation
b) Largest observation
Solutions :( 2)
R = 4 ⇒ L − S = 4 __________ _______( 1)
RR = 0 .25 ⇒ L + S = 16 __________ ___( 2 )
Solving (1) and ( 2 ) at the same time , one can obtain the following value
L = 10 and S =6
The Variance
Definition: Variance is defined as the average of the squared difference between
each of the observations in a set of data and the mean. For a sample data the
_ _
variance is denoted is denoted by 4 and the population variance is denoted by σ (sigma
square). Population Variance
If we divide the variation by the number of values in the population, we get something
called the population variance. This variance is the "average squared deviation from the
mean".
Population Varince = σ
1
∑( X i − µ)2 , i = 1,2,.....N
2
=
N
For the case of frequency distribution it is expressed as:
∑Xi 2
− nX 2 ∑ fi X i 2 − nX 2
2 2 i =1
S = i=1
, for raw data. S = , for frequency distribution.
n −1 n −1
Standard Deviation
Population s tan dard deviation = σ = σ
2
2
Sample s tan dard deviation = s = S
There is a problem with variances. Recall that the deviations were squared. That means that the
units were also squared. To get the units back the same as the original data values, the square
root must be taken.
35
The following steps are used to calculate the sample variance:
1. Find the arithmetic mean.
2. Find the difference between each observation and the mean.
3. Square these differences.
4. Sum the squared differences.
5. Since the data is a sample, divide the number (from step 4 above) by the number of
observations minus one, i.e., n-1 (where n is equal to the number of observations in the data
set).
Examples: Find the variance and standard deviation of the following sample data 5,
17, 12, 10.
Solution:
X = 11
Xi 5 10 12 17 Total
36 1 1 36 74
(Xi- X ) 2
n
∑( X i −X )
2
⇒S
2
= i=1 = 74 = 24.67.
n −1 3
⇒ S = S = 24.67 = 4.97.
2
Solution:
X = 55
Xi(C.M) 42 47 52 57 62 67 72 Total
fi(Xi- X ) 2 1183 640 198 60 588 864 867 4400
n
f (X
∑ i i −X )
2
⇒S
2
= i =1 = 4400 = 59.46.
n −1 74
⇒S=
2
S = 59.46 = 7.71.
36
• Approximately 68.27% of the data values fall within one standard deviation of the
mean. i.e. with in ( X − S , X + S )
• Approximately 95.45% of the data values fall within two standard deviations of the
mean. i.e. with in ( X − 2 S , X + 2 S )
• Approximately 99.73% of the data values fall within three standard deviations of the
mean. i.e. with in ( X − 3 S , X + 3 S )
3. Chebyshev's Theorem
Example: Suppose a distribution has mean 50 and standard deviation 6. What percent of
the numbers are:
a) Between 38 and 62
b) Between 32 and 68
c) Less than 38 or more than 62.
d) Less than 32 or more than 68.
Solutions:
a) 38 and 62 are at equal distance from the mean,50 and this distance is 12
37
d) Similarly done.
4. If the standard deviation of X 1 , X 2 , ..... X n is S , then the standard deviation of
a) X1 + k, X 2 + k,.....X n + k will also be S
b) kX1, kX 2 ,.....kX n would be k S
c) a + kX1, a + kX 2 ,.....a + kX n would be k S
Exercise: Verify each of the above relationship, considering k and a as constants.
Examples:
1. The mean and standard deviation of n Tetracycline Capsules X 1 , X 2 , ..... X n are known to
be 12 gm and 3 gm respectively. New set of capsules of another drug are obtained by the
linear transformation Yi = 2Xi – 0.5 ( i = 1, 2, …, n ) then what will be the standard
deviation of the new set of capsules
2. The mean and the standard deviation of a set of numbers are respectively 500 and 10.
a. If 10 is added to each of the numbers in the set, then what will be the variance and
standard deviation of the new set?
b. If each of the numbers in the set are multiplied by -5, then what will be the
variance and standard deviation of the new set?
Solutions:
1. Using c) above the new standard deviation = k S = 2 *3 = 6
2. a. They will remain the same.
b. New standard deviation= = k S = 5*10 = 50
Coefficient of Variation (C.V)
• Is defined as the ratio of standard deviation to the mean usually expressed as percents.
City 1 25 24 23 26 17
City2 22 21 24 22 20
City3 32 27 35 24 28
38
Which city have the most consistent temperature, based on these data?
C.V1 = S1 *100 = 3.546 *100 = 0.154%
X1 23
C.V2 S *100 = 1.483 *100 = 0.068%
=
2
X2 21.8
C.V3 S3 *100 = 4.324 *100 = 0.148%
=
X3 18.7
Since C.V2 < C.V3 < C.V1, in city 2 there is high consistent temperature.
Standard Scores (Z-scores)
• If X is a measurement from a distribution with mean X and standard deviation S,
then its value in standard units is
S1 6 S2 5
Student A performed better relative to his section because the score of student A is two
standard deviation above the mean score of his section while, the score of student B is only
one standard deviation above the mean score of his section.
2. Two groups of people were trained to perform a certain task and tested to find out which
group is faster to learn the task. For the two groups the following information was given:
Relatively speaking:
a) Which group is more consistent in its performance
39
b) Suppose a person A from group one take 9.2 minutes while person B from Group
two take 9.3 minutes, who was faster in performing the task? Why?
Solutions:
a) Use coefficient of variation.
S1 1 .2
C .V1 = * 100 = * 100 = 11 .54 %
X1 10 .4
S2 1 .3
C .V 2 = * 100 = * 100 = 10 .92 %
X2 11 .9
Since C.V2 < C.V1, group 2 is more
consistent. b) Calculate the standard score of A
and B
X A −X 1 9.2 −10.4
ZA = = = −1
S1 1.2
X B −X 2 9.3 −11.9
ZB = = = −2
S2 1.3
Child B is faster because the time taken by child B is two standard deviation shorter than
the average time taken by group 2 while, the time taken by child A is only one standard
deviation shorter than the average time taken by group 1.
40