Module of Introduction To Statistics
Module of Introduction To Statistics
INTRODUCTION TO STATISTICS
Course Code: (ABVM2101)
(3/5 Cr.Hrs/ECTS)
Module Writer:
Getahun G. Woldemariam (MSc)
May, 2020
Woliso, Ethiopia
TABLE OF CONTENTS
INTRODUCTION TO STATISTICS 1
TABLE OF CONTENTS 2
MODULE OBJECTIVES 4
MAJOR COMPONENTS 5
L1.1 INTRODUCTORY CONCEPTS IN STATISTICS 6
1.1. Definition and classifications of statistics 6
7.3. Steps 99
MODULE OBJECTIVES
The learning task was designed to equip students with the ability to
Identify the importance and application areas of statistics in their
field of study;
Interpret statistical information, reports, charts and figures;
Choose appropriate sampling methods and procedures;
Explain the basic concepts of probability distributions and their
application;
Use estimation and testing methods for predication and
generalization purposes.
In addition, the learning task attempts to enable students to describe
data collection tools and procedures.
MAJOR COMPONENTS
1. Lectures
NO. Title Hours
L1.1 Introductory concepts in statistics 4
L1.2 Measures of central tendency 5
L1.3 Measures of dispersion 5
L1.4 Probability theories 8
L1.5 Concepts of sampling and their applications 5
L1.6 Estimation and hypothesis testing 8
L1.7 Correlation and simple linear regression 5
2. Problem based learning tasks
NumberCase Description
PBL1.1 Students will be provided with socio-economic data and asked to compute various measures such as frequency
distribution, measures of central tendency, measures of dispersion, and measures of shape of distribution.
3. Individual Studies
NO Title of Book, Article or website; or Reader Hrs
S1.1 Agrawal B.L. 1996. Basic statistics, new age international pub. Ltd. New Delhi 7
S1.2 Frank H. and Althoen S.C. 1994. Statistics: concepts, and application. Cambridge university press, UK 7
S1.3 Hooda R.P, 2001. Statistics for business and economics. 2nd , New York 7
S1.4 Johnson , R.A, and Bhata K.G. 1992, statistics principles and methods. New York 7
S1.5 Wayne, W. 1995. Biostatistics : a foundation for analysis in health. 6th ed. New York 7
S1.6 Students read their handouts, notes and any other materials they find helpful to fulfill the objectives of the educational unit 20
4. Practical Activities:
NO Title Practicals (guided) Hou
P1.1 With the help of appropriate software, students will be asked to compute correlation and various 3
test of hypothesis based on fictitious socio-economic data
Demonstrations
Students follow instructor’s demonstration of software package application and exercise to master
D1.1 4
the application
Routine training (independent)
Experts from statistical offices and other institutions that are known to use data processing
R1.1 3
activities will be invited to train students and share their practical experiences.
Hrs for different educational activities within the task (LT)
L task Total hrs LT
L P T S/WS PA A/PoA IS
LT 1 135 40 10 8 7 8 7 55
Exercise: discuss the advantage and disadvantage of the above three methods with
respect to each other.
2. Organization of data: Summarization of data in some meaningful way, e.g table form
3. Presentation of the data: The process of re-organization, classification, compilation, and
summarization of data to present it in a meaningful form.
4. Analysis of data: The process of extracting relevant information from the summarized
data, mainly through the use of elementary mathematical operation.
5. Inference of data: The interpretation and further observation of the various statistical
measures through the analysis of the data by implementing those methods by which
conclusions are formed and inferences made.
Statistical techniques based on probability theory are required.
1. Qualitative Variables are nonnumeric variables and can't be measured. Examples include
gender, religious affiliation, and state of birth.
2. Quantitative Variables are numerical variables and can be measured. Examples include
balance in checking account, number of children in family. Note that quantitative variables are
either discrete (which can assume only certain values, and there are usually "gaps" between the
values, such as the number of bedrooms in your house) or continuous (which can assume any
value within a specific range, such as the air pressure in a tire.)
1.4.Applications, Uses and Limitations of statistics
Applications of statistics:
In almost all fields of human endeavor.
Almost all human beings in their daily life are subjected to obtaining numerical facts
e.g. abut price.
Applicable in some process e.g. invention of certain drugs, extent of environmental
pollution.
In industries especially in quality control area.
Uses of statistics:
The main function of statistics is to enlarge our knowledge of complex phenomena. The
following are some uses of statistics:
1. It presents facts in a definite and precise form.
2. Data reduction.
3. Measuring the magnitude of variations in data.
4. Furnishes a technique of comparison
5. Estimating unknown population characteristics.
6. Testing and formulating of hypothesis.
7. Studying the relationship between two or more variable.
8. Forecasting future events.
Limitations of statistics
As a science statistics has its own limitations. The following are some of the limitations:
Deals with only quantitative information.
Deals with only aggregate of facts and not with individual data items.
Statistical data are only approximately and not mathematical correct.
Statistics can be easily misused and therefore should be used be experts.
1.5.Scales of measurement
Proper knowledge about the nature and type of data to be dealt with is essential in order to
specify and apply the proper statistical method for their analysis and inferences. Measurement
scale refers to the property of value assigned to the data based on the properties of order,
distance and fixed zero.
In mathematical terms measurement is a functional mapping from the set of objects {Oi} to the
set of real numbers {M(Oi)}.
The goal of measurement systems is to structure the rule for assigning numbers to objects in
such a way that the relationship between the objects is preserved in the numbers assigned to the
objects. The different kinds of relationships preserved are called properties of the measurement
system.
Order
The property of order exists when an object that has more of the attribute than another object,
is given a bigger number by the rule system. This relationship must hold for all objects in the
"real world".
The property of ORDER exists
When for all i, j if Oi > Oj, then M(Oi) > M(Oj).
Distance
The property of distance is concerned with the relationship of differences between objects. If a
measurement system possesses the property of distance it means that the unit of measurement
means the same thing throughout the scale of numbers. That is, an inch is an inch, no matters
were it falls - immediately ahead or a mile downs the road.
More precisely, an equal difference between two numbers reflects an equal difference in the
"real world" between the objects that were assigned the numbers. In order to define the
property of distance in the mathematical notation, four objects are required: Oi, Oj, Ok, and Ol .
The difference between objects is represented by the "-" sign; Oi - Oj refers to the actual "real
world" difference between object i and object j, while M(Oi) - M(Oj) refers to differences
between numbers.
The property of DISTANCE exists, for all i, j, k, l
If Oi-Oj ≥ Ok- Ol then M(Oi)-M(Oj) ≥ M(Ok)-M( Ol ).
Fixed Zero
A measurement system possesses a rational zero (fixed zero) if an object that has none of the
attribute in question is assigned the number zero by the system of rules. The object does not
need to really exist in the "real world", as it is somewhat difficult to visualize a "man with no
height". The requirement for a rational zero is this: if objects with none of the attribute did
exist would they be given the value zero. Defining O0 as the object with none of the attribute in
question, the definition of a rational zero becomes:
The property of FIXED ZERO exists if M(O0) = 0.
The property of fixed zero is necessary for ratios between numbers to be meaningful.
1.6.SCALE TYPES
Measurement is the assignment of numbers to objects or events in a systematic fashion. Four
levels of measurement scales are commonly distinguished: nominal, ordinal, interval, and ratio
and each possessed different properties of measurement systems.
Nominal Scales
Nominal scales are measurement systems that possess none of the three properties stated
above. Level of measurement which classifies data into mutually exclusive, all inclusive
categories in which no order or ranking can be imposed on the data.
No arithmetic and relational operation can be applied.
Examples:
Political party preference (Republican, Democrat, or Other,)
Sex (Male or Female.)
Marital status(married, single, widow, divorce)
Country code
Regional differentiation of Ethiopia.
Ordinal Scales
Ordinal Scales are measurement systems that possess the property of order, but not the
property of distance. The property of fixed zero is not important if the property of distance is
not satisfied.
Level of measurement which classifies data into categories that can be ranked Differences
between the ranks do not exist. Arithmetic operations are not applicable but relational
operations are applicable. Ordering is the sole property of ordinal scale.
Examples:
Letter grades (A, B, C, D, F)
Rating scales (Excellent, Very good, Good, Fair, poor)
Military status
Interval Scales
Interval scales are measurement systems that possess the properties of Order and distance, but
not the property of fixed zero. Level of measurement which classifies data that can be ranked
and differences are meaningful. However, there is no meaningful zero, so ratios are
meaningless. All arithmetic operations except division are applicable. Relational operations are
also possible.
Examples:
IQ
Temperature in oF
Ratio Scales
Ratio scales are measurement systems that possess all three properties: order, distance, and
fixed zero. The added power of a fixed zero allows ratios of numbers to be meaningfully
interpreted; i.e. the ratio of Bekele's height to Martha's height is 1.32, whereas this is not
possible with interval scales.
Level of measurement which classifies data that can be ranked, differences are meaningful, and
there is a true zero. True ratios exist between the different units of measure. All arithmetic and
relational operations are applicable.
Examples:
By Getahun G([email protected] ) Page 11
Weight
Height
Number of students
Age
The following present a list of different attributes and rules for assigning numbers to objects.
Try to classify the different measurement systems into one of the four types of scales.
(Exercise)
Your checking account balance as a measure of the amount of money you have in that
account.
Your score on the first statistics test as a measure of your knowledge of statistics.
Your score on an individual intelligence test as a measure of your intelligence.
The distance around your forehead measured with a tape measure as a measure of your
intelligence.
A response to the statement "Abortion is a woman's right" where "Strongly Disagree" =
1, "Disagree" = 2, "No Opinion" = 3, "Agree" = 4, and "Strongly Agree" = 5, as a
measure of attitude toward abortion.
Times for swimmers to complete a 50-meter race
Months of the year Meskerm, Tikimit…
Socioeconomic status of a family when classified as low, middle and upper classes.
Blood type of individuals, A, B, AB and O.
Regions numbers of Ethiopia (1, 2, 3 etc.)
The number of students in a college;
The net wages of a group of workers;
the height of the men in the same town;
Having collected and edited the data, the next important step is to organize it. That is to present
it in a readily comprehensible condensed form that aids in order to draw inferences from it. It is
also necessary that the like be separated from the unlike ones.
The presentation of data is broadly classified in to the following two categories:
Tabular presentation
Diagrammatic and Graphic presentation.
The process of arranging data in to classes or categories according to similarities technically is
called classification.
Classification is a preliminary and it prepares the ground for proper presentation of data.
Definitions:
Raw data: recorded information in its original collected form, whether it is counts or
measurements, is referred to as raw data.
Frequency: is the number of values in a specific class of the distribution.
Frequency distribution: is the organization of raw data in table form using classes and
frequencies.
There are three basic types of frequency distributions
Categorical frequency distribution
Ungrouped frequency distribution
Grouped frequency distribution
There are specific procedures for constructing each type.
2.1.Categorical frequency Distribution:
Used for data that can be place in specific categories such as nominal, or ordinal. e.g. marital
status.
Example: a social worker collected the following data on marital status for 25
persons.(M=married, S=single, W=widowed, D=divorced)
M S D W D
S S M M M
W D S M M
W D D S S
S W W D D
Solution:
Since the data are categorical, discrete classes can be used. There are four types of marital status
M, S, D, and W. These types will be used as class for the distribution. We follow procedure to
construct the frequency distribution.
Step 1: Make a table as shown.
Class (1) Tally (2) Frequency (3) Percent (4)
M
S
D
W
Step 2: Tally the data and place the result in column (2).
Step 3: Count the tally and place the result in column (3).
Step 4: Find the percentages of values in each class by using;
f
%ý * 100 Where f= frequency of the class, n=total number of value.
n
Percentages are not normally a part of frequency distribution but they can be added since they are
used in certain types diagrammatic such as pie charts.
Step 5: Find the total for column (3) and (4).
Combing the entire steps one can construct the following frequency distribution.
Class (1) Tally (2) Frequency (3) Percent (4)
M 6 20
/////
S //// // 7 28
D //// // 7 28
W //// 5 24
80 76 90 85 80
70 60 62 70 85
65 60 63 74 75
76 70 70 80 85
Solution:
74 / 1
75 // 2
76 / 1
80 /// 3
85 /// 3
90 / 1
Each individual value is presented separately, that is why it is named ungrouped frequency
distribution.
Class mark (Mid points): it is the average of the lower and upper class limits or the
average of upper and lower class boundary.
Cumulative frequency: is the number of observations less than/more than or equal to a
specific value.
Cumulative frequency above: it is the total frequency of all values greater than or equal
to the lower class boundary of a given class.
Cumulative frequency below: it is the total frequency of all values less than or equal to
the upper class boundary of a given class.
Cumulative Frequency Distribution (CFD): it is the tabular arrangement of class interval
together with their corresponding cumulative frequencies. It can be more than or less than
type, depending on the type of cumulative frequency used.
Relative frequency (rf): it is the frequency divided by the total frequency.
Relative cumulative frequency (rcf): it is the cumulative frequency divided by the total
frequency.
Guidelines for classes
1. There should be between 5 and 20 classes.
2. The classes must be mutually exclusive. This means that no data value can fall into two
different classes
3. The classes must be all inclusive or exhaustive. This means that all data values must be
included.
4. The classes must be continuous. There are no gaps in a frequency distribution.
5. The classes must be equal in width. The exception here is the first or last class. It is
possible to have an "below ..." or "... and above" class. This is often used with ages.
Steps for constructing Grouped frequency Distribution
1. Find the largest and smallest values
2. Compute the Range(R) = Maximum - Minimum
3. Select the number of classes desired, usually between 5 and 20 or use Sturges rule
k ý 1 3.32 log n where k is number of classes desired and n is total number of
observation.
4. Find the class width by dividing the range by the number of classes and rounding up,
R
not off. w ý .
k
5. Pick a suitable starting point less than or equal to the minimum value. The starting
point is called the lower limit of the first class. Continue to add the class width to this
lower limit to get the rest of the lower limits.
6. To find the upper limit of the first class, subtract U from the lower limit of the second
class. Then continue to add the class width to this upper limit to find the rest of the
upper limits.
7. Find the boundaries by subtracting U/2 units from the lower limits and adding U/2 units
from the upper limits. The boundaries are also half-way between the upper limit of one
class and the lower limit of the next class. !may not be necessary to find the boundaries.
8. Tally the data.
9. Find the frequencies.
10. Find the cumulative frequencies. Depending on what you're trying to accomplish, it
may not be necessary to find the cumulative frequencies.
11. If necessary, find the relative frequencies and/or relative cumulative frequencies
Example*:
Construct a frequency distribution for the following data.
11 29 6 33 14 31 22 27 19 20
18 17 22 38 23 21 26 34 39 27
Solutions:
Step 1: Find the highest and the lowest value H=39, L=6
Step 2: Find the range; R=H-L=39-6=33
Step 3: Select the number of classes desired using Sturges formula;
k ý 1 3.32 log n =1+3.32log (20) =5.32=6(rounding up)
Step 4: Find the class width; w=R/k=33/6=5.5=6 (rounding up)
11, 17, 23, 29, 35, 41 are the upper class limits.
So combining step 5 and step 6, one can construct the following classes.
Class limits
6 – 11
12 – 17
18 – 23
24 – 29
30 – 35
36 – 41
Step 7: Find the class boundaries;
E.g. for class 1 Lower class boundary=6-U/2=5.5
Upper class boundary =11+U/2=11.5
Then continue adding w on both boundaries to obtain the rest boundaries. By doing so
one can obtain the following classes.
Class boundary
5.5 – 11.5
11.5 – 17.5
17.5 – 23.5
23.5 – 29.5
29.5 – 35.5
35.5 – 41.5
Step 8: tally the data.
Step 9: Write the numeric values for the tallies in the frequency column.
Step 10: Find cumulative frequency.
Step 11: Find relative frequency or/and relative cumulative frequency.
Class Class boundary Class Tally Freq. Cf (less Cf (more rf. rcf (less
limit Mark than than type) than type
type)
6 – 11 5.5 – 11.5 8.5 // 2 2 20 0.10 0.10
12 – 17 11.5 – 17.5 14.5 // 2 4 18 0.10 0.20
18 – 23 17.5 – 23.5 20.5 7 11 16 0.35 0.55
//////
24 – 29 23.5 – 29.5 26.5 //// 4 15 9 0.20 0.75
30 – 35 29.5 – 35.5 32.5 /// 3 18 5 0.15 0.90
36 – 41 35.5 – 41.5 38.5 // 2 20 2 0.10 1.00
Pictogram
-In these diagram, we represent data by means of some picture symbols. We decide abut a
suitable picture to represent a definite number of units in which the variable is measured.
Example: draw a pictogram to represent the following population of a town.
Year 1989 1990 1991 1992
Bar Charts:
- A set of bars (thick lines or narrow rectangles) representing some magnitude over time
space.
- They are useful for comparing aggregate over time space.
- Bars can be drawn either vertically or horizontally.
- There are different types of bar charts. The most common being :
Solutions:
30
25
Sales in $
20
15
10
5
0
A B C
product
-When there is a desire to show how a total (or aggregate) is divided in to its component parts,
we use component bar chart.
-The bars represent total value of a variable with each total broken in to its component parts
and different colours or designs are used for identifications
Example:
Draw a component bar chart to represent the sales by product from 1957 to 1959.
Solutions:
100
80
Sales in $
Product C
60
Product B
40
Product A
20
0
1957 1958 1959
Year of production
Example:
Draw a component bar chart to represent the sales by product from 1957 to 1959.
Solutions:
60
50
Sales in $
40 Product A
30 Product B
20 Product C
10
0
1957 1958 1959
Year of production
The histogram, frequency polygon and cumulative frequency graph or ogive are most
commonly applied graphical representations for continuous data.
Histogram
A graph which displays the data by using vertical bars of various height to represent
frequencies. Class boundaries are placed along the horizontal axes. Class marks and class
limits are some times used as quantity on the X axes.
Frequency Polygon:
- A line graph. The frequency is placed along the vertical axis and classes mid points are
placed along the horizontal axis. It is customer to the next higher and lower class interval with
corresponding frequency of zero, this is to make it a complete polygon.
Example: Draw a frequency polygon for the above data (example *).
Solutions:
0
2. 5 8. 5 14.5 20.5 26.5 32.5 38.5 44.5
- A graph showing the cumulative frequency (less than or more than type) plotted against
upper or lower class boundaries respectively. That is class boundaries are plotted along the
horizontal axis and the corresponding cumulative frequencies are plotted along the vertical
axis. The points are joined by a free hand curve.
Example: Draw an ogive curve(less than type) for the above data.(Example *)
Objectives:
To comprehend the data easily.
To facilitate comparison.
To make further statistical analysis.
N
The symbol õX
i ý1
i is a mathematical shorthand for X1+X2+X3+...+XN
The expression is read, "the sum of X sub i from i equals 1 to N." It means "add up all the
numbers."
Example: Suppose the following were scores made on the first homework assignment for five
students in the class: 5, 7, 7, 6, and 8. In this example set of five numbers, where N=5, the
summation could be written:
The "i=1" in the bottom of the summation notation tells where to begin the sequence of
summation. If the expression were written with "i=3", the summation would start with the third
number in the set. For example:
In the example set of numbers, this would give the following result:
The "N" in the upper part of the summation notation tells where to end the sequence of
summation. If there were only three scores then the summation and example would be:
Sometimes if the summation notation is used in an expression and the expression must be
written a number of times, as in a proof, then a shorthand notation for the shorthand notation is
employed. When the summation sign "∑" is used without additional notation, then "i=1" and
"N" are assumed.
For example:
PROPERTIES OF SUMMATION
n
1. õ k ý nk
i ý1
where k is any constant
n n
2. õ kX i ý kõ X i where k is any constant
i ý1 i ý1
n n
3. õ (a bX
i ý1
i ) ý na bõ X i
i ý1
where a and b are any constant
n n n
4. õ(X
i ý1
i Yi ) ý õ X i õ Yi
i ý1 i ý1
6 7
8 8
5 5
a) õ Xi
i ý1
e) õ(X
i ý1
i Yi )
5 5
b) õ Yi
i ý1
f) õX Y
i ý1
i i
5 5
õ10 õX
2
c) g) i
i ý1 i ý1
5 5 5
d) õ ( X i Yi ) h) (õ X i )(õ Yi )
i ý1 i ý1 i ý1
Solutions:
5
a) õX
i ý1
i ý 5 7 7 6 8 ý 33
5
b) õY
i ý1
i ý 6 7 8 7 8 ý 36
5
c) õ10 ý 5 *10 ý 50
i ý1
5
d) õ(X
i ý1
i Yi ) ý (5 6) (7 7) (7 8) (6 7) (8 8) ý 69 ý 33 36
5
e) õ(X
i ý1
i Yi ) ý (5 6) (7 7) (7 8) (6 7) (8 8) ý 3 ý 33 36
5
f) õX Y
i ý1
i i ý 5 * 6 7 * 7 7 * 8 6 * 7 8 * 8 ý 241
õX ý 5 2 7 2 7 2 6 2 8 2 ý 223
2
g) i
i ý1
5 5
h) (õ X i )(õ Yi ) ý 33 * 36 ý 1188
i ý1 i ý1
There are several different measures of central tendency; each has its advantage and
disadvantage.
The Mean (Arithmetic, Geometric and Harmonic)
The Mode
The Median
Quantiles (Quartiles, Deciles and Percentiles)
The choice of these averages depends up on which best fit the property under discussion.
Is defined as the sum of the magnitude of the items divided by the number of items.
The mean of X1, X2 ,X3 …Xn is denoted by A.M ,m or X and is given by:
X 1 X 2 ... X n
X ý
n
n
õX i
X ý i ý1
n
If X1 occurs f1 times, if X2occurs f2 times, … , if Xn occurs fn times
k
õfX i i k
Then the mean will be Xý i ý1
k , where k is the number of classes and õf i ýn
õf
i ý1
i
i ý1
õf i Xi
36
X ý i ý1
4
ý ý 5.15
7
õf i ý1
i
If data are given in the shape of a continuous frequency distribution, then the mean is obtained
as follows:
k
õf i Xi th th
X ý i ý1
k
, Where Xi =the class mark of the i class and fi = the frequency of the i class
õf
i ý1
i
Class frequency
6- 10 35
11- 15 23
16- 20 15
21- 25 12
26- 30 9
31- 35 6
Solutions:
First find the class marks
Find the product of frequency and class marks
Find mean using the formula.
Class fi Xi Xifi
6- 10 35 8 280
11- 15 23 13 299
16- 20 15 18 270
21- 25 12 23 276
26- 30 9 28 252
31- 35 6 33 198
Total 100 1575
õf X i i
1575
X ý i ý1
6
ý ý 15.75
100
õf i ý1
i
Exercises:
65-69 6
70-74 3
õ ( X X ) ý 0.
i ý1
i
2. The sum of the squared deviations of a set of items from their mean is the minimum. i.e.
n n
õ ( Xi X ) 2 ü õ ( X i A) 2 , A X
i ý1 i ý1
the mean of n k observation, then the mean of all the observation in all groups often called
the combined mean is given by:
k
X n X 2 n 2 .... X k n k õX n i i
Xc ý 1 1 ý i ý1
n1 n 2 ...n k
k
õn
i ý1
i
Females Males
X 1 ý 60 X 2 ý 72
n1 ý 30 n2 ý 70
X n X 2 n2 õ
X i ni
Xc ý 1 1 ý iý12
n1 n2
õn
i ý1
i
4. If a wrong figure has been used when calculating the mean the correct mean can be
obtained with out repeating the whole process using:
(CorrectValue WrongValue)
CorrectMean ý WrongMean
n
Where n is total number of observations.
Solutions:
(CorrectValue WrongValue)
CorrectMean ý WrongMean
n
(80 40)
CorrectMean ý 65 ý 65 4 ý 69k.g.
10
k*old mean
Example:
1. The mean of n Tetracycline Capsules X1, X2, …, Xn are known to be 12 gm. New set
of capsules of another drug are obtained by the linear transformation Yi = 2Xi – 0.5
( i = 1, 2, …, n ) then what will be the mean of the new set of capsules
Solutions:
NewMean ý 2 * OldMean 0.5 ý 2 * 12 0.5 ý 23.5
Weighted Mean
õX W i i
Xw ý i ý1
n
õW
i 1
i
Example:
A student obtained the following percentage in an examination:
English 60, Biology 75, Mathematics 63, Physics 59, and chemistry 55.Find the
students weighted arithmetic mean if weights 1, 2, 1, 3, 3 respectively are allotted to the
subjects.
Solutions:
õX W i i
60 * 1 75 * 2 63 * 1 59 * 3 55 * 3 615
Xw ý i ý1
ý ý ý 61.5
1 2 1 3 3
5
10
õW
i 1
i
Merits:
It is based on all observation.
It is suitable for further mathematical treatment.
It is stable average, i.e. it is not affected by fluctuations of sampling to some extent.
It is easy to calculate and simple to understand.
Demerits:
It is affected by extreme observations.
It can not be used in the case of open end classes.
It can not be determined by the method of inspection.
It can not be used when dealing with qualitative characteristics, such as intelligence,
honesty, beauty.
The geometric mean of a set of n observation is the nth root of their product.
The geometric mean of X1, X2 ,X3 …Xn is denoted by G.M and given by:
G.M ý n X1 * X2 * ... * Xn
Taking the logarithms of both sides
1
log(G.M) ý log(n X 1 * X 2 * ... * X n ) ý log(X 1 * X 2 * ... * X n ) n
1 1
log(G.M) ý log(X 1 * X 2 * .... * X n ) ý (log X 1 log X 2 ... log X n )
n n
1 n
log(G.M) ý õ log X i
n iý1
Example:
Solutions:
G.M ý n X1 * X2 * ... * Xn ý 3 2 * 4 * 8 ý 3 64 ý 4
Remark: The Geometric Mean is useful and appropriate for finding averages of ratios.
The harmonic mean of X1, X2 , X3 …Xn is denoted by H.M and given by:
n
H.M ý n , This is called simple harmonic mean.
1
õ
i ý1 X i
k
n
H.M ý k , n ý õ fi
fi
õ
i ý1 X
i ý1
If observations X1, X2, …Xn have weights W1, W2, …Wn respectively, then their harmonic
mean is given by
õW i
H.M ý n
i ý1
, This is called Weighted Harmonic Mean.
õW
i ý1
i Xi
Remark: The Harmonic Mean is useful and appropriate in finding average speeds and average
rates.
Example: A cyclist pedals from his house to his college at speed of 10 km/hr and back from the
college to his house at 15 km/hr. Find the average speed.
The Mode
Examples:
1. Find the mode of 5, 3, 5, 8, 9
Mode =5
2. Find the mode of 8, 9, 9, 7, 8, 2, and 5.
It is a bimodal Data: 8 and 9
3. Find the mode of 4, 12, 3, 6, and 7.
No mode for this data.
If data are given in the shape of continuous frequency distribution, the mode is defined as:
ö 1 ö
X̂ ý L mo w÷÷ ÷÷
ø 1 2 ø
Where:
Xˆ ý the mod e of the distribution
w ý the size of the mod al class
1 ý f mo f 1
2 ý f mo f 2
f mo ý frequencyof the mod al class
f 1 ý frequencyof the class preceedingthe mod al class
f 2 ý frequencyof the class following the mod al class
Example: Following is the distribution of the size of certain farms selected at random from a
district. Calculate the mode of the distribution.
Solutions:
Xˆ ý 45 10ö÷
2 ö
÷
ø 2 26 ø
ý 45.71
Merits:
It is not affected by extreme observations.
Easy to calculate and simple to understand.
It can be calculated for distribution with open end class
Demerits:
It is not rigidly defined.
The Median
- In a distribution, median is the value of the variable which divides it in to two equal halves.
- In an ordered series of data median is an observation lying exactly in the middle of the series.
It is the middle most value in the sense that the number of values less than the median is equal to the
number of values greater than it.
-If X1, X2, …Xn be the observations, then the numbers arranged in ascending order will be X[1],
X[2], …X[n], where X[i] is ith smallest value.
X[1]< X[2]< …<X[n]
-Median is denoted by X̂ .
Median for ungrouped data
Solutions:
a) First order the data: 2, 4, 5, 6, 8, 9
Here n=6
~ ý 1 (X X
X )
n n
2 [2] [ 1]
2
1
ý ( X [3] X [ 4 ] )
2
1
ý ( 5 6) ý 5.5
2
b) Order the data :1, 2, 3, 5, 8
Here n=5
~ý X
X n 1
[ ]
2
ý X [3]
ý3
If data are given in the shape of continuous frequency distribution, the median is defined as:
~ ý L w ( n c)
X med
f med 2
Where :
L med ý lower class boundary of the median class.
w ý the size of the median class
n ý total number of observations.
c ý the cumulativefrequency(less than type) preceeding the median class.
f med ý thefrequency of the median class.
Remark:
The median class is the class with the smallest cumulative frequency (less than type) greater than or
n
equal to .
2
Example: Find the median of the following distribution.
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Solutions:
First find the less than cumulative frequency.
Identify the median class.
Find median using formula.
n 75
ý ý 37.5
2 2
39 is the first cumulative frequencyto be greater thanor equalto 37.5
50 54 is the median class.
L ý 49.5, w ý 5
med
n ý 75, c ý 17, f ý 22
med
~
Xý L w ( n c)
med f 2
med
ý 49.5 5 (37.5 17)
22
ý 54.16
Merits:
Median is a positional average and hence not influenced by extreme observations.
Can be calculated in the case of open end intervals.
Median can be located even if the data are incomplete.
Demerits:
It is not a good representative of data if the number of items is small.
It is not amenable to further algebraic treatment.
It is susceptible to sampling fluctuations.
Quantiles
When a distribution is arranged in order of magnitude of items, the median is the value of the middle
term. Their measures that depend up on their positions in distribution quartiles, deciles, and
percentiles are collectively called quantiles.
Quartiles:
- Quartiles are measures that divide the frequency distribution in to four equal parts.
- The value of the variables corresponding to these divisions are denoted Q1, Q2, and Q3
often called the first, the second and the third quartile respectively.
- Q1 is a value which has 25% items which are less than or equal to it. Similarly Q2 has
50%items with value less than or equal to it and Q3 has 75% items whose values are less
than or equal to it.
iN
- To find Qi (i=1, 2, 3) we count of the classes beginning from the lowest class.
4
- For grouped data: we have the following formula
Q ý L Q w ( iN c) ,i ý 1,2,3
i i fQ 4
i
Where :
L Q ý lower classboundary of thequartile class.
i
w ý thesize of thequartile class
N ý total numberof observations.
c ý thecumulativefrequency(lessthantype) preceedingthequartile class.
f Q ý thefrequency of thequartile class.
i
Remark:
The quartile class (class containing Qi ) is the class with the smallest cumulative frequency (less
iN
than type) greater than or equal to .
4
Deciles:
- Deciles are measures that divide the frequency distribution in to ten equal parts.
- The values of the variables corresponding to these divisions are denoted D1, D2,.. D9 often
called the first, the second,…, the ninth deciles respectively.
iN
- To find Di (i=1, 2,..9) we count of the classes beginning from the lowest class.
10
w iN
Di ý LD i ( c) , i ý 1,2,...,9
f Di 10
Where :
LDi ý lower class boundaryof the decile class.
w ý the size of the decileclass
N ý total number of observations.
c ý the cumulative frequency (less than type) preceeding the decile class.
f Di ý thefrequency of the decile class.
Remark:
The deciles class (class containing Di) is the class with the smallest cumulative frequency (less
iN
than type) greater than or equal to .
10
Percentiles:
- Percentiles are measures that divide the frequency distribution in to hundred equal parts.
- The values of the variables corresponding to these divisions are denoted P 1, P2,.. P99 often
called the first, the second,…, the ninety-ninth percentile respectively.
iN
- To find Pi (i=1, 2,..99) we count of the classes beginning from the lowest class.
100
Remark:
The percentile class (class containing Pi) is the class with the small cumulative frequency
iN
(less than type) greater than or equal to .
100
Example: Considering the following distribution
Calculate:
a) All quartiles.
b) The 7th decile.
c) The 90th percentile.
Values Frequency
140- 150 17
150- 160 29
160- 170 42
170- 180 72
180- 190 84
190- 200 107
200- 210 49
210- 220 34
220- 230 31
230- 240 16
240- 250 12
Solutions:
First find the less than cumulative frequency.
Use the formula to calculate the required quantile.
Values Frequency Cum.Freq(less
than type)
140- 150 17 17
150- 160 29 46
160- 170 42 88
170- 180 72 160
180- 190 84 244
190- 200 107 351
a) Quartiles:
i. Q1
- determine the class containing the first quartile.
N
ý 123.25
4
170 180 is the class containingthe first quartile.
w N
Q1 ý LQ1 ( c)
LQ ý 170 ,
1
w ý10 fQ 4
1
N ý 493 , c ý 88 , f Q ý 72
1 ý 170
10
(123.25 88)
72
ý 174.90
ii. Q2
- determine the class containing the second quartile.
2* N
ý 246.5
4
190 200 is the class containingthe sec ond quartile.
LQ ý 190 ,
2
w ý10
N ý 493 , c ý 244 , f Q ý107
2
w 2* N
Q2 ý LQ ( c)
2
fQ 4
2
10
ý 170 (246.5 244)
72
ý 190.23
iii. Q3
- determine the class containing the third quartile.
3* N
ý 369.75
4
200 210 is the class containingthe third quartile.
LQ ý 200 ,
3
w ý10
N ý 493 , c ý 351 , f Q ý 49
3
w 3* N
Q3 ý LQ 3 ( c)
fQ
3
4
10
ý 200 (369.75 351)
49
ý 203.83
b) D7
- determine the class containing the 7th decile.
7* N
ý 345.1
10
190 200 is the class containingthe seventh decile.
LD ý 190 ,
7
w ý10
N ý 493 , c ý 244 , f D ý107
7
w 7* N
D7 ý LD ( c)
7
fD
7
10
10
ý 190 (345.1 244)
107
ý 199.45
c) P90
- determine the class containing the 90th percentile.
90 * N
ý 443.7
100
220 230 is the class containingthe 90th percentile.
LP ý 220 ,
90
w ý10
N ý 493 , c ý 434 , f P ý 3107 90
w 90 * N
P90 ý LP ( c)
90
f P 100
90
10
ý 220 (443.7 434)
31
ý 223.13
The measures of dispersion which are expressed in terms of the original unit of a series are
termed as absolute measures. Such measures are not suitable for comparing the variability of
two distributions which are expressed in different units of measurement and different average
size. Relative measures of dispersions are a ratio or percentage of a measure of absolute
dispersion to an appropriate measure of central tendency and are thus pure numbers
independent of the units of measurement. For comparing the variability of two distributions
(even if they are measured in the same unit), we compute the relative measure of dispersion
instead of absolute measures of dispersion.
Various measures of dispersions are in use. The most commonly used measures of dispersions
are:
1) Range and relative range
2) Quartile deviation and coefficient of Quartile deviation
3) Mean deviation and coefficient of Mean deviation
4) Standard deviation and coefficient of variation.
4.1.The Range (R)
The range is the largest score minus the smallest score. It is a quick and dirty measure of
variability, although when a test is given back to students they very often wish to know the
range of scores. Because the range is greatly affected by extreme scores, it may give a distorted
picture of the scores. The following two distributions have the same range, 13, yet appear to
differ greatly in the amount of variability.
Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45
Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45
For this reason, among others, the range is not the most important measure of variability.
Merits:
It is rigidly defined.
It is easy to calculate and simple to understand.
Demerits:
It is not based on all observation.
It is highly affected by extreme observations.
It is affected by fluctuation in sampling.
It is not liable to further algebraic treatment.
It can not be computed in the case of open end distribution.
It is very sensitive to the size of the sample.
The inter quartile range is the difference between the third and the first quartiles of a set of
items and semi-inter quartile range is half of the inter quartile range.
Q3 Q1
Q.D ý
2
Coefficient of Quartile Deviation (C.Q.D)
(Q3 Q1 2 2 * Q.D Q3 Q1
C. Q.D ý ý ý
(Q3 Q1 ) 2 Q3 Q1 Q3 Q1
It gives the average amount by which the two quartiles differ from the median.
Example: Compute Q.D and its coefficient for the following distribution.
Values Freq.
140- 150 17
150- 160 29
160- 170 42
170- 180 72
180- 190 84
Remark: Q.D or C.Q.D includes only the middle 50% of the observation.
The mean deviation of a set of items is defined as the arithmetic mean of the values of the
absolute deviations from a given average. Depending up on the type of averages used we have
different mean deviations.
n
õ Xi X
M .D ( X ) ý i ý1
n
For the case of frequency distribution it is given as:
k
õ fi X i X
M .D ( X ) ý i ý1
n
n ~
~
õ Xi X
M .D( X ) ý i ý1
n
For the case of frequency distribution it is given as:
k ~
~
õ fi X i X
M .D ( X ) ý i ý1
n
~
Steps to calculate M.D ( X ):
~
1. Find the median, X
~
2. Find the deviations of each reading from X .
3. Find the arithmetic mean of the deviations, ignoring sign.
õX i
ˆ
X
ˆ)ý
M.D( X i ý1
n
k
õ f i X i Xˆ
M .D ( Xˆ ) ý i ý1
n
Steps to calculate M.D ( X̂ ):
Examples:
1. The following are the number of visit made by ten mothers to the local doctor’s surgery. 8, 6,
5, 5, 7, 4, 5, 9, 7, 4
Find mean deviation about mean, median and mode.
Solutions:
First calculate the three averages
~
X ý 6, X ý 5.5, Xˆ ý 5
Xi 6 2 2 1 1 1 0 1 1 2 3 14
X i 5.5 1.5 1.5 0.5 0.5 0.5 0.5 1.5 1.5 2.5 3.5 14
Xi 5 1 1 0 0 0 1 2 2 3 4 14
10
õ X i 6) 14
M .D( X ) ý i ý1
ý ý 1.4
10 10
10
~
õ X i 5.5 14
M .D ( X ) ý i ý1
ý ý 1.4
10 10
10
õ X i 5) 14
M .D( Xˆ ) ý i ý1
ý ý 1.4
10 10
2. Find mean deviation about mean, median and mode for the following distributions.(exercise)
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
M .D
C.M .D ý
Average about which deviations are taken
M .D( X )
C.M .D( X ) ý
X
~
~ M .D( X )
C.M .D( X ) ý ~
X
M .D( Xˆ )
C.M .D( Xˆ ) ý
Xˆ
Example: calculate the C.M.D about the mean, median and mode for the data in example 1
above.
Solutions:
M .D
C.M .D ý
Average about which deviations are taken
~
M .D( X ) 1.4 ~ M .D( X ) 1.4
C.M .D( X ) ý ý ý 0.233 C.M .D( X ) ý ~ ý ý 0.255
X 6 X 5.5
M .D( Xˆ ) 1.4
C.M .D( Xˆ ) ý ý ý 0.28
Xˆ 5
The Variance
Population Variance
If we divide the variation by the number of values in the population, we get something called
the population variance. This variance is the "average squared deviation from the mean".
1
Population Varince ý 2 ý
N
õ ( X i ) 2 , i ý 1,2,.....N
Sample Variance
One would expect the sample variance to simply be the population variance with the
population mean replaced by the sample mean. However, one of the major uses of statistics is
to estimate the corresponding parameter. This formula has the problem that the estimated value
isn't the same as the parameter. To counteract this, the sum of the squares of the deviations is
divided by one less than the sample size.
1
Sample Varince ý S 2 ý
n 1
õ ( X i X ) 2 , i ý 1,2,.....,n
S2 ý i ý1
, for raw data.
n 1
k
õ fi X i nX 2
2
S2 ý i ý1
, for frequency distribution.
n 1
4.2.Standard Deviation
There is a problem with variances. Recall that the deviations were squared. That means that the
units were also squared. To get the units back the same as the original data values, the square
root must be taken.
Population s tan dard deviation ý ý 2
Sample s tan dard deviation ý s ý S 2
Examples: Find the variance and standard deviation of the following sample data
1. 5, 17, 12, 10.
2. The data is given in the form of frequency distribution.
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Solutions:
1. X ý 11
Xi 5 10 12 17 Total
(Xi- X)2 36 1 1 36 74
n
õ ( X i X )2 74
S2 ý i ý1
ý ý 24.67.
n 1 3
S ý S 2 ý 24.67 ý 4.97.
2. X ý 55
Xi(C.M) 42 47 52 57 62 67 72 Total
n
õ fi ( X i X )2 4400
S 2 ý i ý1 ý ý 59.46.
n 1 74
S ý S 2 ý 59.46 ý 7.71.
1.
õ ( X i X )2 ü õ ( X i A) 2 , A X
n 1 n 1
2. For normal (symmetric) distribution the following holds.
Approximately 68.27% of the data values fall within one standard deviation of the mean.
i.e. with in ( X S, X S)
Approximately 95.45% of the data values fall within two standard deviations of the mean.
i.e. with in ( X 2S , X 2S )
Approximately 99.73% of the data values fall within three standard deviations of the mean.
i.e. with in ( X 3S , X 3S )
3. Chebyshev's Theorem
For any data set ,no matter what the pattern of variation, the proportion of the values that fall
1
with in k standard deviations of the mean or ( X kS , X kS ) will be at least 1 ,
k2
where k is a number greater than 1. i.e. the proportion of items falling beyond k standard
1
deviations of the mean is at most
k2
Example: Suppose a distribution has mean 50 and standard deviation 6. What percent of the
numbers are?
a) Between 38 and 62
b) Between 32 and 68
c) Less than 38 or more than 62.
d) Less than 32 or more than 68.
Solutions:
a) 38 and 62 are at equal distance from the mean,50 and this distance is 12
ks ý 12
12 12
ký ý ý2
S 6
1
Applying the above theorem, at least (1 ) *100% ý 75% of the numbers lie
k2
between 38 and 62
b) Similarly done.
1
c) It is just the complement of a) i.e. at most 2
*100% ý 25% of the numbers lie less
k
than 32 or more than 62.
d) Similarly done.
Exercise: The average score of a special test of knowledge of wood refinishing has a mean of
53 and standard deviation of 6. Find the range of values in which at least 75% the scores will
lie.
Exercise: Verify each of the above relation ship, considering k and a as constants.
Examples:
known to be 12 gm and 3 gm respectively. New set of capsules of another drug are obtained
by the linear transformation Yi = 2Xi – 0.5 ( i = 1, 2, …, n ) then what will be the standard
deviation of the new set of capsules.
2. The mean and the standard deviation of a set of numbers are respectively 500 and 10.
a) If 10 are added to each of the numbers in the set, then what will be the variance and
standard deviation of the new set?
b) If each of the numbers in the set are multiplied by -5, then what will be the variance and
standard deviation of the new set?
Solutions:
Is defined as the ratio of standard deviation to the mean usually expressed as percents.
S
C.V ý *100
X
The distribution having less C.V is said to be less variable or more consistent.
Example: An analysis of the monthly wages paid (in Birr) to workers in two firms A and B
belonging to the same industry gives the following results
Solutions:
Calculate coefficient of variation for both firms.
SA 10
C.VA ý *100 ý *100 ý 19.05%
XA 52.5
SB 11
C.VB ý *100 ý *100 ý 23.16%
XB 47.5
Since C.VA < C.VB, in firm B there is greater variability in individual wages.
City 1 25 24 23 26 17
City2 22 21 24 22 20
City3 32 27 35 24 28
Which city have the most consistent temperature, based on these data?
X
Zý , for population.
X X
Zý , for sample
S
Z gives the deviations from the mean in units of standard deviation
Z gives the number of standard deviation a particular observation lie above or below
the mean.
It is used to compare two observations coming from different groups.
Examples:
1. Two sections were given introduction to statistics examinations. The following information
was given.
Student A from section 1 scored 90 and student B from section 2 scored 95.Relatively speaking
who performed better?
Solutions:
Calculate the standard score of both students.
X A X 1 90 78
ZA ý ý ý2
S1 6
X B X 2 95 90
ZB ý ý ý1
S2 5
Student A performed better relative to his section because the score of student A is two
standard deviations above the mean score of his section while, the score of student B is only
one standard deviation above the mean score of his section.
2. Two groups of people were trained to perform a certain task and tested to find out which
group is faster to learn the task. For the two groups the following information was given:
Value Group one Group two
Relatively speaking:
a) Which group is more consistent in its performance
b) Suppose a person A from group one take 9.2 minutes while person B
from Group two take 9.3 minutes, who was faster in performing the
task? Why?
Solutions:
X A X 1 9.2 10.4
ZA ý ý ý 1
S1 1.2
X B X 2 9.3 11.9
ZB ý ý ý 2
S2 1.3
Child B is faster because the time taken by child B is two standard deviations shorter than
the average time taken by group 2, while the time taken by child A is only one standard
deviation shorter than the average time taken by group 1.
L1.5. ELEMENTARY PROBABILITY
A ý 1,3,5
B ý 2,4,6
C ý or empty spaceor impossibleevent
Remark: If S (sample space) has n members then there are exactly 2n subsets or events.
6. Equally Likely Events: Events which have the same chance of occurring.
7. Complement of an Event: the complement of an event A means non-occurrence of A and is
denoted by A' , or Ac , or A contains those points of the sample space which don’t belong
to A.
8. Elementary Event: an event having only a single element or sample point.
9. Mutually Exclusive Events: Two events which cannot happen at the same time.
10. Independent Events: Two events are independent if the occurrence of one does not affect
the probability of the other occurring.
11. Dependent Events: Two events are dependent if the first event affects the outcome or
occurrence of the second event in a way the probability is changed.
Solution
a) S={1,2,3,4,5,6}
b) S={(HH),(HT),(TH),(TT)}
c) S={t /t≥0}
Sample space can be
Countable ( finite or infinite)
Uncountable.
Counting Rules
In order to calculate probabilities, we have to know
The number of elements of an event
The number of elements of the sample space.
That is in order to judge what is probable, we have to know what is possible.
In order to determine the number of outcomes, one can use several rules of counting.
5 4 3 2
Permutation
is
n!
n Pr ý
(n r )!
3. The number of permutations of n objects in which k1 are alike k2 are alike etc is
n!
ý
k1!*k2 * ...* kn
Example:
1. Suppose we have a letters A,B, C, D
a) How many permutations are there taking all the four?
b) How many permutations are there if two letters are used at a time?
2. How many different permutations can be made from the letters in the word
<CORRECTION=?
Solutions: 1. a)
Here n ý 4, there are four disnict object
There are 4!ý 24 permutations.
b)
Here n ý 4, r ý 2
4! 24
There are 4 P2 ý ý ý 12 permutations.
(4 2)! 2
2.
Here n ý 10
Of which 2 are C , 2 are O, 2 are R ,1E ,1T ,1I ,1N
K1 ý 2, k 2 ý 2, k 3 ý 2, k 4 ý k 5 ý k 6 ý k 7 ý 1
U sin g the 3rd rule of permutation , there are
10!
ý 453600 permutations.
2!*2!*2!*1!*1!*1!*1!
Combination
AB BA CA DA AB BC
AC BC CB DB AC BD
AD BD CD DC AD DC
Note that in permutation AB is different from BA. But in combination AB is the same as BA.
Combination Rule
ö nö
The number of combinations of r objects selected from n objects is denoted by n Cr or ÷÷ ÷÷
ørø
and is given by the formula:
ö nö n!
÷÷ ÷÷ ý
ø r ø (n r )!*r!
Examples:
1. In how many ways a committee of 5 people is chosen out of 9 people?
Solutions:
ný9 , r ý5
önö n! 9!
÷÷ ÷÷ ý ý ý 126 ways
ø ø
r ( n r )!* r! 4!* 5!
2. Among 15 clocks there are two defectives .In how many ways can an inspector chose three
of the clocks for inspection so that:
a) There is no restriction.
b) None of the defective clock is included.
c) Only one of the defective clocks is included.
d) Two of the defective clock is included.
Solutions: n=15 of which 2 are defective and 13 are non-defective; and r=3
a) If there is no restriction select three clocks from 15 clocks and this can be done in :
n ý 15 , r ý 3
önö n! 15!
÷÷ ÷÷ ý ý ý 455 ways
ø ø
r ( n r )!* r! 12!* 3!
ö 2 ö ö13 ö
÷÷ ÷÷ * ÷÷ ÷÷ ý 286 ways.
ø 0ø ø 3 ø
ö 2 ö ö13 ö
÷÷ ÷÷ * ÷÷ ÷÷ ý 156 ways.
ø1ø ø 2 ø
d) Two of the defective clock is included.
This is equivalent to two defective and one non defective, which can be done in:
ö 2 ö ö13 ö
÷÷ ÷÷ * ÷÷ ÷÷ ý 13 ways.
ø 2ø ø 3 ø
Approaches to measuring Probability
There are 3 different conceptual approaches to the study of probability theory. These are:
The classical approach.
The frequenters approach.
The subjective approach.
S ý 1, 2, 3, 4, 5, 6
N ý n( S ) ý 6
a) Let A be the event of number 4 c) Let A be the event of even numbers
A ý 4 A ý 2,4,6
N A ý n( A) ý 1 N A ý n( A) ý 3
n( A) n( A)
P( A) ý ý1 6 P( A) ý ý 3 6 ý 0.5
n( S ) n( S )
b) Let A be the event of odd numbers d) Let A be the event of number 8
A ý 1,3,5 A ý {}
N A ý n( A) ý 3 N A ý n( A) ý 0
n( A) n( A)
P( A) ý ý 3 6 ý 0.5 P( A) ý ý0 6ý0
n( S ) n( S )
2. A box of 80 candles consists of 30 defective and 50 non defective candles. If 10 of this
candles are selected at random, what is the probability that
a) All will be defective.
b) 6 will be non defective
c) All will be non defective
ö 80 ö
Solutions: Total selection ý ÷÷ ÷÷ ý N ý n( S )
ø10 ø
a) Let A be the event that all will be defective.
ö 30 ö ö 50 ö
Total way in which A occur ý ÷÷ ÷÷ * ÷÷ ÷÷ ý N A ý n( A)
ø 10 ø ø 0 ø
ö 30 ö ö 50 ö
÷ ÷*÷ ÷
n( A) ÷ø 10 ÷ø ÷ø 0 ÷ø
P( A) ý ý ý 0.00001825
n( S ) ö 80 ö
÷÷ ÷÷
ø 10 ø
b) Let A be the event that 6 will be non defective.
ö 30 ö ö 50 ö
Total way in which A occur ý ÷÷ ÷÷ * ÷÷ ÷÷ ý N A ý n( A)
ø4ø ø6ø
ö 30 ö ö 50 ö
÷ ÷*÷ ÷
n( A) ÷ø 4 ÷ø ÷ø 6 ÷ø
P( A) ý ý ý 0.265
n( S ) ö 80 ö
÷÷ ÷÷
ø 10 ø
c) Let A be the event that all will be non defective.
ö 30 ö ö 50 ö
Total way in which A occur ý ÷÷ ÷÷ * ÷÷ ÷÷ ý N A ý n( A)
ø 0 ø ø 10 ø
ö 30 ö ö 50 ö
÷ ÷*÷ ÷
n( A) ÷ø 0 ÷ø ÷ø 10 ÷ø
P( A) ý ý ý 0.00624
n( S ) ö 80 ö
÷÷ ÷÷
ø 10 ø
Short coming of the classical approach:
This approach is not applicable when:
- The total number of outcomes is infinite.
- Outcomes are not equally likely.
2. If records show that 60 out of 100,000 bulbs produced are defective. What is the
probability of a newly produced bulb to be defective?
Solution: Let A be the event that the newly produced bulb is defective.
NA 60
P( A) ý lim ý ý 0.0006
N N 100,000
Subjective approach - this is type of probability based on the beliefs of the person making the
probability assessment . subjective probability assessment are often found when events occur
only once or at most every few times.
The disadvantages of subjective probability is that two or more person facing the same
evidence / problem may arrive different probability. That is for the same problem there may be
different decision.
Conditional probability and Independency
Conditional Events: If the occurrence of one event has an effect on the next occurrence of the
other event then the two events are conditional or dependant events.
Example: Suppose we have two red and three white balls in a bag
1. Draw a ball with replacement
Since the first drawn ball is replaced for a second draw it doesn’t affect the second
draw. For this reason A and B are independent. Then if we let
2
A= the event that the first draw is red p ( A) ý
5
2
B= the event that the second draw is red p( B) ý
5
2. Draw a ball with out replacement
This is conditional b/c the first drawn ball is not to be replaced for a second draw
in that it does affect the second draw. If we let
2
A= the event that the first draw is red p ( A) ý
5
B= the event that the second draw is red p( B) ý ?
Let B= the event that the second draw is red given that the first draw is red P(B) = 1/4
The conditional probability of an event A given that B has already occurred, denoted by
p( A B) is
p( A B)
p( A B) = , p( B) 0
p( B)
Remark: (1) p ( A' B ) ý 1 p ( A B ) (2) p( B ' A) ý 1 p( B A)
Examples 1. For a student enrolling at freshman at certain university the probability is 0.25
that he/she will get scholarship and 0.75 that he/she will graduate. If the probability is 0.2 that
he/she will get scholarship and will also graduate. What is the probability that a student who
get a scholarship graduate?
Exercise: A lot consists of 20 defective and 80 non-defective items from which two items are
chosen without replacement. Events A & B are defined as A = the first item chosen is
defective, B = the second item chosen is defective
a) What is the probability that both items are defective?
b) What is the probability that the second item is defective?
Note: for any two events A and B the following relation holds.
ø
pøB ù ý pøB Aù. pø Aù p B A' . p A' ù ø ù
Probability of Independent Events
Here pø A B ù ý pø Aù, P ø B Aù ý p ø B ù
Example; A box contains four black and six white balls. What is the probability of getting two
black balls in drawing one after the other under the following conditions?
a. The first ball drawn is not replaced
b. The first ball drawn is replaced
Solution; Let A= first drawn ball is black
B= second drawn is black
Required pø A B ù
a. pø A B ù ý pøB Aù. pø Aù ý ø3 / 9ùø4 10ù ý 2 15
b. pø A B ù ý pø Aù. pøB ù ý ø4 10ùø4 10ù ý 4 25
Inference is the process of making interpretations or conclusions from sample data for the
totality of the population.
It is only the sample data that is ready for inference.
In statistics there are two ways though which inference can be made.
Statistical estimation
Statistical hypothesis testing.
Inference Analyzed
Population
Data
Numerical
Sample
data
Data analysis is the process of extracting relevant information from the summarized data.
Statistical Estimation
This is one way of making inference about the population parameter where the investigator
does not have any prior notion about values or characteristics of the population parameter.
There are two ways estimation.
1) Point Estimation
It is a procedure that results in a single value as an estimate for a parameter.
2) Interval estimation
It is the procedure that results in the interval of values as an estimate for a parameter, which
is interval that contains the likely values of a parameter. It deals with identifying the upper
and lower limits of a parameter. The limits by themselves are random variable.
Definitions
Confidence Interval: An interval estimate with a specific level of confidence
Confidence Level: The percent of the time that the true value will lie in the interval
estimate given.
Consistent Estimator: An estimator which gets closer to the value of the parameter as the
sample size increases.
Degrees of Freedom: The number of data values which are allowed to vary once a statistic
has been determined.
Estimator: A sample statistic which is used to estimate a population parameter. It must be
unbiased, consistent, and relatively efficient.
Estimate: Is the different possible values which an estimator can assumes.
Interval Estimate: A range of values used to estimate a parameter.
Point Estimate: A single value used to estimate a parameter.
Relatively Efficient Estimator: The estimator for a parameter with the smallest variance.
Unbiased Estimator: An estimator whose expected value is the value of the parameter
being estimated.
Point Estimation
Another term for statistic is point estimate, since we are estimating the parameter value. A
point estimator is the mathematical way we compute the point estimate. For instance, sum of
xi over n is the point estimator used to compute the estimate of the population means, .That
õ xi
is Xý is a point estimator of the population mean.
n
Confidence interval estimation of the population mean
Although X possesses nearly all the qualities of a good estimator, because of sampling error,
we know that it's not likely that our sample statistic will be equal to the population parameter,
but instead will fall into an interval of values. We will have to be satisfied knowing that the
statistic is "close to" the parameter. That leads to the obvious question, what is "close"?
We can phrase the latter question differently: How confident can we be that the value of the
statistic falls within a certain "distance" of the parameter? Or, what is the probability that the
parameter's value is within a certain range of the statistic's value? This range is the confidence
interval.
The confidence level is the probability that the value of the parameter falls within the range
specified by the confidence interval surrounding the statistic.
There are different cases to be considered to construct confidence intervals.
Case 1: If sample size is large or if the population is normal with known variance
Recall the Central Limit Theorem, which applies to the sampling distribution of the mean of a
sample. Consider samples of size n drawn from a population, whose mean is and standard
deviation is with replacement and order important. The population can have any frequency
distribution. The sampling distribution of X will have a mean x ý and a standard
deviation x ý , and approaches a normal distribution as n gets large. This allows us to
n
use the normal distribution curve for computing confidence intervals.
X
Z ý has a normal distribution with mean ý 0 and var iance ý 1
n
ý X Z n
ý X , where is a measure of error.
ý Z n
- For the interval estimator to be good the error should be small. How it be small?
By making n large
Small variability
Taking Z small
- To obtain the value of Z, we have to attach this to a theory of chance. That is, there is an area of size
1ñ such that
P ( Zñ 2 ü Z ü Zñ 2 ) ý 1 ñ
Where ñ ý is the probabilit y that the parameter lies outside the int erval
Zñ 2 ý s tan ds for the s tan dard normal var iable to the right of which
ñ 2 probabilit y lies , i.e P( Z þ Zñ 2 ) ý ñ 2
X
P( Zñ 2 ü ü Zñ 2 ) ý 1 ñ
n
P( X Zñ 2 n ü ü X Zñ 2 n) ý 1 ñ
But usually
2
is not known, in that case we estimate by its point estimator S2
Here are the Z values corresponding to the most commonly used confidence levels.
100(1 ñ ) ñ ñ 2 Zñ 2
%
90 0.10 0.05 1.645
The unit of measurement of the confidence interval is the standard error. This is just the
standard deviation of the sampling distribution of the statistic.
Examples:
1. From a normal sample of size 25 a mean of 32 was found .Given that the population
standard deviation is 4.2. Find
a) A 95% confidence interval for the population mean.
b) A 99% confidence interval for the population mean.
Solution:
a)
b)
Statistic test: is a test or procedure used to evaluate a statistical hypothesis and its value
depends on sample data.
There are two types of hypothesis:
Null hypothesis:
- It is the hypothesis to be tested.
- It is the hypothesis of equality or the hypothesis of no difference.
- Usually denoted by H0.
Alternative hypothesis:
- It is the hypothesis available when the null hypothesis has to be rejected.
- It is the hypothesis of difference.
- Usually denoted by H1 or Ha.
In practice we set ñ at some value and design a test that minimize ò . This is because a type I
error is often considered to be more serious, and therefore more important to avoid, than a type
II error.
Suppose the assumed or hypothesized value of is denoted by 0 , then one can formulate two
1. H 0 : ý 0 vs H1 : 0
2. H 0 : ý 0 vs H1 : þ 0
3. H 0 : ý 0 vs H1 : ü 0
Case 2: When sampling is from a normal distribution with 2 unknown and small sample size
- The relevant test statistic is
X 0
t cal ý ~ t with n 1 deg rees of freedom.
S n
- After specifying ñ we have the following regions on the student t-distribution
corresponding to the above three hypothesis.
H0 Reject H0 if Accept H0 if Inconclusive if
X 0
Z cal ý , if 2 is known.
n
X 0
ý , if 2 is unknown.
S n
- The decision rule is the same as case I.
Examples:
1. Test the hypotheses that the average height content of containers of certain lubricant is 10 liters if
the contents of a random sample of 10 containers are 10.2, 9.7, 10.1, 10.3, 10.1, 9.8, 9.9, 10.4,
10.3, and 9.8 liters. Use the 0.01 level of significance and assume that the distribution of contents
is normal.
Solution:
H 0 : ý 10 vs H1 : 10
Step 2: select the level of significance, ñ ý 0.01( given)
Step 3: Select an appropriate test statistics
t- Statistic is appropriate because population variance is not known and the sample size is
also small.
Step 4: identify the critical region.
Here we have two critical regions since we have two tailed hypothesis
X ý 10.06, S ý 0.25
X 0 10.06 10
t cal ý ý ý 0.76
S n 0.25 10
Step 6: Decision
Accept H0 , since tcal is in the acceptance region.
Step 7: Conclusion
At 1% level of significance, we have no evidence to say that the average height content of
containers of the given lubricant is different from 10 litters, based on the given sample data.
2. The mean life time of a sample of 16 fluorescent light bulbs produced by a company is computed
to be 1570 hours. The population standard deviation is 120 hours. Suppose the hypothesized value
for the population mean is 1600 hours. Can we conclude that the life time of light bulbs is
decreasing?
(Use ñ ý 0.05 and assume the normality of the population)
Solution:
H 0 : ý 1600 vs H1 : ü 1600
Step 2: select the level of significance, ñ ý 0.05 ( given)
Step 3: Select an appropriate test statistics
Z- Statistic is appropriate because population variance is known.
Step 4: identify the critical region.
X 0 1570 1600
Z cal ý ý ý 1.0
n 120 16
Step 6: Decision
Accept H0, since Zcal is in the acceptance region.
Step 7: Conclusion
At 5% level of significance, we have no evidence to say that that the life time of light bulbs is
decreasing, based on the given sample data.
Exercise: It is known in a pharmacological experiment that rats fed with a particular diet over a
certain period gain an average of 40 gms in weight. A new diet was tried on a sample of 20 rats
yielding a weight gain of 43 gms with variance 7 gms. Test the hypothesis that the new diet is an
improvement assuming normality.
Test of Association
B
A B1 B2 . . Bj . Bc Total
A1 O11 O12 O1j O1c R1
A2 O21 O22 O2j O2c R2
.
.
Ai Oi1 Oi2 Oij Oic Ri
.
.
Ar Or1 Or2 Orj Orc
Total C1 C2 Cj n
- The chi-square procedure test is used to test the hypothesis of independency of two attributes
.For instance we may be interested
Whether the presence or absence of hypertension is independent of
smoking habit or not.
Whether the size of the family is independent of the level of education
attained by the mothers.
Whether there is association between father and son regarding boldness.
Examples:
1. A geneticist took a random sample of 300 men to study whether there is association between
father and son regarding boldness. He obtained the following results.
Son
Father Bold Not
Bold 85 59
Not 65 91
Using ñ ý 5% , test whether there is association between father and son regarding boldness.
Solution:
Solution:
H 0 : There is no associatio n between the size of the family and the level of
education attained by fathers.
H1 : not H 0 .
- First calculate the row and column totals
R1 ý 83, R2 ý 117, C1 ý 45, C2 ý 96, C3 ý 59
- Then calculate the expected frequencies( eij’s)
Ri * C j e11 ý 18.675, e12 ý 39.84, e13 ý 24.485
eij ý
n e21 ý 26.325, e22 ý 56.16, e23 ý 34.515
- Obtain the calculated value of the chi-square.
2 3 ù (Oij eij ) 2 ù
ó 2
cal ý õ õú ú
i ý1 j ý1ú
û eij ûú
(14 18.675) 2 (37 39.84) 2 (27 34.515) 2
ý ... ý 6.3
18.675 39.84 34.515
- Obtain the tabulated value of chi-square
ñ ý 0.05
Degrees of freedom ý (r 1)(c 1) ý 1* 2 ý 2
ó 02.05 (2) ý 5.99 from table.
- The decision is to reject H0 since ó 2 cal þ ó 02.05 (2)
Conclusion: At 5% level of significance we have evidence to say there is association between
the size of the family and the level of education attained by fathers, based on this sample data.
Correlation Analysis: deals with the measurement of the closeness of the relation ship which
are described in the regression equation.
We say there is correlation if the two series of items vary together directly or inversely.
7.2.Correlation Analysis
When higher values of X are associated with higher values of Y and lower values of X
are associated with lower values of Y, then the correlation is said to be positive or
direct.
Examples:
- Income and expenditure
- Number of hours spent in studying and the score obtained
- Height and weight
- Distance covered and fuel consumed by car.
When higher values of X are associated with lower values of Y and lower values of X
are associated with higher values of Y, then the correlation is said to be negative or
inverse.
Examples:
- Demand and supply
- Income and the proportion of income spent on food.
The correlation between X and Y may be one of the following
1. Perfect positive (slope=1)
2. Positive (slope between 0 and 1)
3. No correlation (slope=0)
4. Negative (slope between -1 and 0)
5. Perfect negative (slope=-1)
The presence of correlation between two variables may be due to three reasons:
1.One variable being the cause of the other. The cause is called <subject= or
<independent= variable, while the effect is called <dependent= variable.
2.Both variables being the result of a common cause. That is, the correlation that
exists between two variables is due to their being related to some third force.
Example:
Let X1= ESLCE result
Y1= rate of surviving in the University
Y2= the rate of getting a scholar ship.
By Getahun G([email protected] ) Page 96
Both X1&Y1 and X1&Y2 have high positive correlation, likewiseY1 & Y2 have
positive correlation but they are not directly related, but they are related to each other
via X1.
Examples:
Price of teff in Addis Ababa and grade of students in USA.
Weight of individuals in Ethiopia and income of individuals in Kenya.
Therefore, while interpreting correlation coefficient, it is necessary to see if there is any likelihood
of any relation ship existing between variables under study.
The correlation coefficient between X and Y denoted by r is given by
rý
õ ( X i X )(Yi Y ) and the short cut formula is
õ ( X i X ) õ (Yi Y )
2 2
nõ XY (õ X )( õ Y )
rý
[nõ X 2 (õ X ) 2 ] [nõ Y 2 (õ Y ) 2
rý
õ XY nXY
[õ X 2 nX 2 ] [õ Y 2 nY 2 ]
Remark: Always this r lies between -1 and 1 inclusively and it is also symmetric.
Interpretation of r
1.Perfect positive linear relationship ( if r ý 1)
2.Some Positive linear relationship ( if r is between 0 and 1)
Examples:
1. Calculate the simple correlation between mid semester and final exam scores of 10 students
(both out of 50)
rý
õ XY nXY
[õ X 2 nX 2 ] [õ Y 2 nY 2 ]
10331 10(31.2)(32.9)
ý
(9920 10(973.4)) (11003 10(1082.4))
66.2
ý ý 0.363
182.5
This means mid semester exam and final exam scores have a slightly positive correlation.
Exercise The following data were collected from a certain household on the monthly income
(X) and consumption (Y) for the past 10 months. Compute the simple correlation coefficient.
X: 650 654 720 456 536 853 735 650 536 666
Y: 450 523 235 398 500 632 500 635 450 360
The above formula and procedure is only applicable on quantitative data, but when we have
qualitative data like efficiency, honesty, intelligence, etc we calculate what is called
Spearman’s rank correlation coefficient as follows:
7.3.Steps
i. Rank the different items in X and Y.
ii. Find the difference of the ranks in a pair , denote them by Di
iii. Use the following formula
6õ Di
2
rs ý 1
n(n 2 1)
Where rs ý coefficien t of rank correlatio n
D ý the difference between paired ranks
n ý the number of pairs
Example:
Aster and Almaz were asked to rank 7 different types of lipsticks, see if there is correlation
between the tests of the ladies.
Lipstick types A B C D E F G
Aster 2 1 4 3 5 7 6
Almaz 1 3 2 4 5 6 7
Solution:
X Y R1-R2
D2
(R1) (R2) (D)
2 1 1 1
1 3 -2 4
4 2 2 4
3 4 -1 1
5 5 0 0
7 6 1 1
6 7 -1 1
Total 12
6õ Di
2
6(12)
rs ý 1 ý 1 ý 0.786
n(n 2 1) 7(48)
Y ý ñ òX
Where:Y ý Dependentvar iable
- The linear model is: X ý independent var iable
ñ ý Re gression cons tan t
ò ý regression slope
ý randomdisturbance term
Y ~ N (ñ òX , 2 )
~ N (0, 2 )
Where a is a constant which gives the value of Y when X=0 .It is called the Y-intercept. b
is a constant indicating the slope of the regression line, and it gives a measure of the change
in Y for a unit change in X. It is also regression coefficient of Y on X.
bý
õ ( X i X )(Yi Y ) ý õ XY nXY
õ ( X i X )2 õ X 2 nX 2
a ý Y bX
Example 1: The following data shows the score of 12 students for Accounting and Statistics
examinations.
Accounting Statistics
X2 Y2 XY
X Y
a)
The Coefficient of Correlation (r) has a value of 0.92. This indicates that the two variables are
positively correlated (Y increases as X increases).
b)
where:
Yˆ ý 7.0194 0.9560 X
ý 7.0194 0.9560(85) ý 88.28
Exercise: A car rental agency is interested in studying the relationship between the distance
driven in kilometer (Y) and the maintenance cost for their cars (X in birr). The following
summarized information is given based on samples of size 5.
2
õi ý1 X i ý 147,000,000 õi ý1Yi ý 314
5 5 2
- To know how far the regression equation has been able to explain the variation in Y we use a
2
measure called coefficient of determination ( r )
õ (Yˆ Y ) 2
i.e r ý 2
õ (Y Y ) 2
Where r ý the simple correlatio n coefficient.
- r 2 gives the proportion of the variation in Y explained by the regression of Y on X.
- 1 r 2 gives the unexplained proportion and is called coefficient of indetermination.
Example: For the above problem (example 1): r ý 0.9194
r 2 ý 0.8453 84.53% of the variation in Y is explained and only 15.47% remains
unexplained and it will be accounted by the random term.
o Covariance of X and Y measures the co-variability of X and Y together. It is denoted by
S XY and given by
SX Y ý
õ ( X i X )(Yi Y ) ý õ XY nXY
n 1 n 1
2
S S
i. r ý XY r 2 ý X2 Y 2
S X SY S X SY
bS X rS
ii. rý bý Y
SY SX
o When we fit the regression of X on Y , we interchange X and Y in all formulas, i.e. we
fit
Xˆ ý a1 b1Y
b1 ý
õ XY nXY
õ Y 2 nY 2
b1SY
a1 ý X b1Y , rý
SX
Here X is dependent and Y is independent.
7.5.Choice of Dependent and Independent variable
bYX S X bXY SY
Then rý ý r 2 ý bYX * bXY
SY SX
- Moreover, bYX and bX Y are completely different numerically as well as conceptually.
1. If the correlation is perfect positive, i.e. r ý 1 then the b values reciprocals of each
other.
common point ( X , Y )
Example: The regression line between height (X) in inches and weight (Y) in lbs of male
students are:
4Y 15 X 530 ý 0 and
20 X 3Y 975 ý 0
Determine which is regression of Y on X and X on Y
Solution
We will assume one of the equation as regression of X on Y and the other as Y on X and
calculate r
530 4 4
4Y 15 X 530 ý 0 X ý Y bXY ý
15 15 15
975 20 20
20 X 3Y 975 ý 0 Y ý X bYX ý
3 3 3
ö 4 öö 20 ö
r 2 ý bXY * bYX ý ÷ ÷÷ ÷ ý 1.78 þ 1,
ø 15 øø 3 ø
This is impossible (contradiction). Hence our assumption is not correct. Thus
4Y 15 X 530 ý 0 is regressionof Y on X
20 X 3Y 975 ý 0 is regressionof X on Y
To verify:
530 15 15
4Y 15 X 530 ý 0 Y ý X bYX ý
4 4 4
975 3 3
20 X 3Y 975 ý 0 X ý Y bXY ý
20 20 20
ö 15 öö 3 ö 9
r 2 ý bYX * bXY ý ÷ ÷÷ ÷ ý þ 0,1
ø 4 øø 20 ø 16