Introduction To Statistics Hand Out 2022 Alebachew A.
Introduction To Statistics Hand Out 2022 Alebachew A.
Email: [email protected]/[email protected]
______________________________________________________________________________
1
Chapter One
Introduction
Definition of terms
Data: are figures or facts from which conclusion can be made. Data are the numerical results of
any scientific measurement. Any value that is expressed in numbers is called data.
Population: the totality of all elements under study.
Sample: is a portion or part of the population taken so that some generalization about the
population can be made. It is the subset of the population which is assumed to be the
representative of the population.
Definition of Statistics
Statistics can be defined in two senses: plural (as Statistical Data) and singular (as Statistical
Methods).
Plural sense: Statistics are collection of facts (figures). This meaning of the word is widely used
when reference is made to facts and figures on sales, employment or unemployment, accident,
weather, death, education, e.t.c. Eg: Sales Statistics, Labor Statistics, Employment Statistics,
e.t.c. In this sense the word Statistics serves simply as data. But not all numerical data are
statistics.
Singular sense: Statistics is the science that deals with the methods of data collection,
organization, presentation, analysis and interpretation of data. It refers the subject area that is
concerned with extracting relevant information from available data with the aim to make sound
decisions. According to this meaning, statistics is concerned with the development and
application of methods and techniques for collecting, organizing, presenting, analyzing and
interpreting statistical data.
1. Collection of Data: This is the first stage in any statistical investigation and involves the
process of obtaining (gathering) a set of related measurements or counts to meet
predetermined objectives. The data collected may be primary data (data collected directly by
the investigator) or it may be secondary data (data obtained from intermediate sources such
as newspaper s, journals, official records, e.t.c).
2. Organization of Data: It is usually not possible to derive any conclusion about the main
features of the data from direct inspection of the observations. The second purpose of
statistics is describing the properties of the data in a summary form.
______________________________________________________________________________
2
This stage of statistical investigation helps to have a clear understanding of the information
gathered and includes editing (correcting), classifying and tabulating the collected data in a
systematic manner.
Thus the first step in the organization of data is editing. It means correcting (adjusting)
omissions, inconsistencies, irrelevant answers and wrong computations in the collected data.
The second step of the organization of data is classification that is arranging the collected
data according to some common characteristics. The last step of the organization of data is
presenting the classified data in tabular form, using rows and columns (tabulation).
3. Presenting of Data: The purpose of data presentation is to have an overview of what the data
actually looks like, and to facilitate statistical analysis. Data presentation can be done using
Graphs and Diagrams which have great memorizing effect and facilitates comparison.
4. Analysis of Data: The analysis of data is the extraction of summarized and comprehensive
numerical description in order to reach conclusions or provide answers to a problem. The
problem may require simple or sophisticated mathematical expressions.
5. Interpretation of Data: This is the last stage of statistical investigation. Interpretation
involves drawing conclusions from the data collected and analyzed in order to make decision.
Classification of Statistics
Based on the scope of the decision, statistics can be classified into two; Descriptive and
Inferential Statistics.
Descriptive Statistics refers to the procedures used to organize and summarize masses of data.
It is concerned with describing or summarizing the most important features of the data. It deals
only the characteristics of the collected data without going beyond it. That is, this part deals with
only describing the data collected without going any further: that is without attempting to
infer(conclude) anything that goes beyond the data themselves. The methodology of descriptive
statistics includes the methods of organizing (classification, tabulation, Frequency Distributions)
and presenting (Graphical and Diagrammatic Presentation) data and calculations of certain
indicators of data like Measures of Central Tendency and Measures of Dispersion (Variation)
which summarize some important features of the data.
Inferential (Inductive) Statistics includes the methods used to find out something about a
population, based on the sample. It is concerned with drawing statistically valid conclusions
about the characteristics of the population based on information obtained from sample. In this
form of statistical analysis, descriptive statistics is linked with probability theory in order to
generalize the results of the sample to the population. Performing hypothesis testing, determining
relationships between variables and making predictions are also inferential statistics.
______________________________________________________________________________
3
c. Of the students enrolled in Haramaya University in this year 74% are male and 26% are
female.
d. The chance of winning the Ethiopian National Lottery in any day is 1 out of 167000.
Applications of Statistics
In this modern time, statistical information plays a very important role in a wide range of fields.
Today, statistics is applied in almost all fields of human endeavor.
Uses of Statistics
Condenses and summarizes masses of data and presents facts in numerical and definite form
Facilitates comparison: statistical devises such as averages, percentages, ratios, e.t.c. are
used for this purpose.
Formulating and testing hypothesis: For instance, hypothesis like whether a new medicine is
effective in curing a disease, whether there is an association between variables can be tested
using statistical tools.
Determining the relationship between two or more variables.
Forecasting: Statistical methods help in studying past data and predicting future trends.
Limitations of Statistics
______________________________________________________________________________
4
Variable
It is a characteristics or an attribute that can assume different values. Eg: Height, Family size,
Gender
Based on the values that variables assume, variables can be classified as
1. Qualitative variables: do not assume numeric values. Eg: Gender
2. Quantitative variables: assume numeric values. These variables are numeric in nature.
Eg: Height, Family size
Discrete variable: takes whole number values and consists of distinct recognizable
individual elements that can be counted. It is a variable that assumes a finite or countable
number of possible values. These values are obtained by counting (0, 1, 2. . .).
Eg: Family size, Number of children in a family, number of cars at the traffic light
Continuous variable: takes any value including decimals. Such a variable can theoretically
assume an infinite number of possible values. These values are obtained by measuring.
Eg: Height, Weight, Time, and Temperature
Generally the values of a variable can be obtained either by counting for discrete variables, by
measuring for continuous variables or by making categories for qualitative variables.
Ex: Classify each of the following as Qualitative and Quantitative and if it is quantitative classify
as Discrete and Continuous.
______________________________________________________________________________
5
to find the average shirt numbers (or the average shirt number is nothing) because the numbers
on the shirts are simply codes but it is possible to obtain the average test score.
1. Nominal Scales of variables are those qualitative variables which show category of
individuals. They reflect classification in to categories (name of groups) where there is no
particular order or qualitative difference to the labels. Numbers may be assigned to the
variables simply for coding purposes. It is not possible to compare individual basing on the
numbers assigned to them. The only mathematical operation permissible on these variables is
counting.
These variables:
Have mutually exclusive (non-overlapping) and exhaustive categories.
No ranking or order between (among) the values of the variable.
Eg: Gender, Religion, ID No, Ethnicity, Color
2. Ordinal Scales of variables are also those qualitative variables whose values can be ordered
and ranked. Ranking and counting are the only mathematical operations to be done on the
values of the variables. But there is no precise difference between the values (categories) of
the variable.
Eg: Academic qualifications (B.Sc., M.Sc., Ph.D.), Grade Scores (A, B, C, D, F), Strength
(very weak, week, strong, very strong), Health status (very sick, sick, cured), Economic
status (lower class, middle class, higher class group).
3. Interval Scales of variables are those quantitative variables when the value of the variables is
zero it does not show absence of the characteristics i.e. there is no true zero. Zero indicates
low than empty. There is a precise difference between the units of measurement (levels)
Eg: temperature, 00c does not mean there is no temperature but to say it is too cold.
4. Ratio Scales of variables are those quantitative variables when the values of the variables are
zero it shows absence of the characteristics. Zero indicates absence of the characteristics.
Eg: Height, Weight, Income, Amount of yield, Expenditure, Consumption.
All mathematical operations are allowed to be operated on the values of the variables.
Data Types
Based on the source, data can be classified into two: Primary Data and Secondary Data.
Primary data are data collected for the first time either through direct observation or by
enquiring individuals. It refers to the data collected either by or under the direct supervision
and instruction of the researcher.
______________________________________________________________________________
6
Secondary data are data obtained from published or unpublished sources like newspapers,
journals, official records, e.t.c.
Based on the role of time, data can be classified as Cross-sectional and Time series.
Cross-sectional Data: is a set of observations taken at a point of time.
Time series Data: is a set of observations collected for a sequence of time usually at equal
intervals.
The first and foremost task in statistical investigation is data collection. Before data collection,
four important points should be considered. These are the purpose of data collection (why we
need to collect data), the data to be collected (what kind of data to be collected), the source of
data (where we can get the data) and the methods of data collection (how can we collect this
data). These steps are called the why, what, where and how of the data collection.
Primary data are collected from primary sources and secondary data from secondary sources.
Primary data can be collected through experimental methods in laboratory in natural sciences
and through survey method in social sciences.
The survey methods of data collection are personal interview, telephone interview, mailed
questionnaire and personal observation.
Observational Method: This method involves monitoring of an ongoing activity and direct
recording of data. It avoids incompleteness of data. However, it is rarely used as it is not
possible to plan when the events will happen.
Personal Interview: a trained interviewer asks a series of questions and records responses on
a specially designed form called questionnaire. In this approach the enumerator is with the
respondent s/he explains some points which is not clear for the respondent. In this approach
the quality of the data affected both the design of the questionnaire and the quality of the
interviewer. It has the advantage of obtaining information in depth from a person being
interviewed, since we can make some clarifications to the questions and avoids
incompleteness and disorder responses.
Disadvantage:
It is costly than other methods, since it requires training of interviewers and transportation
cost.
The respondent may not tell us the real information for sensitive questions, since there is face
to face interaction. Eg: Asking about salary, if his/her salary is very small, he/she might tell
us the wrong one, since the respondent gets ashamed of it.
Telephone Interview: This method involves contacting the respondent on telephone and
collecting information. It is faster to collect information. The absence of telephone lines
makes this approach less usable. It cannot be also used for rural surveys.
______________________________________________________________________________
7
Advantage: It is less costly, since it requires less number of interviewers and the cost for calling
is than the cost for transportation. The respondent may give his/her opinion candidly since there
is no face to face interaction. Because of this, the data we get through this method are more
realistic than the previous one.
Disadvantage: this method is not applicable in developing countries because of the lack of access
to telephone. The respondent might not be in his/her house or may not respond to the call, and in
the meantime the interviewer might get bored. There is a high chance of getting incomplete
response, since the connection can be interrupted.
Mailed Questionnaire: the researcher sends the questionnaire to the respondent; the
respondents complete the form and sends back to the researcher. Costs are low. The
responses are free from biases of the interviewer and respondents can have more time to give
well thought answers. But it is applicable for educated persons. Non response, Partial
response, low return rates.
Disadvantage: the respondent might give in appropriate answers to questions, since there is no
one is there with them they may understand the question wrongly and respond it incorrectly.
Chapter Two
Data Organization and Presentation
In order to describe situations, draw conclusions or make inferences about the population even to
describe the sample, the collected data must organize into some meaningful way. The most
convenient way of organizing data is to construct a frequency distribution.
FD is the organization of raw data in table form, using classes and frequencies.
There are three types of frequency distributions; categorical, ungrouped (discrete or frequency
array) and grouped (continuous) frequency distributions.
Categorical FD:-a FD in which the data is qualitative i.e. either nominal or ordinal. Each
category of the variable represents a single class and the number of times each category repeats
represents the frequency of that class (category).
Eg 1:-The blood type of 24 students is given below
A B B AB O A
O O B AB B A
B B O A O AB
A O O O AB O
______________________________________________________________________________
8
Class(Blood type) Frequency(number of students)
A 5
B 6
AB 4
O 9
Total 24
Eg 2: Construct FD for the following letter grade of 25 students
A B C C C
C B B A D
A C C A B
F C C A B
Ungrouped FD (Frequency Array):- A FD of numerical data (quantitative) in which each value
of a variable represents a single class (i.e. the values of the variable are not grouped) and the
number of times each value repeats represents the frequency of that class.
Eg:-Number of children for 21 families.
2 3 5 4 3 3 2
3 1 0 4 3 2 2
1 1 1 4 2 2 2
Class(Number of children) Frequency(Number of families)
0 1
1 4
2 7
3 5
4 3
5 1
Total 21
Grouped (Continuous) FD: - A FD of numerical data in which several values of a variable are
grouped into one class. The number of observations belonging to the class is the frequency of the
class.
Eg:-Consider age group and number of persons.
Class Limits Class Boundaries Frequency
(Age in years) (Age in years) (number of persons)
1-25 0.5-25.5 20
26-50 25.5-50.5 15
51-75 50.5-75.5 25
76-100 75.5-100.5 10
Total 70
Class Limits:-The lowest and highest values that can be included in a class are called Class
Limits. The lowest values are called Lower Class Limits and the highest values are called Upper
Class Limits. Class limit for the first class 1-25; Lower class limit 1; Upper class limit 25
Class Boundaries:-are class limits when there is no gap between the UCL of the first class and
the LCL of the second class. The lowest values are called Lower Class Boundaries and the
______________________________________________________________________________
9
highest values are called Upper Class Boundaries. Cass Boundary for the first class 0.5-25.5;
Lower class boundary 0.5; Upper class boundary 25.5
Class Width (Class Size):-the difference between UCB and LCB of a class. It is also the
difference between the lower limits of two consecutive classes or it is the difference between
upper limits of two consecutive classes.
W=UCB-LCB or W=LCLi-LCLi-1 or W=UCLi-UCLi-1
For the above Eg W=25.5-0.5=25 or W=26-1=25 or W=50-25=25
Class Mark (Class Mid point):-is the half way between the class limits or the class boundaries.
LCL UCL LCB UCB
CM= or CM= Note that, W=CMi-CMi-1
2 2
Class Limits Class Boundaries Class Mark Frequency
1-25 0.5-25.5 13 20
26-50 25.5-50.5 38 15
51-75 50.5-75.5 63 25
76-100 75.5-100.5 88 10
Total 70
Relative frequency: - is the ratio of class frequency to the total frequency (total number of
observations).
Percentage frequency: - Relative frequency ×100%
Class Limits Class Boundaries Class Mark Frequency Relative frequency Percentage
frequency
1-25 0.5-25.5 13 20 20/70
26-50 25.5-50.5 38 15 15/70
51-75 50.5-75.5 63 25 25/70
76-100 75.5-100.5 88 10 10/70
Total 70 70/70=1 100
Cumulative frequency: is the sum of frequencies (total number of observations) below or above
a certain value.
Less than Cumulative Frequency: is the total number of values of a variable below a certain
UCB.
More than Cumulative Frequency: - is the total number of values of a variable above a certain
LCB.
Class Class Class Frequency Less than More than
Limits Boundaries Mark Cum. Freq. Cum. Freq.
1-25 0.5-25.5 13 20 20 10+25+15+20=70
26-50 25.5-50.5 38 15 20+15=35 10+25+15=50
51-75 50.5-75.5 63 25 20+15+25=60 10+25=35
76-100 75.5-100.5 88 10 20+15+25+10=70 10
Total 70
Construction of Grouped Frequency Distribution
1. Arrange the data in an array form (increasing or decreasing order).
______________________________________________________________________________
10
2. Find the Unit of Measurement (U). U is the smallest difference between any two distinct
values of the data.
3. Find the Range(R): R is the maximum numerical difference in the data set, i.e. the difference
between the largest and the smallest values of the variable.
4. Determine the number of classes (K) using Sturge‟s Rule.
K=1+3.322logN where N is the total number of observations.
R
5. Specify the class width(W): W=
K
6. Put the smallest value of the data set as the LCL of the first class. To obtain the LCL of the
second class add the class width W to the LCL of the first class. Continue adding until you
get K classes. Let X be the smallest observation; LCL1=X; LCLi=LCLi-1+W for i=2, 3… K.
7. Obtain the UCLs of the FD by adding W-U to the corresponding LCLs.
UCLi=LCLi+ (W-U) for i=1, 2…K.
1 1
8. Generate the class boundaries. LCBi=LCLi- U and UCBi=UCLi+ U for i=1,2…K.
2 2
Example 1: Mark of 50 students out of 40.
16 21 26 24 11 17 25 26 13 27 24 26 3 27 23 24 15 22 22 12 22 29 18 22 28 25 7
17 22 28 19 23 23 22 3 19 13 31 23 28 24 9 20 33 30 23 20 8 21 24. Construct
grouped frequency distribution.
Solution:
1. The array form of the data (increasing order)
3 3 7 8 9 11 12 13 13 15 16 17 17 18 19 19 20 20 21 21 22 22 22 22 22 22
23 23 23 23 23 24 24 24 24 24 25 25 26 26 26 27 27 28 28 28 29 30 31 33.
2. U=9-8=1
3. R=L-S=33-3=30
4. K=1+3.322logN=1+3.322log50=6.64≈7
5. W=R/K=30/6.64=4.5≈5
6. W-U=5-1=4
CL CB CM F RF PF
3-7 2.5-7.5 5 3 3/50=0.06 6
8-12 7.5-12.5 10 4 4/50=0.08 8
13-17 12.5-17.5 15 6 6/50=0.12 12
18-22 17.5-22.5 20 13 13/50=0.26 26
23-27 22.5-27.5 25 17 17/50=0.34 34
28-32 27.5-32.5 30 6 6/50=0.12 12
33-37 32.5-37.5 35 1 1/50=0.02 2
Total 50 1 100
______________________________________________________________________________
11
CB f Class LCF Class MCF
2.5-7.5 3 <7.5 3 >2.5 50
7.5-12.5 4 <12.5 7 >7.5 47
12.5-17.5 6 <17.5 13 >12.5 43
17.5-22.5 13 <22.5 26 >17.5 37
22.5-27.5 17 <27.5 43 >22.5 24
27.5-32.5 6 <32.5 49 >27.5 7
32.5-37.5 1 <37.5 50 >32.5 1
50
Exercise: In a survey the age of 44 women at marriage was reported as follows. Construct the
appropriate FD for this data. 24 25 27 26 22 23 24 25 24 23 26 28 24 25 23 24 25 25
25 22 27 28 27 24 25 24 25 28 26 25 24 28 24 25 25 24 25 24 26 27 27 25 28 26
Classes should be
Complete and non-overlapping: Complete- it should include all the data set. Non-
overlapping- no data should belong to two classes.
Clear and properly set: The W and K should be calculated properly and W should be the
same for all classes.
Standardized: A class should follow logical and chronological (increasing) order.
The number of classes should be in between 5 and 20 i.e. 5≤K≤20. K depends on N. the
larger the N the more the K. But we need to condense the data set with minimum lose of
information in an easy manageable classes.
Continuous: Even if there are no values in a class the class must be included in the
frequency distribution.
Advantages and disadvantages of frequency distributions
a. Advantages
It condenses a large mass of data in to a comparatively small table.
It attracts the attention of even a layman and gives him an insight into the nature of the
distribution.
It helps for further statistical analysis, like central tendency, scatter, and symmetry … of the
data.
b. Disadvantages
In the grouped frequency distributions, the identity of the observations is lost. We know only
the number of observations in a class and don not know what the values are.
Because the selection of the class width and the lower class limit of the first class are to a
certain extent arbitrary, different frequency distributions may be constructed for the same
data and hence may give contradictory impressions.
Data Presentation
i. Graphs
______________________________________________________________________________
12
1. Histogram: A graph in which the classes are marked on the X axis (horizontal axis) and the
frequencies are marked along the Y axis (vertical axis).
The height of each bar represents the class frequencies and the width of the bar represents
the class width.
The bars are drawn adjacent to each other.
2. Frequency Polygon: A graph that consists of line segments connecting the intersection of
the class marks and the frequencies.
Can be constructed from Histogram by joining the mid-points of each bar.
3. Frequency curve: is a smooth free hand curve of frequency polygon.
ii. Diagrams
1. Bar Diagram:-It is the simplest and most commonly used diagrammatic representation of a
frequency distribution. It is appropriate to present Qualitative Data (nominal\ordinal). It uses
a serious of separated and equally spaced bars in which the width of the bars is constant and
height of bars corresponds to the frequency of the category. The bars are separated by
constant distance.
1.1 Simple Bar Diagram: is a diagram in which categories of a variable are marked on the X
axis and the frequencies of the categories are marked on the Y axis.
It is applicable for discrete variables, that is, for data given according to some period, places and
timings. These periods and timings are represented on the base line (X-axis) at regular interval
and the corresponding frequencies are represented on the Y-axis.
The width of the rectangle represents nothing (it is meaningless), but it should be equal for
all rectangles.
Each rectangle is separated by an equal space.
It can also represent some magnitude (on the Y axis) over time, space, groups, e.t.c. (on the
X axis).
Example1:
Marital Status Number of individuals
Single 100
Married 70
Divorced 30
Total 200
______________________________________________________________________________
13
Marital Status
100
80
60
Frequ en cy
40
20
0
Single Married Divorced
Marital Status
Example2:
Year 1983 1984 1985 1986 1987
1.2 Component Bar Diagram: is used when there is a desire to show a total or aggregate is
divided into its component parts. The bars represent total value of a variable with each total
broken into its component parts and different colors are used for identification. In such type
of diagrams, a bar is subdivided in to parts in proportion to the size of the sub division. These
subdivided rectangles are shaded differently by lines, dots and colors so that they will be very
easy to compare the components.
Sometimes the volumes of different attributes may be greatly different. For making meaningful
comparisons, the components of the attributes are reduced to percentages. In that case each
attribute will have 100 as its maximum volume. This sort of component bar diagram is known as
percentage bar-diagram.
Each rectangle represents total value of a variable and is broken into its component parts.
Example:
______________________________________________________________________________
14
Marital Status Male Female Total
Single 90 10 100
Married 30 40 70
Divorced 1 29 30
Divorced
Married
Single
1.3 Multiple Bars Diagram: used to display data on more than one variable. In the multiple bars
diagram two or more sets of inter-related data are interpreted. Example:
Year Coffee Butter Sugar Total
1997 120 127 75
1998 25 98 87
1999 100 120 75
2000 198 98 60
Coffee Butter
Sugar Total
Example2:
Items of Expenditure Family A Family B
Education 65 23
1.4 Deviation Bar Diagram: When the data contains both positive and negative values such as
data on net profit, net expense, percent change e.t.c. Example:
Commodity Net profit
Soap 80
Sugar -95
Coffee 125
______________________________________________________________________________
15
Soap Sugar Coffee
2. Pie chart: - Pie chart is popularly used in practice to show percentage break down of data. A
pie chart is a circle representing a set of data by dividing the circle into sectors proportional
to the number of items in the categories or a pie chart is a circle representing the total, cut
into slices in proportional to the size of the parts that make up the total. It gives the
proportional sizes of different data groups as slice of a pie or a circle. Example:
Marital Status Number of individuals Percentage Degree
Single 100 50 180
Married 70 35 126
Divorced 30 15 54
Total 200 100 360
Single
Married
Divorced
Chapter Three
Measures of Central Tendency
Usually the collected data is not suitable to draw conclusions about the mass from which it has
been taken. Even though the data will be ,somewhat summarized after it is depicted using
frequency distributions and presented by using graphs and diagrams, still we cannot make any
inferences about the data since we have many groups. Hence, organizing a data into a FD is not
sufficient, there is a need for further condensation, particularly when we want to compare two or
more distributions we may reduce the entire distribution into one number that represents the
distribution we need. A single value which can be considered as a typical or representative of a
set of observations and around which the observations can be considered as centered is called an
„Average‟ (or average value or center of location). Since, such a typical values tend to lie
centrally within asset of observations when arranged according to magnitudes; averages are
called Measures of Central Tendency.
Objectives of MCT
1. To condense a mass of data in to one single value. That is to get a single value which is best
representative of the data (that describes the characteristics of the entire data). Measures of
central tendency, by condensing masses of in to one single value enable us to get an idea of
the entire data. Thus one value can represent thousands of data even more.
2. To facilitate comparison. Statistical devices like averages, percentages and ratios used for
this purpose. Measures of central tendency, by condensing masses of in to one single value,
______________________________________________________________________________
16
facilitates comparison. Eg: to compare two classes A and B, instead of comparing each
student result, which is infeasible, we can compare the average mark of the two classes.
There are many types of measures of central tendency, each possessing particular properties and
each being typical in some unique way. The most frequently encountered ones are
I. Computed averages: Mean (Arithmetic Mean. Geometric Mean and Harmonic Mean)
II. Positional averages: Median; Quantiles (Quartiles, Deciles, Percentiles)
III. Mode
Desirable properties of good Measures of central tendency
A measure of central tendency is good or satisfactory if it possesses the following characteristics.
1. It should be calculated based on all observations.
2. It should not be affected by extreme values. It should be as close to the maximum number of
observed values as possible.
3. It should be defined rigidly which means it should have a definite value (it should be unique).
4. It should always exist.
5. It should be easy to understand calculate. It should not be subject to complicated and tedious
calculations, though the advent of electronic calculators and computers has made it possible.
6. It should be capable of further algebraic treatment. By algebraic treatment, we mean that the
measures should be used further in the formulation of other formulae or it should be used for
further statistical analysis.
Mean
Arithmetic Mean
Simple Arithmetic Mean:-is the sum of all observations divided by total number of
observations. For a sample of n observations X1X2,…,Xn the sample mean is denoted by X
(X-bar) and calculated as follows.
X=
X = X 1 X 2 .... X n
n n
______________________________________________________________________________
17
For a frequency array (ungrouped FD), X =
fX = f X
1 1 f 2 X 2 .... f K X K
f f1 f 2 ... f K
For grouped FD, X represents class mark.
Combined Mean
If there are p different groups (having the same unit of measurement) with mean X 1 , X 2 ,…, X p
and number of observations n1,n2,…np respectively, then the mean of all the groups i.e. the
combined mean is given by X C
XC =
nX = n X n X .... n
1 1 2 2 p Xp
n n n ... n
1 2 p
Properties of AM
The algebraic sum of the deviations of each value from the arithmetic mean is zero. That
is ∑(X- X ) =0.
The sum of the squares of the deviations from the mean is less than the sum of the
squares of the deviations about the other score in the distribution. That is ∑(X- X )
2
≤∑(X-A) 2, A≠ X
If a constant C is added or subtracted from each value in a distribution, then the new
mean will be X new= X old C respectively.
If each value of a distribution is multiplied by a constant C, the new mean will be the
original mean multiplied by C.
Ex:
1. Find the arithmetic mean of A) 1, 2, 3, 4, 5. B) 1, 2, 3, 4, 100. Is there a great difference
between the mean of A and that of B?
______________________________________________________________________________
18
2. A teacher attaches 2 to Quiz, 3 to Mid-term and 5 for Final exam. If a student gets 90, 50 and
60 for Quiz, Mid-term and Final-exam respectively, what is his/her average academic
performance?
3. The mean weight of 50 women workers in a factory is 48 kg. The mean weight of 75 men
working in the same factory is 58 kg. Find the mean weight of all workers in the factory.
4. The mean of 200 items was found to be 40. Later on it was discovered that two items were
wrongly read as 92 and 8 instead of 192 and 88 respectively. Find the correct mean.
5. The mean salary of 100 laborers working in a factory , running in two shifts of 40 and 60
workers respectively is birr 380. The mean salary of the 40 laborers working in the morning
shift is birr 350. Find the mean salary of the 60 laborers working in the evening shift.
Geometric Mean
Geometric mean is the nth root of the product of the n values. GM= n X = n X 1 X 2 ... X n
But this formula is used if n is small. If it is large, it is difficult to calculate the nth root. Thus to
1
facilitate the computation, we make use of logarithms. GM=Antilog ( ∑logX)
n
1
For ungrouped FD, GM=Antilog ( ∑logfX)
f
For grouped FD, X represents class mark.
If the variable values are measures as ratios, proportions or percentage and some values are
larger in magnitude and others are small, then the geometric mean is a better representative of
the data than the simple average. In a “geometric series”, the most meaning full average is the
geometric mean. The arithmetic mean is very biased toward the large numbers in the series.
The geometric mean is important in determining the average rate of growth, percentages, ratios
and portions.
The disadvantage of GM is that it cannot be calculated if one or more observations are zero or
negative. It is also affected by extreme values but not to the extent of AM.
Ex:
1. Find the geometric mean of A) 1, 2, 3, 4, 5. B) 1, 2, 3, 4, 100. Is there a great difference
between the GM of A and that of B?
2. The price of a commodity increased by 5% from 1989 to 1990, 8% from 1990 to 1991 and by
77% from 1991 to 1992. Find the average price increase.
3. A machine depreciated by 10% each in the first two years and by 40% in the third year. Find
out the average rate of depreciation.
4. Decadal percentage growth of population in country A is given below. Find the average rate
growth.
Year 1921 1931 1941 1951 1961 1971 1981
______________________________________________________________________________
19
Harmonic Mean
Harmonic Mean is another specialized average which is useful in averaging variables expressed
as rate per unit of time, such as speed, number of units produced per day. It is the reciprocal of
n n
the arithmetic mean of the numbers. HM= =
1 1 1 1
X X X ... X
1 2 n
Harmonic mean is not affected by extreme values. But it cannot be calculated when one or more
observations are zero.
For n observations AM ≥ GM ≥ HM
For two positive observations GM = AM * HM
EX:-
1. Find the harmonic mean of A) 1, 2, 3, 4, 5. B) 1, 2, 3, 4, 100. Is there a great difference
between the HM of A and that of B?
2. A driver traveled 400 km per day for three days at a speed of 60, 50 and 40 kilometers per
hour. Find the average speed of the driver.
3. A student reads the first 100 pages of a book at a rate of 5 pages per hour, the next 100 pages
at a rate of 8 pages per hour. What is the student‟s average reading speed?
4. Suppose a train moves 100 km with a speed of 40 km per hour, then 150 km with a speed of
50 km per hour and the next 135 km with a speed of 45 km per hour. Calculate the average
speed of the train.
5. In a factory a mechanic takes 15 days to fabricate a machine, the second mechanic takes 18
days, the third takes 30 days and the fourth takes 90 days. Find the average number of days
taken by the workers to fabricate the machine.
6. Suppose a train moves 5 hours at a speed of 40 km per hour, then 3 hours at a speed of 50 km
per hour and the next 5 hours with a speed of 45 km per hour. Calculate the average speed of
the train.
______________________________________________________________________________
20
~
Median ( X )
Median is the half way point in a data set. It divides a data set into two equal parts such that half
of the numbers have a value less than the median and have will have values greater than the
median. Graphically median is the intersection of the less than and more than cumulative
frequency curves.
The median of a set of n observations X1X2,…,Xn arranged in ascending order of magnitude is
the middle value if n is odd or the arithmetic mean of the two middle values if n is even. That is
n n 1 th
( ) th value ( ) value
~ n 1 th ~ 2 2
If n is odd X = ( ) value and if n is even X =
2 2
Median for continuous grouped data: for grouped frequency distributions median is given by the
n
FX~ 1
~
formula X = L X~ ( 2 )w
f X~
Where n=∑f= sum of frequencies
L X~ is the LCB of the median class.
FX~ 1 is the less than cumulative frequency just before the median class.
f X~ is frequency of the median class.
First obtain the less than cumulative frequencies. From the cumulative frequencies select the
n
minimum one which contains the value . Then the median class is the class corresponding to
2
n
this minimum cumulative frequency which contains the value .
2
Median is not influenced by extreme values. It can be calculated for FD with open-ended classes,
even it can be located if the data is incomplete.
Other positional measures: The median of a set of data divides a given data set into two equal
parts; there are also measures that divide a given data set in to more than two equal parts. These
measures are collectively known as Quantiles. Quantiles include quartiles, deciles and
percentiles.
Quartiles: are values that divide a dataset into four equal parts. These values are denoted by Q 1,
Q2 and Q3 such that 25% of the data fall below Q1, 50%below Q2 and 75% below Q3.
Deciles: are values that divide the data into ten equal parts. These values are denoted by D 1, D2,
…, D9 such that 10% of the data fall below D1, 20% below D2, …, 90% below D9.
Percentiles: are values that divide a dataset into 100 equal parts. These values are denoted by P1,
P2, …, P99.
Methods of calculation
a. Ungrouped (individual) series: Arrange the values in ascending order. Then
Quartiles: Let Qi be the ith quartile (i=1,2,3), then
______________________________________________________________________________
21
i (n 1) th
Qi= ( ) value
4
Deciles: Let Di be the ith decile (i=1,2,…,9)
i (n 1) th
Di= ( ) value
10
Percentiles: Let Pi be the ith percentile (i=1,2,…,99)
i (n 1) th
Pi= ( ) value
100
b. Group (continuous) data:
in
FQi 1
Quartiles: Qi= LQi ( 4 )w i=1, 2, 3.
f Qi
in
FDi 1
Deciles: Di= LDi ( 10 )w i=1, 2,…., 9.
f Di
in
FDi 1
Percentiles: Pi= LDi ( 100 )w i=1, 2,…,99.
f Di
Where n=∑f= sum of frequencies
L is the LCB of the ith(quartile, decile and percentile) class.
F is the less than cumulative frequency just before the ith(quartile, decile and percentile)
class.
f is frequency of the ith(quartile, decile and percentile) class .
w is the class width.
Relationship between median, quartiles, deciles and percentiles
~
X =Q2=D5=P50
Qi=Pi*25
Di=Pi*10
Mode ( X̂ )
The mode is denoted by X̂ is the most frequently occurring value in a set of observations or it is
the value with the highest frequency. A data set may have one mode (uni-modal), two modes (bi-
modal), more than two modes (multi-modal) or no mode at all (i.e. when all observations are
equally frequent).
Ungrouped (individual series): Arrange the data in ascending order and take the value
appearing most frequently (the most frequent value).
Grouped (continuous) series: In a frequency distribution, the mode is located in the class with
highest frequency and that class is the modal class.
______________________________________________________________________________
22
f Xˆ f Xˆ 1
Then the formula for mode is X̂ = L Xˆ ( )w
( f Xˆ f Xˆ 1 ) ( f Xˆ f Xˆ 1 )
Mode is not affected by extreme values and can be calculated for open-ended classes. But it
often does not exist and is value may not be unique.
EXERCISES
1. The life of eighty condensers obtained in a life testing experiment has been presented as
follows and the mean life of a condenser was 5.39 years.
Life of Less Less Less Less Less Less Less Less Less Less
condensers than than than than than than than 7 than 8 than 9 than
(in years) 1 2 3 4 5 6 10
a. Find the median and modal life of condensers and interpret them.
b. Find the values of all quartiles.
c. Compute the 5th decile, 25th percentile, 50th percentile and the 75th percentile and interpret
the results.
3. The mean annual salary of all employees in a company is 2500. The mean salary of male and
female is 2700 and 1700 respectively. Find the percentage of males and females employed in
the company.
4. Given the following FD.
Mid-price of a commodity 15 25 35 45 55
a. If 75% of the items were sold in birr 45 or less and most items were sold in birr 34, find the
missing frequencies.
b. If 25% of the items were sold in less than or equal to birr 45 and most items were sold in
birr 34, find the missing frequencies.
______________________________________________________________________________
23
Chapter Four
Measures of Variation (Dispersion)
In the third chapter, we concentrated on a central value (measures of central tendency), which
gives an idea of the whole mass that is a complete set of values. However the information so
obtained is neither exhaustive nor comprehensive, as the mean does not lead us to know whether
the observations are close to each other or far apart. Median is a positional average and has
nothing to do with the variability of the observations in a data set. Mode is the largest occurring
value independent of the other values in the set. This leads us to conclude that a measure of
central tendency is not enough to have a clear idea about the data unless all observations are the
same. Moreover two or more data sets may have the same mean and/or median but they may be
quite different. So MCT alone do not provide enough information about the nature of the data.
To illustrate this let us consider the following three data sets: the price of a certain commodity in
four cities in five different months.
Month
A 30 30 30 30 30
City
B 28 29 31 30 32
C 15 5 55 45 30
D 3 5 37 30 75
Now if we calculate the mean and median for each of the city, we will come up with the value
30. This value implies that, the price of the commodity in the four cities A, B, C and D, on
average, is the same. That is the average price of the commodity in the four cities is the same.
But by inspection, it is apparent that the price of the commodity in the cities differs remarkably
from one another. For city A, it is right, for city B more or less it is ok, but for city C and D it is
not realistic to say the price of the commodity is 30. This means, just only by looking at the
average we cannot talk about the data set confidently. So, along with the average values
(measures of central tendency), we have to study the scatterdness or dispersion of the data.
Dispersion or variation may be defined as the extent of scatterdness of value around the
measures of central tendency. Thus measure of dispersion tells us the extent to which the values
of a variable vary about the measure of central tendency.
Purposes of Measures of Dispersion
To have an idea about the reliability of the measure of central tendency. If the degree of
scatterdness is large, an average is less reliable. If the value of the dispersion is small, it
indicates that a central value is a good representative of all the values in the data set.
______________________________________________________________________________
24
To compare two or more sets of data with regard to their variability. Two or more data
sets can be compared by calculating the same measure of dispersion having the same unit of
measurement. A set with smaller value posses less variability or is more uniform (or more
consistent).
To provide information about the structure the data. A value of a measure of dispersion
gives an idea about the spread of the observations. Further, one can surmise about the limits
of the expansion of the values in the data set.
To pave way to the use of other statistical measures. Measures of dispersion, especially
variance and standard deviation, lead to many statistical techniques like correlation,
regression, analysis of variance.
Absolute measures of variation: A measure of variation is said to be an absolute form when it
shows the actual amount of variation of an item from a measure of central tendency and are
expressed in concrete units in which the data have been expressed.
Relative measure of variation: It is the quotient obtained by dividing the absolute measure by a
quantity in respect to which absolute deviation has been computed. Relative measure of variation
is a pure number and used for making comparisons between different distributions.
Absolute Measures Relative Measures
Range Coefficient of Range
Quartile Deviation Coefficient of Quartile Deviation
Mean Deviation Coefficient of Mean Deviation
Variance and Standard Deviation Coefficient of Variation
Standard Scores
Before giving the details of these measures of dispersion, it is worthwhile to point out that a
measure of dispersion (variation) is to be judged on the basis of all those properties of good
measure of central tendency.
Range: It is the simplest and crudest measure of dispersion. Range is defined as the difference
between the largest and the smallest values in the data.
Ungrouped Data: R=L-S
Grouped Data: R=UCLlast-LCLfirst
Coefficient of Range (CR)
LS
For raw data: CR=
LS
Range hardly satisfies any property of good measure of dispersion as it is based on two extreme
values only, ignoring the others. It is not liable to further algebraic treatment.
______________________________________________________________________________
25
Variance
The Variance and Standard Deviation are the most superior and widely used measures of
dispersions and both measure the average dispersion of the observations around the mean.
For a population containing N elements, the population variance ( 2 ) is calculated by using the
formula 2
=
(X X ) 2
Thus the other disadvantage of variance is, the variation of the data is exaggerated because the
deviation (difference) of the each value from the mean is squared. Also it gives more weight the
extreme values as compared to those which are near to the mean value.
Combined Variance
______________________________________________________________________________
26
If the data are a sample and the distribution is normal or bell-shaped (or close to it!) or
approximately normally distributed, then the following conclusions can be reached:
approximately 68% of the scores in the sample fall within one standard deviation of the mean
i.e. X S will include approximately 68% of the data
approximately 95% of the scores in the sample fall within two standard deviations of the
mean i.e. X S will include approximately 95% of the data
Approximately 99% of the scores in the sample fall within three standard deviations of the
mean i.e. X S will include approximately 99.73% of the data.
Even if standard deviation is better than variance, there is however on difficulty with it. If there
are two or more distributions of different variables (having different units of measurement), there
variability cannot be compared by comparing the values of the standard deviation.
1. Coefficient of Variation (CV)
All absolute measures of dispersion have units. If two or more distributions differs in their units
of measurement, there variability cannot be compared by any of the absolute measure given
before. Also, the size of these measures of dispersion depends up on the size of the values. That
is if the size of the values is larger, the value of the absolute measures will also be larger. Hence,
in situations where either the two or more data sets have different units of measurement, or their
means differ sufficiently in size, absolute measures fails to be appropriate.
It is a relative measure of standard deviation. The coefficient of variation is the ratio of the
standard deviation to the mean and it is expressed as percent.
S
CV= ×100%, population; CV= ×100%, for sample
X
It is used for comparing the variability of two or more distributions. The distribution having less
CV is said to be less variable or more consistent or more uniform.
Since absolute measures depend on the units of measurement of the data, they fail to be
appropriate for comparing two or more groups if
1. The groups have different units of measurement.
2. The size of the data between the groups is not the same.
When either of these two conditions happens we have to use relative measures of variation. CV
is a unit less measure of variation and also takes into account the size of the means of the
distributions.
EX: Given: Data Set A: 2 Meters, 4 Meters, 6 Meters
Data Set B: 1000 Liters, 800 Liters, 900Liters
Compare the variability of the two data sets using standard deviation and coefficient of variation.
EXERCISES
1. Find the range, variance, standard deviation and coefficient of variation for the following FD.
______________________________________________________________________________
27
class 2-4 4-6 6-8 8-10
frequency 2 5 4 7
2. Two persons participated in five shooting competition and were able to hit the target
correctly out of fifteen shots as given below.
Competitor A 6 12 12 10 7
Competitor B 12 15 7 7 4
0 27 17
1 9 9
2 8 6
3 5 5
4 4 3
Chapter Five
Elementary Probability
Probability is
A quantitative (numerical) measure of uncertainty.
A measure of the strength of belief in the occurrence of something (an event).
A measure of the degree of chance of an uncertain event.
______________________________________________________________________________
28
As a general concept, probability is the measure of a chance that something will occur. It is a
numerical measure with a value between 0 (0%) and 1 (100%) where the probability of 0
indicates that the given event cannot occur and a probability of 1(100%) assures certainty of such
an occurrence.
Basic Definitions
Set: is a collection of elements or objects of interest.
Empty set (denoted by Ф or {})
A set containing no element.
Universal set (denoted by S)
A set containing all possible elements.
Complement (Not)
The complement of a set A is A‟: a set containing all elements of S that are not in A.
Intersection (And) ( AnB)
A set containing all elements in A and B.
Union (Or) (AuB)
A set containing all elements in A or B or both.
Mutually exclusive or disjoint sets
Sets having no element in common, having no intersection, whose intersection is empty set.
AnB=Ф.
Definition of terms
1. Experiment: - It is an activity or a trial that leads to well-defined results called outcomes,
but it is uncertain to which outcome will occur. Hence, a probability experiment is identified
by two properties. First each experiment has several possible outcomes (two or more) and all
these outcomes are known in advance and, second none of these outcomes can be predicted
with certainty. For example for the experiment of tossing a fair coin, we cannot be certain
whether the outcome will be a head.
2. Outcome: - is a result of a single trial (experiment).
3. Sample Space: - is a collection of all possible outcomes of an experiment.
4. Event: - is an outcome or a set of outcomes (having common characteristics) of an
experiment. For example getting one head in a trial of tossing three coins simultaneously
would be an event, E={HT,TH}. Getting even number in rolling a die, E={2,4,6}.
Eg:
Experiment Sample Space
1. Tossing a coin S={Head (H), Tail (T)}
2. Tossing two coins S={ HH,HT,TH,TT }
3. Rolling a die S={1,2,3,4,5,6}
4. Selecting an item from
a production lot S={Defective, Non-defective}
5. Introducing a new product S={Success, Failure}
Probability Theories
Definition of probability
There are three types of definitions of probabilities.
______________________________________________________________________________
29
A. Classical (Mathematical) Probability: Suppose there are N possible outcomes in the
sample space S of an experiment. Out of these N outcomes, only n are favorable to the event
n( E ) n
E, then the probability that the event E will occur is P( E ) .
n( S ) N
Ex
1. What is the probability of getting number 6 in rolling a die?
2. What is the probability of getting two heads in tossing two coins?
3. A family plans to have three children. Describe the sample space for all possible
gender combinations. What is the probability that the family will have two boys?
4. A die is rolled. What is the probability of getting
i. An odd number.
ii. Number greater than 3.
5. Two dice are rolled. Describe the sample space. What is the probability of getting
i. A sum of 10 or more.
ii. A pair which at least one number is 3.
iii. A sum of 8, 9 or 10.
iv. One number less than 4.
The difference between classical and empirical probability is that the former uses sample
space to determine the numerical probability while the latter is based on frequency
distribution.
Ex: Given the following frequency distribution.
Grade A B C D F
No of students 10 20 50 15 5
______________________________________________________________________________
30
2. If the probability that an event E will occur is P(E), then the probability that this event will
not occur is P(E‟), where P(E’)=1-P(E).
3. The sum of the probabilities of each outcome in the sample space S is 1 i.e. ∑Pi=1.
Eg: Rolling a die.
Outcome 1 2 3 4 5 6
∑Pi=1/6+1/6+1/6+1/6+1/6+1/6=1
4. If there are two events E1 and E2, the probability that at least one of these events will occur
is the sum of the probability that each event will occur minus the probability that both events
will occur at the same time (simultaneously).
P (E1 u E2) =P (E1) +P (E2)-P (E1 n E2)
Ex: A part time student is taking two courses, namely Economics and Statistics. The probability
that the student will pass economics course is 0.60 and the probability of passing statistics course
is 0.70. The probability that the student will pass both courses is 0.50. Find the probability that
the student.
a. Will pass at least one course.
b. Will fail both courses.
Counting Techniques
Counting techniques are mathematical models which are used to determine the number of
possible ways of arranging or ordering objects. They are used to find a solution to fix the size of
the sample space that is extremely large. Eg: What is the size of the sample space if a coin is
tossed large number of times say 20 times or more?
Ex:
1. In how many different ways can a secretary, a president and a manager be selected from 5
persons?
2. A committee of three persons is to be selected from 5 persons. In how many different ways
can this be done?
3. A committee of 5 persons must be selected from 5 men and 8 women. How many ways can
the selection be done if there are at least 3 women in the committee. What is the probability
that the committee consists of at least 3 women?
4. An urn contains 6 white, 4 red and 9 black balls. If three balls are drawn at random, what is
the probability that
i. 2 of the balls drawn are white.
ii. 1 is of each color.
iii. None is red.
iv. Al least one is white.
Definitions:
Two events are said to mutually exclusive if they can not occur at the same time or if the
occurrence of one stops the occurrence of the other, as a result P (A n B)=0. Two events are said
to independent if the occurrence of one does not affect the occurrence of the other. P (A n B) =P
(A) P (B).
______________________________________________________________________________
31
Ex:
1. What is the probability of getting a head or a tail in tossing a coin?
2. A coin is tossed and a die is rolled. Find the probability of getting a head on the coin and number
4 on the die. What is the probability of getting a head on the coin or number 4 on the die?
Conditional Probability
When the outcome or occurrence of an event affects the outcome or occurrence of another event,
the two events are said to be dependent (conditional).
If two events, A and B, are dependent to each other, the probability of event A occurring
knowing that event B has already occurred is said to be the conditional probability of A given
P( AnB)
that event B has already occurred, P( A / B ) , and the probability of event B occurring
P( B)
knowing that event A has already occurred is said to be the conditional probability of B given
P( AnB)
that event A has already occurred, P( B / A) .
P( A)
=>P(AnB)=P(A)P(B/A) and P(AnB)=P(B)P(A/B).
Ex:
1. A package contains 12 resistors, 3 of which are defective. If 4 are selected, find the
probability of getting
a. No defective resistor.
b. One defective resistor.
c. Three defective resistors.
2. An urn contains 6 white and 3 black balls. Three balls are drawn. What is the probability
that all the drawn balls will be black
A. If the selection is done with replacement.
B. If the selection is without replacement.
Exercises
1. A certain travel club has 1000 members. 60%of these members are males. 45% of these
members pay by credit card when they travel including 175 females. If a member is selected
from the travel club at random, what is the probability that :
a. The member is a female.
b. The member is a female and pays in cash.
c. The member is a male or a credit card user.
d. The member pays cash if we know that the member is a female.
e. Are the sex of the member and the mode of payment statistically independent events?
2. An integer is chosen at random from the first 200 numbers. Find the probability that the
chosen integer is divisible by 6 or 8.
Chapter Six
Probability Distributions
Random Variable is a variable whose values are determined by chance or with some
probability.
Discrete random variable is a random variable that assumes only certain clearly separated
values or whole numbers.
Continuous random variable assumes any values between two specific values.
______________________________________________________________________________
32
Probability Distribution is a listing of all possible values of a random variable together with
their corresponding probabilities.
Ex.
1. Show that the following functions are pdf.
1,0 x 1
a. f ( x)
0, otherwise
e x , x 0
b. f ( x)
0, otherwise
______________________________________________________________________________
33
1,0 x 1
3. Find the mean of the following probability distributions. f ( x)
0, otherwise
Let X be the number of successes. Then X follows a binomial distribution with parameters n,
number of experiments performed and p, probability of success, and write as X~Bin(n,p).
The probability of getting exactly x successes in n trials is given by
n
P( X x) p x q n x , x 0,1,2,...n .
x
Where p is the probability of success
q=1-p is the probability of failure
n is number of trials
x is number of successes.
This is called the Binomial Distribution.
The mean of a binomial distribution is E(X) =np and variance is V(X) =npq (S.d=sqrt (V(X))).
Ex
1. Suppose a coin is tossed 10 times. What is the probability of getting
a. Exactly 3 heads
b. At most 3 heads
c. At least 3 heads
d. More than 3 heads
e. No head
f. Find the average and variance of the number of heads.
2. The probability of a man kicking into the goal is 2/3. If a person kicks 5 times, what is the
probability of scoring
a. At least one goal.
b. At most 3 goals.
c. Find the average, variance and standard deviation of the number of goals.
The Poisson Distribution
Properties
1. The probability of success, p, is very small.
2. The experiment is performed indefinitely (n is very large).
3. The average number of events per unit of time ( ) is known.
______________________________________________________________________________
34
Thus, the random variable X (number of successes) has a Poisson distribution with parameter ,
X~Poisson ( ) and the probability of getting x successes is given by
e x
P( X x) , x 0,1,2,.... .
x!
If X is a Poisson random variable then E(X) = and V(X) = .
Ex:
1. On average a typist commits 3 errors per page. Find the probability that she will make
a. No mistake.
b. More than one mistake.
2. Customer arrive at a photocopying machine at an average rate of two every 10 minutes. What
is the probability that there will be
a. No arrivals during any period of ten minutes.
b. Exactly one arrival during these time period.
c. More than two arrivals during this time period.
The Normal Distribution
The normal distribution is a continuous probability distribution and plays a very important and
pivotal role in statistical theory and practice, particularly in the area of statistical inference and
statistical quality control. Its importance is due to the fact that in practice, the experimental
results, very often seem to follow the normal distribution or bell shaped curve. The normal curve
is symmetrical and is defined by its mean µ and its standard deviation . The normal curve is not
just one curve but a family of curves. Just as the equation for a circle describes the family of
circles, some small and some large, the equation for the normal curve describes a family of such
curves which may differ only with regard to the values of µ and , but have the same
characteristics.
Characteristics of the Normal Curve
1. The normal curve is symmetrical about the mean. This means that the number of units in the
data below the mean is the same as the number of units above the mean. This means the
mean and median have the same value.
2. The height of the normal curve is maximum at the mean value. Thus, the mean and mode
coincide. This means that the normal distribution has the same value of mean, median and
mode.
3. The curve declines as we go in either direction from the mean, but never touches the base (X-
axis) so that the tails of the curve on both sides extend indefinitely.
4. The corresponding deciles, quartiles and percentiles are equi-distant from the mean.
The height of the normal curve Y at any value of the random variable X is given by
1 x 2
1 ( )
Y f ( x) e 2 , x and write as X N ( , 2 )
2
Where µ is the mean of the distribution
is the standard deviation of the distribution
______________________________________________________________________________
35
Standard Normal Distribution
X
If X N ( , 2 ) , then Z is called the standard normal curve variate with mean zero
1 2
1 z
and variance one and written as Z N (0,1) and f ( z ) e 2 , z
2
b
1 X 2
a b
b b
1 ( )
Therefore, P(a X b) f ( x)dx e 2
dx P( Z ) f ( z )dz
a a 2 a
Ex:
The IQ score of students is normally distributed with a mean of 120 and variance 400. What is
the probability that a student will have an IQ.
a) Between 100 and 130.
b) Below 150.
c) Above 140.
d) Between 140 and 150.
Chapter Seven
7. Statistical Inference
The most important objective of statistical analysis is to draw inferences about the population
using sample information. The process of generalizing from sample to the population is known
as Statistical Inference.
A summary measure that describes any given characteristic of the population is known as
Parameter. Eg: population mean (µ), population variance (δ2), population standard deviation (δ),
population proportion (P), population moments (µr) are parameters.
The summary measure that describes the characteristic of the sample is known as Statistic.
Eg: sample mean ( X ), sample variance (S2), sample standard deviation (S), sample proportion
(p), sample moments ( X r ) are Statistics.
Statistical inference generally takes one of the two forms, namely, estimation of the population
parameter and testing of hypothesis.
______________________________________________________________________________
36
For the purpose of general discussion, a population parameter is denoted by and the
corresponding statistic by ˆ . As already stated the parameter is unknown. The value of the
statistic ˆ is computed from the random sample taken from the population.
The statistic ˆ intended for estimating a parameter is called an Estimator of . The specific
numerical value of an estimator calculated from the sample is called the Estimate.
The process of obtaining an estimate of the unknown value of a parameter by a statistic is called
Estimation. There are two types of estimations. One is the point estimation and the other is
interval estimation.
7.1 Point Estimation
It is the process of obtaining a single sample value (point estimate) that is used to estimate the
desired population parameter. The estimator is known as point estimator.
Eg: X is a point estimate of µ. S is a point estimate of .
The best estimator should be highly reliable and have such desirable properties as unbiasedness,
consistency, efficiency and sufficiency. These criteria are described as follows:
1. Unbiasedness: An estimator is a random variable since it is always a function of the sample
values. The expected value of the sample statistic is considered to be an unbiased estimator if
it equals the population parameter which is being estimated. This means E( ˆ )= .
2. Consistency: It refers to the effect of sample size on the accuracy of the estimator. A statistic
is said to be consistent estimator of the population parameter if it approaches the parameter
as the sample size increases, i.e. ˆ → as n→N.
3. Efficiency: An estimator is considered to be efficient if its value remains stable from sample
to sample. The best estimator would be the one which would have the least variance from
sample to sample. From the three point estimators of central tendency, namely the mean,
median and mode, the mean is considered the least variant and hence a better estimator.
4. Sufficiency: An estimator is said to be sufficient if it uses all the information about the
population parameter contained in the sample. For example, the statistic mean uses all the
sample values in its computation while median and mode do not. Hence the mean is a better
estimator in this sense.
7.2 Interval Estimation for the Population Mean (µ)
If the probability of rejecting true hypothesis is given, then it is denoted by α and it is called level
of significance. The (1-α) 100% confidence interval for µ is ( X -Zα/2 , X +Zα/2 )=(L,U)
n n
Zα/2 is the maximum error of the estimate (the maximum difference between the point
n
estimate of a parameter and the actual value of a parameter).
Interpretation of the confidence Interval
a. If all possible samples of size n were drawn, then on an average (1-α) 100% of these samples
would include the population mean within the interval around there sample means bounded
by L and U.
b. If we took a random sample of size n from a given population, the probability is 1-α that the
population mean would lie between the interval L and U around the sample mean.
c. If a random sample of size n was taken from the population, we can be (1-α)100% confident
in our assertion that the population mean would lie around the sample mean in the interval
bounded by the values L and U.
______________________________________________________________________________
37
Ex: - Haramaya University wishes to estimate the average age of students who graduate with
B.Sc. degree. A random sample of 625 graduating students showed that the average age was 24
with a standard deviation of 5 years. Construct the 95% confidence interval for the true average
age of all such graduating students at the university and interpret it.
7.3 Hypothesis Testing
A statistical hypothesis is a conjecture (an assumption) about a population parameter which may
or may not be true. Hypothesis testing is a statistical procedure which leads to take a decision
about such an assumption for the population parameter being correct or not, by using data
obtained from the sample.
In hypothesis testing, the researcher must define the population under study, state the particular
hypothesis that will be checked, give the significance level, select sample from the population,
perform calculations required for statistical test and reach conclusion.
It is already expressed that a statistical hypothesis may or may not true. For each situation, there
two types of statistical hypotheses.
1. Null Hypothesis (H0):- is a statistical hypothesis that states there is no difference between a
parameter and a specific value or hypothesized value. H0:µ=µ0 where µ is the population
mean and µ0 is the hypothesized mean
2. Alternative Hypothesis (H1):- is a statistical hypothesis that states there exists a difference
between a parameter and a specific value or hypothesized value.
H1: µ≠µ0 H1: µ<µ0 H1: µ>µ0
Errors in Hypothesis Testing
1. Type I error: is an error occurred if one rejects the null hypothesis which is actually true.
2. Type II error: is an error occurred if one failed to reject the null hypothesis which is actually
false.
The maximum probability of committing type I error is called the level of significance and
denoted by α (alpha).
Hypothesis testing for the population mean
1. State both hypotheses; the null and the alternative hypotheses. The hypotheses may be either
of the three.
H0:µ=µ0 H0:µ=µ0 H0:µ=µ0
H1: µ≠µ0 H1: µ<µ0 H1: µ>µ0
Two tailed test left tailed test right tailed test
2. Determine the level of significance α and obtain the tabulated (critical) value. For two tailed
test the critical value is Zα/2 (tα/2) and for left tailed -Zα (-tα) and right tailed Zα (tα).
3. Use the appropriate test statistic.
X
Use t statistic if n is small, t ~ t (n 1) .
S
n
X
If n is large use Z statistic, Z ~N (0, 1).
n
4. Define the critical (rejection) region.
5. If the value of the test statistic falls in the critical region (rejection region), reject the null
hypothesis; otherwise accept it.
6. Make a decision.
______________________________________________________________________________
38
EX:
1. A research repots that the average salary of veterinarians is more than $42000. A sample of
30 veterinarians has a mean salary of $43260. Test the reports claim. Assume the population
standard deviation is $5230.
2. A national magazine claims that the average college students watches less television than the
general public. The national average is 29.4 hours per week, with a standard deviation 2
hours. A random sample of 25 college students has a mean of 27 hours. Test the claim.
Assume normality.
3. A merchant believes that the average age of customers who purchase a certain brand of wears
is 13 years of age. A random sample of 35 customers had an average age of 15.6 years. At
α=0.01, should this conjecture be rejected. The standard deviation of the population is 1year.
Chapter Eight
Simple Linear Regression and Correlation
In the previous chapters we have been dealing with a single variable. In this chapter we will deal
with a bi-variate data i.e. data involving two variables. In this section we will deal with the
problem of predicting the average value of one variable in terms of known values of the other
variable(s).
Regression may be defined as the estimation or prediction of the unknown value of one variable
from the known values of one or more variables. The variable whose values are to be estimated
or predicted is known as dependent or explained variable while the variable which are used in
determining the value of the dependent variable are called independent or predictor variables.
The regression study that involves only two variables is called simple regression and the
regression analysis that studies more than two variables is called multiple regression. If the
relationship between the two variables can be described by a straight line then the regression is
known as linear regression otherwise it is called non-linear.
The regression analysis involving only two variables and having a linear relationship is
called Simple Linear Regression. This linear relationship between the two variables is
represented by a straight line.
Regression Line (Line of Regression): is the line that gives the best estimate of one variable for
any given value of another variable. The regression line which is used to predict the values of Y
for any given value of X is called regression line of Y on X. similarly the regression line which is
used to predict the values of X for any given value of Y is called regression line of X on Y.
Regression Equation: is a mathematical equation that defines the relationship between two
variables.
Regression of Y on X: Model: Y= α + βX + Є
Where Y is the dependent variable
X is the dependent variable
α is constant term(intercept)
β is slope(change in Y for a unit change in X)
Є is the error term
To estimate the regression coefficients (α and β), the procedure is minimizing the sum of the
squares of the errors. Let the estimated model be Yˆ = a + bX. Then, from sample data the values
of a (estimate of α) and b (estimate of β) can be obtained as follows:
______________________________________________________________________________
39
n XY X Y
b= and a= Y -b X .
n X 2 ( X ) 2
Interpretation of the slope (b)
1. If b is positive, there is a direct relationship between the two variables.
2. If b is zero, there is no linear relationship between the two variables.
3. If b is negative, there is an indirect relationship between the two variables.
Correlation
Most of the variables in economics and business area show relationship. For example, price and
supply, income and expenditure, advertizing expenditure and sales. Thus in order to know the
degree or direction of such a relationship between variables, correlation analysis is important.
Correlation is a mathematical tool desired towards measuring the degree of the relationship
(degree of association) between the variables. Correlation that involves only two variables is
called simple correlation and which involves more than two variables is called multiple
correlations.
Covariance is a measure of the joint variation in two variables, i.e. it measures the way in which
the values of the two variables vary together. If the covariance is zero, there is no linear
relationship between the two variables. If it is negative, there is an indirect linear relationship
between them. If the covariance is positive, there is a direct linear relationship between the
variables.
Pearson‟s coefficient of correlation (r)
Pearson‟s coefficient of correlation (r) is used to measure the strength of the linear relationship
between two variables.
The population correlation coefficient is denoted by ρ and the sample correlation coefficient is
denoted by r.
n XY X Y
r=
n X 2 ( X ) 2 n Y 2 ( Y ) 2
______________________________________________________________________________
40
Sales (Y) 10 11 13 15 16 19 14
a. Estimate the regression equation supply on sales.
b. Interpret the estimated coefficients (the slope and intercept).
c. Calculate the correlation coefficient between supply and sales, and interpret it.
d. Find the coefficient of determination and interpret it.
e. Predict the amount of sales of the commodity if the supply amount is 80.
2. The following summary results are obtained from price and demand of a commodity
∑price=30 ∑demand=40 ∑(price)(demand)=214
∑(price)2=220 ∑(demand)2=340 n=5
a. Identify the dependent and independent variable.
b. Estimate the regression equation.
c. Interpret the estimated coefficients.
d. Calculate the correlation coefficient between price and demand, and interpret it.
e. Find the coefficient of determination and interpret it.
3. Given n=25, X =3.95, Y =2.03, S x2 =85.35, S y2 =98.75, S xy =90
a. Fit the regression equation Y on X.
b. Interpret the estimated coefficients.
c. Calculate the correlation coefficient and interpret it.
d. Find the coefficient of determination and interpret it.
______________________________________________________________________________
41