Introductory Statistics Notes
Introductory Statistics Notes
Introductory Statistics Notes
Numbers are essential role in statistics. They provide raw material of statistics. These materials
must be processed to be useful, just like crude oil must be refined into petrol before it can be used
by an automobile engine. The study of statistics involves methods of refining numerical and non-
numerical information into useful forms.
Numbers can represent qualities and values of commodities produced and sold, prices of
products,inventories,asserts and liabilities, row materials, customers, income and expenses,
records of birth and death rates, number of passengers traveled during a year by road, rail, ship
etc, value of import of different commodities to and from different countries, number of student in
various courses in the university, agricultural production, number of working population in a
certain village/ward/District/Religion/Country.
Whenever numbers are collected and compiled, regardless of what they present, they become
statistics. In other words the term statistics is considered synonymous with ways and means of
representing and handling data, making inferences logically and drawing conclusion.
Statistics Methods:
The large volume of numerical information gives rise to the need for systematic methods which
can be used to organize, present, analyze and interpret the information effectively. Statistical
methods are primarily developed to meet this need.
The methods by which statistical data analyzed are called Statistical methods. Statistical
methods are applicable to a very large number of fields like Economics, Sociology, Management,
Agriculture, etc
Statistical methods are used by the Governmental bodies, private firms and research agencies as
an indispensable aid in forecasting, controlling and exploring.
Note: Statistics is usually not studied for its own sake; rather it is widely employed as a tool and
highly valuable one in the analysis of problem in nature-physical and sciences.
Statistics;
Refers to the collection, presentation, analysis and utilization of numerical data to make
inferences and reach decisions in the face of uncertainty in economics, business and other social
and physical sciences.
Is the body of procedures and techniques used to collect, presents and analyze data on
which to base decisions in the face of uncertainty or incomplete information?
ABUSES OF STATISTCS
Most of the time, sample are used to infer something about the population. If an experiment was
done continuously and the results were interpreted wrongly then the information is meaningless.
Things which can lead the wrong information are;
Sample size: If the sample is very large or very small it leads to the wrong information.
Summarization: Graphs which is not clear to what it mean, i.e., the percentage which not
appropriate to what is represented.
1
OBJECTIVES OF STATISTICS
LIMITATION OF STATISTICS
Statistics are not suitable to the study qualitative phenomenon such as honesty
poverty and culture.
Statistics deals with an aggregate of object and not individual figure.
There is no certainty in statistics as statistics laws are not exact (approximations
are dormant)
USES OF STATISTICS
Definitions of terms.
Descriptive statistics summarizes a body of data with one or two pieces of information that
characterize the whole data. It also refers to the presentation of a body of data in the form of data
in the form of tables, charts, graphs, and other forms of graphic display.
Inferential statistics tries to infer information about a population by using information gathered
by sampling. Or refers to the drawing of generalizations about the properties of the whole from the
specific or a sample drawn from the population, it involves inductive reasoning (This is to be
contrasted with deductive reasoning which ascribes properties to the specific starting with whole).
Applied Statistics 1s the application of statistical methods and techniques to the problems and
facts of real life as they exist. Quality control, sample surveys, quantitative analysis for business
decision.
Population: The complete set of data elements is termed the population. The term population will
vary widely with its application. Examples could be any of the following; animals; primates; human
beings; homo sapiens; who are high school students; attending the Math & Science Center.
Sample: A sample is a portion of a population selected for further analysis.
Parameter: A parameter is a characteristic of the whole population
DATA TYPES
Data are the facts and figures collected, analyzed and summarized for presentation and
interpretation. All the data collected in a particular study are referred to as the data set for the
study. Data can also be classified as either qualitative or quantitative;
2
Qualitative data.
Data are nonnumeric. {Poor, Fair, Good, Better, Best}, colors (ignoring any physical causes), and
types of material {straw, sticks, bricks} are examples of qualitative data.
Qualitative data include labels or names used to identify an attribute of each element. Qualitative
data use either the nominal or ordinal scale of measurement and may be nonnumeric or numeric.
The statistical analysis appropriate for a particular variable depends upon whether the variable is
qualitative or quantitative. If the variable is qualitative the statistical analysis is rather limited. We
can summarize qualitative data by counting the number of observations in each qualitative
category or by computing the proportion of the observations in each category. However, even
when the qualitative data use a numeric code, arithmetic operations such as addition, subtraction,
multiplication and division do not provide meaningful results.
Quantitative data.
Qualitative data are often termed categorical data. Some books use the terms individual and
variable to reference the objects and characteristics described by a set of data. They also stress
the importance of exact definitions of these variables, including what units they are recorded in.
The reason the data were collected is also important.
Quantitative data require numeric values that indicate how much or how many, they are obtained
using either the interval or ratio scale of measurement. Quantitative data use a numeric values,
arithmetic operations such as addition, subtraction; multiplication and division provide meaningful
results for a quantitative variable. For example, for a quantitative variable the data may be added
and then divided by the number of observations to compute the average value. This average is
usually meaningful and easily interpreted.
Discrete data are numeric data that have a finite number of possible values.
• A classic example of discrete data is a finite subset of the counting numbers, {1, 2,
3, 4, 5} perhaps corresponding to {strongly Disagree... Strongly Agree}.
• When data represent counts, they are discrete. An example might be how many
students were absent on a given day. Counts are usually considered exact and
integer. Consider, however, if three tradies make an absence, then aren't two
tradies equal to 0.67 absences?
Continuous data have infinite possibilities: 1.4, 1.41, 1.414, 1.4142, 1.141421...
The real numbers are continuous with no gaps or interruptions. Physically measurable quantities
of length, volume, time, mass, etc. are generally considered continuous. At the physical level
(microscopically), especially for mass, this may not be true, but for normal life situations is a valid
assumption.
The structure and nature of data will greatly affect our choice of analysis method. By structure we
are referring to the fact that, for example, the data might be pairs of measurements. Consider the
legend of Galileo dropping weights from the leaning tower of Pisa. The times for each item would
be paired with the mass (and surface area) of the item. Something which Galileo clearly did was
measure the time it took a pendulum to swing with various amplitudes. Galileo Galilei is
considered a founder of the experimental method.
Sources of data
Data can be obtained from major two sources
Primary source.
Secondary source.
SCALE OF MEASUREMENT;
The search for answers to research questions calls for collection of data. Data are facts, figures
and other relevant materials collected, analyzed and summarized for presentation and
interpretation.
3
Data collection requires one of the following scales of measurement; nominal, ordinal, interval, or
ratio. The scale of measurement determines the amount of information contained in the data and
indicates the most appropriate data summarization and statistical analysis.
Nominal;
When the data for a variable consist of labels or names used to identify an attribute of the element
the scale of measurement is considered a nominal scale. In cases where the scale of
measurement is nominal a numeric code as well as nonnumeric labels may be used. For
instance, to facilitate data collection and to prepare the data into computer database, we might
use numeric code by assigning labels the numbers.
Ordinal Scale;
The scale of measurement for a variable is called an ordinal scale if the data exhibit the properties
of nominal data and the order or rank of the data is meaningful. For instance; the data recorded
as excellent indicate the best, followed by good and then poor. The scale of measurement is
ordinal. Note that the ordinal data can also be recorded using numeric code.
Interval Scale;
The scale of measurement for a variable becomes an interval scale if the data show the
properties of an ordinal data and the interval between values is expressed in terms of fixed unit of
measure. Interval data are always numeric example; aptitude test scores, three students with SAT
scores of 1120, 1050 and 970 can be ranked or ordered in terms of best performance to poorest
performance. In addition the difference between the scores is meaningful.
Ratio Scale
A variable is ratio scaled if the data have all the properties of interval data and the ratio of two
values is meaningful. Variables such as distance, height, weight, and time use the ratio scale of
measurement. This scale requires that a zero value be included to indicate that nothing exists for
the variable at zero point. For example, consider the cost of an automobile. A zero value for the
cost would indicate that the automobile has no cost and is free.
Observation plays a major role in formulating and testing hypothesis in social sciences.
Behavioral scientists observe interactions in small groups; anthropologists observe simple
societies, and small communities; political scientists observe the behavior of political leaders and
political institutions. A researcher silently watching a city council or trade union committee or
quality circle or departmental meeting or conference of politicians or others picks up hints that
help him to formulate new hypothesis.
Reliability entails consistency and freedom from measurement error. This is usually assessed in
terms of the extent to which two or more independent observers agree in their ratings of the same
event.
Validity refers to the extent to which the recorded observations accurately reflect the construct
they are intended to measure. Validity is assessed by examining how well the observation agrees
with alternative measures of the same construct.
Observation is defined as a systematic viewing of a specific phenomenon in its proper setting for
the specific purpose of gathering data for a particular study. Observation as a method includes
both „seeing‟ and „hearing‟, it is accompanied by perceiving as well.
4
No interview is required although it can be done as follow up on the observed
information.
It can be highly accurate.
The information obtained relates to what currently happened.
INTERVIEWING.
Interviewing may be used either as a main method or as a supplementary one in studies, is the
only suitable method for gathering information from illiterate or less educated respondents. It is
useful for collecting a wide range of data from factual demographic data to highly personal and
intimate information relating to a person‟s opinions, attitudes, and values, beliefs, past experience
and future intentions.
Interview can add flesh to statistical information; it enables the investigator to grasp the behavioral
context of the data furnished by the respondents. It permits the investigator to seek clarifications
and brings to forefront those questions that for one reason or another respondent do not want to
answer.
Types of interview
The interviews are classified into structured (directive) and unstructured (non-directive) interviews;
Structured interview. This is an interview made with a detailed standardized schedule. The
same questions are put to all the respondents and in the same order, each question is asked in
the same way in each interview promoting measurement reliability. This interview is used for large
scale formalized surveys.
The data from structured interview are easy for comparison, recording and coding data do not
pose any problem and greater precision is achieved. Last, attention is not diverted to extraneous,
irrelevant and time-consuming conversation.
Unstructured interview. In this interview, interviewer encourages the respondent to talk freely
about a given topic with minimum of prompt or guidance. The interviewer avoids channeling the
interview directions, he/she develops a permissive atmosphere and the questions are not
standardized and not ordered in a particular way.
This interviewing is more useful in case studies rather than in surveys. It is particularly useful in
exploratory research where the lines of investigation are clearly defined. It is also useful for
gathering information on sensitive topics such as divorce, social discrimination, class conflict,
generation gap. It provides opportunity to explore the various aspects of the problem in an
unrestricted manner.
5
Advantages of Personal interview:
It is time consuming; especially when the sample is large and recall upon the respondents is
necessary.
Telephone interview:
MAIL SURVEY.
This method involves sending questionnaires to the respondents with a request to complete them
and return them by post. This can be used in case of educated respondents only. The mail
questionnaire should be simple so that the respondents can easily understand the questions and
answer them. It should preferably contain mostly closed-end and multiple-choice questions so that
it could be completed within a few minutes.
6
Procedure;
The researcher should prepare a mailing list of the selected respondents by collecting the
addresses from the telephone directory or the association or the organization to which
they belong.
A covering letter should accompany a copy of the questionnaire. It must explain to the
respondent the purpose of the study and the importance of his study to the progress of the
organization.
EXPERIMENTATION.
Experimentation is a research process used to study the causal relationships between variables.
It aims at studying the effect of independent variables on a dependent variable by keeping the
other independent variables constant through some type of control mechanism.
The fundamental weakness of any non-experimental study is its inability to specify causes and
effects. It can show only correlations between variables but correlation alone never prove
causation. The experiment is the only method which can show the effect of an independent
variable on dependent variable. In experimentation, researcher can manipulate the independent
variable and measure its effect on the dependent variable. For example, the effect of various
types of promotional strategies on the sale of a given product can be studied by using different
advertising media such as TV, radio and newspapers.
QUESTIONNAIRE DESIGN
A questionnaire is an instrument or a tool of data collection which contain a set of questions
logically related to a problem under study aim at eliciting responses from the respondents.
Construction of questionnaire.
Construction of questionnaire is not a matter of simply listing questions that comes from
researcher‟s mind. It is a rational process involving much time, effort, and thought;
a) Data need determination: Mailed questionnaire is an instrument for gathering data for a
specific study, its construction should flow logically from the data required for the given study.
The data required for specific study can be determined by a deep analysis of the research
objectives, the negative questions relating to each of the research objectives, hypothesis and
operational definitions of the concepts used in them.
7
This will help to identify gaps and duplications in the instrument and enable the designer to
make appropriate additions, corrections and deletions.
Construction of dummy tables is very useful way to visualize how data can be organized and
summarized. A dummy table contains all the elements of a real table except that the cells of
the table are empty.
Cross-tabulation tables are usually presented with cell frequencies converted into percentages
based on either row or column totals. If a dependent variable is cross- tabulated with an
independent variable, the percentages should be calculated so that they add up to 100% for
each category of the independent variable.
e) Evaluation of the draft instrument: The draft instrument is evaluated with other qualified
persons each question is examined its relevance, appropriateness, clarity, and validity. The
logical and psychological order of the questions, their clarity and content, length of the
instrument and other aspects of its structure should be considered.
f) Pre-testing: The revised draft must be pre-tested in order to identify the weaknesses of the
instrument and to make the required further revisions to rectify them.
8
SAMPLING TECHNIQUES (METHODS).
The objective of statistics is to make inferences about a population from information contained in
a sample. This same objective motivates the discussion of sampling techniques or methods. In
most cases the inference is done in form of an estimate of a population parameter, such as a
mean, total or proportion with a bound on the error of estimation.
The objective of sampling is to estimate population parameters such as the mean or the total from
information contained in a sample. The experimenter controls the quantity of information
contained in the sample by the number of sampling units he or she includes in the sample and by
the method used to select the sample data.
Each observation or item taken from the population contains a certain amount of information
about the population parameter or parameters of interest. Since information costs money the
experimenter must determine how much information he/she should buy. Too little information
prevents the experimenter from making good estimates while too much information from results in
a waste of money. The quantity of information obtained in the sample depends on the items
sampled and on the amount of variation in the data. To obtain representative sample which
produces valid and reliable information sampling techniques (methods) were established.
Sampling: Is the selection of some unit to represent the entire population from which the units
were drawn. It consists of one or more elements selected from a population.
Sampling Units: That element or set of element considered for the selection in some stage of
sampling.
Sampling frame: Is the actual list of sampling units from which the sample or some stage of
sample is selected.
Sample design: Is a set of rules or procedures that specify how a sample is to be selected.
Sampling error: Is the degree of error to be expected for a given sample design. OR Is a
difference between sample statistic and population parameter
TYPES OF SAMPLING
Probability sampling
Non-probability sampling
Probability sampling,
Sampling techniques used when the resources (money and time) are limited, the objective of
study is to make generalization and greater degree of accuracy of estimation of population
parameter is required. Each element or sampling unit in a given population under study has the
same chance to be included in the sample.
The classical formulation of a statistical estimation problem requires that randomness be built into
the sampling design so that properties of the estimators can assessed probabilistically. With
proper randomness in the sampling, one can make statements such as “Our estimate is unbiased
and we are 95% confident that our estimate will be within 2% points of the true proportion”.
Sample designs that are based on planned randomness are called probability samples.
Example 1; Auditors study simple random samples of accounts in order to check for compliance
with audit controls set up by the firm or to verify the actual dollar value of the accounts.
Example 2; Marketing research often involves a simple random of potential users of a product.
The researcher may want to estimate the proportion of potential buyers who prefer a certain color
of a car or flavor of food.
2) Systematic Sampling.
This is a modification of simple random sampling which is ordinarily less time-consuming and
easier to implement. The estimated number of elements in the large population is divided by the
desired sample, yielding a sampling interval. The sample is then drawn by listing the population
elements in an arbitrary order and selecting every nth case, starting with a randomly selected
number between 1 and n.
Example1; suppose it is desired to select a sample of 20 students from a list of 300 students,
divide the total population of 300 by 20. The quotient is 15, and then selects a number at random
between 1 and 15 using lottery method or a table of random numbers. Suppose the selected
number is 9. Then the students numbered 9, 24…64, 84 are selected as the sample.
Example 2; consider the problem of sampling rural health centers. In this center, the sampling
interval is 7. You would then choose a randomly example, the sampling frame would be a list of
rural health centers arranged alphabetically by health center name. If your desired sample size is
285 rural health centers drawn from a universe of 2000 rural health selected number between 1
and 7 as your start. If your random number is 3 the first unit selected would be the 3 th rural clinic
listed in the sampling frame, the second would be the 10th clinic listed and so forth until the
sampling frame is exhausted.
Systematic sampling is useful when the units in your sampling frame are not numbered, when the
elements are not numbered serially or when the sampling frame consists of very long lists. Other
possible areas of application are sampling of students in a class, houses in a street, telephone
directory, customers of a bank, assembly line output in a factory or members of association.
3) Stratified Sampling,
For the purpose of maximizing the amount of information for a given cost, the population under
study is sub-divided into homogeneous groups or strata and from each stratum, random sample is
drawn. For example, university students may be divided on the basis of discipline, and each
discipline group may again divided into juniors and seniors; the employees of a business
undertaking may be divided into managers and non-managers and each of those two groups may
be sub-divided into salary grade wise strata.
This sampling technique involves drawing a sample from each stratum in proportion to the latter‟s
share in the total population. This sampling method gives proper representation to each stratum
and its statistical efficiency is generally higher. For example, if the final year MBA students of the
Management Faculty of a University consist of the following specialization groups:
10
Specializations Stream Number of students Proportion of each stream Sample size
4) Cluster sampling
Where the population elements are scattered over a wide area and a list of population elements is
not readily available, the use of simple or stratified random sampling method would be too
expensive and time-consuming. In such cases cluster sampling is usually adopted.
Cluster sampling means random selection of sampling units consisting of population elements.
Each such sampling unit is a cluster of population elements. Then from each selected sampling
unit a sample of population elements is drawn by either simple random selection or stratified
random selection.
Cluster sampling is used when it is not possible to get an adequate sampling frame for the
individuals you wish to study or when a simple random technique would result in a list of
individuals so dispersed that it would be too costly to visit each one. The disadvantage of a cluster
sample is that it increases sampling error and requires a larger sample size for reliable estimates
of population characteristics. If the cost of the larger sample size outweighs the costs associated
with uncluttered sampling, clustering should not be used. The cluster may be an institution,
geographical area or any other appropriate group depending on the nature of survey.
Example; suppose a researcher wants to select a random sample of 1000 households out of
40,000 estimated households in a city for a survey. A direct sample of individual households
would be difficult to select because a list of households does not exist and would be costly to
prepare. Instead he can select a random sample of a few blocks/wards. The number of blocks to
be selected depends upon the average number of estimated households per block.
This sampling is appropriate where the population is scattered over a wide geographical area and
no frame or list is available for sampling. It is also used when a survey has to be made within a
limited time and cost budget.
Sometimes, when populations are extremely complex it is necessary to go beyond two stages in
cluster sampling. For example, in absence of list of households for your survey of AIDS widows,
one might have to begin with a random sample of villages and when arrive at each village make a
list of households and draw a random selection of households to visit. When you arrive at a
household you would randomly select a woman to interview or interview all eligible women.
Non-probability sampling
This technique does not adopt the theory of probability and does not give a representative sample
of the population. It is distinguished from probability sampling by the fact that subjective
judgments play a role in selecting the sampling elements.
11
The broad categories of non-probability are;
Example; this method is used to test simple purposes such as testing ideas or gaining ideas or
rough impression about a subject of interest.
The chance that a particular case be selected for the sample depends on the subjective judgment
of the researcher. For example, a researcher may deliberately choose industrial undertakings in
which quality circles are believed to be functioning successfully and undertakings in which quality
circles are believed to be a total failure.
Application; The method is appropriate when what is important is the typicality and specific
relevance of the sampling units to the study and not their overall representativeness to the
population.
Application, Quota sampling is used in studies like marketing surveys, opinion polls, and
readership surveys which do not aim at precision, but to get quickly some crude results.
ORGANIZATION OF DATA
Data organization is an intermediary stage of work between data collection and data analysis. The
completed instruments of data collection such as interview schedules, questionnaires and
observation schedules contain vast mass of data. They cannot straightaway provide answers to
research questions, they need to be classified and summarized in order to make them amenable
to analysis.
Data collected from published sources are generally in organized form compared to those come
from primary sources. However a large mass of figures that are collected from the survey
frequently needs organization. Data organization consists of a number of closely related
operations such as editing, classification, coding and tabulation.
Editing;
Editing is the process of checking to detect and correct errors, omissions, inconsistencies,
irrelevant answers and wrong computation in the return from the survey may be corrected or
adjusted. Why editing?
During the stress of interviewing the interviewer cannot always record responses completely and
legibly. Therefore after each interview is over he should review the schedule to complete
abbreviated responses, rewrite illegible responses and correct omissions.
12
The returns (schedules or questionnaires) received from the respondents have to be scrutinized
patiently and carefully and detect errors caused by careless recording by the field workers or
inconsistency or factually wrong information given by the respondents.
Classification
The edited data are arranged according to some characteristics possessed by the items
consisting data and coded. The responses are classified into meaningful categories to bring out
their essential pattern so that to reduce several responses into appropriate categories containing
critical information needed for analysis.
Objective of Classification
To condense the mass of data: Statistical data collected during the course of an
investigation are so varied that it is not possible to appreciate, even after a carefully study,
the real significance of figures, unless they are properly to small groups or classes.
Example; data collected during population census can be classified according to sex, age,
marital status, education, occupation etc.
To enable grasp of data; The figures are easily arranged in a few classes or
categories so that the like go with the like.
To prepare the data for tabulation; Only classified data can be presented into
tabular form. Classification thus provides a basis for tabulation and further statistical
processing.
To study the relationships; Relationship between variables can be established
only after the various characteristics of the data have been known, which is possible only
through classification and tabulation.
To facilitate comparison: Classification enables comparisons between variables.
Example the data on house hold classified on the basis of age, religion, education,
income, expenditure etc can be used for drawing comparisons.
Coding
Coding means assigning numerals or other symbols to the categories or responses. For each
question a coding scheme is designed on the basis of the concerned categories. The coding
schemes with their assigned symbols together with specific coding instructions may be assembled
in a book.
Example;
Question No. Variable/Observation Responses categories code
Tabulation
Tabulation is the process of summarizing raw data and displaying them on compact statistical
tables for further analysis. It involves counting of the number of cases falling into each of several
categories. Tabulation can be done by hand or mechanical or electronic devices. The choice
depends upon the size and type of study, cost considerations, time pressure and availability of
tabulating machines or computers.
13
To clarify the object of investigation.
To simplify complex data.
To depict trend.
To economize space.
To clarify the characteristics of data.
To facilitate comparison.
To detect errors and omissions in the data.
To facilitate statistical processing.
To help reference.
DATA PRESENTATION
After data collection the next step is to present them in some suitable form. The proper
presentation arises because of the fact that statistical data in their raw form almost defy
comprehensions. When data are presented in easy-to-read form, it can help the reader to acquire
knowledge in much shorter period of time and also facilitate statistical analysis.
There are two main objectives of the study of measures of central tendency.
Arithmetic Mean
This is a central characteristic of the given mass of data which is obtained by adding all the
observations and dividing the total by the number of observations.
Steps
n
Add together all the values of the variable X and obtain the total i.e, x
i 1
i
x1 x2 x3 ... x n
xi
AM i 1
, where x are observations.
N N
Example; The table below presents the salaries of 12 employees of a certain company in
(000shs).
32 32 45 80 66 66 54 45 32 71 111 14
Solution;
12
x i
32 32 45 80 66 66 54 45 32 71 111 14 648
Mean salary= i 1
54
N 12 12
14
The mean salary of 12 employees of that particular company is 54000 Tshs.
Steps:
iv. Devide the total obtained in part iii above by the total number of observations.
n
fx i i
f1 x1 f 2 x 2 ... f nx n
Arithmetic Mean= i 1
n
.
f
N
i
i 1
Example; In a certain industry the number of thousands of employees in 1970 were grouped in
age as follows;
Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64
No. of employee 66 65 56 50 42 37 35 30 24 22
Solution;
20-24 65 22 1430
25-29 56 27 1512
30-34 50 32 1600
35-39 42 37 1554
40-44 37 42 1554
45-49 35 47 1645
50-54 30 52 1560
55-59 24 57 1368
60-64 22 62 1364
fx 14709
Arithmetic Mean= 34.44730679 34.45 .
427
f
The mean age of 427 employees in particular industry is approximately 34.45 years.
15
Merits and limitation of Arithmetic mean
Merits
It is affected by the value of every item in the series and therefore it is considered
to be more representative of the distribution.
It is widely used method because of its mathematical property, since it lends itself
easily to further mathematical treatment like computation of standard deviation, coefficient
of skewness etc.
It is a calculated value and not based on position in the series.
It is the simplest average to understand and easiest to compute. Neither the
arraying the data as required for calculating median nor grouping of data as required for
calculating mode is needed while calculating mean.
i. The value of mean can be affected by extreme items.i.e, very small and very large items
as compared to other items.
ii. It cannot be determined by inspection nor cannot be located graphically.
iii. It cannot be used in the study of qualitative phenomena which are not capable of
numerical measurement,i.e.,intelligence,beauty,honesty,etc
iv. In a distribution with open-end classes the value of mean cannot be computed without
making assumption regarding the size of the class interval of the open-end classes.
MEDIAN
Median is the measure of central tendency which appears in the middle of an ordered sequence
of values that is half of the observations in a set of data are lower than it and half of the
observations are greater than it.
Note. Since the arithmetic mean is calculated from the values of every observation in a series, the
median is the positional average and its value is such that an equal number of observations lie
either side. Median is the central value of the distribution or the value that divides the distribution
into two equal parts.
Computation of median value involves two basic steps; location of the middle value/item and
finding out its value.
25 20 15 45 18 7 10 38 12.
Solution;
Arrange the observations in ascending order; that is 7 10 12 15 18 20 25 38 45.
Median=
n 1 9 1 5th item which is 18.
2 2
Example 2; Determine the median value of the following items
7 10 12 15 18 20 25 38 45 64.
16
Median=
n 1 10 1 5.5th item. The median value lies between 5th and 6th which is the
2 2
18 20
average of 18 and 20= 19
2
Calculation of median for discrete series
Steps
Arrange the data in ascending or descending order of magnitude.
Find out the cumulative frequencies.
Apply the formula: Median=size of
N 1 th item.
2
Look at the cumulative frequency column and find the total which is either equal to
N 1 or next higher to that and determine the value of the variable corresponding to
2
it.This gives you the value of median
No. of person: 12 16 2 10 3 15 7
Solution:
1200 12 12
1600 15 27
1800 16 43
2500 10 53
3000 3 56
3500 7 63
5000 2 65
N
c. f
Median L 2 *i
f
Where L is lower limit of the median class, i.e. the class in which the middle item of the
distribution lies.
c. f Is the cumulative frequency of the class preceding the median class or simply sum
17
of the frequencies of all classes lower than the median class.
No. of apples 14 20 42 54 45 18 7
Solution:
Since we are given inclusive class intervals, we should convert it to the exclusive one by
deducting 0.5 from the lower limits and adding 0.5 from the upper limits.
409.5-419.5 14 14
419.5-429.5 20 34
429.5-439.5 42 76
439.5-449.5 54 130
449.5-459.5 45 175
459.5-469.5 18 193
469.5-479.5 7 200
N 200
Median th item which is 100th item
2 2
The median lies in a class 439.5 449.5 , therefore L 439.5 , i 449.5 439.5 10 , c. f 76
f 54 and N 200
200
2 76
Median 439.5 *10 443.94
54
Merits and limitations of median
Merits
Limitations of median
i. Since it is a positional average, its value is not determined by each and every observation.
ii. For calculating median it is necessary to arrange the data; other averages do not need any
arrangement.
iii. The median, in some cases, cannot be computed exactly as the mean. When the number
of items included in the series of data is even, the median is determined approximately as
the mid-point of the two middle items.
MODE
Mode is the measure of central tendency which appears more frequently in the given number of
observations. Mode is the measure of central value of the distribution which appears several
times.
In discrete series quite often mode can be determined just by inspection, i.e., by looking that value
of the variable around which the items are most heavily concentrated.
Example; A farmer records the number of eggs he collects from his hens over a 150 day period.
The frequency distribution of the eggs collected is below.
Number of eggs 15 16 17 18 19 20 21 22 23 24 25
Frequency 1 2 4 6 9 13 18 22 35 30 10
Solution;
From the above data we can clearly say that the modal number of eggs collected per day is 23
because the value 23 has occurred the maximum number of times.
In continuous data the value of mode is calculated using the formula bellow
1
Mode L *i
1 2
1 Is the difference between the frequency of the modal class and the frequency of the
pre-modal class.
2 Is the difference between the frequency of the modal class and the frequency of the
post-modal class.
19
Example: Find the value of the mode from the data given bellow.
Weight (lbs) 93-97 98-102 103-107 108-112 113-117 118-122 123-127 128-132
No. of students 2 5 12 17 14 6 3 1
Solution:
By inspection the modal class is 108-112, then the real limits of this class are 107.5-112.5
5 25
Mode 107.5 * 5 107.5 110.625
5 3 8
Merits of mode
Limitation of mode
i. The value of mode is not based on each and every item in the series.
ii. It is not capable of further mathematical treatment.
iii. Mode may be unrepresentative in many cases.
iv. The value of mode cannot always be determined. In some cases we may have a bimodal
series.
NOTE: Arithmetic mean, median and mode each of these statistical measures describe the
typical characteristic or tendency of a group in a slightly different way.
MEASURE OF DISPERSION
The measures of central tendency give us one single value that represents the entire data. But
the average alone cannot adequately describe a set of observations, unless all observations are
the same. It is necessary to study variability or dispersion of the observations. In two or more
distributions the central value may be the same but still there can be wide disparities in the
formation of distribution. While an average discovers the representative values, dispersion finds
out how individual values fall apart, on an average, from the representative value. The average
was derived from the actual values but dispersion is known by averaging the deviation in
individual values from some representative value and therefore called an average of the second
order.
Definitions
Dispersion is the degree of the scatter or variation of the variable about the central
value.
Dispersion is the degree to which numerical data tends to spread about an average
value.
20
Objectives of measures of dispersion
To serve as a basis for the control of variability: When dispersion is small, the
average is a typical in the sense that it closely represent the individual value and it is
reliable in the sense that it is a good estimate of the average in the corresponding
average, the opposite is true.
To judge the reliability of the measures of central tendency: Measures of dispersion
are the only means to test the representative character of an average.
To facilitate comparison: Measures of dispersion enable a comparison to be made
of two or more series with regards to their variability. A high degree of variation would
mean little uniformity or inconsistency whereas low degree of variation would mean great
uniformity or consistency.
To facilitate the use of other statistical measure. Many powerful analytical tools in statistics
such as correlation analysis, regression analysis, the testing of hypothesis, ANOVA,
statistical quality control are based on the measures of variation of one kind or another.
THE RANGE
Range is the simplest method of studying dispersion. It is defined as the different between the
value of the largest item and the value of the smallest item.
Mathematically
Range L S
Limitation of range
Range is not based on each and every item in the distribution.
It is subject to fluctuations of considerable magnitude from sample to sample.
Range cannot tell us about the character of the distribution within the two extreme
observations
Uses of range
Quality control
Fluctuation in the share prices.
Weather forecast.
21
Interquartile Range: A measures of variation for a set of data defined to be the different between
the first and third quartiles. Quartiles are closely to the median which divides the distribution in to
two equal parts.
4 9 12 10 15 16 30 4 7 2
Solution: IQR Q3 Q1
10-19 3 3
20-29 8 11
30-39 15 26
40-49 5 31
50-59 5 36
n
cf
Q1 LQ1 4 *i
f
36
3
Q1 19.5 4 *10 27
8
3n
cf
Q3 LQ3 4 *i
f
3 * 36 26
Q3 39.5 4 *10 41.5
4
Interquartile range
IQR Q3 Q1 41.5 27 14.5
Advantages
22
1. It is not affected by extreme values.
2. It account for the spread of the middle values of data which is most significant part of data.
Disadvantages
Quartile Deviation
Is a measure of variation of the data for a set of data, define to be half distance between the first
and the third Quartile.
Uses: It is frequently used in skewed distribution.
Q3 Q1
QD
2
Where
Q1=the first (lower) quartile.
Q3=the third (upper) quartile.
Advantages:
It is easy to understand
It is a better measure of dispersion than standard deviation for badly skewed
distribution
Unlike the range it is not affected by the extreme values.
Disadvantages:
Uses: It is frequently used in quality control, statistical forecasting and in financial analysis.
fd
M .D
N
Steps
Calculate the average of series.
Take the deviations of the items from mean ignoring signs and denote them d
Divide the total obtained in step (ii) by the number of observations. This gives us
the value for mean deviation.
Example: Calculate the mean deviation about the mean for the following data.
X: 5 15 25 35 45 55 65
F: 8 12 10 8 3 2 7
Let assumed mean (A) is 35
23
Solution
(x) F d=(x-35) fd x 29 d fd
35 8 35-35=0 0 0 0
45 3 45-35=+10 +30 10 30
55 2 55-35=+20 +40 20 40
Total N 50 50
fd 300 fd 800
fd
M .D
N
fd
x A , A 35 , fd 300 , N 50
N
x 35
300 35 6 29
50
fd 800
M .D 16
N 50
fd
M .D
N
Example: Calculate the mean deviation from the mean for the following data. Also calculate the
coefficient of mean deviation.
No. of student 6 5 8 15 7 6 3
24
Solution;
Let assumed mean (A) =35
Marks Mid point (x) f x 35 fu x 33.4 d fd
u
10
30-40 35 15 0 0 1.6 24
Total N 50
fu 8 fd 659.2
fu
x A * i , where i=class size. A 35, fu 8, N 50, i 10
N
x 35
8 *10 33.4
50
fd 659.2
M .D 13.184
N 50
25
THE STANDARD DEVIATION.
Ungrouped data.
The procedure of computing standard deviation from ungrouped data is given below;
Find the variance ( 2 ) by dividing the sum obtained in (IV) above by the number of
observations.
x x
2
Var
n
VI. Extract the square root of the variance to find the standard deviation
x x
2
S .D (i)
n
Note that the direct method is appropriate only when the mean is a whole number.
Standard deviation can be computed directly without finding mean. See the formula bellow
Direct method.
2
x x
2
S .D
n n
2
d 2
d
S .D
n n
Example
8 9 15 23 5 11 19 8 10 12
26
Soln
x 120
x 12
n 10
Values ( x ) Deviation d xx
d2 x x
2
8 8-12=-4 16
9 9-12=-3 9
15 15-12=+3 9
23 23-12=+11 121
5 5-12=-7 49
11 11-12=-1 1
19 19-12=+7 49
8 8-12=-4 16
10 10-12=-2 4
12 12-12=0 0
x x
2
x 120 274
x x
2
274
S .D 5.234500931
n 10
1. Direct method.
2
fx fx
2
S .D
n n
(1)
27
2
fd 2
fd
S .D
n n
(2)
Then substitute the values in the above formula to obtain the value of standard deviation.
2
fu 2
fu
S .D i * (3)
n n
XA
Where u is deviation from assumed mean i.e.( u )
i
„i‟ is the class size and X is the mid-point if the class interval.
fu 2
Then substitute the values in the above formula to obtain the value of standard deviation.
28
Example: Find the standard deviation of the following data.
Frequency 6 14 10 8 1 3 8
Class Mid-point f X 35 fu u2 fu 2
u
interval X 10
0-10 5 6 -3 -18 9 54
10-20 15 14 -2 -28 4 56
20-30 25 10 -1 -10 1 10
30-40 35 8 0 0 0 0
40-50 45 1 +1 +1 1 1
50-60 55 3 +2 +6 4 12
60-70 65 8 +3 +24 9 72
Total n 50
fu 25 fu 2
205
2
fu 2
fu
205 25 2
S .D i * 10 * 19.62
n
n
50 50
Merits
29
ii. It gives more weight to extreme items and less to those which are near the mean.
SKEWNESS
Measures of central tendency give us an estimate of the representative value of a series, the
measures of dispersion gives an indication of the extent to which the items cluster around or
scatter away from the central value and the skewness is a measure that refers to the extent of
symmetry or asymmetry in a distribution.Skewness describes shape of a distribution.
Definitions
Skewness is lack of symmetry. When frequency distribution is plotted on a chart,
skewness present in the items tends to be dispersed more on one side of the mean than
on the other.
A distribution is said to be skewed when the mean and median fall at different
points in the distribution and balance (centre of gravity) is shifted to one side or the other
to the left or right.
Measures of skewness tell us the direction and the extent of skewness.In
symmetrical distribution the mean, median and mode are identical. The more the mean
moves away from the mode the larger the symmetry or skewness.
Objective of skewness
ii. It helps in knowing if the distribution is normal. Many statistical measures such as the error
of mean are based on the assumption of a normal distribution.
iii. The empirical relations of mean, median and mode are based on a moderately skewed
distribution. The measure of skewness will reveal to what extent such empirical
relationship holds good.
Skewnwss can be
E.g. the number of children per family, age at which women marry and the distribution of
wages in farm.
NOTE:
The median generally lies between the mode and mean and following relation is satisfied
Mean-Mode ≈ 3(mean-median)
30
MEASURE OF SKEWNESS
Standard deviation
Generally skewness can take any values between 3 and -3 and using approximation mean-
mode=3(mean-median), then the skewness becomes;
Skewness= 3 (mean-median)
Standard deviation
The relative measure of dispersion based on the standard deviation and is given by
S.D
Coefficient of standard deviation
x
100 times the coefficient of dispersion based on standard deviations is called the coefficient of
variation abbreviated as C.V
S .D
C.V *100
x
Coefficient of variation is equal to standard deviation divide by arithmetic mean multiplied by 100.
Coefficient of variation is widely used for comparing the variability, homogeneity, stability,
uniformity and consistency of two series.
The series having greater C.V is said to be more variable than the other.
The series having less C.V is said to be less variable or more consistent, more
uniform, more stable or more homogeneous than the other.
Example. Two workers on the same job show the following results over a long period of time
Worker A Worker B
31
a) Which worker appears to be more consistent in the time he requires to complete the job?
b) Which worker appears to be faster in the completion the job?
For worker A
S .D 6
C.V *100 *100 20%
x 30
For worker B
S .D 4
C.V *100 *100 16%
x 25
Since the C.V of worker B is less compared to that of worker A; therefore worker B appears to be
more consistent in the time requires in completing the job.
To develop a model a phenomenon (e.g., law of demand) statisticians make heavy use of a
statistical technique known as regression analysis. The purpose of this chapter is to introduce the
basics of regression analysis in terms of the simplest possible linear regression model, namely
the two variable models.
Regression analysis is concerned with the study of the relationship between one variable called
the explained or dependent variable and one or more other variables called independent or
explanatory variables.
Example; studying the relationship between the quantity demanded of a commodity in terms of
the price of that commodity, income of the consumer, and prices of other commodities competing
with this commodity.
We may be also be interested in finding out how sales of a product (e.g. automobiles) are related
to advertising expenditure incurred on that product.
32
To estimate the mean or average value of the dependent variable given the values of the
independent variables.
To test hypotheses about the nature of the dependence-hypotheses suggested by the
underlying economic theory. For example, one may want to test hypothesis that price
elasticity of demand in a demand curve has unitary price elasticity. If the price of
commodity goes up by 1% the quantity demanded on the average goes down by 1%,
assuming all other factors affecting demand are held constant.
To predict or forecast the mean value of the dependent variable, given the values of the
independent variable(s).
This is a technique for fitting the best straight line to the sample of xy observations. This technique
is the best way to estimate the population regression function by choosing b1, b2 the estimator of
B1, B2 i is as small as possible. The least squares principle states
that; “the sample regression function should be fixed in such a way that the sum of the squared
distance between the actual Y and the Y obtained from that SRF is the smallest one”.
Y=b0+b1X1
∑y=nb0+b1∑x………………………….i
xy= xb0+b1x2
∑ xy= b0∑x+b1∑x2……………………..ii
Example 1;
A random sample of ten families had the following income and food expenditure (000Tsh per
week).
Families A B C D E F G H I J
income 20 30 33 40 15 13 26 38 35 43
33
expenditure 7 9 8 11 5 4 8 10 9 10
Example 2;
The following table gives the quantities of cotton (in tons) bought in each year from 1961-1970
and the corresponding prices (0000Tsh).
quantity 770 785 790 795 800 805 810 820 840 850
price 18 16 15 15 12 10 10 7 9 6
After the estimation of the parameters and determination of the least squares regression, there is
a need to know how good is the fit of this line to the sample observations of Y and X, that is to say
we need to measure the dispersion of observations around the regression line. This knowledge is
essential because the closer the observations to the line the better the goodness of fit, that is the
better is the explanation on the variations of Y by the changes in the explanatory variables. The
measure of goodness of fit is the square of the correlation coefficient r 2 which shows the
percentage of the total variation of the dependent variable that can be explained by the
independent variable X.
By fitting the estimated regression line (Y) we try to obtain the explanation of the variations of the
dependent variable Y produced by the changes of the explanatory variable X. However the fact
that the observations deviate from the estimated line shows that the regression line explains only
a part of the total variation of the dependent variable.
R2= 1-∑℮2/∑y2
Correlation Analysis
There are various methods of measuring the relationships existing between economic variables.
The simplest are correlation and regression analysis. It should be noted that correlation analysis
has serious limitations and throws little light on the nature of the relationship existing between
variables; it will make us familiar with the correlation coefficient which is an essential statistic of
regression analysis.
34
Correlation is defined as the degree of relationship existing between two or more variables.
Correlation may be linear when all points on scatter diagram seem to cluster near a straight line or
nonlinear when all points seem to lie near a curve.
Positive correlation;
Two variables are said to be positively correlated if they tend to change together in the same
direction, that is, if they tend to increase or decrease in the same direction.
Example; positive correlation is postulated by economic theory for the quantity of a commodity
supplied and its price. When the price increases the quantity supplied increases, and conversely.
Negative correlation;
Two variables are said to be negatively correlated if they tend to change in the opposite direction,
that is, if one variable increases the decreases.
Example; the theory of demand, the quantity of commodity demanded and its price are negatively
related. When price increases demand for the commodity decreases and when price decreases
the demand for commodity falls.
Zero correlation;
Two variables are uncorrelated when they tend to change with no connection to each other. The
scatter diagram will show dispersion of points all over the surface of the xy plane.
Example; the height of the inhabitants of a country and the production of steel, or between the
weight of students and the color of their hair.
+ve sign indicates that the variables are moving in the same direction, that is when one variable is
moving the other will also be moving in the same direction with the same or almost the same
magnitude. Example, income and quantity demanded
-ve sign indicates that the variables are moving in opposite direction, that is when one variable
goes up the other will be going down by the same or almost the same magnitude. Example, price
and quantity demanded.
35
Economic theory and business studies show relationship between variables like price and
quantity demanded, advertising expenditure and sales. The correlation analysis helps to
derive precisely the degree and direction of such relationship.
The usefulness of this technique is that the average relationships in a series can be
summed up in a single value of change called the coefficient of correlation.
Definition:
The correlation analysis refers to a technique used in measuring the closeness
relationship between the variables
The correlation analysis is a statistical device which helps us in the covariation of two or
more variables.
Geographical method
The scatter diagram is a simple and attractive method of diagrammatic representation of a
bivariate distribution for ascertaining the nature of correlation between the variables
36
x x y y
r
x x * y y (1)
2 2
n xy x y
r
2
2
(2)
n x x * n y y
2 2
If r=+1, shows there is perfect positive relationship (positive correlation) between the two
variables.
If r=-1, shows there is perfect negative relationship (negative correlation) between the two
variables.
If r=0,shows there is no relationship between the two variables (absence of correlation)
If r is nearly to +1 shows there is strong or significant positive relationship (strong positive
correlation)
If r is nearly equal to zero shows there is weak or insignificant relationship (weak
correlation)
Correlation and Causation
Correlation analysis helps us to determine in the degree of relationship between two or more
variables, it does not tell us anything about cause and effects relationship.
In simple linear regression the coefficient of determination (R2) is equal r2, where “r” is coefficient
of correlation. If the value of r=0.8,R2 will be 0.64 and it would mean that 64 per cent of the
variation in the dependent variable has been explained by the independent variable.
The coefficient of determination (R2) can also be defined as the ratio of the explained variance to
the total variance.
R2 = Explained Variation
Total Variation
37
If R2=0 means X does not explain Y
Note that: High values of R2 implies goodness of while low of R2 implies poor fit
REGRESSION ANALYSIS
Regression is a statistical technique used to test the relationship between measurable variables.
Regression analysis attempts to established the nature of the relationship between variables and
thereby provide a mechanism for prediction or forecasting. Using regression analysis we are in
the position to estimate or predict the unknown values of one variable from known values of
another variable.
The variable which is used to predict the variable of interest is called the independent variable or
explanatory variable and the variable we are trying to predict is called dependent variable or
explained variable. The independent variable or explanatory variable is denoted by “X” and the
dependent variable or explained variable is denoted by “Y”.The analysis used is called simple
linear regression analysis, simple because there is only one predictor/independent variable and
linear because of the assumed linear relationship between independent variable (Y).Linear means
equation is straight line the form Y=a+bx.
In simple linear regression we develop the relationship between one dependent variable against
one independepent variable. The relationship can be expressed mathematically as
Yi i i
α is the slope of Y (Its value is the point at which the regression line across Y-axis.
β is the slope of the relationship (Represent the change in Y for a unit change in X
variable.
X is an independent variable.
ε is a random error.
Random error: Is unobservable quantity introduced in the model to account the failure of
observed values to fall exactly on the single straight line.
38
Properties of error term (ε)
ii. To distributed normally with zero mean and known variance. Mathematically
i N o, 2
Assumptions of Regression
There is linear relationship between the independent variable and dependent variable.
The Xi‟s are non random variable and observed with negligible error.
The εi‟s are random variable with zero mean and constant variance.
i N o, 2
E i . j { }
Snce we cannot fit exactly the line we use the method of least square for estimating the
parameters α and β to give the estimatimates “a” and “b”respectively.
Y a bxi ……………………………………………………………(*)
The equation (*) above is the equation of line which provides estimated of the dependent
variable when the values of independent variable are inserted into the equation.
We do not have εi in the fitted model because E(y/x) is now an exactly one.
The least square estimators are those values of “a” of α and “b” of β that minimize a
quantity known as residual sum of square ∑εi2 with respect to “a” and “b” since “a” and “b”
are constants used to describe the average relationship that exists between two variables
can be obtained by the method of least square as
Yi i i , i
39
a
1
yi b xi y b x
n
b
x y x y
i i i i
n x x
2 2
i i
Correlation and regression analysis are constructed under different assumptions they furnish
different types of information and it is not always clear as to which measure should be used in a
given problem situation.
3. There may be nonsense correlation between two variables which is purely due to chance
and has no practical relevance such as increase in income and increase in weight of a
group of people.However,there is nothing like nonsense regression.
4. Corelation coefficient is independent of change of scale and origin but not of scale.
40
-0 ELEMENTARY PROBABILITY THEORY
Probability Distributions
Probability distribution of a random variable is a listing of the values of random variable with the
corresponding probability associated with each value of the random variable. When probability
values are assigned to all possible values of a random variable either by listing or by a
mathematical function, the result is a probability distribution.
Example;
Probability 1
6
1
6
1
6
1
6
1
6
1
6
Probability distribution 1
6 1 1
6 (2) 1
6 (3) 1
6 (4) 1
6 5 1
6 (6)
P x 1,2,3...,6 1
6 16 16 16 16 16 =1
Whenever possible we try to express probability distribution by means of a formula which enables
us to calculate the probabilities associated with various numerical descriptions of the outcomes
with the usual functional notations.
Random variable
A variable whose value is determined by the outcomes of a random experiment that are not under
the control of the observer is called random variable. A random variable is a function defined over
the sample space of an experiment and generally assumes different values with definite
probability associated with each other.
Example; in three tosses of a coin the number of heads obtained is a random variable which can
take any one of the three values 0,1,2,or 3 as long as the coin is tossed.
Discrete random variable; A variable which takes on the integral values such as 0, 1, 2, 3….
Example; number of defective items in a sample, printing mistakes in each page of a book and
telephone calls received by telephone switchboard of a firm.
41
Example 1; a random variable x has the following probability function:
X 0 1 2 3 4 5 6 7
X 0 1 2 3 4 5 6 8 9
Continuous random variable; A variable which takes on all values within a certain interval and
their probabilities are determined by mathematical function are typically portrayed by a probability
density function or probability curve. Example; height, weight (mass), time, volume and
temperature.
x2
Example 1; in a continuous distribution whose probability density function is given by f(x) =x (2-x),
0≤x≤2. Find
42
i. Mean
ii. Variance
iii. Mode
iv. Median
v. Check if the above function is p.d.f
i. Mean
ii. Variance
iii. Mode
iv. Median
v. Check if the above function is p.d.f
The probability distribution of a random variable which takes on the integral values such as 0, 2,
3… and their associated probabilities is discrete probability distributions. The set of the 6
outcomes in rolling a die and their associated probabilities is an example of discrete probability
distribution. The sum of the probabilities associated with all the values that the discrete random
variable can assume always equal to 1.
a) Binomial distribution
This is used to find the probability of X number of occurrences or success of an event P (x) , in n
trials of the same experiment when there are only two possible and mutually exclusive outcomes,
the n trials are independent and probability of occurrence or success p remain constant in each
trial.
43
ii. The number of trials must be independent
iii. Each trial has two mutually exclusive outcomes that are success and failure.
iv. The probability of success remains constant from trial to trial.
v. The number of trials should not be more 30.
Mean=np
Variance= npq
P (x) = n C x P x 1 P
n x
1-p=probability of failure
Example 1;
The mean of a binomial distribution is 40 and standard deviation is 6. Calculate n,p and q.
Example 2;
Assume that on an average one telephone number out of fifteen is busy. What is probability that if
6 randomly selected telephone numbers are called?
44
Exercise;
i. A certain system of belting on horses produces winners 40% of time. What is the
probability that the system produces exactly 3 winners out of 8 on a race day?
ii. A biased die is thrown 30 times and the number of 6 seen is 8. If the die is thrown for 12
times, find the probability that 6 will occur exactly twice.
iii. Toss a fair coin 12 times. Find the probability of obtaining at least 9 heads in 12 times.
iv. A coin is biased so that a head is twice as likely to occur as a tail. If the coin is tossed four
times. What is the probability of getting exactly 2 tails?
v. A manufacturer claims that at most 10% of his product is defective. To test this claim 18
units are inspected and his claim is accepted if among these 18 units at most 2 are
defective. Find the probability that the manufacturer‟s claim will be accepted if the actual
probability that a unit is defective is 0.05 and 0.15.
b) Poisson Distribution
It is used to determine the probability of a designated number of successes per unit of time when
the events or successes are independent and average number of successes per unit of time
remains constant.
Poisson distribution is a limiting case of the binomial distribution under the following conditions;
x
P (x) =
x
Where
EXERCISE
1. A manufacturer of cotter pins knows that 5% of his product is defective. If he sells cotter
pins in boxes of 100 and guarantees that not more than 10 pins will be defective. What is
the approximate probability that a box will fail to meet the guaranteed quality?
2. Assuming that the chance of a traffic accident in a day in a street of Mwanza is 0.001 on
how many days out of a total of 1000 days can we expect:
i. No accident
ii. More than three accidents, if there are 1000 such streets in the whole city?
Normal Distribution;
The normal distribution is the most important continuous distribution in statistics. Many measured
quantities in the natural sciences follow a normal distribution for example heights, masses, ages,
random errors, IQ scores, examination results.
Probability distribution of random variable which is determined by the range of all possible values
within a defined interval limits and associated with their probabilities is known a continuous
probability distribution. The probability distribution of a continuous random variable is often called
46
a probability density function or simply probability function. It is given by a smooth curve such that
the total area under the curve is 1.
The normal distribution is a continuous probability function that is bell-shaped, symmetrical about
the mean and extends indefinitely in both directions but most of the area clustered around the
mean.
To standardize random variable X subtract µ and divide by σ, so the standard normal variable is
X
given by Z= . If the random variable X has a normal distribution with mean µ and variance
σ2 then the random variable Z has a standard normal distribution with mean 0 and variance 1.
Example 1; If Z~ (0, 1) find from tables (a) P (Z>1.377), (b) P (Z<1.377), (c) P (Z>-1.377), (d) P
(Z<-1.377).
47
Example 2; If Z~ (0, 1) find,
Example 3; If Z~(0 , 1), find the value of a if (a) P (Z>a)=0.3802, (b) P (Z<a)=0.9693, (c) P
(|Z|<a)=0.9.
Example 4; The random variable X~N (300, 25). Find (a) P (X>305), (b) P (X<291).
Example 5; The random variable is such that X~N (50, 8). Find (a) P (48<X<54), (b) P
(46<X<49), (d) P (|X-50|√8).
Example 6;
i. The time taken by a milkman to deliver milk to the high street is normally distributed
with mean 12 minutes and standard deviation 2 minutes. He delivers milk every day.
Estimate the number of days during the year when he takes (a) longer than 17 minutes
(b) less than 10 minutes (c) between 9 and 13 minutes.
ii. A certain type of cabbage has a mass which is normally distributed with mean 1kg and
standard deviation 0.15kg. In a lorry load of 800 of these cabbages, estimate how
many will have mass (a) greater than 0.79kg, (b) less than 1.13kg, (c) between 0.85kg
and 1.15kg, (d) between 0.75kg and 1.29kg.
iii. The marks of 500 candidates in an examination are normally distributed with mean of
45 marks and a standard deviation of 20 marks.
a. Given that the pass mark is 41, estimate the number of candidates who passed
the examination.
b. If 5% of the candidates obtained a distinction by scoring x marks or more,
estimate the value of x.
c. Estimate the interquartile range of the distribution.
Hypothesis Testing
General Description
There are two types of statistical inferences: estimation of population parameters and hypothesis
testing. Hypothesis testing is one of the most important tools of application of statistics to real life
problems. Most often, decisions are required to be made concerning populations on the basis of
sample information. Statistical tests are used in arriving at these decisions.
48
There are five ingredients to any statistical test:
In attempting to reach a decision, it is useful to make an educated guess or assumption about the
population involved, such as the type of distribution.
Statistical Hypotheses: They are defined as assertion or conjecture about the parameter or
parameters of a population, for example the mean or the variance of a normal population. They
may also concern the type, nature or probability distribution of the population.
Statistical hypotheses are based on the concept of proof by contradiction. For example, say, we
Null Hypothesis: It is a hypothesis which states that there is no difference between the
procedures and is denoted by H0. For the above example the corresponding H0 would be that
there has been no increase or decrease in the mean. Always the null hypothesis is tested, i.e., we
want to either accept or reject the null hypothesis because we have information only for the null
hypothesis.
Alternative Hypothesis : It is a hypothesis which states that there is a difference between the
procedures and is denoted by HA.
1 �
49
Test Statistic: It is the random variable X whose value is tested to arrive at a decision. The
Central Limit Theorem states that for large sample sizes (n > 30) drawn randomly from a
population, the distribution of the means of those samples will approximate normality, even when
the data in the parent population are not distributed normally. A z statistic is usually used for large
sample sizes (n > 30), but often large samples are not easy to obtain, in which case the t-
standard deviation, s. The t curves are bell shaped and distributed around t=0. The exact shape
on a given t-curve depends on the degrees of freedom. In case of performing multiple
comparisons by one way ANOVA, the F-statistic is normally used. It is defined as the ratio of the
mean square due to the variability between groups to the mean square due to the variability within
groups. The critical value of F is read off from tables on the F-distribution knowing the Type-I error
Rejection Region: It is the part of the sample space (critical region) where the null hypothesis H0
only that the observed difference between the sample statistic and the mean of the sampling
distribution did not occur by chance alone.
Conclusion: If the test statistic falls in the rejection/critical region, H0 is rejected, else H0 is
accepted.
Types of Tests
Tests of hypothesis can be carried out on one or two samples. One sample tests are used to test
1 2).
Two sample tests can further be classified as unpaired or paired two sample tests. While in
unpaired two sample tests the sample data are not related, in paired two sample tests the sample
data are paired according to some identifiable characteristic. For example, when testing
hypothesis about the effect of a treatment on (say) a landfill, we would like to pair the data taken
at different points before and after implementation of the treatment.
One tailed test: Here the alternate hypothesis HA is one-sided and we test whether the test
statistic falls in the critical region on only one side of the distribution.
1. One sample test: For example, we are measuring the concentration of a lake and we need
to know if the mean concentration of the lake is greater than a specified value of 10mg/L.
Hence, H0 � A
2. Two sample test: In Table1, cases 2 and 3 are illustrations of two sample, one tailed tests.
In case 2 we want to test whether the population mean of the first sample is lesser than
that of the second sample.
Hence, H0 1 � 2 , vs, HA 1 2.
Two tailed test: Here the alternate hypothesis HA is formulated to test for difference in either
direction, i.e., for either an increase or a decrease in the random variable. Hence the test statistic
is tested for occurrence within either of the two critical regions on the two extremes of the
distribution.
50
1. One sample test: For the lake example we need to know if the mean concentration of the
lake is the same as or different from a specified value of 10 mg/L.
Hence, H0 � A
2. Two sample test: In Table 1, case 1 is an illustration of a two sample two tailed test. In
1) is the same as
2).
Hence H0 2 , vs, HA � 2.
Given the same level of significance the two tailed test is more conservative, i.e., it is more
rigorous than the one-tailed test because the rejection point is farther out in the tail. It is more
difficult to reject H0 with a two-tailed test than with a one-tailed test.
The diagram associated with the link illustrates the critical region(s) for one and two tailed tests.
Error
When using probability to decide whether a statistical test provides evidence for or against our
predictions, there is always a chance of driving the wrong conclusions. Even when choosing a
probability level of 95%, there is always a 5% chance that one rejects the null hypothesis when it
It is possible to err in the opposite way if one fails to reject the null hypothesis when it is, in fact,
incorrect. This is called Type II error, represented by the
represented in the following chart.
A related concept is power, which is the probability of rejecting the null hypothesis when it is
actually false. Power is simply 1 minus the Type II error rate, and is usually expressed as 1-
When choosing the probability level of a test, it is possible to control the risk of committing a Type
51
This also affects Type II error, since they are are inversely related: as one increases, the other
decreases. To appreciate this in a diagram, follow this link:
There is little control on the risk of committing Type II error, because it also depends on the actual
difference being evaluated, which is usually unknown. The following link leads to a diagram that
according to the actual distribution of the
population:
The consequences of these different types of error are very different. For example, if one tests for
the significant presence of a pollutant, incorrectly deciding that a site is polluted (Type I error) will
cause a waste of resources and energy cleaning up a site that does not need it. On the other
hand, failure to determine presence of pollution (Type II error) can lead to environmental
deterioration or health problems in the nearby community.
Select the test statistic and determine its value from the sample data. This value is
called the observed value of the test statistic. Remember that a t statistic is usually
appropriate for a small number of samples; for larger number of samples, a z
statistic can work well if data are normally distributed.
52
Compare the observed value of the statistic to the critical value obtained for the
Make a decision.
Practical Examples
An aquaculture farm takes water from a stream and returns it after it has circulated through the
fish tanks. The owner thinks that, since the water circulates rather quickly through the tanks, there
is little organic matter in the effluent. To find out if this is true, he takes some samples of the water
at the intake and other samples downstream the outlet and tests for Biochemical Oxygen Demand
(BOD). If BOD increases, it can be said that the effluent contains more organic matter than the
stream can handle.
The data for this problem are given in the following table:
Upstream Downstream
6.782 9.063
5.809 8.381
6.849 8.660
6.879 8.405
7.014 9.248
53
7.321 8.735
5.986 9.772
6.628 8.545
6.822 8.063
6.448 8.001
1. A is the set of samples taken at the intake; and B is the set of samples taken downstream.
o H0 B < A
o HA B > A
2.
3. The observed t value is calculated
4. The critical t value is obtained according to the degrees of freedom
Upstream Downstream
Observations 10 10
Degrees of freedom 18
t stat -8.9941
54
B) Two tailed Test
Let us assume that an induced bioremediation process is being conducted at a contaminated site.
The researcher has obtained good cleanup rates by injecting a mixture of nutrients into the soil in
order to maintain an abundant microbial community. Someone suggests using a cheaper mixture.
The researcher tries one patch of land with the new mixture, and compares the degradation rates
to those obtained from a patch treated with the expensive one to see if he can get the same
degradation rates.
The data for this problem are shown in the following table:
7.1031 9.6662
6.4085 10.1320
8.8819 9.0624
7.0094 8.8136
4.6715 9.2345
6.6135 9.9949
6.5877 9.4299
6.2849 8.8012
6.6789 9.9249
6.5542 8.1739
1) A is treated with the cheap nutrient; and B is treated with the expensive one.
H0 A B
HA A � B
55
Assuming that variances from the two sets are unequal, we obtain the following t test:
Observations 10 10
Degrees of freedom 15
t Stat -6.9691
Although hypothesis tests are a very useful tool in general, they are sometimes not appropriate in
the environmental field. The following cases illustrate some of the limitations of this type of test:
A) Multiple Comparisons
z and t tests are very useful when comparing two population menas. However, when it comes to
comparing several population means at the same time, this method is not very appropriate.
Suppose we are interested in comparing pollutant concentrations form three different wells with
1 2 3. We could test the following hypothesis:
56
H0 1 2 3
HA: not all means are equal
We would need to conduct three different hypothesis tests, which are shown here:
1 2 2 3 1 3
1 � 2 2 � 3 1 � 3
For each test, there is always the possibility of committing an error. Since we are conducting three
such tests, the overall error probability would exceed the acceptable ranges, and we could not
feel very confident about the final conclusion. Table 8 show t
tests are conducted. Assume that each k value represents the number of populations to be
compared.
Table 8. Probability of committing a type I error by using multiple t tests to seek differences
between all pairs of k means.
Level of
Significance used
in the t tests 0.20 0.10 0.05 0.02 0.01 0.001
Number of
means (k)
Note: The particular values were derived from a table by Pearson (1942) by assuming equal
population variances and large samples.
57
A better method for comparing several population means is an analysis of variance, abbreviated
as ANOVA.
ANOVA test is based on the variability between the sample means. This variability is measured in
relation to the variability of the data values within the samples. These two variances are compared
through means of the F ratio test.
If there is a large variability between the sample means, this suggests that not all the population
means are equal. When the variability between the samples means is large compared to the
variability within the samples, it can be concluded that not all the population means are equal.
B) Multiple Constituents
In example 1, we were only testing for BOD, so only one t test was necessary. If we had been
trying to trace more than one pollutant, which is usually the case, we would have to take out
different tests for each pollutant in order to determine if the effluent was similar to the receiving
stream. Then we would have the same problem we encountered with multiple comparisons:
The tests used in the testing of hypothesis, viz., t-tests and ANOVA have some fundamental
assumptions that need to be met, for the test to work properly and yield good results. The main
assumptions for the t-test and ANOVA are listed below.
1. The samples are drawn randomly from a population in which the data are distributed
normally distributed.
2 2 2 2
2. In the case of a two sample t- 1 2 .Therefore it is assumed that s1 and s2 both
2
. This assumption is called the homogeneity of
variances
3. In the case of a two sample t-test, the measurements in sample 1 are independent of
those in sample 2.
Like the t-test, analysis of variance is based on a model that requires certain assumptions. Three
primary assumptions of ANOVA are that:
1. Each group is obtained randomly, with each observation independent of all other
observations and the groups independent of each other.
2. The samples represent populations in which the data are normally distributed.
2 2 2 2
3. 1 2 3 k . The assumption of homogeneity of variances is similar to the
discussion above under the t-test. The group variances are assumed to be an estimate of
2
.
In actual experimental or sampling situations, the underlying populations are not likely to be
exactly normally distributed with exactly equal variances. Both the t-test and ANOVA are quite
robust and yield reliable results when some of the assumptions are not met. For example, if n1 =
n2 = ... = nk, ANOVA tends to be especially robust with respect to the assumption of homogeneity
As the number of groups tested, k, increases there is a greater effect on the value of the F-
statistic. It is also seen that a reasonable departure from the assumption of population normality
does not have a serious effect on the reliability of the F-statistic or the t-statistic. It is essential
58
however that the assumption of independence be met. The analysis is not robust for non-
independent measurements. These factors are to be taken into consideration while testing
hypotheses.
ESTIMATION THEORY.
Estimation theory is a branch of statistics and signal processing that deals with estimating the
values of parameters based on measured/empirical data. The parameters describe the physical
scenario or object that answers a question posed by the estimator. The entire purpose of
estimation theory is to arrive at an estimator and preferably an implementable one that could
actually be used, the estimator takes the measured data as input and produces an estimate of the
parameters.
It also preferable to derive an estimator that exhibits optimality. An optimal estimator would
indicate that all available information in the measured data has been extracted.
In statistics an estimator is a function of the observable sample data that is used to estimate
unknown population parameters. An estimate is the result from the actual application of the
function to a particular set of data.
Estimation is the process of inferring or estimating a population parameter from the corresponding
statistic of a sample drawn from the population. To be valid estimation must be based on a
representative sample.
59
Point estimate. When the estimate is a single value/ number generated from representative
sample that yields the best estimation of population parameter.
Interval estimate. Refers to a range of values together with the probability, or confidence interval
that the interval includes the unknown population parameter or when the estimate is in a range of
scores/ values.
Properties of estimators
NONPARAMETRIC TESTS
Introduction;
Inference statistical methods are divided into parametric methods and nonparametric methods.
Parametric methods deal with the study of sampling distribution of sample statistics like x and
variance 2 . These are studied with the aim of testing some hypothesis concerning the population
parameters such as mean or 2 for a variable phenomenon being studied.
Nonparametric techniques;
Non parametric tests are used even with nominal or ordinal data and do not make assumptions
about values of the population parameters or about the shape of sampled population. As such
non-parametric tests are considered as distribution free statistical test techniques in which the
focus of inference is whether or not a given sample characterizes specified population. However,
sample observations for non-parametric testing must be drawn at random so that resulting errors
are uncorrelated.
60
d) Specifying the level of significance, usually 0.05.
e) Specify the decision condition, reject or accept.
f) Compare the experimental value and theoretical value of its sampling distribution.
g) Presents findings and conclusions about the null hypothesis being tested.
Non-parametric techniques are broadly divided into: sign, Rank-sum and randomness
tests.
SIGNTESTS
Sign tests are employed whenever data are capable of being categorized into two groups:
bad versus good, high versus low, positive versus negative, or plus versus minus. In short
sign tests are applied whenever the set of data (nominal, ordinal, interval, ratio) have been
signed so that we always deal with “sign” of the observations rather than their values. The
test statistic that is considered in the non-parametric technique is the number of signs or
counts or frequencies. Thus the sampling distribution of the number of positive or negative
signs that are obtained in a random trial is studied and inference on the null hypothesis
reached upon.
RANDOMNESS TESTS
In all statistical tests discussed above, there is one major assumption which is that the
study sample is random. How randomness of a sample is ascertained? That is, given a
sequence of events or observations how can we check that they constitute a random
phenomenon? The answer to such questions is based on the number of times there is a
change in sequence of the events or observations. The change in sequence of events or
observations is called a run.
Definition:
A run is a succession of identical events or attributes that may be represented by letters or
other symbols, and followed by different succession of events or attributes or no event at
all.
Thus, a run is a change of sequence of identical events. Run tests studies a sequence of
events or observations where each element in the sequence may assume one of the two
outcomes such as, success versus failure, good versus bad, large versus small, non-
defective versus defective, or below versus above a level of some specified attribute
index. The runs test is thus applied to a sequence of n1 successes and n2 failures. In a
manufacturing process, for instance, it is of great interest to know the extent to which
sequence of defective and/non-defective items are randomly produced. Non-randomness
of the sequence of defective items implies lack of process control.
The hypothesis being tested is; the sample is random and its alternative is; the sample is
non-random. The number of runs or identical succession of events or attributes of a
sampled observations or events, test randomness of the sample. If the null hypothesis is
true, the number of runs R has the following asymptotical parametric characteristics.
61
E R
2n1n2
Expected value of R is 1 and variance of R is
n1 n2
2n1n2 2n1n2 n1 n2 R E R
R2 . Consequently, smaller values of Z
n1 n2 n1 n2 1
2
R
Example 1;
Consider the following arrangement of health (H) and diseased (D) pine trees observed
next to each other in a given order by a Forestry Service Crew. Is there statistical evidence
to show that the observed distribution of pine trees constitutes a random phenomenon?
Solution;
4. Decision rule; For small samples critical value of R is obtained from statistical
tables of probability distribution of the total number of runs R such that Prob
R R0 . For large samples the statistic Z has a standard normal distribution.
5. Conclusion: The null hypothesis is rejected implying that distribution of pine trees is
non-random. There is a real systematic factor to the observed distribution of pine
trees observed such as tree disease.
In the above example, a dividing line between attributes was subjective. There are
cases when median or some other test value index is used to classify the two
outcomes in a runs test, for instance, below median versus above median. Thus,
we may use event A for an outcome that is below the median and event B for an
outcome that is above median. The sequence of these designated events now
becomes the subject of analysis.
62
Solution:
Test statistic is based on runs R the number of changes of pulse rate from below to
above median.
Let A represent an event that pulse rate is below median and B represent the event
that pulse rate is above median.
The given data on pulse rate are represented in terms of sequence of events of A
and B as follows: ABBAAAABBBABB.The number of runs in this sequence of
events A and B is R=6 counted as follows: A,BB,AAAA,BBB,A,BB. Given that n1
=7 and n2 =6, then:
Conclusion: The null hypothesis is not rejected; implying that pulse rate data are
random there is no systematic factor surrounding the pulse rate of the astronauts.
RANK-SUM TESTS
Rank-sum nonparametric tests rely on ranking of the sample observations and are
therefore applicable for that are measured at ordinal level or above. The statistic to
be studied is the rank (rather than values) of the sampled observations. Thus, the
focus in rank-sum tests is sampling distribution of the sum of ranks. The analytical
framework for using rank-sum nonparametric tests varies from one-way ANOVA
(for completely randomized designs) to two-way ANOVA (for randomized block
designs). The null hypothesis that is being tested also varies with the research
design used.
Given the context of the research designs that are typically being associated with the
before-after, with-without, or two treatments being applied to homogeneous experimental
units, the focus of Wilcoxon test is to answer questions that related to;
Are there significant differences in the subjects‟ response before and after the
treatment?
Are there significant differences in the subjects‟ response with and without
treatment?
Are two treatments statistically different?
The statistic that is used to test the null hypothesis is the sum of ranks of the difference of
the paired observations which are ranked while retaining the +ve/-ve sign identity. If the
null hypothesis is true, T=min R , R , has the following asymptotical parametric
characteristics;
Henceforth, the test statistic for the Wilcoxon nonparametric test is;
T E T
Z .
T
The accept area for the null hypothesis is -1.96<Z<1.96. The decision rule is: reject iff ΙZΙ>1.96.
However, for small samples critical values of T are for each relevant level of significance and
sample and sample size n.
In a way the test is an alternative to the one-way analysis of variance for completely randomized
research designs. The test is based on the combined ranking procedure while retaining the
sample identity, like in Man-Whitney test. However the test statistic with Kruskal-Wallis test
procedure is the variation of ranks, which is computed as;
k
R2I
3n 1. where RI some of ranks assigned to observations in the i th
12
H
nn 1 i ni
k
sample, ni =size of the ith sample, and n n
i
n1 n2 n3 ... nk
64