Introductory Statistics Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

INTRODUCTION TO STATISTICS

Numbers are essential role in statistics. They provide raw material of statistics. These materials
must be processed to be useful, just like crude oil must be refined into petrol before it can be used
by an automobile engine. The study of statistics involves methods of refining numerical and non-
numerical information into useful forms.

Numbers can represent qualities and values of commodities produced and sold, prices of
products,inventories,asserts and liabilities, row materials, customers, income and expenses,
records of birth and death rates, number of passengers traveled during a year by road, rail, ship
etc, value of import of different commodities to and from different countries, number of student in
various courses in the university, agricultural production, number of working population in a
certain village/ward/District/Religion/Country.

Whenever numbers are collected and compiled, regardless of what they present, they become
statistics. In other words the term statistics is considered synonymous with ways and means of
representing and handling data, making inferences logically and drawing conclusion.

Statistics Methods:

The large volume of numerical information gives rise to the need for systematic methods which
can be used to organize, present, analyze and interpret the information effectively. Statistical
methods are primarily developed to meet this need.

The methods by which statistical data analyzed are called Statistical methods. Statistical
methods are applicable to a very large number of fields like Economics, Sociology, Management,
Agriculture, etc

Statistical methods are used by the Governmental bodies, private firms and research agencies as
an indispensable aid in forecasting, controlling and exploring.

Note: Statistics is usually not studied for its own sake; rather it is widely employed as a tool and
highly valuable one in the analysis of problem in nature-physical and sciences.

Statistics;

The collection of methods used in planning an experiment


and analyzing data in order to draw accurate conclusions.

Refers to the collection, presentation, analysis and utilization of numerical data to make
inferences and reach decisions in the face of uncertainty in economics, business and other social
and physical sciences.
Is the body of procedures and techniques used to collect, presents and analyze data on
which to base decisions in the face of uncertainty or incomplete information?

ABUSES OF STATISTCS

Most of the time, sample are used to infer something about the population. If an experiment was
done continuously and the results were interpreted wrongly then the information is meaningless.
Things which can lead the wrong information are;

Sample size: If the sample is very large or very small it leads to the wrong information.

Summarization: Graphs which is not clear to what it mean, i.e., the percentage which not
appropriate to what is represented.

1
OBJECTIVES OF STATISTICS

To present facts in a definite form.


To simplify and classify large mass of facts
To furnish methods of comparison.
To settle questions of general Economic or political nature and enable
management, researcher, government, industry etc to determine policies.

LIMITATION OF STATISTICS

Statistics are not suitable to the study qualitative phenomenon such as honesty
poverty and culture.
Statistics deals with an aggregate of object and not individual figure.
There is no certainty in statistics as statistics laws are not exact (approximations
are dormant)

USES OF STATISTICS

To inform the public.


To provide the comparison.
To influence the decision making.
To simplify and classify the large mass of facts.
To predict the future outcomes.
To justify the claim.
To establish a relationship/association between variables.

Definitions of terms.

Descriptive statistics generally characterizes or describes a set of data elements by graphically


displaying the information or describing its central tendencies and how it is distributed. Or

Descriptive statistics summarizes a body of data with one or two pieces of information that
characterize the whole data. It also refers to the presentation of a body of data in the form of data
in the form of tables, charts, graphs, and other forms of graphic display.
Inferential statistics tries to infer information about a population by using information gathered
by sampling. Or refers to the drawing of generalizations about the properties of the whole from the
specific or a sample drawn from the population, it involves inductive reasoning (This is to be
contrasted with deductive reasoning which ascribes properties to the specific starting with whole).
Applied Statistics 1s the application of statistical methods and techniques to the problems and
facts of real life as they exist. Quality control, sample surveys, quantitative analysis for business
decision.
Population: The complete set of data elements is termed the population. The term population will
vary widely with its application. Examples could be any of the following; animals; primates; human
beings; homo sapiens; who are high school students; attending the Math & Science Center.
Sample: A sample is a portion of a population selected for further analysis.
Parameter: A parameter is a characteristic of the whole population

Statistic: A summary typically numerical of a variable over a sample.

DATA TYPES
Data are the facts and figures collected, analyzed and summarized for presentation and
interpretation. All the data collected in a particular study are referred to as the data set for the
study. Data can also be classified as either qualitative or quantitative;

2
Qualitative data.
Data are nonnumeric. {Poor, Fair, Good, Better, Best}, colors (ignoring any physical causes), and
types of material {straw, sticks, bricks} are examples of qualitative data.

Qualitative data include labels or names used to identify an attribute of each element. Qualitative
data use either the nominal or ordinal scale of measurement and may be nonnumeric or numeric.
The statistical analysis appropriate for a particular variable depends upon whether the variable is
qualitative or quantitative. If the variable is qualitative the statistical analysis is rather limited. We
can summarize qualitative data by counting the number of observations in each qualitative
category or by computing the proportion of the observations in each category. However, even
when the qualitative data use a numeric code, arithmetic operations such as addition, subtraction,
multiplication and division do not provide meaningful results.

Quantitative data.
Qualitative data are often termed categorical data. Some books use the terms individual and
variable to reference the objects and characteristics described by a set of data. They also stress
the importance of exact definitions of these variables, including what units they are recorded in.
The reason the data were collected is also important.
Quantitative data require numeric values that indicate how much or how many, they are obtained
using either the interval or ratio scale of measurement. Quantitative data use a numeric values,
arithmetic operations such as addition, subtraction; multiplication and division provide meaningful
results for a quantitative variable. For example, for a quantitative variable the data may be added
and then divided by the number of observations to compute the average value. This average is
usually meaningful and easily interpreted.

Categories of quantitative data;


Quantitative data are classified as either discrete or continuous.

Discrete data are numeric data that have a finite number of possible values.
• A classic example of discrete data is a finite subset of the counting numbers, {1, 2,
3, 4, 5} perhaps corresponding to {strongly Disagree... Strongly Agree}.
• When data represent counts, they are discrete. An example might be how many
students were absent on a given day. Counts are usually considered exact and
integer. Consider, however, if three tradies make an absence, then aren't two
tradies equal to 0.67 absences?

Continuous data have infinite possibilities: 1.4, 1.41, 1.414, 1.4142, 1.141421...
The real numbers are continuous with no gaps or interruptions. Physically measurable quantities
of length, volume, time, mass, etc. are generally considered continuous. At the physical level
(microscopically), especially for mass, this may not be true, but for normal life situations is a valid
assumption.
The structure and nature of data will greatly affect our choice of analysis method. By structure we
are referring to the fact that, for example, the data might be pairs of measurements. Consider the
legend of Galileo dropping weights from the leaning tower of Pisa. The times for each item would
be paired with the mass (and surface area) of the item. Something which Galileo clearly did was
measure the time it took a pendulum to swing with various amplitudes. Galileo Galilei is
considered a founder of the experimental method.

Sources of data
Data can be obtained from major two sources
Primary source.
Secondary source.

SCALE OF MEASUREMENT;
The search for answers to research questions calls for collection of data. Data are facts, figures
and other relevant materials collected, analyzed and summarized for presentation and
interpretation.

3
Data collection requires one of the following scales of measurement; nominal, ordinal, interval, or
ratio. The scale of measurement determines the amount of information contained in the data and
indicates the most appropriate data summarization and statistical analysis.

Nominal;
When the data for a variable consist of labels or names used to identify an attribute of the element
the scale of measurement is considered a nominal scale. In cases where the scale of
measurement is nominal a numeric code as well as nonnumeric labels may be used. For
instance, to facilitate data collection and to prepare the data into computer database, we might
use numeric code by assigning labels the numbers.

Ordinal Scale;
The scale of measurement for a variable is called an ordinal scale if the data exhibit the properties
of nominal data and the order or rank of the data is meaningful. For instance; the data recorded
as excellent indicate the best, followed by good and then poor. The scale of measurement is
ordinal. Note that the ordinal data can also be recorded using numeric code.

Interval Scale;
The scale of measurement for a variable becomes an interval scale if the data show the
properties of an ordinal data and the interval between values is expressed in terms of fixed unit of
measure. Interval data are always numeric example; aptitude test scores, three students with SAT
scores of 1120, 1050 and 970 can be ranked or ordered in terms of best performance to poorest
performance. In addition the difference between the scores is meaningful.

Ratio Scale
A variable is ratio scaled if the data have all the properties of interval data and the ratio of two
values is meaningful. Variables such as distance, height, weight, and time use the ratio scale of
measurement. This scale requires that a zero value be included to indicate that nothing exists for
the variable at zero point. For example, consider the cost of an automobile. A zero value for the
cost would indicate that the automobile has no cost and is free.

METHODS OF DATA COLLECTION


OBSERVATION.
Observation is a classical method of scientific enquiry. The body of knowledge of various natural
and physical sciences such as biology, physiology, astronomy, plant ecology etc. has been built
upon centuries of systematic observation.

Observation plays a major role in formulating and testing hypothesis in social sciences.
Behavioral scientists observe interactions in small groups; anthropologists observe simple
societies, and small communities; political scientists observe the behavior of political leaders and
political institutions. A researcher silently watching a city council or trade union committee or
quality circle or departmental meeting or conference of politicians or others picks up hints that
help him to formulate new hypothesis.

Observation becomes scientific when it serves a formulated research purpose, is planned


deliberately, recorded systematically and is subjected to checks and controls on validity and
reliability.

Reliability entails consistency and freedom from measurement error. This is usually assessed in
terms of the extent to which two or more independent observers agree in their ratings of the same
event.

Validity refers to the extent to which the recorded observations accurately reflect the construct
they are intended to measure. Validity is assessed by examining how well the observation agrees
with alternative measures of the same construct.

Observation is defined as a systematic viewing of a specific phenomenon in its proper setting for
the specific purpose of gathering data for a particular study. Observation as a method includes
both „seeing‟ and „hearing‟, it is accompanied by perceiving as well.

Advantages of observation method:

4
No interview is required although it can be done as follow up on the observed
information.
It can be highly accurate.
The information obtained relates to what currently happened.

Disadvantages of observation method

It is limited in application i.e., the information is very limited.


It tells what happened but not why. It does not go into the motives, attitude or
opinion.
It is expensive.

Things to Consider When Using Observation Method;

What should be observed?


How the observation should be recorded?
How accuracy of observation can be ensured?

INTERVIEWING.
Interviewing may be used either as a main method or as a supplementary one in studies, is the
only suitable method for gathering information from illiterate or less educated respondents. It is
useful for collecting a wide range of data from factual demographic data to highly personal and
intimate information relating to a person‟s opinions, attitudes, and values, beliefs, past experience
and future intentions.

Interview can add flesh to statistical information; it enables the investigator to grasp the behavioral
context of the data furnished by the respondents. It permits the investigator to seek clarifications
and brings to forefront those questions that for one reason or another respondent do not want to
answer.

Interview can be defined as a two-way systematic conversation between an investigator and an


informant initiated for obtaining information relevant to a specific study. It involves not only
conversation but also learning from respondent‟s gestures, facial expressions and pauses and his
environment.

Types of interview
The interviews are classified into structured (directive) and unstructured (non-directive) interviews;

Structured interview. This is an interview made with a detailed standardized schedule. The
same questions are put to all the respondents and in the same order, each question is asked in
the same way in each interview promoting measurement reliability. This interview is used for large
scale formalized surveys.

The data from structured interview are easy for comparison, recording and coding data do not
pose any problem and greater precision is achieved. Last, attention is not diverted to extraneous,
irrelevant and time-consuming conversation.

Unstructured interview. In this interview, interviewer encourages the respondent to talk freely
about a given topic with minimum of prompt or guidance. The interviewer avoids channeling the
interview directions, he/she develops a permissive atmosphere and the questions are not
standardized and not ordered in a particular way.

This interviewing is more useful in case studies rather than in surveys. It is particularly useful in
exploratory research where the lines of investigation are clearly defined. It is also useful for
gathering information on sensitive topics such as divorce, social discrimination, class conflict,
generation gap. It provides opportunity to explore the various aspects of the problem in an
unrestricted manner.

5
Advantages of Personal interview:

1. More information and that too greater depth can be obtained.


2. Interviewer by his/her own skill can overcome the resistance, of the respondent.
3. There is greater flexibility under this method as the opportunity of restructured questions is
always there.
4. The personal information can be obtained easily under this method.
5. The interviewer can control which person will answer the question.

Disadvantages of personal interview

It is time consuming; especially when the sample is large and recall upon the respondents is
necessary.

There are high chances in introducing errors when interviewing.


It is administratively difficult and expensive especially when respondents are
geographically scattered.
Interviewing at times may also introduce systematic errors.
The presence of interviewer on the spot may over-stimulate the respondent,
sometimes even to the extent that he/she may give imaginary information just to make
interview interesting.
Under the interview method the organization required for selecting, training and
supervising the field-staff is more complex with formidable problems.

Telephone interview:

This method of collecting information consisting in contacting respondents on telephone. It is


not very widely used method, but plays important part in industrial surveys, particularly in
developed countries.

Advantages telephone interview

More interviews can be conducted in a given period of time as no time lost in


traveling and locating respondents.
It is more flexible in comparison to mailing method.
Recall is easy; callbacks are simple and economical.
Interview can explain requirements more easily.
No field staff required.

Disadvantages telephone interview

It is difficult to employ visual aids.


Questions have to be short and the point, probes are difficult to handle.
Little time is given to the respondents for the considered answers; interview period
is not likely to exceed five minutes in most cases.
Surveys are restricted to respondents who have telephone facilities.
Possibility of the bias of the interviewer is relatively more.

MAIL SURVEY.
This method involves sending questionnaires to the respondents with a request to complete them
and return them by post. This can be used in case of educated respondents only. The mail
questionnaire should be simple so that the respondents can easily understand the questions and
answer them. It should preferably contain mostly closed-end and multiple-choice questions so that
it could be completed within a few minutes.

Distinctive features of mail survey;

The questionnaire is self-administered by the respondents themselves and the responses


are recorded by them.
Communication is carried out only in writing and this requires more cooperation from the
respondents than does verbal communication.

6
Procedure;

 The researcher should prepare a mailing list of the selected respondents by collecting the
addresses from the telephone directory or the association or the organization to which
they belong.
 A covering letter should accompany a copy of the questionnaire. It must explain to the
respondent the purpose of the study and the importance of his study to the progress of the
organization.

Advantages of Mail survey


1. It is relative cheaper than other methods as it does not involve travelling or interviews.
2. It is suitable for widely scattered respondents.
3. If careful structured, questionnaires administered in this way may give more accurate
and adequate results than the other methods.

EXPERIMENTATION.
Experimentation is a research process used to study the causal relationships between variables.
It aims at studying the effect of independent variables on a dependent variable by keeping the
other independent variables constant through some type of control mechanism.

The fundamental weakness of any non-experimental study is its inability to specify causes and
effects. It can show only correlations between variables but correlation alone never prove
causation. The experiment is the only method which can show the effect of an independent
variable on dependent variable. In experimentation, researcher can manipulate the independent
variable and measure its effect on the dependent variable. For example, the effect of various
types of promotional strategies on the sale of a given product can be studied by using different
advertising media such as TV, radio and newspapers.

Advantages of observation method

Is reliable as researcher get the first hand information.

Disadvantages of observation method

Experiment needs qualified researcher; which means cost.


They may be expensive if expensive equipments are required.
It may be time consuming if it requires a reasonable number of trials.

QUESTIONNAIRE DESIGN
A questionnaire is an instrument or a tool of data collection which contain a set of questions
logically related to a problem under study aim at eliciting responses from the respondents.

Construction of questionnaire.

Construction of questionnaire is not a matter of simply listing questions that comes from
researcher‟s mind. It is a rational process involving much time, effort, and thought;

a) Data need determination: Mailed questionnaire is an instrument for gathering data for a
specific study, its construction should flow logically from the data required for the given study.
The data required for specific study can be determined by a deep analysis of the research
objectives, the negative questions relating to each of the research objectives, hypothesis and
operational definitions of the concepts used in them.

b) Preparation of Dummy Tables: We are concerned with adequate coverage of the


information required for the study and also with securing the information in the most usable
form. The best way to ensure these requirements is to develop dummy tables in which to
display the data to be gathered. The adequacy of the dummy tables for describing the
possible distributions or relationship related to the problem or hypothesis should be examined.

7
This will help to identify gaps and duplications in the instrument and enable the designer to
make appropriate additions, corrections and deletions.

Construction of dummy tables is very useful way to visualize how data can be organized and
summarized. A dummy table contains all the elements of a real table except that the cells of
the table are empty.

Cross-tabulation tables are usually presented with cell frequencies converted into percentages
based on either row or column totals. If a dependent variable is cross- tabulated with an
independent variable, the percentages should be calculated so that they add up to 100% for
each category of the independent variable.

c) Determination of the respondents: Determination of respondent‟s level of education,


knowledge or specialized knowledge relating to the problem under study is important. The
choice of words and concepts depends upon the level of the respondents‟ knowledge.
d) Questionnaire drafting: After determining the data required for the study a broad outline of
the tool must drafted, listing various broad categories of data. Drafting of a good questionnaire
is a highly specialized job and requires great care, skill, wisdom, efficiency and experience.
Example the outline of a questionnaire for a survey of consumer behavior towards color
television may consist of sections as; identification data, brand awareness, brand choice,
purchase decision, brand loyalty, post-purchase behavior and personal information. The
sequence of these groupings must be arranged in a logical order, example brand choice
cannot precede brand awareness because choice is made out of brands known.
Drafting of a good questionnaire is a highly specialized job and requires great care, skill,
wisdom, efficiency and experience. The questionnaires should be designed with great care
and caution so that all the relevant and essential info for the inquiry may be collected without
any difficult, ambiguity and vagueness. The following are the tips when designing
questionnaire.
The size of the questionnaire should be as small as possible e.g. questions (20-25)
Questions should be simple, clear and unambiguous.
The questionnaire should be brief.
Questions should be arranged in a logical order. The questionnaires should be so
designed that the questions fall into a logical sequence.
Questions of sensitive and personal nature should be avoided.
Questions should be capable of brief answers.
Get up the questionnaire.
Mode of tabulation and analysis.

e) Evaluation of the draft instrument: The draft instrument is evaluated with other qualified
persons each question is examined its relevance, appropriateness, clarity, and validity. The
logical and psychological order of the questions, their clarity and content, length of the
instrument and other aspects of its structure should be considered.

f) Pre-testing: The revised draft must be pre-tested in order to identify the weaknesses of the
instrument and to make the required further revisions to rectify them.

g) Specification of instructions: The instructions regarding the mode of answering should be


specified at the top of the first page. The tool itself should be carefully and clearly laid out
using bold types and capitals to emphasize particular words and instruction. It should be
neatly printed in quality paper, as primary consideration in designing.

8
SAMPLING TECHNIQUES (METHODS).

The objective of statistics is to make inferences about a population from information contained in
a sample. This same objective motivates the discussion of sampling techniques or methods. In
most cases the inference is done in form of an estimate of a population parameter, such as a
mean, total or proportion with a bound on the error of estimation.

The objective of sampling is to estimate population parameters such as the mean or the total from
information contained in a sample. The experimenter controls the quantity of information
contained in the sample by the number of sampling units he or she includes in the sample and by
the method used to select the sample data.

Each observation or item taken from the population contains a certain amount of information
about the population parameter or parameters of interest. Since information costs money the
experimenter must determine how much information he/she should buy. Too little information
prevents the experimenter from making good estimates while too much information from results in
a waste of money. The quantity of information obtained in the sample depends on the items
sampled and on the amount of variation in the data. To obtain representative sample which
produces valid and reliable information sampling techniques (methods) were established.

Definition of terms used in sampling

Sampling: Is the selection of some unit to represent the entire population from which the units
were drawn. It consists of one or more elements selected from a population.

Sampling Units: That element or set of element considered for the selection in some stage of
sampling.

Sampling frame: Is the actual list of sampling units from which the sample or some stage of
sample is selected.

Sample design: Is a set of rules or procedures that specify how a sample is to be selected.

Sample size: Is the number of element in the obtained sample.

Sampling error: Is the degree of error to be expected for a given sample design. OR Is a
difference between sample statistic and population parameter

TYPES OF SAMPLING

Probability sampling
Non-probability sampling

Probability sampling,

Sampling techniques used when the resources (money and time) are limited, the objective of
study is to make generalization and greater degree of accuracy of estimation of population
parameter is required. Each element or sampling unit in a given population under study has the
same chance to be included in the sample.

The classical formulation of a statistical estimation problem requires that randomness be built into
the sampling design so that properties of the estimators can assessed probabilistically. With
proper randomness in the sampling, one can make statements such as “Our estimate is unbiased
and we are 95% confident that our estimate will be within 2% points of the true proportion”.
Sample designs that are based on planned randomness are called probability samples.

TYPES OF PROBABILITY SAMPLING:


Simple random sampling
Systematic sampling
Stratified sampling
Cluster sampling
Multi-stage cluster sampling
9
1) Simple random sampling

A sampling technique where a sample of size n is drawn randomly from a homogeneous


population of size N in such a way that every element in population has the same chance of being
selected, E.g. students studying in a fifth standard in boys school form a homogeneous group
regards level of education age group.

Example 1; Auditors study simple random samples of accounts in order to check for compliance
with audit controls set up by the firm or to verify the actual dollar value of the accounts.

Example 2; Marketing research often involves a simple random of potential users of a product.
The researcher may want to estimate the proportion of potential buyers who prefer a certain color
of a car or flavor of food.

2) Systematic Sampling.

This is a modification of simple random sampling which is ordinarily less time-consuming and
easier to implement. The estimated number of elements in the large population is divided by the
desired sample, yielding a sampling interval. The sample is then drawn by listing the population
elements in an arbitrary order and selecting every nth case, starting with a randomly selected
number between 1 and n.

Example1; suppose it is desired to select a sample of 20 students from a list of 300 students,
divide the total population of 300 by 20. The quotient is 15, and then selects a number at random
between 1 and 15 using lottery method or a table of random numbers. Suppose the selected
number is 9. Then the students numbered 9, 24…64, 84 are selected as the sample.

Example 2; consider the problem of sampling rural health centers. In this center, the sampling
interval is 7. You would then choose a randomly example, the sampling frame would be a list of
rural health centers arranged alphabetically by health center name. If your desired sample size is
285 rural health centers drawn from a universe of 2000 rural health selected number between 1
and 7 as your start. If your random number is 3 the first unit selected would be the 3 th rural clinic
listed in the sampling frame, the second would be the 10th clinic listed and so forth until the
sampling frame is exhausted.

Systematic sampling is useful when the units in your sampling frame are not numbered, when the
elements are not numbered serially or when the sampling frame consists of very long lists. Other
possible areas of application are sampling of students in a class, houses in a street, telephone
directory, customers of a bank, assembly line output in a factory or members of association.

3) Stratified Sampling,

For the purpose of maximizing the amount of information for a given cost, the population under
study is sub-divided into homogeneous groups or strata and from each stratum, random sample is
drawn. For example, university students may be divided on the basis of discipline, and each
discipline group may again divided into juniors and seniors; the employees of a business
undertaking may be divided into managers and non-managers and each of those two groups may
be sub-divided into salary grade wise strata.

Proportionate stratified sampling.

This sampling technique involves drawing a sample from each stratum in proportion to the latter‟s
share in the total population. This sampling method gives proper representation to each stratum
and its statistical efficiency is generally higher. For example, if the final year MBA students of the
Management Faculty of a University consist of the following specialization groups:

10
Specializations Stream Number of students Proportion of each stream Sample size

Production 40 0.4 30 x0.4=12

Finance 20 0.2 30x0.2=6

Marketing 30 0.3 30x0.3=9

Rural Development 10 0.1 30x0.1=3

Total 100 1.0 30

The researcher wants to draw an overall sample of 30.

4) Cluster sampling

Where the population elements are scattered over a wide area and a list of population elements is
not readily available, the use of simple or stratified random sampling method would be too
expensive and time-consuming. In such cases cluster sampling is usually adopted.

Cluster sampling means random selection of sampling units consisting of population elements.
Each such sampling unit is a cluster of population elements. Then from each selected sampling
unit a sample of population elements is drawn by either simple random selection or stratified
random selection.

Cluster sampling is used when it is not possible to get an adequate sampling frame for the
individuals you wish to study or when a simple random technique would result in a list of
individuals so dispersed that it would be too costly to visit each one. The disadvantage of a cluster
sample is that it increases sampling error and requires a larger sample size for reliable estimates
of population characteristics. If the cost of the larger sample size outweighs the costs associated
with uncluttered sampling, clustering should not be used. The cluster may be an institution,
geographical area or any other appropriate group depending on the nature of survey.

Example; suppose a researcher wants to select a random sample of 1000 households out of
40,000 estimated households in a city for a survey. A direct sample of individual households
would be difficult to select because a list of households does not exist and would be costly to
prepare. Instead he can select a random sample of a few blocks/wards. The number of blocks to
be selected depends upon the average number of estimated households per block.

5-Multi-stage cluster sampling;

This sampling is appropriate where the population is scattered over a wide geographical area and
no frame or list is available for sampling. It is also used when a survey has to be made within a
limited time and cost budget.

Sometimes, when populations are extremely complex it is necessary to go beyond two stages in
cluster sampling. For example, in absence of list of households for your survey of AIDS widows,
one might have to begin with a random sample of villages and when arrive at each village make a
list of households and draw a random selection of households to visit. When you arrive at a
household you would randomly select a woman to interview or interview all eligible women.

Non-probability sampling

This technique does not adopt the theory of probability and does not give a representative sample
of the population. It is distinguished from probability sampling by the fact that subjective
judgments play a role in selecting the sampling elements.

11
The broad categories of non-probability are;

i. Convenience or Accidental sampling


ii. Purposive or Judgmental sampling
iii. Quota sampling

i. Convenience or Accidental sampling


The convenience samples are selected from whatever cases happen to be available at a given
time or place. Additionally, this method is also known as accidental sampling because the
respondents who researcher meets accidentally are included in the sample.

Example; this method is used to test simple purposes such as testing ideas or gaining ideas or
rough impression about a subject of interest.

ii. Purposive or Judgmental sampling


This method produces purposive samples which consist of units deliberately selected to provide
specific information about a population which we judge as the most appropriate ones for the given
study. The deliberate selection of sample units that conform to some pre-determined criteria is
also known as judgmental sampling.

The chance that a particular case be selected for the sample depends on the subjective judgment
of the researcher. For example, a researcher may deliberately choose industrial undertakings in
which quality circles are believed to be functioning successfully and undertakings in which quality
circles are believed to be a total failure.

Application; The method is appropriate when what is important is the typicality and specific
relevance of the sampling units to the study and not their overall representativeness to the
population.

iii. Quota sampling


This sampling method involves selection of quota groups of accessible sampling units such as
sex, age, social class and so on. When the population is known to consist of various categories by
sex, age, religion social class in specific proportions, each investigator may be given an
assignment of quota groups specified by the pre-determined traits in specific proportions. He/ she
can then selects accessible persons belonging to those quota groups in the area assigned to him.

Application, Quota sampling is used in studies like marketing surveys, opinion polls, and
readership surveys which do not aim at precision, but to get quickly some crude results.

ORGANIZATION OF DATA

Data organization is an intermediary stage of work between data collection and data analysis. The
completed instruments of data collection such as interview schedules, questionnaires and
observation schedules contain vast mass of data. They cannot straightaway provide answers to
research questions, they need to be classified and summarized in order to make them amenable
to analysis.

Data collected from published sources are generally in organized form compared to those come
from primary sources. However a large mass of figures that are collected from the survey
frequently needs organization. Data organization consists of a number of closely related
operations such as editing, classification, coding and tabulation.

Editing;

Editing is the process of checking to detect and correct errors, omissions, inconsistencies,
irrelevant answers and wrong computation in the return from the survey may be corrected or
adjusted. Why editing?

During the stress of interviewing the interviewer cannot always record responses completely and
legibly. Therefore after each interview is over he should review the schedule to complete
abbreviated responses, rewrite illegible responses and correct omissions.

12
The returns (schedules or questionnaires) received from the respondents have to be scrutinized
patiently and carefully and detect errors caused by careless recording by the field workers or
inconsistency or factually wrong information given by the respondents.

Classification

The edited data are arranged according to some characteristics possessed by the items
consisting data and coded. The responses are classified into meaningful categories to bring out
their essential pattern so that to reduce several responses into appropriate categories containing
critical information needed for analysis.

Example; suppose the responses to a question on occupation in a survey consist of items as


business executive, electrician, farm laborer, college teacher, carpenter, accounts assistant,
share broker,driver,lawyer, medical practitioner, barber, goldsmith. These data are not amenable
to analysis unless an appropriate scheme of classification is used. One way to classify them is as
per the following categories; Professional and Managerial, Clerical, Service and skilled labors and
unskilled labors.

Objective of Classification

To condense the mass of data: Statistical data collected during the course of an
investigation are so varied that it is not possible to appreciate, even after a carefully study,
the real significance of figures, unless they are properly to small groups or classes.
Example; data collected during population census can be classified according to sex, age,
marital status, education, occupation etc.
To enable grasp of data; The figures are easily arranged in a few classes or
categories so that the like go with the like.
To prepare the data for tabulation; Only classified data can be presented into
tabular form. Classification thus provides a basis for tabulation and further statistical
processing.
To study the relationships; Relationship between variables can be established
only after the various characteristics of the data have been known, which is possible only
through classification and tabulation.
To facilitate comparison: Classification enables comparisons between variables.
Example the data on house hold classified on the basis of age, religion, education,
income, expenditure etc can be used for drawing comparisons.

Coding
Coding means assigning numerals or other symbols to the categories or responses. For each
question a coding scheme is designed on the basis of the concerned categories. The coding
schemes with their assigned symbols together with specific coding instructions may be assembled
in a book.

Example;
Question No. Variable/Observation Responses categories code

5.1 Occupation  Salaried 1


 Professional 2
 Business 3
 Retired 4
 House wife 5
 Others 6

Tabulation
Tabulation is the process of summarizing raw data and displaying them on compact statistical
tables for further analysis. It involves counting of the number of cases falling into each of several
categories. Tabulation can be done by hand or mechanical or electronic devices. The choice
depends upon the size and type of study, cost considerations, time pressure and availability of
tabulating machines or computers.

The major objectives of Classification

13
To clarify the object of investigation.
To simplify complex data.
To depict trend.
To economize space.
To clarify the characteristics of data.
To facilitate comparison.
To detect errors and omissions in the data.
To facilitate statistical processing.
To help reference.

DATA PRESENTATION
After data collection the next step is to present them in some suitable form. The proper
presentation arises because of the fact that statistical data in their raw form almost defy
comprehensions. When data are presented in easy-to-read form, it can help the reader to acquire
knowledge in much shorter period of time and also facilitate statistical analysis.

A statistical table is presentation of numbers in logical arrangement with some brief


explanations to show what they are.
Statistical chart or graph is presentation of data.

MEASURES OF CENTRAL TENDENCY;


Measures of central tendency shows the tendency of some central value around which data tends
to cluster. The following are the important measures of central tendency; Arithmetic means, Mode,
Median, Harmonic mean and Geometric mean.

There are two main objectives of the study of measures of central tendency.

To get a single value that describes the characteristics of entire group.


To facilitate the comparison. Reducing the mass of data in one single figure enables
comparisons to be made. For example the average sales of December may be compared with
sales figures of previous month.

Arithmetic Mean

This is a central characteristic of the given mass of data which is obtained by adding all the
observations and dividing the total by the number of observations.

Arithmetic mean –Discrete data (Ungrouped Data)

Steps
n
Add together all the values of the variable X and obtain the total i.e, x
i 1
i

Divide the total by the number of observations.


n

x1  x2  x3  ...  x n 
xi
AM   i 1
, where x are observations.
N N
Example; The table below presents the salaries of 12 employees of a certain company in
(000shs).

32 32 45 80 66 66 54 45 32 71 111 14

Solution;
12

x i
32  32  45  80  66  66  54  45  32  71  111  14 648
Mean salary= i 1
   54
N 12 12

14
The mean salary of 12 employees of that particular company is 54000 Tshs.

The case of grouped data;

Steps:

Lower lim it  upper lim it


i. Find the class marks of the variable (x) by taking
2
ii. Multiply each class mark of the variable (x) with the corresponding frequencies to get f i xi
n
iii. Sum the product of f i xi to fx
i 1
i i

iv. Devide the total obtained in part iii above by the total number of observations.
n

fx i i
f1 x1  f 2 x 2  ...  f nx n
Arithmetic Mean= i 1
n
 .
f
N
i
i 1

Example; In a certain industry the number of thousands of employees in 1970 were grouped in
age as follows;

Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64

No. of employee 66 65 56 50 42 37 35 30 24 22

Solution;

Age No. of employee ( f ) Age mark ( f i xi


X)
15-19 66 17 1122

20-24 65 22 1430

25-29 56 27 1512

30-34 50 32 1600

35-39 42 37 1554

40-44 37 42 1554

45-49 35 47 1645

50-54 30 52 1560

55-59 24 57 1368

60-64 22 62 1364

Total 427 14709

 fx 14709
Arithmetic Mean=  34.44730679  34.45 .
427
f
The mean age of 427 employees in particular industry is approximately 34.45 years.

15
Merits and limitation of Arithmetic mean

Merits

It is affected by the value of every item in the series and therefore it is considered
to be more representative of the distribution.
It is widely used method because of its mathematical property, since it lends itself
easily to further mathematical treatment like computation of standard deviation, coefficient
of skewness etc.
It is a calculated value and not based on position in the series.
It is the simplest average to understand and easiest to compute. Neither the
arraying the data as required for calculating median nor grouping of data as required for
calculating mode is needed while calculating mean.

Limitation of arithmetic mean

i. The value of mean can be affected by extreme items.i.e, very small and very large items
as compared to other items.
ii. It cannot be determined by inspection nor cannot be located graphically.
iii. It cannot be used in the study of qualitative phenomena which are not capable of
numerical measurement,i.e.,intelligence,beauty,honesty,etc
iv. In a distribution with open-end classes the value of mean cannot be computed without
making assumption regarding the size of the class interval of the open-end classes.

MEDIAN

Median is the measure of central tendency which appears in the middle of an ordered sequence
of values that is half of the observations in a set of data are lower than it and half of the
observations are greater than it.

Note. Since the arithmetic mean is calculated from the values of every observation in a series, the
median is the positional average and its value is such that an equal number of observations lie
either side. Median is the central value of the distribution or the value that divides the distribution
into two equal parts.

Computation of median value involves two basic steps; location of the middle value/item and
finding out its value.

Calculation of median for individual observation


Steps
i. Arrange the data in ascending or descending order of magnitude.(Both arrangements
would give you the same answer)
ii. If you have odd number of values the median is the
n  1 th item and if you have an
2
even number of observations the median is the average of the two middle values.

Example 1: Obtain the value of the median following data:

25 20 15 45 18 7 10 38 12.

Solution;
Arrange the observations in ascending order; that is 7 10 12 15 18 20 25 38 45.

Median=
n  1  9  1  5th item which is 18.
2 2
Example 2; Determine the median value of the following items

7 10 12 15 18 20 25 38 45 64.

16
Median=
n  1  10  1  5.5th item. The median value lies between 5th and 6th which is the
2 2
18  20
average of 18 and 20=  19
2
Calculation of median for discrete series
Steps
Arrange the data in ascending or descending order of magnitude.
Find out the cumulative frequencies.
Apply the formula: Median=size of
N  1 th item.
2
Look at the cumulative frequency column and find the total which is either equal to
N  1 or next higher to that and determine the value of the variable corresponding to
2
it.This gives you the value of median

Example:From the following data find the value of the median

Income (USD): 1200 1800 5000 2500 3000 1600 3500

No. of person: 12 16 2 10 3 15 7

Solution:

Income (USD) in No. of Cumulative


ascending order persons (f) frequency (c.f)

1200 12 12

1600 15 27

1800 16 43

2500 10 53

3000 3 56

3500 7 63

5000 2 65

Median is the value of


N  1 th item 
65  1 th or 33rd item
2 2
All items from 27 onwards up to 43 have a value 1800,thus the median value is USD 1800.

Calculation of median for continuous series

In continuous series median is calculated using the formula

N 
  c. f 
Median  L   2 *i
 f 
 

Where L is lower limit of the median class, i.e. the class in which the middle item of the
distribution lies.

c. f Is the cumulative frequency of the class preceding the median class or simply sum

17
of the frequencies of all classes lower than the median class.

f Is the frequency of the median class.

i Is the class interval of the median class.


Example: Calculate the median in the following data

Weight (gms) 410-419 420-429 430-439 440-449 450-459 460-469 470-479

No. of apples 14 20 42 54 45 18 7

Solution:

Since we are given inclusive class intervals, we should convert it to the exclusive one by
deducting 0.5 from the lower limits and adding 0.5 from the upper limits.

Weight (in gms) No. of apples (f) Cumulative frequencies (c.f)

409.5-419.5 14 14

419.5-429.5 20 34

429.5-439.5 42 76

439.5-449.5 54 130

449.5-459.5 45 175

459.5-469.5 18 193

469.5-479.5 7 200

N 200
Median  th item which is   100th item
2 2
The median lies in a class 439.5  449.5 , therefore L  439.5 , i  449.5  439.5  10 , c. f  76

f  54 and N  200

 200 
 2  76 
Median  439.5    *10  443.94
 54 
 
Merits and limitations of median

Merits

Median is easy to calculate and is readily understood. In some cases it can be


located merely by inspection.
It is not at all affected by item on the extremes.
18
It can be computed for distributions which have open-ended classes.
It is the most appropriate average in dealing with qualitative data.
The value of median can be determined (located) graphically.

Limitations of median

i. Since it is a positional average, its value is not determined by each and every observation.
ii. For calculating median it is necessary to arrange the data; other averages do not need any
arrangement.
iii. The median, in some cases, cannot be computed exactly as the mean. When the number
of items included in the series of data is even, the median is determined approximately as
the mid-point of the two middle items.

MODE

Mode is the measure of central tendency which appears more frequently in the given number of
observations. Mode is the measure of central value of the distribution which appears several
times.

Calculation of mode –Discrete series

In discrete series quite often mode can be determined just by inspection, i.e., by looking that value
of the variable around which the items are most heavily concentrated.

Example; A farmer records the number of eggs he collects from his hens over a 150 day period.
The frequency distribution of the eggs collected is below.

Number of eggs 15 16 17 18 19 20 21 22 23 24 25

Frequency 1 2 4 6 9 13 18 22 35 30 10

What is the modal number of eggs collected per day?

Solution;

From the above data we can clearly say that the modal number of eggs collected per day is 23
because the value 23 has occurred the maximum number of times.

Calculation of mode –Continuous series

In continuous data the value of mode is calculated using the formula bellow

 1 
Mode  L   *i
 1   2 

Where L Is the lower limit of the modal class.

1 Is the difference between the frequency of the modal class and the frequency of the

pre-modal class.

 2 Is the difference between the frequency of the modal class and the frequency of the

post-modal class.

i Is the class interval of the modal class.

19
Example: Find the value of the mode from the data given bellow.

Weight (lbs) 93-97 98-102 103-107 108-112 113-117 118-122 123-127 128-132

No. of students 2 5 12 17 14 6 3 1

Solution:

By inspection the modal class is 108-112, then the real limits of this class are 107.5-112.5

L  107.5 , 1  17  12  5 ,  2  17  14  3 , i  112.5  107.5  5

 5   25 
Mode  107.5    * 5  107.5     110.625
5  3 8

Merits and Limitations of mode

Merits of mode

It can be determined even in open-end distributions without ascertaining the class


limits.
It can be used to describe qualitative phenomena. For example, if we want to
compare the consumer preferences for different types of products, say, soap, toothpaste,
etc., we should complete the modal preferences expressed by different group of people.
The value of the mode can be determined graphically.
It is not unduly affected by extreme values.

Limitation of mode

i. The value of mode is not based on each and every item in the series.
ii. It is not capable of further mathematical treatment.
iii. Mode may be unrepresentative in many cases.
iv. The value of mode cannot always be determined. In some cases we may have a bimodal
series.

NOTE: Arithmetic mean, median and mode each of these statistical measures describe the
typical characteristic or tendency of a group in a slightly different way.

MEASURE OF DISPERSION

The measures of central tendency give us one single value that represents the entire data. But
the average alone cannot adequately describe a set of observations, unless all observations are
the same. It is necessary to study variability or dispersion of the observations. In two or more
distributions the central value may be the same but still there can be wide disparities in the
formation of distribution. While an average discovers the representative values, dispersion finds
out how individual values fall apart, on an average, from the representative value. The average
was derived from the actual values but dispersion is known by averaging the deviation in
individual values from some representative value and therefore called an average of the second
order.

Definitions

Dispersion is the degree of the scatter or variation of the variable about the central
value.
Dispersion is the degree to which numerical data tends to spread about an average
value.

20
Objectives of measures of dispersion

To serve as a basis for the control of variability: When dispersion is small, the
average is a typical in the sense that it closely represent the individual value and it is
reliable in the sense that it is a good estimate of the average in the corresponding
average, the opposite is true.
To judge the reliability of the measures of central tendency: Measures of dispersion
are the only means to test the representative character of an average.
To facilitate comparison: Measures of dispersion enable a comparison to be made
of two or more series with regards to their variability. A high degree of variation would
mean little uniformity or inconsistency whereas low degree of variation would mean great
uniformity or consistency.
 To facilitate the use of other statistical measure. Many powerful analytical tools in statistics
such as correlation analysis, regression analysis, the testing of hypothesis, ANOVA,
statistical quality control are based on the measures of variation of one kind or another.

PROPERTIES OF A GOOD MEASURE OF DISPERSION


A good measure of dispersion should posses, as far as possible, the following properties;
1. It should be based on each and every item of the distribution.
2. It should be amenable to further algebraic treatment.
3. It should have sampling stability.
4. It should not be unduly affected by extreme items.

IMPORTANT MEASURES OF DISPERSION


The Range.
The Interquartile range and the quartile deviation.
The mean deviation or the average deviation.
The standard deviation.
Lorenz curve.

THE RANGE
Range is the simplest method of studying dispersion. It is defined as the different between the
value of the largest item and the value of the smallest item.

Mathematically

Range  L  S

Where L=largest item and S=smallest item.

Merits and limitation of range


Merits of range
i. Range is the simplest to understand and the easiest to compute.
ii. It takes minimum time to calculate the value of range. If one is interested in getting the
quick rather than a very accurate picture of variability one may compute range.

Limitation of range
Range is not based on each and every item in the distribution.
It is subject to fluctuations of considerable magnitude from sample to sample.
Range cannot tell us about the character of the distribution within the two extreme
observations

Uses of range
Quality control
Fluctuation in the share prices.
Weather forecast.

THE INTERQUARTILE RANGE AND THE QUARTILE DEVIATION.

21
Interquartile Range: A measures of variation for a set of data defined to be the different between
the first and third quartiles. Quartiles are closely to the median which divides the distribution in to
two equal parts.

Intrquartile range of raw data

Eg. Find the intequartile range of the following data

4 9 12 10 15 16 30 4 7 2

Solution: IQR  Q3  Q1

Arrange the data in array


2 4 4 7 9 10 12 15 16 18 30
n  1 11  1
Position of Q1    3th item  4
4 4
Position of Q3  n  1  11  1  *12 
3 3 3 36
 9 th item  16
4 4 4 4
Therefore, IQR  Q3  Q1  9  3  16  4  12
th th

Interquartile range of frequency distribution


Example: Find the interquartile range for a data given below
Class Freq C.f

10-19 3 3

20-29 8 11

30-39 15 26

40-49 5 31

50-59 5 36

n
 cf
Q1  LQ1  4 *i
f

n  836, i  10, cf  3, f  8, LQ1  19.5

36
3
Q1  19.5  4 *10  27
8

3n
 cf
Q3  LQ3  4 *i
f
3 * 36  26
Q3  39.5  4 *10  41.5
4
Interquartile range
IQR  Q3  Q1  41.5  27  14.5

Advantages

22
1. It is not affected by extreme values.
2. It account for the spread of the middle values of data which is most significant part of data.

Disadvantages

Fail to take into account all data.


It gives no indication of the degree of clustering.

Quartile Deviation
Is a measure of variation of the data for a set of data, define to be half distance between the first
and the third Quartile.
Uses: It is frequently used in skewed distribution.

Q3  Q1
QD 
2
Where
Q1=the first (lower) quartile.
Q3=the third (upper) quartile.

Advantages:
It is easy to understand
It is a better measure of dispersion than standard deviation for badly skewed
distribution
Unlike the range it is not affected by the extreme values.

Disadvantages:

Fails to take into account all the values of the data.


Gives no real indication to the degree of clustering.

THE MEAN DEVIATION OR THE AVERAGE DEVIATION.


The mean deviation is the arithmetic mean of the deviations of the individual values from the
average of a given data. The average which is frequently used in computing the mean deviation is
mean, mode and median, but mean is generally preferred because it is mathematically a sound
measure of central tendency.

Uses: It is frequently used in quality control, statistical forecasting and in financial analysis.

Computation of Mean Deviation (M.D)-Discrete series

fd
M .D 
N

Steps
Calculate the average of series.
Take the deviations of the items from mean ignoring signs and denote them d

Multiply these deviations by the respective frequencies and obtained fd .

Divide the total obtained in step (ii) by the number of observations. This gives us
the value for mean deviation.
Example: Calculate the mean deviation about the mean for the following data.

X: 5 15 25 35 45 55 65
F: 8 12 10 8 3 2 7
Let assumed mean (A) is 35

23
Solution
(x) F d=(x-35) fd x  29  d fd

5 8 5-35=-30 -240 30 240

15 12 15-35=-20 -240 20 240

25 10 25-35=-10 -100 10 100

35 8 35-35=0 0 0 0

45 3 45-35=+10 +30 10 30

55 2 55-35=+20 +40 20 40

65 7 65-35=+30 +210 30 210

Total N  50 50
 fd  300 fd  800

fd
M .D 
N

 fd
x  A , A  35 ,  fd  300 , N  50
N

x  35 
 300  35  6  29
50

fd 800
M .D    16
N 50

Calculation of Mean deviation –Continuous series


The calculation is the same as M.D for discrete series, the only difference is that here we have to
obtain the mid-point of the various classes and take the deviation of these points from mode or
median or mean.

fd
M .D 
N
Example: Calculate the mean deviation from the mean for the following data. Also calculate the
coefficient of mean deviation.

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70

No. of student 6 5 8 15 7 6 3

24
Solution;
Let assumed mean (A) =35
Marks Mid point (x) f x  35 fu x  33.4  d fd
u
10

0-10 5 6 -3 -18 28.4 170.4

10-20 15 5 -2 -10 18.4 92.0

20-30 25 8 -1 -8 8.4 67.2

30-40 35 15 0 0 1.6 24

40-50 45 7 +1 +7 11.6 81.2

50-60 55 6 +2 +12 21.6 129.6

60-70 65 3 +3 +9 31.6 94.8

Total N  50
 fu  8 fd  659.2

 fu
x  A * i , where i=class size. A  35,  fu  8, N  50, i  10
N

x  35 
 8 *10  33.4
50

fd 659.2
M .D    13.184
N 50

Merits and limitation of Mean deviation.


Merits
Mean deviation is easy to compute as compared to other calculated measure of
dispersion.
It is based upon all items in the series.
It is less affected by extreme item as compared to standard deviation.

Limitation of mean deviation.


i. Mean deviation is the arithmetic mean of the absolute values of the deviation. It ignores
the positive and negative signs of deviations. This weakness creates the demand for a
more reliable measure of dispersion.
ii. It is not amenable to further algebraic treatment.
iii. It cannot be computed for distributions with open-end classes.

25
THE STANDARD DEVIATION.

Computation of Standard Deviation

Ungrouped data.

The procedure of computing standard deviation from ungrouped data is given below;

Obtain the arithmetic mean ( x ) of the given data.


Obtain the deviation of each value from the arithmetic mean i.e. d  x  x

Square each deviation to make it positive.i.e, d 2  x  x 
2

Obtained the some of the deviation squared.i.e, d 2



 xx 
2

Find the variance (  2 ) by dividing the sum obtained in (IV) above by the number of
observations.

 x  x 
2

Var 
n
VI. Extract the square root of the variance to find the standard deviation

 x  x 
2

S .D  (i)
n

Note that the direct method is appropriate only when the mean is a whole number.

Standard deviation can be computed directly without finding mean. See the formula bellow

Direct method.

   
2

 x   x  
2

   
S .D    
 n n  
 
   

Assumed mean method.

   
2

 d 2
 d  
   
S .D    
 n n  
 
   

Where d is the deviation from assumed means i.e. d=X-A

Example

Calculate the standard deviation from the following set of observation

8 9 15 23 5 11 19 8 10 12

26
Soln

x 120
x   12
n 10
Values ( x ) Deviation  d  xx  
d2  x  x 
2

8 8-12=-4 16
9 9-12=-3 9
15 15-12=+3 9
23 23-12=+11 121
5 5-12=-7 49
11 11-12=-1 1
19 19-12=+7 49
8 8-12=-4 16
10 10-12=-2 4
12 12-12=0 0

 x  x 
2

 x  120  274

 x  x 
2

274
S .D    5.234500931
n 10

Standard deviation for Grouped data.

Standard deviation can be computed by using the following formulae

1. Direct method.

   
2

  fx   fx  
2

   
S .D    
 n n  
  (1)
   

2. Assumed mean method

27
   
2

  fd 2
  fd  
   
S .D    
 n n  
  (2)
   

Where d is deviation from assumed mean ( d  X  A )

Steps for equation 3

 Find the mid-point of various classes.


 Take the deviation of these mid-points from the assumed mean (A) and denote these
deviation by d.
 Multiply these deviations by the respective frequencies and obtain  fd
 Obtain the squares of the deviations to obtain d 2

 Multiply squared deviation by the respective frequencies and obtain  fd 2

Then substitute the values in the above formula to obtain the value of standard deviation.

3. Step deviation method (Coding method)


The use of this formula simplifies calculation.

   
2

  fu 2
  fu  
   
S .D  i *    (3)
 n n  
 
   
   

XA
Where u is deviation from assumed mean i.e.( u  )
i
„i‟ is the class size and X is the mid-point if the class interval.

Steps for equation 3

 Find the mid-point of various classes.


 Take the deviation of these mid-points from the assumed mean (A) and denote these
deviation by d.
XA
 Wherever possible take a common factor and denote this column by u i.e. u 
i
 Multiply the frequencies of each class with their deviations and obtain  fu
Square the deviations and multiply them with the respective frequencies of each class obtain

 fu 2

Then substitute the values in the above formula to obtain the value of standard deviation.

28
Example: Find the standard deviation of the following data.

Class interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70

Frequency 6 14 10 8 1 3 8

Solution: Use formula (3) above

Let the assumed mean (A) =35

Class Mid-point f X  35 fu u2 fu 2
u
interval X  10

0-10 5 6 -3 -18 9 54

10-20 15 14 -2 -28 4 56

20-30 25 10 -1 -10 1 10

30-40 35 8 0 0 0 0

40-50 45 1 +1 +1 1 1

50-60 55 3 +2 +6 4 12

60-70 65 8 +3 +24 9 72

Total n  50
 fu  25  fu 2
 205

   
2

  fu 2
  fu  
     205   25  2 
S .D  i *      10 *      19.62
 n

n
   50  50  
   
   

Therefore the S.D for the above data is 19.62.

Merits and limitations of standard deviation.

Merits

The standard deviation is the best measure of dispersion because it based on


every item of the distribution.
For comparing variability of two or more distributions the coefficient of variation is
considered to be more appropriate and this is base on mean and standard deviation.
It is used for further statistical test.eg computing skewness, correlation,etc

Limitations of standard deviation

i. Difficult to compute as you compare with other measures of dispersion.

29
ii. It gives more weight to extreme items and less to those which are near the mean.

SKEWNESS
Measures of central tendency give us an estimate of the representative value of a series, the
measures of dispersion gives an indication of the extent to which the items cluster around or
scatter away from the central value and the skewness is a measure that refers to the extent of
symmetry or asymmetry in a distribution.Skewness describes shape of a distribution.

Definitions
Skewness is lack of symmetry. When frequency distribution is plotted on a chart,
skewness present in the items tends to be dispersed more on one side of the mean than
on the other.
A distribution is said to be skewed when the mean and median fall at different
points in the distribution and balance (centre of gravity) is shifted to one side or the other
to the left or right.
Measures of skewness tell us the direction and the extent of skewness.In
symmetrical distribution the mean, median and mode are identical. The more the mean
moves away from the mode the larger the symmetry or skewness.

Objective of skewness

i. It helps in finding out the nature and the degree of concentration.

ii. It helps in knowing if the distribution is normal. Many statistical measures such as the error
of mean are based on the assumption of a normal distribution.

iii. The empirical relations of mean, median and mode are based on a moderately skewed
distribution. The measure of skewness will reveal to what extent such empirical
relationship holds good.

Skewnwss can be

Positively skewed distributions.


Negatively skewed distributions.

Positively skewed distributions


In this distribution the long tail indicates the presence of extreme values at the positive end
of the distribution. This pulls the mean to the right. This type of distribution is said to be
positively skewed.

E.g. the number of children per family, age at which women marry and the distribution of
wages in farm.

Negatively skewed distributions


In a negatively skewed distribution the mean is pulled in a negative direction. This type of
distribution can occur when considering for example; reaction times for an experiment and
daily maximum temperatures for a month in a summer.

NOTE:

The median generally lies between the mode and mean and following relation is satisfied

Mean-Mode ≈ 3(mean-median)

30
MEASURE OF SKEWNESS

Pearson’s coefficient of skewness


We need to know two things to describe the skewness of a distribution. One is the direction of the
skew, positive or negative and the other is a measure of the degree of skewness i.e how far the
distribution is pulled in one direction. The degree measure of skewness is given by Pearson‟s
coefficient of skewness

Pearson‟s coefficient of skewness = mean-mode

Standard deviation

If mean>median>mode, then the skew is positive

If mean<median<mode, then the skew is negative

If mean=median=mode, then the skew is zero and the distribution is symmetrical.

Generally skewness can take any values between 3 and -3 and using approximation mean-
mode=3(mean-median), then the skewness becomes;

Skewness= 3 (mean-median)

Standard deviation

COFFICIENT OF VARIATION (C.V)

The relative measure of dispersion based on the standard deviation and is given by

S.D
Coefficient of standard deviation 
x
100 times the coefficient of dispersion based on standard deviations is called the coefficient of
variation abbreviated as C.V

S .D
C.V  *100
x
Coefficient of variation is equal to standard deviation divide by arithmetic mean multiplied by 100.

Coefficient of variation is widely used for comparing the variability, homogeneity, stability,
uniformity and consistency of two series.

The series having greater C.V is said to be more variable than the other.
The series having less C.V is said to be less variable or more consistent, more
uniform, more stable or more homogeneous than the other.

Example. Two workers on the same job show the following results over a long period of time

Worker A Worker B

Mean time for completion the job (minutes) 36 25

Standard deviation (in minutes) 6 4

31
a) Which worker appears to be more consistent in the time he requires to complete the job?
b) Which worker appears to be faster in the completion the job?

Solution for part (a);

For worker A

S .D 6
C.V  *100  *100  20%
x 30
For worker B

S .D 4
C.V  *100  *100  16%
x 25

Since the C.V of worker B is less compared to that of worker A; therefore worker B appears to be
more consistent in the time requires in completing the job.

Solution for part (b);

Worker B takes on an average 25 minutes as against 30 minutes taken by worker A. Therefore,


worker B appears to be faster in completing the job.

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

SIMPLE LINEAR REGRESSION

To develop a model a phenomenon (e.g., law of demand) statisticians make heavy use of a
statistical technique known as regression analysis. The purpose of this chapter is to introduce the
basics of regression analysis in terms of the simplest possible linear regression model, namely
the two variable models.

The meaning of regression

Regression analysis is concerned with the study of the relationship between one variable called
the explained or dependent variable and one or more other variables called independent or
explanatory variables.

Example; studying the relationship between the quantity demanded of a commodity in terms of
the price of that commodity, income of the consumer, and prices of other commodities competing
with this commodity.

We may be also be interested in finding out how sales of a product (e.g. automobiles) are related
to advertising expenditure incurred on that product.

The objectives of regression analysis are;

32
 To estimate the mean or average value of the dependent variable given the values of the
independent variables.
 To test hypotheses about the nature of the dependence-hypotheses suggested by the
underlying economic theory. For example, one may want to test hypothesis that price
elasticity of demand in a demand curve has unitary price elasticity. If the price of
commodity goes up by 1% the quantity demanded on the average goes down by 1%,
assuming all other factors affecting demand are held constant.
 To predict or forecast the mean value of the dependent variable, given the values of the
independent variable(s).

Assumptions underlying the regression analysis

 The explanatory variable(s) X is uncorrelated with disturbance term ui.

 The expected or mean value of the disturbance term ui is zero.

 The variance of each ui is constant or homoscedastic.

 There is no correlation between two error terms. This is the assumption of no


autocorrelation.

The ordinary least-squares method

This is a technique for fitting the best straight line to the sample of xy observations. This technique
is the best way to estimate the population regression function by choosing b1, b2 the estimator of
B1, B2 i is as small as possible. The least squares principle states
that; “the sample regression function should be fixed in such a way that the sum of the squared
distance between the actual Y and the Y obtained from that SRF is the smallest one”.

Let y be dependent variable and x be independent variable,

Y=b0+b1X1

Taking the sum of values of a random variable.

∑y=nb0+b1∑x………………………….i

Multiply by x throughout Y=b0+b1X1

xy= xb0+b1x2

∑ xy= b0∑x+b1∑x2……………………..ii

Example 1;

A random sample of ten families had the following income and food expenditure (000Tsh per
week).

Families A B C D E F G H I J

income 20 30 33 40 15 13 26 38 35 43

33
expenditure 7 9 8 11 5 4 8 10 9 10

Estimate the regression line of food and interpret your results.

Example 2;

The following table gives the quantities of cotton (in tons) bought in each year from 1961-1970
and the corresponding prices (0000Tsh).

quantity 770 785 790 795 800 805 810 820 840 850

price 18 16 15 15 12 10 10 7 9 6

i. Estimate the linear demand function for the cotton.

ii. Calculate the price elasticity of demand.

iii. Forecast the demand at the mean price of the demand.

iv. Forecast the demand at P=20.

COEFFICIENT OF DETERMINATION (R2)

After the estimation of the parameters and determination of the least squares regression, there is
a need to know how good is the fit of this line to the sample observations of Y and X, that is to say
we need to measure the dispersion of observations around the regression line. This knowledge is
essential because the closer the observations to the line the better the goodness of fit, that is the
better is the explanation on the variations of Y by the changes in the explanatory variables. The
measure of goodness of fit is the square of the correlation coefficient r 2 which shows the
percentage of the total variation of the dependent variable that can be explained by the
independent variable X.

By fitting the estimated regression line (Y) we try to obtain the explanation of the variations of the
dependent variable Y produced by the changes of the explanatory variable X. However the fact
that the observations deviate from the estimated line shows that the regression line explains only
a part of the total variation of the dependent variable.

R2= 1-∑℮2/∑y2

Correlation Analysis

There are various methods of measuring the relationships existing between economic variables.
The simplest are correlation and regression analysis. It should be noted that correlation analysis
has serious limitations and throws little light on the nature of the relationship existing between
variables; it will make us familiar with the correlation coefficient which is an essential statistic of
regression analysis.

34
Correlation is defined as the degree of relationship existing between two or more variables.
Correlation may be linear when all points on scatter diagram seem to cluster near a straight line or
nonlinear when all points seem to lie near a curve.

Positive correlation;

Two variables are said to be positively correlated if they tend to change together in the same
direction, that is, if they tend to increase or decrease in the same direction.

Example; positive correlation is postulated by economic theory for the quantity of a commodity
supplied and its price. When the price increases the quantity supplied increases, and conversely.

Negative correlation;

Two variables are said to be negatively correlated if they tend to change in the opposite direction,
that is, if one variable increases the decreases.

Example; the theory of demand, the quantity of commodity demanded and its price are negatively
related. When price increases demand for the commodity decreases and when price decreases
the demand for commodity falls.

Zero correlation;

Two variables are uncorrelated when they tend to change with no connection to each other. The
scatter diagram will show dispersion of points all over the surface of the xy plane.

Example; the height of the inhabitants of a country and the production of steel, or between the
weight of students and the color of their hair.

CORRELATION AND ANALYSIS


If two quantities vary in such a way that movement in one are accompanied by movement in the
other, these quantities are correlated. For example :There exists some relationship between age
of husband and age of wife, price of commodity and amount demanded, increase in rainfall up to
the point and production of rice, blood pressure and age, income and expenditure, sales and
advertising strategies.

The degree of relationship/association between the variables under consideration is measured


through the Correlation Analysis. The measure of correlation called the Correlation coefficient or
Correlation Index summarizes in one figure the direction and degree of correlation. The
association between variables/ quantities is expressed by coefficient of correlation which ranges
between +1 and -1, the direction of change is indicated by signs –ve and +ve.

+ve sign indicates that the variables are moving in the same direction, that is when one variable is
moving the other will also be moving in the same direction with the same or almost the same
magnitude. Example, income and quantity demanded

-ve sign indicates that the variables are moving in opposite direction, that is when one variable
goes up the other will be going down by the same or almost the same magnitude. Example, price
and quantity demanded.

Application of correlation analysis

35
 Economic theory and business studies show relationship between variables like price and
quantity demanded, advertising expenditure and sales. The correlation analysis helps to
derive precisely the degree and direction of such relationship.

 The usefulness of this technique is that the average relationships in a series can be
summed up in a single value of change called the coefficient of correlation.

 It also helps to contribute to the understanding of economic behavior, that is fluctuation of


economic variables and it reduces the range of uncertainty of prediction.

Definition:
The correlation analysis refers to a technique used in measuring the closeness
relationship between the variables
The correlation analysis is a statistical device which helps us in the covariation of two or
more variables.

Methods of studying correlation


 Graphical method
 Algebraic method

Geographical method
The scatter diagram is a simple and attractive method of diagrammatic representation of a
bivariate distribution for ascertaining the nature of correlation between the variables

Computation of correlation coefficient


Correlation coefficient denoted by “r” lies between +1 and -1 inclusive, i.e.  1  r  1

Correlation coefficient is computed using the formula below

36
 x  x y  y 
r
 x  x  * y  y  (1)
2 2

Equation (1) can be simplified to the formula below

n xy   x  y
r
    
2
  
2
(2)
n x    x   * n y    y  
2 2

       

Statistical Interpretation of Correlation coefficient


As correlation les between +1 and -1 the following can be made.

If r=+1, shows there is perfect positive relationship (positive correlation) between the two
variables.
If r=-1, shows there is perfect negative relationship (negative correlation) between the two
variables.
If r=0,shows there is no relationship between the two variables (absence of correlation)
If r is nearly to +1 shows there is strong or significant positive relationship (strong positive
correlation)
If r is nearly equal to zero shows there is weak or insignificant relationship (weak
correlation)
Correlation and Causation

Correlation analysis helps us to determine in the degree of relationship between two or more
variables, it does not tell us anything about cause and effects relationship.

COEFFICIENT OF DETERMINATION (R2 )

Coefficient of determination shows the percentage in variation of values of dependent variable


that can be explained by the variation in the values of independent variable. The coefficient of
determination is denoted by R2 and it lies between 0 and 1 inclusive. Mathematically
0  R2  1

In simple linear regression the coefficient of determination (R2) is equal r2, where “r” is coefficient
of correlation. If the value of r=0.8,R2 will be 0.64 and it would mean that 64 per cent of the
variation in the dependent variable has been explained by the independent variable.

The coefficient of determination (R2) can also be defined as the ratio of the explained variance to
the total variance.

R2 = Explained Variation

Total Variation

1-R2=is called the coefficient of non-determination

37
 If R2=0 means X does not explain Y

 If R2=1 means X is perfect explanation of Y

Note that: High values of R2 implies goodness of while low of R2 implies poor fit

REGRESSION ANALYSIS

Regression is a statistical technique used to test the relationship between measurable variables.

Regression analysis attempts to established the nature of the relationship between variables and
thereby provide a mechanism for prediction or forecasting. Using regression analysis we are in
the position to estimate or predict the unknown values of one variable from known values of
another variable.

The variable which is used to predict the variable of interest is called the independent variable or
explanatory variable and the variable we are trying to predict is called dependent variable or
explained variable. The independent variable or explanatory variable is denoted by “X” and the
dependent variable or explained variable is denoted by “Y”.The analysis used is called simple
linear regression analysis, simple because there is only one predictor/independent variable and
linear because of the assumed linear relationship between independent variable (Y).Linear means
equation is straight line the form Y=a+bx.

The linear regression Analysis is further classified into two categories

Simple linear regression.

Multiple linear regression.

SIMPLE LINEAR REGRESSION

In simple linear regression we develop the relationship between one dependent variable against
one independepent variable. The relationship can be expressed mathematically as

Yi     i   i

Where Y is a dependent variable

α is the slope of Y (Its value is the point at which the regression line across Y-axis.

β is the slope of the relationship (Represent the change in Y for a unit change in X

variable.

X is an independent variable.

ε is a random error.

Random error: Is unobservable quantity introduced in the model to account the failure of
observed values to fall exactly on the single straight line.

38
Properties of error term (ε)

i. It is considered to be independent in the sense that error of one experiment has no


effects on the error of another experiment.

ii. To distributed normally with zero mean and known variance. Mathematically

 i  N o,  2 

Assumptions of Regression

There is linear relationship between the independent variable and dependent variable.

The Xi‟s are non random variable and observed with negligible error.

The εi‟s are random variable with zero mean and constant variance.

The normal assumption is imposed on εi‟s .Mathematically

 i  N o,  2 

The εi;s are uncorrelated. Mathematically

E  i . j   { }

THE LEAST SQUARE ESTIMATION

Snce we cannot fit exactly the line we use the method of least square for estimating the
parameters α and β to give the estimatimates “a” and “b”respectively.

When we use estimates in the model we we get the fitted modelgiven by

Y  a  bxi ……………………………………………………………(*)

The equation (*) above is the equation of line which provides estimated of the dependent
variable when the values of independent variable are inserted into the equation.

Where Ŷ is the estimated dependent variable.

“a” is the estimator of α.

“b” is the estimator of β

“xi” is the independent variables.

We do not have εi in the fitted model because E(y/x) is now an exactly one.

The least square estimators are those values of “a” of α and “b” of β that minimize a
quantity known as residual sum of square ∑εi2 with respect to “a” and “b” since “a” and “b”
are constants used to describe the average relationship that exists between two variables
can be obtained by the method of least square as

Yi     i   i ,  i 

39
a
1
 yi  b xi   y  b x
n

b
x y x y
i i i i

n x   x 
2 2
i i

INTERPRETATIONS OF REGRESSION COEFFICIENTS

DIFFERENCE BETWEEN CORRELATION AND REDRESSION ANALYSIS

Correlation and regression analysis are constructed under different assumptions they furnish
different types of information and it is not always clear as to which measure should be used in a
given problem situation.

The following are the points of difference between the two.

1. Where as coefficient is a measure of degree of covariability between X and Y,the objective


of regression analysis is to study the nature of relationship between the variables so that
we may be able to predict the value of one on the basis of another.The closer the
relationship between two variables,the greater the confidence that may be placed in the
estimates.

2. In correlation analysis rxy is a measure of direction and degree of linear relationship


between two variables X and Y, rxy and ryx are symmetric (rxy=ryx),i.e,it is immaterial which
of X and Y is dependent variable and which is independent variable.In regression analysis
the regression analysis the regression coefficients bxy are not symmetric,i.e, bxy≠byx and
hence it definitely makes a difference as to which variable is dependent and which is
independent.

3. There may be nonsense correlation between two variables which is purely due to chance
and has no practical relevance such as increase in income and increase in weight of a
group of people.However,there is nothing like nonsense regression.

4. Corelation coefficient is independent of change of scale and origin but not of scale.

40
-0 ELEMENTARY PROBABILITY THEORY

Probability Distributions

Probability distribution of a random variable is a listing of the values of random variable with the
corresponding probability associated with each value of the random variable. When probability
values are assigned to all possible values of a random variable either by listing or by a
mathematical function, the result is a probability distribution.

Example;

Outcomes of rolled die 1 2 3 4 5 6

Probability 1
6
1
6
1
6
1
6
1
6
1
6

Probability distribution 1
6 1 1
6 (2) 1
6 (3) 1
6 (4) 1
6 5 1
6 (6)

The sum of the probabilities of all possible outcomes must be equal to 1.

P x  1,2,3...,6  1
6  16  16  16  16  16 =1

Whenever possible we try to express probability distribution by means of a formula which enables
us to calculate the probabilities associated with various numerical descriptions of the outcomes
with the usual functional notations.

In general if a random variable X assumes the values x1 , x2 , x3 ,..., xn with probabilities P


x1 , Px2 , Px3 ..., Pxn  respectively then the probability distribution of X is given by
Px1   x1  Px 2   x2  Px3   x3 ...  Pxn   xn

Random variable

A variable whose value is determined by the outcomes of a random experiment that are not under
the control of the observer is called random variable. A random variable is a function defined over
the sample space of an experiment and generally assumes different values with definite
probability associated with each other.

Example; in three tosses of a coin the number of heads obtained is a random variable which can
take any one of the three values 0,1,2,or 3 as long as the coin is tossed.

Discrete random variable; A variable which takes on the integral values such as 0, 1, 2, 3….

Example; number of defective items in a sample, printing mistakes in each page of a book and
telephone calls received by telephone switchboard of a firm.

41
Example 1; a random variable x has the following probability function:

X 0 1 2 3 4 5 6 7

P(X) 0 K 2K 2K 3K K2 2K2 7K2+K

i. Find the value K


ii. Evaluate P(X<6), P (X≥6), and P (0<X<5).

Example 2; a random variable X has the following probability distribution:

X 0 1 2 3 4 5 6 8 9

P(X) a 3a 5a 7a 9a 11a 13a 15a 17a

i. Determine the value of a


ii. Find P(X<3), P (X≥3), P(0<X<5).

Continuous random variable; A variable which takes on all values within a certain interval and
their probabilities are determined by mathematical function are typically portrayed by a probability
density function or probability curve. Example; height, weight (mass), time, volume and
temperature.

Probability density function is denoted by f(x) and abbreviated by p.d.f.

x2

i. P.d.f =  f x dx  1 the sum of area shaded.


x1
x2

ii. Mean= xf  x dx



x1
x2
iii. Variance = x 2 f  x dx

x1
m x2
iv. Median =  f x dx   f x dx  1 2
x1 m

Example 1; in a continuous distribution whose probability density function is given by f(x) =x (2-x),
0≤x≤2. Find

42
i. Mean
ii. Variance
iii. Mode
iv. Median
v. Check if the above function is p.d.f

Example 2; the diameter of an electric cable say X is assumed to be a continuous random


variable with probability density function f(x)=6x(1-x), 0≤x≤1.

i. Mean
ii. Variance
iii. Mode
iv. Median
v. Check if the above function is p.d.f

Discrete probability distributions;

The probability distribution of a random variable which takes on the integral values such as 0, 2,
3… and their associated probabilities is discrete probability distributions. The set of the 6
outcomes in rolling a die and their associated probabilities is an example of discrete probability
distribution. The sum of the probabilities associated with all the values that the discrete random
variable can assume always equal to 1.

a) Binomial distribution
This is used to find the probability of X number of occurrences or success of an event P (x) , in n
trials of the same experiment when there are only two possible and mutually exclusive outcomes,
the n trials are independent and probability of occurrence or success p remain constant in each
trial.

Properties of binomial distribution;

i. The number of trials must be fixed and finite

43
ii. The number of trials must be independent
iii. Each trial has two mutually exclusive outcomes that are success and failure.
iv. The probability of success remains constant from trial to trial.
v. The number of trials should not be more 30.

Mean and variance of binomial distribution;

Mean (u) =number of trials x probability of success

Mean=np

Variance =number of trials x probability of success x probability failure.

Variance= npq

 
P (x) = n C x P x 1  P 
n x

Where, p=probability of success

1-p=probability of failure

µ= mean= np and  2 = variance = np (1-p).

Example 1;

The mean of a binomial distribution is 40 and standard deviation is 6. Calculate n,p and q.

Example 2;

Assume that on an average one telephone number out of fifteen is busy. What is probability that if
6 randomly selected telephone numbers are called?

i. Not more than 3 will be busy


ii. At least 3 of them will be busy.

44
Exercise;

i. A certain system of belting on horses produces winners 40% of time. What is the
probability that the system produces exactly 3 winners out of 8 on a race day?
ii. A biased die is thrown 30 times and the number of 6 seen is 8. If the die is thrown for 12
times, find the probability that 6 will occur exactly twice.
iii. Toss a fair coin 12 times. Find the probability of obtaining at least 9 heads in 12 times.
iv. A coin is biased so that a head is twice as likely to occur as a tail. If the coin is tossed four
times. What is the probability of getting exactly 2 tails?
v. A manufacturer claims that at most 10% of his product is defective. To test this claim 18
units are inspected and his claim is accepted if among these 18 units at most 2 are
defective. Find the probability that the manufacturer‟s claim will be accepted if the actual
probability that a unit is defective is 0.05 and 0.15.

b) Poisson Distribution
It is used to determine the probability of a designated number of successes per unit of time when
the events or successes are independent and average number of successes per unit of time
remains constant.

Poisson distribution is a limiting case of the binomial distribution under the following conditions;

i. The number of trials are indefinitely large


ii. The constant probability of success for each trial is indefinitely small.
iii. The mean is the same as variance.

Example; Events which might follow a Poisson distribution are

 Number of flaws in a given length of material


 Number of accidents on a particular stretch of road in one day
 Number of accidents in a factory in one week
 Telephone calls made to switchboard in a given minute
 Insurance claims made to a company in a given time
 Particles emitted by a radioactive source in a given time

Poisson distribution is given by,

x  
P (x) =
x

Where

X= designated number of successes

P (x) = probability of x number of successes

λ= average number of successes per unit time

℮= base of natural logarithmic system


45
Example 1; A car hire firm has two cars which it hires out day to day. The number of demands for
a car on each day is distributed as Poisson variate with means 1.5.Calculate the proportion of
days on which,

i. Neither car is used.


ii. Some demand is refused.

EXERCISE

1. A manufacturer of cotter pins knows that 5% of his product is defective. If he sells cotter
pins in boxes of 100 and guarantees that not more than 10 pins will be defective. What is
the approximate probability that a box will fail to meet the guaranteed quality?
2. Assuming that the chance of a traffic accident in a day in a street of Mwanza is 0.001 on
how many days out of a total of 1000 days can we expect:
i. No accident
ii. More than three accidents, if there are 1000 such streets in the whole city?

3. A hospital switch board receives an average of 4 emergency calls in a 10-minute on an


average. What is the probability that i. there are at the most 2 emergency calls in a 10-
minute interval ii. There are exactly 3 emergency calls in a 10-minute interval.

Normal Distribution;

The normal distribution is the most important continuous distribution in statistics. Many measured
quantities in the natural sciences follow a normal distribution for example heights, masses, ages,
random errors, IQ scores, examination results.

Continuous probability distribution;

Probability distribution of random variable which is determined by the range of all possible values
within a defined interval limits and associated with their probabilities is known a continuous
probability distribution. The probability distribution of a continuous random variable is often called

46
a probability density function or simply probability function. It is given by a smooth curve such that
the total area under the curve is 1.

The normal distribution is a continuous probability function that is bell-shaped, symmetrical about
the mean and extends indefinitely in both directions but most of the area clustered around the
mean.

Probability density function of normal variable

A continuous random variable X having probability density function


( x )2

f x  
1
 2 2
where  x is said to have a normal distribution with mean µ and
 2
variance σ2. Where µ and σ2 are the parameters of the distribution.

If X is distributed in this way we write  


X  N  ,  2 that is a random variable X is approximately
2.
normal with mean µ and variance σ

Properties of probability density function

 The distribution is bell-shaped and symmetrical about x=µ


The maximum value of f(x) occurs when x= µ and is given by f x  
1

 2
 Extends indefinitely in both directions but most of the area clustered around the mean.

The standard normal distribution

To standardize random variable X subtract µ and divide by σ, so the standard normal variable is
X 
given by Z= . If the random variable X has a normal distribution with mean µ and variance

σ2 then the random variable Z has a standard normal distribution with mean 0 and variance 1.

The use of standard normal tables

Example 1; If Z~ (0, 1) find from tables (a) P (Z>1.377), (b) P (Z<1.377), (c) P (Z>-1.377), (d) P
(Z<-1.377).

47
Example 2; If Z~ (0, 1) find,

(a) P (0.345<Z<1.751), (b) P (-2.696<Z<1.865), (c) P (-1.4<Z<-0.6), (d) P (|Z|<1.433) or P (-


1.433<Z< 1.433).

Example 3; If Z~(0 , 1), find the value of a if (a) P (Z>a)=0.3802, (b) P (Z<a)=0.9693, (c) P
(|Z|<a)=0.9.

Example 4; The random variable X~N (300, 25). Find (a) P (X>305), (b) P (X<291).

Example 5; The random variable is such that X~N (50, 8). Find (a) P (48<X<54), (b) P
(46<X<49), (d) P (|X-50|√8).

Example 6;

i. The time taken by a milkman to deliver milk to the high street is normally distributed
with mean 12 minutes and standard deviation 2 minutes. He delivers milk every day.
Estimate the number of days during the year when he takes (a) longer than 17 minutes
(b) less than 10 minutes (c) between 9 and 13 minutes.

ii. A certain type of cabbage has a mass which is normally distributed with mean 1kg and
standard deviation 0.15kg. In a lorry load of 800 of these cabbages, estimate how
many will have mass (a) greater than 0.79kg, (b) less than 1.13kg, (c) between 0.85kg
and 1.15kg, (d) between 0.75kg and 1.29kg.
iii. The marks of 500 candidates in an examination are normally distributed with mean of
45 marks and a standard deviation of 20 marks.
a. Given that the pass mark is 41, estimate the number of candidates who passed
the examination.
b. If 5% of the candidates obtained a distinction by scoring x marks or more,
estimate the value of x.
c. Estimate the interquartile range of the distribution.

Hypothesis Testing

General Description

There are two types of statistical inferences: estimation of population parameters and hypothesis
testing. Hypothesis testing is one of the most important tools of application of statistics to real life
problems. Most often, decisions are required to be made concerning populations on the basis of
sample information. Statistical tests are used in arriving at these decisions.

48
There are five ingredients to any statistical test:

(a) Null Hypothesis


(b) Alternate Hypothesis
(c) Test Statistic
(d) Rejection/Critical Region
(e) Conclusion

In attempting to reach a decision, it is useful to make an educated guess or assumption about the
population involved, such as the type of distribution.

Statistical Hypotheses: They are defined as assertion or conjecture about the parameter or
parameters of a population, for example the mean or the variance of a normal population. They
may also concern the type, nature or probability distribution of the population.

Statistical hypotheses are based on the concept of proof by contradiction. For example, say, we

Null Hypothesis: It is a hypothesis which states that there is no difference between the
procedures and is denoted by H0. For the above example the corresponding H0 would be that
there has been no increase or decrease in the mean. Always the null hypothesis is tested, i.e., we
want to either accept or reject the null hypothesis because we have information only for the null
hypothesis.

Alternative Hypothesis : It is a hypothesis which states that there is a difference between the
procedures and is denoted by HA.

Table 1. Various types of H0 and HA

Case Null Hypothesis H 0 Alternate Hypothesis H A

1 �

49
Test Statistic: It is the random variable X whose value is tested to arrive at a decision. The
Central Limit Theorem states that for large sample sizes (n > 30) drawn randomly from a
population, the distribution of the means of those samples will approximate normality, even when
the data in the parent population are not distributed normally. A z statistic is usually used for large
sample sizes (n > 30), but often large samples are not easy to obtain, in which case the t-

standard deviation, s. The t curves are bell shaped and distributed around t=0. The exact shape
on a given t-curve depends on the degrees of freedom. In case of performing multiple
comparisons by one way ANOVA, the F-statistic is normally used. It is defined as the ratio of the
mean square due to the variability between groups to the mean square due to the variability within
groups. The critical value of F is read off from tables on the F-distribution knowing the Type-I error

Rejection Region: It is the part of the sample space (critical region) where the null hypothesis H0

in the critical region when H0


the value of the random variable falling in the critical region. Also it should be noted that the term

only that the observed difference between the sample statistic and the mean of the sampling
distribution did not occur by chance alone.

Conclusion: If the test statistic falls in the rejection/critical region, H0 is rejected, else H0 is
accepted.

Types of Tests

Tests of hypothesis can be carried out on one or two samples. One sample tests are used to test

1 2).

Two sample tests can further be classified as unpaired or paired two sample tests. While in
unpaired two sample tests the sample data are not related, in paired two sample tests the sample
data are paired according to some identifiable characteristic. For example, when testing
hypothesis about the effect of a treatment on (say) a landfill, we would like to pair the data taken
at different points before and after implementation of the treatment.

Both one sample and two sample tests can be classified as :

One tailed test: Here the alternate hypothesis HA is one-sided and we test whether the test
statistic falls in the critical region on only one side of the distribution.

1. One sample test: For example, we are measuring the concentration of a lake and we need
to know if the mean concentration of the lake is greater than a specified value of 10mg/L.
Hence, H0 � A
2. Two sample test: In Table1, cases 2 and 3 are illustrations of two sample, one tailed tests.
In case 2 we want to test whether the population mean of the first sample is lesser than
that of the second sample.
Hence, H0 1 � 2 , vs, HA 1 2.

Two tailed test: Here the alternate hypothesis HA is formulated to test for difference in either
direction, i.e., for either an increase or a decrease in the random variable. Hence the test statistic
is tested for occurrence within either of the two critical regions on the two extremes of the
distribution.

50
1. One sample test: For the lake example we need to know if the mean concentration of the
lake is the same as or different from a specified value of 10 mg/L.
Hence, H0 � A
2. Two sample test: In Table 1, case 1 is an illustration of a two sample two tailed test. In
1) is the same as
2).

Hence H0 2 , vs, HA � 2.

Given the same level of significance the two tailed test is more conservative, i.e., it is more
rigorous than the one-tailed test because the rejection point is farther out in the tail. It is more
difficult to reject H0 with a two-tailed test than with a one-tailed test.

The diagram associated with the link illustrates the critical region(s) for one and two tailed tests.

Error

When using probability to decide whether a statistical test provides evidence for or against our
predictions, there is always a chance of driving the wrong conclusions. Even when choosing a
probability level of 95%, there is always a 5% chance that one rejects the null hypothesis when it

It is possible to err in the opposite way if one fails to reject the null hypothesis when it is, in fact,
incorrect. This is called Type II error, represented by the
represented in the following chart.

Table 2. Types of error

Type of decision H0 true H0 false

Reject H0 Correct decision (1-

Accept H0 Correct decision (1-

A related concept is power, which is the probability of rejecting the null hypothesis when it is
actually false. Power is simply 1 minus the Type II error rate, and is usually expressed as 1-

When choosing the probability level of a test, it is possible to control the risk of committing a Type

51
This also affects Type II error, since they are are inversely related: as one increases, the other
decreases. To appreciate this in a diagram, follow this link:

There is little control on the risk of committing Type II error, because it also depends on the actual
difference being evaluated, which is usually unknown. The following link leads to a diagram that
according to the actual distribution of the
population:

The consequences of these different types of error are very different. For example, if one tests for
the significant presence of a pollutant, incorrectly deciding that a site is polluted (Type I error) will
cause a waste of resources and energy cleaning up a site that does not need it. On the other
hand, failure to determine presence of pollution (Type II error) can lead to environmental
deterioration or health problems in the nearby community.

Steps in Hypothesis Testing

Identify the null hypothesis H0 and the alternate hypothesis HA.

consider the consequences of both types of errors.

Select the test statistic and determine its value from the sample data. This value is
called the observed value of the test statistic. Remember that a t statistic is usually
appropriate for a small number of samples; for larger number of samples, a z
statistic can work well if data are normally distributed.

52
Compare the observed value of the statistic to the critical value obtained for the

Make a decision.

If the test statistic does not fall in the


If the test statistic falls in the critical
critical region:
region:
Conclude that there is not enough evidence to
Reject H0 in favour of HA.
reject H0.

Practical Examples

A) One tailed Test

An aquaculture farm takes water from a stream and returns it after it has circulated through the
fish tanks. The owner thinks that, since the water circulates rather quickly through the tanks, there
is little organic matter in the effluent. To find out if this is true, he takes some samples of the water
at the intake and other samples downstream the outlet and tests for Biochemical Oxygen Demand
(BOD). If BOD increases, it can be said that the effluent contains more organic matter than the
stream can handle.

The data for this problem are given in the following table:

Table 3. BOD in the stream

One tailed t-test :

Upstream Downstream

6.782 9.063

5.809 8.381

6.849 8.660

6.879 8.405

7.014 9.248

53
7.321 8.735

5.986 9.772

6.628 8.545

6.822 8.063

6.448 8.001

1. A is the set of samples taken at the intake; and B is the set of samples taken downstream.
o H0 B < A
o HA B > A
2.
3. The observed t value is calculated
4. The critical t value is obtained according to the degrees of freedom

The resulting t test values are shown in this table:

Table 4. t-Test : Two-Sample Assuming Equal Variances

Upstream Downstream

Mean 6.6539 8.6874

Variance 0.2124 0.2988

Observations 10 10

Pooled Variance 0.2556

Hypothesized Mean Difference 0

Degrees of freedom 18

t stat -8.9941

P(T<t) one-tail 2.22 x 10-08

t Critical one-tail 1.7341

P(T<t) two-tail 4.45 x 10-08

t Critical two-tail 2.1009

5) Make a decision.... Is the effluent polluting the stream?

54
B) Two tailed Test

Let us assume that an induced bioremediation process is being conducted at a contaminated site.
The researcher has obtained good cleanup rates by injecting a mixture of nutrients into the soil in
order to maintain an abundant microbial community. Someone suggests using a cheaper mixture.
The researcher tries one patch of land with the new mixture, and compares the degradation rates
to those obtained from a patch treated with the expensive one to see if he can get the same
degradation rates.

The data for this problem are shown in the following table:

Table 5. Degradation rates on treatment with different nutrients.

Two tailed t-test

Cheap Nutrient Expensive Nutrient

7.1031 9.6662

6.4085 10.1320

8.8819 9.0624

7.0094 8.8136

4.6715 9.2345

6.6135 9.9949

6.5877 9.4299

6.2849 8.8012

6.6789 9.9249

6.5542 8.1739

1) A is treated with the cheap nutrient; and B is treated with the expensive one.

H0 A B

HA A � B

3) The observed t value is calculated.


4) The critical t value must be obtained according to the degrees of freedom.

55
Assuming that variances from the two sets are unequal, we obtain the following t test:

Table 6. t-test: Two sample Assuming Unequal Variances

Cheap Nutrient Expensive Nutrient

Mean 6.6794 9.3233

Variance 1.0476 0.3917

Observations 10 10

Hypothesized Mean Difference 0

Degrees of freedom 15

t Stat -6.9691

P(T<t) one-tail 2.25 x10-6

t Critical one-tail 1.7531

P(T<t crit) two-tail 4.51 x 10-6

t critical two tail 2.1315

5) Make a decision... Was the expensive nutrient actually necessary?

Limitations for Environmental Sampling

Although hypothesis tests are a very useful tool in general, they are sometimes not appropriate in
the environmental field. The following cases illustrate some of the limitations of this type of test:

A) Multiple Comparisons

z and t tests are very useful when comparing two population menas. However, when it comes to
comparing several population means at the same time, this method is not very appropriate.

Suppose we are interested in comparing pollutant concentrations form three different wells with
1 2 3. We could test the following hypothesis:

56
H0 1 2 3
HA: not all means are equal

We would need to conduct three different hypothesis tests, which are shown here:

Table 7.Hypothesis tests needed for testing three different populations

1 2 2 3 1 3

1 � 2 2 � 3 1 � 3

For each test, there is always the possibility of committing an error. Since we are conducting three
such tests, the overall error probability would exceed the acceptable ranges, and we could not
feel very confident about the final conclusion. Table 8 show t
tests are conducted. Assume that each k value represents the number of populations to be
compared.

Table 8. Probability of committing a type I error by using multiple t tests to seek differences
between all pairs of k means.

Level of
Significance used
in the t tests 0.20 0.10 0.05 0.02 0.01 0.001
Number of
means (k)

2 0.20 0.10 0.05 0.02 0.01 0.001

3 0.41 0.23 0.13 0.05 0.03 0.003

4 0.58 0.36 0.21 0.09 0.05 0.006

5 0.71 0.47 0.23 0.13 0.07 0.009

10 0.96 0.83 0.63 0.37 0.23 0.034

20 1.00 0.98 0.92 0.71 0.52 0.109

� 1.00 1.00 1.00 1.00 1.00 1.00

Note: The particular values were derived from a table by Pearson (1942) by assuming equal
population variances and large samples.

57
A better method for comparing several population means is an analysis of variance, abbreviated
as ANOVA.

ANOVA test is based on the variability between the sample means. This variability is measured in
relation to the variability of the data values within the samples. These two variances are compared
through means of the F ratio test.

If there is a large variability between the sample means, this suggests that not all the population
means are equal. When the variability between the samples means is large compared to the
variability within the samples, it can be concluded that not all the population means are equal.

B) Multiple Constituents

In example 1, we were only testing for BOD, so only one t test was necessary. If we had been
trying to trace more than one pollutant, which is usually the case, we would have to take out
different tests for each pollutant in order to determine if the effluent was similar to the receiving
stream. Then we would have the same problem we encountered with multiple comparisons:

represent the number of pollutants instead of a number of populations.

C) Difficulty in meeting assumptions

The tests used in the testing of hypothesis, viz., t-tests and ANOVA have some fundamental
assumptions that need to be met, for the test to work properly and yield good results. The main
assumptions for the t-test and ANOVA are listed below.

The primary assumptions underlying the t-test are:

1. The samples are drawn randomly from a population in which the data are distributed
normally distributed.
2 2 2 2
2. In the case of a two sample t- 1 2 .Therefore it is assumed that s1 and s2 both
2
. This assumption is called the homogeneity of
variances
3. In the case of a two sample t-test, the measurements in sample 1 are independent of
those in sample 2.

Like the t-test, analysis of variance is based on a model that requires certain assumptions. Three
primary assumptions of ANOVA are that:

1. Each group is obtained randomly, with each observation independent of all other
observations and the groups independent of each other.
2. The samples represent populations in which the data are normally distributed.
2 2 2 2
3. 1 2 3 k . The assumption of homogeneity of variances is similar to the
discussion above under the t-test. The group variances are assumed to be an estimate of
2
.

In actual experimental or sampling situations, the underlying populations are not likely to be
exactly normally distributed with exactly equal variances. Both the t-test and ANOVA are quite
robust and yield reliable results when some of the assumptions are not met. For example, if n1 =
n2 = ... = nk, ANOVA tends to be especially robust with respect to the assumption of homogeneity
As the number of groups tested, k, increases there is a greater effect on the value of the F-
statistic. It is also seen that a reasonable departure from the assumption of population normality
does not have a serious effect on the reliability of the F-statistic or the t-statistic. It is essential
58
however that the assumption of independence be met. The analysis is not robust for non-
independent measurements. These factors are to be taken into consideration while testing
hypotheses.

ESTIMATION THEORY.

Estimation theory is a branch of statistics and signal processing that deals with estimating the
values of parameters based on measured/empirical data. The parameters describe the physical
scenario or object that answers a question posed by the estimator. The entire purpose of
estimation theory is to arrive at an estimator and preferably an implementable one that could
actually be used, the estimator takes the measured data as input and produces an estimate of the
parameters.

It also preferable to derive an estimator that exhibits optimality. An optimal estimator would
indicate that all available information in the measured data has been extracted.

In statistics an estimator is a function of the observable sample data that is used to estimate
unknown population parameters. An estimate is the result from the actual application of the
function to a particular set of data.

Estimation is the process of inferring or estimating a population parameter from the corresponding
statistic of a sample drawn from the population. To be valid estimation must be based on a
representative sample.

To estimate a parameter of interest (population mean, proportion, difference between population


mean, ratios of 2 population standard deviation) the usual procedure is as follow;

 Select a random sample from population of interest.


 Calculate the point estimate of the parameter
 Calculate the measure of its variability, often a confidence interval and associate with this
estimate.

There are two types of estimators.

59
Point estimate. When the estimate is a single value/ number generated from representative
sample that yields the best estimation of population parameter.

Interval estimate. Refers to a range of values together with the probability, or confidence interval
that the interval includes the unknown population parameter or when the estimate is in a range of
scores/ values.

An estimator- a rule/function/formula/ procedure that tells how to go about estimating population


quantity (e.g. population mean)

Estimate-a numerical value taken by an estimator.

Properties of estimators

 Consistency. Consistent estimator is an estimator that converges in probability to the


quantity being estimated as sample size grows.
 Unbiasedness. An estimator is unbiased if the mean/ expected value of its sampling
distribution equal the parameter it is estimating.
 Efficiency. Efficient estimators are those that have the lowest possible variance among all
unbiased estimators.

NONPARAMETRIC TESTS

Introduction;

Inference statistical methods are divided into parametric methods and nonparametric methods.

Parametric methods deal with the study of sampling distribution of sample statistics like x and
variance  2 . These are studied with the aim of testing some hypothesis concerning the population
parameters such as mean  or  2 for a variable phenomenon being studied.

Nonparametric techniques;

Non parametric tests are used even with nominal or ordinal data and do not make assumptions
about values of the population parameters or about the shape of sampled population. As such
non-parametric tests are considered as distribution free statistical test techniques in which the
focus of inference is whether or not a given sample characterizes specified population. However,
sample observations for non-parametric testing must be drawn at random so that resulting errors
are uncorrelated.

The general steps of conducting statistical tests are;

a) Formulating null and alternative hypotheses


b) Identifying test technique, usually z, t, chi-square, F, and binomial.
c) Identify test statistic and its sampling distribution.

60
d) Specifying the level of significance, usually 0.05.
e) Specify the decision condition, reject or accept.
f) Compare the experimental value and theoretical value of its sampling distribution.
g) Presents findings and conclusions about the null hypothesis being tested.

Non-parametric techniques are broadly divided into: sign, Rank-sum and randomness
tests.

SIGNTESTS

Sign tests are employed whenever data are capable of being categorized into two groups:
bad versus good, high versus low, positive versus negative, or plus versus minus. In short
sign tests are applied whenever the set of data (nominal, ordinal, interval, ratio) have been
signed so that we always deal with “sign” of the observations rather than their values. The
test statistic that is considered in the non-parametric technique is the number of signs or
counts or frequencies. Thus the sampling distribution of the number of positive or negative
signs that are obtained in a random trial is studied and inference on the null hypothesis
reached upon.

RANDOMNESS TESTS

In all statistical tests discussed above, there is one major assumption which is that the
study sample is random. How randomness of a sample is ascertained? That is, given a
sequence of events or observations how can we check that they constitute a random
phenomenon? The answer to such questions is based on the number of times there is a
change in sequence of the events or observations. The change in sequence of events or
observations is called a run.

Definition:
A run is a succession of identical events or attributes that may be represented by letters or
other symbols, and followed by different succession of events or attributes or no event at
all.

Thus, a run is a change of sequence of identical events. Run tests studies a sequence of
events or observations where each element in the sequence may assume one of the two
outcomes such as, success versus failure, good versus bad, large versus small, non-
defective versus defective, or below versus above a level of some specified attribute
index. The runs test is thus applied to a sequence of n1 successes and n2 failures. In a
manufacturing process, for instance, it is of great interest to know the extent to which
sequence of defective and/non-defective items are randomly produced. Non-randomness
of the sequence of defective items implies lack of process control.

The hypothesis being tested is; the sample is random and its alternative is; the sample is
non-random. The number of runs or identical succession of events or attributes of a
sampled observations or events, test randomness of the sample. If the null hypothesis is
true, the number of runs R has the following asymptotical parametric characteristics.

61
E R  
2n1n2
Expected value of R is 1 and variance of R is
n1  n2
2n1n2 2n1n2  n1  n2  R  E R 
 R2  . Consequently, smaller values of Z 
n1  n2  n1  n2  1
2
R

In absolute terms indicate that the sequence of events or observations in a sample


constitute a random phenomenon. For small samples probability distribution of the total
number of runs R in samples of sizes n1 and n2 is tabulated and critical value R0 can be
obtained such that PR  R0    .

Example 1;

Consider the following arrangement of health (H) and diseased (D) pine trees observed
next to each other in a given order by a Forestry Service Crew. Is there statistical evidence
to show that the observed distribution of pine trees constitutes a random phenomenon?

Solution;

1. H 0 : Distribution of pine trees is random.

H1 : Distribution of pine trees is non-random.

2. The appropriate non-parametric tests is the number of runs R whose expected


value is E R  =14.33, and variance   2.38 , and therefore Z=-3.08.

3. Level of significance is  =0.05.

4. Decision rule; For small samples critical value of R is obtained from statistical
tables of probability distribution of the total number of runs R such that Prob
R  R0    . For large samples the statistic Z has a standard normal distribution.

5. Conclusion: The null hypothesis is rejected implying that distribution of pine trees is
non-random. There is a real systematic factor to the observed distribution of pine
trees observed such as tree disease.

The median test for randomness

In the above example, a dividing line between attributes was subjective. There are
cases when median or some other test value index is used to classify the two
outcomes in a runs test, for instance, below median versus above median. Thus,
we may use event A for an outcome that is below the median and event B for an
outcome that is above median. The sequence of these designated events now
becomes the subject of analysis.

Example 2; The following data pertain to increase in pulse rate of astronauts.


26,20,17,22,23,21,25,30,14,12,21,19,25,14,18.

62
Solution:

H 0 : The distribution of pulse rate is random.


H1 : The distribution of pulse rate is non-random.

Level of significance is given as 5%.

Test statistic is based on runs R the number of changes of pulse rate from below to
above median.

Computation: Median pulse rate is 21.

Let A represent an event that pulse rate is below median and B represent the event
that pulse rate is above median.

The given data on pulse rate are represented in terms of sequence of events of A
and B as follows: ABBAAAABBBABB.The number of runs in this sequence of
events A and B is R=6 counted as follows: A,BB,AAAA,BBB,A,BB. Given that n1
=7 and n2 =6, then:

E R   7.46,  R  2.94 , Z c  0.496

Conclusion: The null hypothesis is not rejected; implying that pulse rate data are
random there is no systematic factor surrounding the pulse rate of the astronauts.

RANK-SUM TESTS

Rank-sum nonparametric tests rely on ranking of the sample observations and are
therefore applicable for that are measured at ordinal level or above. The statistic to
be studied is the rank (rather than values) of the sampled observations. Thus, the
focus in rank-sum tests is sampling distribution of the sum of ranks. The analytical
framework for using rank-sum nonparametric tests varies from one-way ANOVA
(for completely randomized designs) to two-way ANOVA (for randomized block
designs). The null hypothesis that is being tested also varies with the research
design used.

a) Wilcoxon test for two-paired and correlated.


This statistical test is typically used to assess the extent to which two samples come from
identical populations. The focus of the test is on analysis of data generated under before-
after, or with-without research designs, hence the terms paired and correlated samples.
Note that analysis of such research designs can be done by binomial test discussed under
sign tests. Thus, Wilcoxon test provides an alternative way to binomial test for data are
measured at ordinal level and above.

H 0 : The two samples come from identical populations


63
H1 : The two samples come from different populations.

Given the context of the research designs that are typically being associated with the
before-after, with-without, or two treatments being applied to homogeneous experimental
units, the focus of Wilcoxon test is to answer questions that related to;

 Are there significant differences in the subjects‟ response before and after the
treatment?
 Are there significant differences in the subjects‟ response with and without
treatment?
 Are two treatments statistically different?

The statistic that is used to test the null hypothesis is the sum of ranks of the difference of
the paired observations which are ranked while retaining the +ve/-ve sign identity. If the
 
null hypothesis is true, T=min R  , R , has the following asymptotical parametric
characteristics;

nn  1 nn  12n  1


Mean E T   and variance V T  
4 24

Henceforth, the test statistic for the Wilcoxon nonparametric test is;

T  E T 
Z .
T

The accept area for the null hypothesis is -1.96<Z<1.96. The decision rule is: reject iff ΙZΙ>1.96.
However, for small samples critical values of T are for each relevant level of significance and
sample and sample size n.

b) Kruskal Wallis test


The Kruskal Wallis Test is to ascertain the extent to which k independent random samples come
from identical populations.

H 0 : k-samples come from identical populations.

H1 : k-samples come from different populations.

In a way the test is an alternative to the one-way analysis of variance for completely randomized
research designs. The test is based on the combined ranking procedure while retaining the
sample identity, like in Man-Whitney test. However the test statistic with Kruskal-Wallis test
procedure is the variation of ranks, which is computed as;

k
 R2I 
  3n  1. where RI  some of ranks assigned to observations in the i th
12
H 
nn  1 i  ni 
k
sample, ni =size of the ith sample, and n  n
i
 n1  n2  n3  ...  nk

64

You might also like