MCOM Statistical Analysis Notes
MCOM Statistical Analysis Notes
Unit-4 Sample
Unit-12 Index Numbers (Not available inside this PDF) – Click there to get
RESEARCH
STRUCTURE
1.0 Objectives
1.1 Introduction
1.2 Meaning of Research
1.3 Meaning of Science
1.4 Knowledge and Science
1.5 Inductive and Deductive Logic
1.6 Significance of Research in Business
1.7 Types of Research
1.8 Methods of Research
1.8.1 Survey Method
1.8.2 Observation Method
1.8.3 Case Method
1.8.4 Experimental Method
1.8.5 Historical Method
1.8.6 Comparative Method
1.9 Difficulties in Business Research
1.10 Business Research Process
1.11 Let Us Sum Up
1.12 Key Words
1.13 Answers to Self Assessment Exercises
1.14 Terminal Questions
1.15 Further Reading
1.0 OBJECTIVES
After studying this unit, you should be able to:
l explain the meaning of research,
l differentiate between Science and Knowledge,
l distinguish between inductive and deductive logic,
l discuss the need for research in business,
l classify research into different types,
l narrate different methods of research,
l list the difficulties in business research, and
l explain the business research process and its role in decision making.
1.1 INTRODUCTION
Research is a part of any systematic knowledge. It has occupied the realm of
human understanding in some form or the other from times immemorial. The
thirst for new areas of knowledge and the human urge for solutions to the
problems, has developed a faculty for search and research and re-research in
him/her. Research has now become an integral part of all the areas of human
activity.
5
Geektonight Notes
Research and Data Research in common parlance refers to a search for knowledge. It is an
Collection endeavour to discover answers to problems (of intellectual and practical nature)
through the application of scientific methods. Research, thus, is essentially a
systematic inquiry seeking facts (truths) through objective, verifiable methods in
order to discover the relationship among them and to deduce from them broad
conclusions. It is thus a method of critical thinking. It is imperative that any
type of organisation in the globalised environment needs systematic supply of
information coupled with tools of analysis for making sound decisions which
involve minimum risk. In this Unit, we will discuss at length the need and
significance of research, types and methods of research, and the research
process.
L.V. Redman and A.V.H. Mory in their book on “The Romance of Research”
defined research as “a systematized effort to gain new knowledge”
“A careful investigation or inquiry specially through search for new facts in any
branch of knowledge” (Advanced learners Dictionary of current English)
l concepts mean the terms designating the things and their perceptions about Introduction to
Business Research
which science tries to make sense. Examples: velocity, acceleration, wealth,
income.
l Symbols may be signs indicating +, –, ÷, ×, x , σ, Σ, etc.
l Manipulation of a ball or vaccine means when the ball is kept on different
degrees of incline how and at what speed does it move? When the vaccine is
used, not used, used with different gaps, used in different quantities (doses)
what are the effects?
ii) Manipulation is for the purpose of generalizing
The purpose of research is to arrive at generalization i.e., to arrive at statements of
generality, so that prediction becomes easy. Generalization or conclusion of an
enquiry tells us to expect some thing in a class of things under a class of conditions.
Examples: Debt repayment capacity of farmers will be decreased during
drought years.
When price increases demand falls.
Advertisement has a favourable impact on sales.
iii) The purpose of research (or generalization) is to extend, correct or
verify knowledge
Generalization has in turn certain effects on the established corpus or body of
knowledge. It may extend or enlarge the boundaries of existing knowledge by
removing inconsistencies if any. It may correct the existing knowledge by
pointing out errors if any. It may invalidate or discard the existing knowledge
which is also no small achievement. It may verify and confirm the existing
knowledge which also gives added strength to the existing knowledge. It may
also point out the gaps in the existing corpus of knowledge requiring attempts to
bridge these gaps.
Research and Data At one time the word science was used to denote all systematic studies or
Collection organized bodies of knowledge. Let us see some definitions.
– “Science means a branch of ( accumulated) knowledge”. In this sense it refers
to a particular field or branch of knowledge such as Physics, Chemistry,
Economics.
– “The systematized knowledge about things or events in nature is called
Science”.
– “Science is popularly defined as an accumulation of systematic knowledge”
(Goode & Hatt).
In these definitions the words ‘systematic’ and ‘knowledge’ are very important.
Knowledge refers to the goal of science, while ‘systematic’ refers to the
‘method’ that is used to reach that goal. Now a days the stress is on the
method rather than the knowledge. See the following definitions:
– Knowledge not of things but of their relations.
– Science is a process which makes knowledge.
– It is the approach rather than the content that is the test of science.
– Science is a way of investigation.
– Science is a way of looking at the World.
– “ The unity of all sciences consists alone in its methods, not in its material” -
(Karl Pearson).
From the above definitions two broad views emerge. They are: (a) Science as
organized or accumulated knowledge. (b) Science as a method / process leading
to knowledge. (a) is a STATIC view where as (b) is a DYNAMIC View. The
view that Science is a method rather than a field of specific subject matter is
more popular.
Knowing has an external reference, which may be called a fact. A fact is any
thing that exists or can be conceived of. A fact is neither true nor false. It is
what it is. What we claim to know is belief or judgement. But every belief
cannot, however, be equated with knowledge, because some of our beliefs, even
the true ones, may turn out to be false on verification. Knowledge, therefore, is
a matter of degree. However, knowledge need not always be private or
individual. Private knowledge may be transformed into public knowledge by the
application of certain scientific and common sense procedures.
We have shown that knowledge requires explanations and these come in Introduction to
Business Research
Science. Knowledge and Science are not necessarily synonymous. Science
implies knowledge, but the converse is not true. Therefore, we can say that “all
Sciences are knowledge, but all knowledge is not science”. Scientific knowledge
is unified, organized and systematic, while ordinary knowledge is a jumble of
isolated and disconnected facts. Science applies special means and methods to
render knowledge true and exact, but ordinary knowledge rests on observations
which are not methodical. But scientific knowledge and ordinary knowledge are
not different in kind, but only in degree. Scientific knowledge is more
specialized, exact and organized than ordinary knowledge.
4) What is a fact?
..................................................................................................................
..................................................................................................................
..................................................................................................................
Research and Data Deduction, on the other hand, is a way of making a particular inference from
Collection
a generalization. Deduction is a movement of knowledge from a general rule to
a particular case. For example, ‘All men are mortal’ is a general rule. Ranjit is
a man. Therefore, from the general rule it can be deduced that Ranjit is also
mortal’. Similarly, All M.Com. degree holders are eligible for Ph.D. in
Commerce is a general statement. Praneeth is a M.Com. degree holder.
Therefore, it can be deduced that Praneeth is eligible for Ph.D. in Commerce.
Empirical studies have a great potential, for they lead to inductions and
deductions. Research enables one to develop theories and principles, on the one
hand, and to arrive at generalizations on the other. Both are aids to acquisition
of knowledge.
i) Industrial and economic activities have assumed huge dimensions. The size of
modern business organizations indicates that managerial and administrative
decisions can affect vast quantities of capital and a large number of people.
Trial and error methods are not appreciated, as mistakes can be tremendously
costly. Decisions must be quick but accurate and timely and should be
objective i.e. based on facts and realities. In this back drop business decisions
now a days are mostly influenced by research and research findings. Thus,
research helps in quick and objective decisions.
ii) Research, being a fact-finding process, significantly influences business
decisions. The business management is interested in choosing that course of
action which is most effective in attaining the goals of the organization.
Research not only provides facts and figures to support business decisions but
also enables the business to choose one which is best.
iii) A considerable number of business problems are now given quantitative
treatment with some degree of success with the help of operations research.
Research into management problems may result in certain conclusions by
means of logical analysis which the decision maker may use for his action or
solution.
iv) Research plays a significant role in the identification of a new project, project
feasibility and project implementation.
v) Research helps the management to discharge its managerial functions of
planning, forecasting, coordinating, motivating, controlling and evaluation
effectively.
vi) Research facilitates the process of thinking, analysing, evaluating and
interpreting of the business environment and of various business situations and
business alternatives. So as to be helpful in the formulation of business policy
10
and strategy.
Geektonight Notes
vii) Research and Development ( R & D) helps discovery and invention. Introduction to
Business Research
Developing new products or modifying the existing products, discovering new
uses, new markets etc., is a continuous process in business.
viii) The role of research in functional areas like production, finance, human
resource management, marketing need not be over emphasized. Research not
only establishes relationships between different variables in each of these
functional areas, but also between these various functional areas.
ix) Research is a must in the production area. Product development, new and
better ways of producing goods, invention of new technologies, cost reduction,
improving product quality, work simplification, performance improvement,
process improvement etc., are some of the prominent areas of research in the
production area.
x) The purchase/material department uses research to frame alternative suitable
policies regarding where to buy, when to buy, how much to buy, and at what
price to buy.
xi) Closely linked with production function is marketing function. Market research
and marketing research provide a major part of marketing information which
influences the inventory level and production level. Marketing research studies
include problems and opportunities in the market, product preference, sales
forecasting, advertising effectiveness, product distribution, after sales service
etc.,
xii) In the area of financial management, maintaining liquidity, profitability through
proper funds management and assets management is essential. Optimum
capital mix, matching of funds inflows and outflows, cash flow forecasting,
cost control, pricing etc., require some sort of research and analysis. Financial
institutions also (banking and non-banking) have found it essential to set up
research division for the purpose of collecting and analysing data both for their
internal purpose and for making indepth studies on economic conditions of
business and people.
xiii) In the area of human resource management personnel policies have to be
guided by research. An individual’s motivation to work is associated with his
needs and their satisfaction. An effective Human Resource Manager is one
who can identify the needs of his work force and formulate personnel policies
to satisfy the same so that they can be motivated to contribute their best to the
attainment of organizational goals. Job design, job analysis, job assignment,
scheduling work breaks etc., have to be based on investigation and analysis.
xiv) Finally, research in business is a must to continuously update its attitudes,
approaches, products goals, methods, and machinery in accordance with the
changing environment in which it operates.
Research and Data remembered that good research uses a number of types, methods, &
Collection
techniques. Hence rigid classification is impossible. The following is only an
attempt to classify research into different types.
a) Life and physical sciences such as Botany, Zoology, Physics and Chemistry.
b) Social Sciences such as Political Science, Public Administration, Economics,
Sociology, Commerce and Management.
Research in these fields is also broadly referred to as life and physical science
research and social science research. Business education covers both
Commerce and Management, which are part of Social sciences. Business
research is a broad term which covers many areas.
Business Research
testing, sales analysis, market surveys, test marketing, consumer behaviour Introduction to
Business Research
studies, marketing information system etc.
a) One time or single time period research - eg. One year or a point of
time. Most of the sample studies, diagnostic studies are of this type.
b) Longitudinal research - eg. several years or several time periods ( a time
series analysis) eg. industrial development during the five year plans in
India.
viii) According to the purpose of the Study
What is the purpose/aim/objective of the study ? Is it to describe or analyze or
evaluate or explore? Accordingly the studies are known as.
5) List the various types of studies according to the purpose of the study. Introduction to
Business Research
..................................................................................................................
..................................................................................................................
..................................................................................................................
1) Survey Method
2) Observation Method
3) Case Method
4) Experimental Method
5) Historical Method
6) Comparative Method
i) It is not only seeing & viewing but also hearing and perceiving as well.
ii) It is both a physical and a mental activity. The observing eye catches many
things which are sighted, but attention is also focused on data that are relevant
to the problem under study.
iii) It captures the natural social context in which the person’s behaviour occurs.
iv) Observation is selective: The investigator does not observe every thing but
selects the range of things to be observed depending upon the nature, scope and
16 objectives of the study.
Geektonight Notes
v) Observation is not casual but with a purpose. It is made for the purpose of Introduction to
noting things relevant to the study. Business Research
vi) The investigator first of all observes the phenomenon and then gathers and
accumulates data.
Case Study is one of the popular research methods. A case study aims at
studying every thing about something rather than something about everything. It
examines complex factors involved in a given situation so as to identify causal
factors operating in it. The case study describes a case in terms of its
peculiarities, typical or extreme features. It also helps to secure a fund of
information about the unit under study. It is a most valuable method of study
for diagnostic therapeutic purposes.
Research and Data Sociology. Experimentation is a research process used to observe cause and
Collection effect relationship under controlled conditions. In other words it aims at studying
the effect of an independent variable on a dependent variable, by keeping the
other interdependent variables constant through some type of control. In
experimentation, the researcher can manipulate the independent variables and
measure its effect on the dependent variable. The main features of the
experimental method are :
The contrast between the field experiment and laboratory experiment is not
sharp, the difference is a matter of degree. The laboratory experiment has a
maximum of control, where as the field experiment must operate with less
control.
In historical research primary and also secondary sources of data can be used. Introduction to
Business Research
A primary source is the original repository of a historical datum, like an
original record kept of an important occasion, an eye witness description of an
event, the inscriptions on copper plates or stones, the monuments and relics,
photographs, minutes of organization meetings, documents. A secondary
source is an account or record of a historical event or circumstance, one or
more steps removed from an original repository. Instead of the minutes of the
meeting of an organization, for example, if one uses a newspaper account of
the meeting, it is a secondary source.
For historical data only authentic sources should be depended upon and their
authenticity should be tested by checking and cross checking the data from as
many sources as possible. Many a times it is of considerable interest to use
Time Series Data for assessing the progress or for evaluating the impact of
policies and initiatives. This can be meaningfully done with the help of historical
data.
The origin and the development of human beings, their customs, their
institutions, their innovations and the stages of their evolution have to be traced
and established. The scientific method by which such developments are traced
is known as the Genetic method and also as the Evolutionary method. The
science which appears to have been the first to employ the Evolutionary
method is comparative philology. It is employed to “compare” the different
languages in existence, to trace the history of their evolution in the light of such
similarities and differences as the comparisons disclosed. Darwin’s famous work
“Origin of Species” is the classic application of the Evolutionary method in
comparative anatomy.
Research and Data into every scientific method. Classification requires careful comparison and
Collection
every other method of science depends upon a precise comparison of
phenomena and the circumstances of their occurrence. All methods are,
therefore, “comparative” in a wider sense.
xi) Poor library facilities at many places, because of which researchers have to Introduction to
Business Research
spend much of their time and energy in tracing out the relevant material and
information.
xii) Many researchers in our country also face the difficulty of inadequate
computerial and secretarial assistance, because of which the researchers
have to take more time for completing their studies.
xiv) Social Research, especially managerial research, relates to human beings and
their behaviour. The observations, the data collection and the conclusions etc
must be valid. There is the problem of conceptualization of these aspects.
xv) Another difficulty in the research arena is that there is no code of conduct for
the researchers. There is need for developing a code of conduct for
researchers to educate them about ethical aspects of research, maintaining
confidentiality of information etc.
In spite of all these difficulties and problems, a business enterprise cannot avoid
research, especially in the fast changing world. To survive in the market an
enterprise has to continuously update itself, it has to change its attitudes,
approaches, products, technology, etc., through continuous research.
21
Geektonight Notes
Research and Data 5) List out five important difficulties faced by business researchers in India.
Collection
..................................................................................................................
..................................................................................................................
..................................................................................................................
Specifically, aspects (i) to (iv) are covered in unit-2, aspects (v) to (viii)
are covered in units 3,4 and 5, processing and presentation aspects of
(ix) are discussed in units 6 & 7, and analytical tools and techniques of
data analysis of (ix) are elaborated in units 8 to 17, interpretation
aspects of (x) are discussed in unit 18 and reporting aspects in unit 19.
Therefore, the above aspects are not elaborated in this unit.
Empirical studies have a great potential for they lead to inductions and
22 deductions. Induction is the process of reasoning to arrive at generalizations
Geektonight Notes
from particular facts. Deduction is a way of making a particular inference from Introduction to
Business Research
a generalization.
Research can be classified into different types for the sake of better
understanding. Several bases can be used for this classification such as branch
of knowledge, nature of data, coverage, application, place of research, research
methods used, time frame etc., and the research may be known as that type.
The research has to provide answers to the research questions raised. For this
the problem has to be investigated and relevant data has to be gathered. The
procedures adopted for obtaining the data and information are described as
methods of research. There are six methods viz., Survey, Observation, Case,
Experimental, Historical and Comparative methods.
The business researcher in India has to face certain difficulties such as lack of
scientific research training, paucity of competent researchers and research
supervisors, non-encouragement of research by business organizations, small
business organizations are not able to afford R & D departments, lack of
scientific orientation in business management, insufficient interaction between
industry and university, funding problems, poor library facilities, delayed
availability of published data etc.
7) What are the bases used for classifying research into different types? Introduction to
Business Research
8) List the various methods of research.
9) Distinguish between qualitative and quantitative data.
10) What are the stages in the business research process?
B. Essay Type Questions:
1) Define the concept of research and analyze its characteristics.
2) Define the term Science and distinguish it from knowledge.
3) Explain the significance of business research.
4) Write an essay on various types of research.
5) What do you mean by a method of research? Briefly explain different
methods of research.
6) Explain the significance of research in various functional areas of business.
7) What is Survey Research? How is it different from Observation Research?
8) Write short note on:
a) Case Research
b) Experimental Research
c) Historical Research
d) Comparative Method of research
9) What are the difficulties faced by researchers of business in India?
10) What is meant by business research process? What are the various stages /
aspects involved in the research process.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
25
Geektonight Notes
2.0 OBJECTIVES
After studying this unit, you should be able to:
2.1 INTRODUCTION
In unit 1, we have discussed the meaning and significance of business research,
types of research, methods of conducting research, and the business research
process. There we have shown that the research process begins with the
raising of a problem, leading to the gathering of data, their analysis and
interpretation and finally ends with the writing of the report. In this unit, we
propose to give a complete coverage on selection and specification of the
research problem, formulation of research objectives / hypotheses and designing
2 6 the action plan of research. Now we will dwell in detail on these aspects along
Geektonight Notes
with the associated features which are interwoven with the research problem Research Plan
and hypothesis formulation and testing.
Let us, now, discuss some considerations for selection of a research problems.
Research and Data The topic of study may be selected by some individual researcher having
Collection intellectual or scientific interests. The researcher may be interested in exploring
some general subject matter about which relatively little is known. And its
purpose is just for scientific curiosity. Person may also be interested in a
phenomenon which has already been studied in the past, but now it appears
that conditions are different and, therefore, it requires further examination.
Person may also be interested in a field in which there is a highly developed
theoretical system but there is need for retesting the old theory on the basis of
new facts, so as to test its validity in the changed circumstances.
2) Day to Day Problems: A research problem can be from the day to day
experience of the researcher. Every day problems constantly present some thing
new and worthy of investigation and it depends on the keenness of observation
and sharpness of the intellect of the researcher to knit his daily experience into
a research problem. For example, a person who travels in city buses every day
finds it a problem to get in or get out of the bus. But a Q system (that is the
answer to the problem) facilitates boarding and alighting comfortably.
The topic or problem which the researcher selects among the many possibilities
should meet certain requirements. Every problem selected for research must
satisfy the following criteria.
1) The topic selected should be original or at least less explored. The purpose
of research is to fill the gaps in existing knowledge or to discover new facts
and not to repeat already known facts. Therefore, a preliminary survey of the
existing literature in the proposed area of research should be carried out to find
out the possibility of making an original contribution. Knowledge about previous
research will serve at least three purposes.
a) It will enable the researcher to identify his specific problem for research.
b) It will eliminate the possibility of unnecessary duplication of effort, and
c) It will give him valuable information on the merits and limitations of various
research techniques which have been used in the past.
2) It should be of significance and socially relevant and useful.
3) It should be interesting to the researcher and should fit into his aptitude.
2 9
Geektonight Notes
The research problem should define the goal of the researcher in clear terms.
It means that along with the problem, the objective of the proposal should
adequately be spelled out. Without a clear cut idea of the goal to be reached,
research activities would be meaningless.
It should be remembered that there must be at least two means available to the
research consumer. If he/she has no choice of means, he/she cannot have a
3 0 problem.
Geektonight Notes
The selection of a topic for research is only half-a-step forward. This general
topic does not help a researcher to see what data are relevant to his/her
purpose. What are the methods would he/she employ in securing them? And
how to organize these? Before he/she can consider all these aspects, he/she
has to formulate a specific problem by making the various components of it (as
explained above) explicit.
1) What do you want to know? (What is the problem / what are the questions to
be answered).
3) How do you want to answer or solve it? (What is the methodology we want to
adopt to solve it)
Research and Data activity. We have to identify the goal / goals to be achieved and they must be
Collection
specified in order to give direction to the research study. Hence, formulation of
research objectives is equally important. Once research objectives are stated,
then the entire research activity will be geared to achieving those objectives.
For example, we intend to examine the working of a Regulated Agricultural
Market in a town to know whether it is fulfilling the objectives for which it has
been set up. For this study, we will gather all the relevant information/data such
as arrivals of different commodities, sources and uses of funds, facilities
provided in the market, users opinions etc. Similarly, if we are clear about what
we want from the research exercise, then the rest of the things will depend
upon the objectives such as identifying sources of data, instruments of collection
of data, tools of analyzing data. However, the objectives of the study must be
clear, specific and definite.
2.4 HYPOTHESIS
We know that research begins with a problem or a felt need or difficulty. The
purpose of research is to find a solution to the difficulty. It is desirable that the
researcher should propose a set of suggested solutions or explanations of the
difficulty which the research proposes to solve. Such tentative solutions
formulated as a proposition are called hypotheses. The suggested solutions
formulated as hypotheses may or may not be the real solutions to the problem.
Whether they are or not is the task of research to test and establish.
3 2
Geektonight Notes
“It is a proposition which can be put to test to determine validity”. (Goode and
Hatt).
Research and Data vi) Statistical Hypothesis: Statistical hypotheses are the statements derived from
Collection a sample. These are quantitative in nature and are numerically measurable. For
example, the market share of product X is 70%, the average life of a tube light
is 2000 hours etc.
2.4.3 Criteria for Workable Hypothesis
A hypothesis controls and directs the research study. When a problem is felt,
we require the hypothesis to explain it. Generally, there is more than one
hypothesis which aims at explaining the same fact. But all of them cannot be
equally good. Therefore, how can we judge a hypothesis to be true or false,
good or bad? Agreement with facts is the sole and sufficient test of a true
hypothesis. Therefore, certain conditions can be laid down for distinguishing a
good hypothesis from bad ones. The formal conditions laid down by thinkers
provide the criteria for judging a hypothesis as good or valid. These conditions
are as follows:
(viii) A Hypothesis should be related to available techniques: If tools and Research Plan
techniques are not available we cannot test the hypothesis. Therefore, the
hypothesis should be formulated only after due thought is given to the methods
and techniques that can be used to measure the concepts and variables related
to the hypothesis.
2.4.4 Stages in Hypothesis
There are four stages. The first stage is feeling of a problem. The observation
and analysis of the researcher reveals certain facts. These facts pose a
problem. The second stage is formulation of a hypothesis or hypotheses. A
tentative supposition/ guess is made to explain the facts which call for an
explanation. At this stage some past experience is necessary to pick up the
significant aspects of the observed facts. Without previous knowledge, the
investigation becomes difficult, if not impossible. The third stage is deductive
development of hypothesis using deductive reasoning. The researcher uses the
hypothesis as a premise and draws a conclusion from it. And the last stage is
the verification or testing of hypothesis. This consists in finding whether the
conclusion drawn at the third stage is really true. Verification consists in finding
whether the hypothesis agrees with the facts. If the hypothesis stands the test
of verification, it is accepted as an explanation of the problem. But if the
hypothesis does not stand the test of verification, the researcher has to search
for further solutions.
To explain the above stages let us consider a simple example. Suppose, you
have started from your home for college on your scooter. A little while later
the engine of your scooter suddenly stops. What can be the reason? Why has
it stopped? From your past experience, you start guessing that such problems
generally arise due to either petrol or spark plug. Then start deducing that the
cause could be: (i) that the petrol knob is not on. (ii) that there is no petrol in
the tank. (iii) that the spark plug has to be cleaned. Then start verifying them
one after another to solve the problem. First see whether the petrol knob is on.
If it is not, switch it on and start the scooter. If it is already on, then see
whether there is petrol or not by opening the lid of the petrol tank. If the tank
is empty, go to the near by petrol bunk to fill the tank with petrol. If there is
petrol in the tank, this is not the reason, then you verify the spark plug. You
clean the plug and fit it. The scooter starts. That means the problem is with the
spark plug. You have identified it. So you got the answer. That means your
problem is solved.
When the hypothesis has been framed in the research study, it must be verified
as true or false. Verifiability is one of the important conditions of a good
hypothesis. Verification of hypothesis means testing of the truth of the
hypothesis in the light of facts. If the hypothesis agrees with the facts, it is said
to be true and may be accepted as the explanation of the facts. But if it does
not agree it is said to be false. Such a false hypothesis is either totally rejected
or modified. Verification is of two types viz., Direct verification and Indirect
verification.
Research and Data indirect verification. Indirect verification is a process in which certain possible
Collection consequences are deduced from the hypothesis and they are then verified
directly. Two steps are involved in indirect verification. (i) Deductive
development of hypothesis: By deductive development certain consequences are
predicted and (ii) finding whether the predicted consequences follow. If the
predicted consequences come true, the hypothesis is said to be indirectly
verified. Verification may be done directly or indirectly or through logical
methods.
If a clear scientific hypothesis has been formulated, half of the research work
is already done. The advantages/utility of having a hypothesis are summarized
here underneath:
..................................................................................................................
..................................................................................................................
..................................................................................................................
The research has to be geared to the available time, energy, money and to the
availability of data. There is no such thing as a single or correct design.
Research design represents a compromise dictated by many practical
considerations that go into research.
i) It provides the researcher with a blue print for studying research questions.
ii) It dictates boundaries of research activity and enables the investigator to
channel his energies in a specific direction.
iii) It enables the investigator to anticipate potential problems in the implementation
of the study.
iv) The common function of designs is to assist the investigator in providing
answers to various kinds of research questions.
3 7
Geektonight Notes
Research and Data A study design includes a number of component parts which are interdependent
Collection and which demand a series of decisions regarding the definitions, methods,
techniques, procedures, time, cost and administration aspects.
1) Need for the Study: Explain the need for and importance of this study and its
relevance.
2) Review of Previous Studies: Review the previous works done on this topic,
understand what they did, identify gaps and make a case for this study and justify it.
3) Statement of Problem: State the research problem in clear terms and give
a title to the study.
4) Objectives of Study: What is the purpose of this study? What are the
objectives you want to achieve by this study? The statement of objectives should
not be vague. They must be specific and focussed.
5) Formulation of Hypothesis: Conceive possible outcome or answers to the
research questions and formulate into hypothesis tests so that they can be
tested.
6) Operational Definitions: If the study is using uncommon concepts or
unfamiliar tools or using even the familiar tools and concepts in a specific sense,
they must be specified and defined.
7) Scope of the Study: It is important to define the scope of the study,
because the scope decides what is within its purview and what is outside.
are divided into primary source (field sources) and secondary source Research Plan
(documentary sources). The data from primary source are called as primary
data, and data from secondary source are called secondary data. Hence, the
researcher has to decided whether to collect from primary source or
secondary source or both sources. (This will be discussed in detail in Unit-3).
9) Method of Collection: After deciding the sources for data collection, the
researcher has to determine the methods to be employed for data
collection, primarily, either census method or sampling method. This decision
may depend on the nature, purpose, scope of the research and also time
factor and financial resources.
10) Tools & Techniques: The tools and techniques to be used for collecting
data such as observation, interview, survey, schedule, questionnaire, etc.,
have to be decided and prepared.
11) Sampling Design: If it is a sample study, the sampling techniques, the size
of sample, the way samples are to be drawn etc., are to be decided.
12) Data Analysis: How are you going to process and analyze the data and
information collected? What simple or advanced statistical techniques are
going to be used for analysis and testing of hypothesis, so that necessary
care can be taken at the collection stage.
13) Presentation of the Results of Study: How are you going to present the
results of the study? How many chapters? What is the chapter scheme?
The chapters, their purpose, their titles have to be outlined. It is known as
chapterisation.
14) Time Estimates: What is the time available for this study? Is it limited or
unlimited time? Generally, it is a time bound study. The available or
permitted time must be apportioned between different activities and the
activities to be carried out within the specified time. For example,
preparation of research design one month, preparation of questionnaire one
month, data collection two months, analysis of data two months, drafting of
the report two months etc.,
15) Financial Budget: The design should also take into consideration the
various costs involved and the sources available to meet them. The
expenditures like salaries (if any), printing and stationery, postage and
telephone, computer and secretarial assistance etc.
16) Administration of the Enquiry: How is the whole thing to be executed?
Who does what and when? All these activities have to be organized
systematically, research personnel have to be identified and trained. They
must be entrusted with the tasks, the various activities are to be
coordinated and the whole project must be completed as per schedule.
Research designs provide guidelines for investigative activity and not necessarily
hard and fast rules that must remain unbroken. As the study progresses, new
aspects, new conditions and new connecting links come to light and it is
necessary to change the plan / design as circumstances demand. A universal
characteristic of any research plan is its flexibility.
Depending upon the method of research, the designs are also known as survey
design, case study design, observation design and experimental design.
3 9
Geektonight Notes
The difference between pilot study and pre-test is that, the former is a full
fledged miniature study of a research problem, where as the latter is a trial test
of a specific aspect of the study, such as a questionnaire.
..................................................................................................................
..................................................................................................................
..................................................................................................................
Having specified the problem, the next step is to formulate the objectives of
research so as to give direction to the study. The researcher should also
propose a set of suggested solutions to the problem under study. Such tentative
solutions formulated are called hypotheses. The hypotheses are of various types
such as explanatory hypothesis, descriptive hypothesis, analogical hypothesis,
working hypothesis, null hypothesis and statistical hypothesis. A good hypothesis
must be empirically verifiable, should be relevant, must have explanatory power,
must be as far as possible within the established knowledge, must be simple,
clear and definite. There are four stages in a hypothesis (a) feeling a problem
(b) formulating hypothesis (c) deductive development of hypothesis and (d)
verification / testing of hypothesis verification can be done either directly or
indirectly or through logical methods. Testing is done by using statistical
methods.
Having selected the problem, formulated the objectives and hypothesis, the
researcher has to prepare a blue print or plan of action, usually called as
research design. The design/study plan includes a number of components which
are interdependent and which demand a series of decisions regarding definitions,
scope, methods, techniques, procedures, instruments, time, place, expenditure and
administration aspects.
If the problem selected for research is not a familiar one, a pilot study may be
conducted to acquire knowledge about the subject matter, and the various issues
involved. Then for collection of data instruments and/or scales have to
constructed, which have to be pre-tested before finally accepting them for use. 4 1
Geektonight Notes
4 2
Geektonight Notes
5) What is meant by hypothesis? Explain the criteria for a workable Research Plan
hypothesis.
6) What are the different stages in a hypothesis? How do you verify /
test a hypothesis?
7) What is a research design? Explain the functions of a research design.
8) Define a research design and explain its contents.
9) What are the various components of a research design?
10) Distinguish between pilot study and pre-test. Also explain the need for
Pilot study and pre-testing.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
4 3
Geektonight Notes
3.0 OBJECTIVES
On the completion of this unit, you should be able to:
l discuss the necessity and usefulness of data collection,
l explain and distinguish between primary data and secondary data,
l explain the sources of secondary data and its merits and demerits,
l describe different methods of collecting primary data and their merits and
demerits,
l examine the choice of a suitable method, and
l examine the reliability, suitability and adequacy of secondary data.
3.1 INTRODUCTION
In Unit 2, we have discussed about the selection of a research problem and
formulation of research design. A research design is a blue print which directs
the plan of action to complete the research work. As we have mentioned
earlier, the collection of data is an important part in the process of research
work. The quality and credibility of the results derived from the application of
research methodology depends upon the relevant, accurate and adequate data.
In this unit, we shall study about the various sources of data and methods of
collecting primary and secondary data with their merits and limitations and also
the choice of suitable method for data collection.
the quality of the research results depends upon the reliability of the data. Collection of Data
Suppose, you are the Director of your company. Your Board of Directors has
asked you to find out why the profit of the company has decreased since the
last two years. Your Board wants you to present facts and figures. What are
you going to do?
The first and foremost task is to collect the relevant information to make an
analysis for the above mentioned problem. It is, therefore, the information
collected from various sources, which can be expressed in quantitative form, for
a specific purpose, which is called data. The rational decision maker seeks to
evaluate information in order to select the course of action that maximizes
objectives. For decision making, the input data must be appropriate. This
depends on the appropriateness of the method chosen for data collection. The
application of a statistical technique is possible when the questions are
answerable in quantitative nature, for instance; the cost of production, and profit
of the company measured in rupees, age of the workers in the company
measured in years. Therefore, the first step in statistical activities is to gather
data. The data may be classified as primary and secondary data. Let us now
discuss these two kinds of data in detail.
With the above discussion, we can understand that the difference between
primary and secondary data is only in terms of degree. That is that the data
which is primary in the hands of one becomes secondary in the hands of
another.
This category of secondary data source may also be termed as Paper Source.
The main sources of documentary data can be broadly classified into two
categories:
a) Published Sources
There are various national and international institutions, semi-official reports of
various committees and commissions and private publications which collect and
publish statistical data relating to industry, trade, commerce, health etc. These
publications of various organisations are useful sources of secondary data.
These are as follows:
The secondary data is also available through electronic media (through Internet).
You can download data from such sources by entering web sites like
google.com; yahoo.com; msn.com; etc., and typing your subject for which the
information is needed.
You can also find secondary data on electronic sources like CDs, and the
following online journals:
Research and Data Now you have learnt that the secondary data are available in documents, either
Collection published or unpublished, and electronic sources. However, you have to take
precautions while using secondary data in research. Let us discuss them in
detail.
With the above discussion, we can understand that there is a lot of published
and unpublished sources where researcher can gets secondary data. However,
the researcher must be cautious in using this type of data. The reason is that
such type of data may be full of errors because of bias, inadequate size of the
sample, errors of definitions etc. Bowley expressed that it is never safe to take
published or unpublished statistics at their face value without knowing their
meaning and limitations. Hence, before using secondary data, you must examine
the following points.
Merits
1) Secondary data is much more economical and quicker to collect than primary
data, as we need not spend time and money on designing and printing data
collection forms (questionnaire/schedule), appointing enumerators, editing and
tabulating data etc.
4 8
Geektonight Notes
2) It is impossible to an individual or small institutions to collect primary data with Collection of Data
regard to some subjects such as population census, imports and exports of
different countries, national income data etc. but can obtain from secondary
data.
Limitations
1) Secondary data is very risky because it may not be suitable, reliable, adequate
and also difficult to find which exactly fit the need of the present investigation.
2) It is difficult to judge whether the secondary data is sufficiently accurate or not
for our investigation.
3) Secondary data may not be available for some investigations. For example,
bargaining strategies in live products marketing, impact of T.V. advertisements
on viewers, opinion polls on a specific subject, etc. In such situations we have to
collect primary data.
Self Assessment Exercise B
1) Write names of five web sources of secondary data which have not been
included in the above table.
....................................................................................................................
....................................................................................................................
....................................................................................................................
2) Explain the merits and limitations of using secondary data.
....................................................................................................................
....................................................................................................................
....................................................................................................................
3) What precautions must a researcher take before using the secondary data?
....................................................................................................................
....................................................................................................................
....................................................................................................................
Research and Data called primary data. There are several methods of collecting primary data, such
Collection
as observation, interview through reporters, questionnaires and schedules. Let us
study about them in detail.
The Concise Oxford Dictionary defines observation as, ‘accurate watching and
noting of phenomena as they occur in nature with regard to cause and effect
or mutual relations’. Thus observation is not only a systematic watching but it
also involves listening and reading, coupled with consideration of the seen
phenomena. It involves three processes. They are: sensation, attention or
concentration and perception.
Merits
1) This is the most suitable method when the informants are unable or reluctant to
provide information.
2) This method provides deeper insights into the problem and generally the data is
accurate and quicker to process. Therefore, this is useful for intensive study
rather than extensive study.
Limitations
Despite of the above merits, this method suffers from the following limitations:
1) In many situations, the researcher cannot predict when the events will occur. So
when an event occurs there may not be a ready observer to observe the event.
2) Participants may be aware of the observer and as a result may alter their
behaviour.
3) Observer, because of personal biases and lack of training, may not record
specifically what he/she observes.
4) This method cannot be used extensively if the inquiry is large and spread over a
wide area.
3.5.2 Interview Method
Interview is one of the most powerful tools and most widely used method for
primary data collection in business research. In our daily routine we see
interviews on T.V. channels on various topics related to social, business, sports,
budget etc. In the words of C. William Emory, ‘personal interviewing is a two-
way purposeful conversation initiated by an interviewer to obtain information
5 0 that is relevant to some research purpose’. Thus an interview is basically, a
Geektonight Notes
meeting between two persons to obtain the information related to the proposed Collection of Data
study. The person who is interviewing is named as interviewer and the person
who is being interviewed is named as informant. It is to be noted that, the
research data/information collect through this method is not a simple
conversation between the investigator and the informant, but also the glances,
gestures, facial expressions, level of speech etc., are all part of the process.
Through this method, the researcher can collect varied types of data intensively
and extensively.
Another technique for data collection through this method can be structured and
unstructured interviewing. In the Structured interview set questions are asked
and the responses are recorded in a standardised form. This is useful in large
scale interviews where a number of investigators are assigned the job of
interviewing. The researcher can minimise the bias of the interviewer. This
technique is also named as formal interview. In Un-structured interview, the
investigator may not have a set of questions but have only a number of key
points around which to build the interview. Normally, such type of interviews
are conducted in the case of an explorative survey where the researcher is not
completely sure about the type of data he/ she collects. It is also named as
informal interview. Generally, this method is used as a supplementary method of
data collection in conducting research in business areas.
Merits
The major merits of this method are as follows:
Research and Data 3) The informant’s reactions to questions can be properly studied.
Collection
4) The researcher can use the language of communication according to the
standard of the information, so as to obtain personal information of informants
which are helpful in interpreting the results.
Limitations
The limitations of this method are as follows:
1) The chance of the subjective factors or the views of the investigator may come
in either consciously or unconsciously.
2) The interviewers must be properly trained, otherwise the entire work may be
spoiled.
3) It is a relatively expensive and time-consuming method of data collection
especially when the number of persons to be interviewed is large and they are
spread over a wide area.
4) It cannot be used when the field of enquiry is large (large sample).
Precautions : While using this method, the following precautions should be
taken:
5 2
Geektonight Notes
Merits
1) This method is cheap and economical for extensive investigations.
2) It gives results easily and promptly.
3) It can cover a wide area under investigation.
Limitations
1) The data obtained may not be reliable.
2) It gives approximate and rough results.
3) It is unsuited where a high degree of accuracy is desired.
4) As the agent/reporter or correspondent uses his own judgement, his personal
bias may affect the accuracy of the information sent.
3.5.4 Questionnaire and Schedule Methods
Questionnaire and schedule methods are the popular and common methods for
collecting primary data in business research. Both the methods comprise a list
of questions arranged in a sequence pertaining to the investigation. Let us study
these methods in detail one after another.
i) Questionnaire Method
Merits
1) You can use this method in cases where informants are spread over a vast
geographical area.
2) Respondents can take their own time to answer the questions. So the researcher
can obtain original data by this method.
3) This is a cheap method because its mailing cost is less than the cost of personal
visits.
4) This method is free from bias of the investigator as the information is given by
the respondents themselves.
5) Large samples can be covered and thus the results can be more reliable and
dependable.
Limitations
1) Respondents may not return filled in questionnaires, or they can delay in replying
to the questionnaires. 5 3
Geektonight Notes
Research and Data 2) This method is useful only when the respondents are educated and co-operative.
Collection
3) Once the questionnaire has been despatched, the investigator cannot modify the
questionnaire.
4) It cannot be ensured whether the respondents are truly representative.
ii) Schedule Method
Merits
1) It is a useful method in case the informants are illiterates.
2) The researcher can overcome the problem of non-response as the enumerators
go personally to obtain the information.
3) It is very useful in extensive studies and can obtain more reliable data.
Limitations
1) It is a very expensive and time-consuming method as enumerators are paid
persons and also have to be trained.
2) Since the enumerator is present, the respondents may not respond to some
personal questions.
3) Reliability depends upon the sincerity and commitment in data collection.
The success of data collection through the questionnaire method or schedule
method depends on how the questionnaire has been designed.
l Choose the appropriate type of questions. Generally there are five kinds of Collection of Data
Specimen Questionnaire
The following specimen questionnaire incorporates most of the qualities which
we have discussed above. It relates to ‘Computer User Survey’.
5 5
Geektonight Notes
Research and Data The information collected from various processes for a specific purpose is
Collection
called data. Statistical data may be either primary data or secondary data. Data
which is collected originally for a specific purpose is called primary data. The
data which is already collected and processed by some one else and is being
used now in the present study, is called secondary data. Secondary data can be
obtained either from published sources or unpublished sources. It should be used
if it is reliable, suitable and adequate, otherwise it may result in misleading
conclusions. It has its own merits and demerits. There are several problems in
the collection of primary data. These are: tools and techniques of data
collection, degree of accuracy, designing the questionnaire, selection and training
of enumerators, problem of tackling non-responses and other administrative
aspects.
Several methods are used for collection of primary data. These are: observation,
interview, questionnaire and schedule methods. Every method has its own merits
and demerits. Hence, no method is suitable in all situations. The suitable method
can be selected as per the needs of the investigator which depends on objective
nature and scope of the enquiry, availability of funds and time.
toothpaste, he or she must make a survey and collect data on the opinions of Collection of Data
the consumer. This is called primary data. The data obtained from published
and unpublished sources is called secondary data.
B. 1) l http://www.bis.org.in
l http://www.business-today.com
l http://www.businessonlineindia.com
l http://www.indiacofee.org
l http://www.dgft.nic.in
IV. i) No, ii) No
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
5 9
Geektonight Notes
6 0
Geektonight Notes
Sampling
UNIT 4 SAMPLING
STRUCTURE
4.0 Objectives
4.1 Introduction
4.2 Census and Sample
4.3 Why Sampling?
4.4 Essentials of a Good Sample
4.5 Methods of Sampling
4.5.1 Random Sampling Methods
4.5.2 Non-Random Sampling Methods
4.6 Sample Size
4.7 Sampling and Non-Sampling Errors
4.7.1 Sampling Errors
4.7.2 Non-Sampling Errors
4.7.3 Control of Errors
4.8 Let Us Sum Up
4.9 Key Words
4.10 Answers to Self Assessment Exercises
4.11 Terminal Questions
4.12 Further Reading
4.0 OBJECTIVES
After studying this Unit, you should be able to:
4.1 INTRODUCTION
In the previous Unit 3, we have studied the types of data (primary and
secondary data) and various methods and techniques of collecting the primary
data. The desired data may be collected by selecting either census method or
sampling method.
In this Unit, we shall discuss the basics of sampling, particularly how to get a
sample that is representative of a population. It covers different methods of
drawing samples which can save a lot of time, money and manpower in a 6 1
Geektonight Notes
Research and Data variety of situations. These include random sampling methods, such as, simple
Collection random sampling, stratified sampling, systematic sampling, multistage sampling,
cluster sampling methods (and non-random sampling methods viz., convenience
sampling, judgement sampling and quota sampling. The advantages and
disadvantages of sampling and census are covered. How to determine the
sample size of a given population is also discussed.
2) The information obtained on the basis of census data is more reliable and Sampling
accurate. It is an adopted method of collecting data on exceptional matters like
child labour, distribution by sex, educational level of the people etc.
3) If we are conducting a survey for the first time we can have a census instead of
sample survey. The information based on this census method becomes a base
for future studies. Similarly, some of the studies of special importance like
population data are obtained only through census.
Research and Data use. Good examples of this occur in quality control. For example, to test the
Collection quality of a bulb, to determine whether it is defective, it must be destroyed.
To obtain a census of the quality of a lorry load of bulbs, you have to
destroy all of them. This is contrary to the purpose served by quality-control
testing. In this case, only a sample should be used to assess the quality of
the bulbs. Another example is blood test of a patient.
The disadvantages of sampling are few but the researcher must be cautious.
These are risk, lack of representativeness and insufficient sample size each of
which can cause errors. If researcher don’t pay attention to these flaws it may
invalidate the results.
1) Risk: Using a sample from a population and drawing inferences about the
entire population involves risk. In other words the risk results from dealing with
a part of a population. If the risk is not acceptable in seeking a solution to a
problem then a census must be conducted.
2) Lack of representativeness: Determining the representativeness of the
sample is the researcher’s greatest problem. By definition, ‘sample’ means a
representative part of an entire population. It is necessary to obtain a sample
that meets the requirement of representativeness otherwise the sample will be
biased. The inferences drawn from nonreprentative samples will be misleading
and potentially dangerous.
3) Insufficient sample size: The other significant problem in sampling is to
determine the size of the sample. The size of the sample for a valid sample
depends on several factors such as extent of risk that the researcher is willing to
accept and the characteristics of the population itself.
1) A sample must represent a true picture of the population from which it is drawn.
2) A sample must be unbiased by the sampling procedure.
3) A sample must be taken at random so that every member of the population of
data has an equal chance of selection.
4) A sample must be sufficiently large but as economical as possible.
5) A sample must be accurate and complete. It should not leave any information
incomplete and should include all the respondents, units or items included in the
sample.
6) Adequate sample size must be taken considering the degree of precision
required in the results of inquiry.
It was scientifically proved that if we increase the sample size we shall be that
much closer to the characteristics of the population. Ultimately, if we cover
each and every unit of the population, the characteristics of the sample will be
equal to the characteristics of the population. That is why in a census there is
no sampling error. Thus, “generally speaking, the larger the sample size, the less
sampling error we have.”
The statistical meaning of bias is error. The sample must be error free to make
it an unbiased sample. In practice, it is impossible to achieve an error free
sample even using unbiased sampling methods. However, we can minimize the
error by employing appropriate sampling methods.
The various sampling methods can be classified into two categories. These are
random sampling methods and non-random sampling methods. Let us discuss
them in detail. 6 5
Geektonight Notes
The best way to choose a simple random sample is to use random number
table. A random sampling method should meet the following criteria.
a) Every member of the population must have an equal chance of inclusion in the
sample.
b) The selection of one member is not affected by the selection of previous
members.
The random numbers are a collection of digits generated through a probabilistic
mechanism. The random numbers have the following properties:
i) The probability that each digit (0,1,2,3,4,5,6,7,8,or 9) will appear at any place
is the same. That is 1/10.
ii) The occurrence of any two digits in any two places is independent of each
other.
Each member of a population is assigned a unique number. The members of
the population chosen for the sample will be those whose numbers are identical
to the ones extracted from the random number table in succession until the
desired sample size is reached. An example of a random number table is given
below.
6 6
Geektonight Notes
1 2 3 4 5 6 7 8 9 10
1 96268 11860 83699 38631 90045 69696 48572 05917 51905 10052
2 03550 59144 59468 37984 77892 89766 86489 46619 50236 91136
3 22188 81205 99699 84260 19693 36701 43233 62719 53117 71153
4 63759 61429 14043 44095 84746 22018 19014 76781 61086 90216
5 55006 17765 15013 77707 54317 48862 53823 52905 70754 68212
6 81972 45644 12600 01951 72166 52682 37598 11955 73018 23528
7 06344 50136 33122 31794 86723 58037 36065 32190 31367 96007
8 92363 99784 94169 03652 80824 33407 40837 97749 18361 72666
9 96083 16943 89916 55159 62184 86206 09764 20244 88388 98675
10 92993 10747 08985 44999 35785 65036 05933 77378 92339 96151
11 95083 70292 50394 61947 65591 09774 16216 63561 59751 78771
12 77308 60721 96057 86031 83148 34970 30892 53489 44999 18021
13 11913 49624 28519 27311 61586 28576 43092 69971 44220 80410
14 70648 47484 05095 92335 55299 27161 64486 71307 85883 69610
15 92771 99203 37786 81142 44271 36433 31726 74879 89384 76886
16 78816 20975 13043 55921 82774 62745 48338 88348 61211 88074
17 79934 35392 56097 87613 94627 63622 08110 16611 88599 02890
18 64698 83376 87527 36897 17215 74339 69856 43622 22567 11518
19 44212 12995 03581 37618 94851 63020 65348 55857 91742 79508
20 89292 00204 00579 70630 37136 50922 83387 15014 51838 81760
21 08692 87237 87879 01629 72184 33853 95144 67943 19345 03469
22 67927 76855 50702 78555 97442 78809 40575 79714 06201 34576
23 62167 94213 52971 85794 68067 78814 40103 70759 92129 46716
24 45828 45441 74220 84157 23241 49332 23646 09390 13031 51569
25 01164 35307 26526 80335 58090 85871 07205 31749 40571 51755
26 29283 31581 04359 45538 41435 61103 32428 94042 39971 63678
27 19868 49978 81699 84904 50163 22652 07845 71308 00859 87984
28 14292 93587 55960 23159 07370 65065 06580 46285 07884 83928
29 77410 52135 29495 23032 83242 89938 40516 27252 55565 64714
30 36580 06921 35675 81645 60479 71035 99380 59759 42161 93440
31 07780 18093 31258 78156 07871 20369 53977 08534 39433 57216
32 07548 08454 36674 46255 80541 42903 37366 21164 97516 66181
33 22023 60448 69344 44260 90570 01632 21002 24413 04671 05665
34 20827 37210 57797 34660 32510 71558 78228 42304 77197 79168
35 47802 79270 48805 59480 88092 11441 96016 76091 51823 94442
36 76730 86591 18978 25479 77684 88439 34112 26052 57112 91653
37 26439 02903 20935 76297 15290 84688 74002 09467 41111 19194
38 32927 83426 07848 59372 44422 53372 27823 25417 27150 21750
39 51484 05286 77103 47284 00578 88774 15293 50740 07932 87633
40 45142 96804 92834 26886 70002 96643 36008 02239 93563 66429
6 7
Geektonight Notes
Research and Data To select a random sample using simple random sampling method we should
Collection follow the steps given below:
v) Choose the direction in which you want to read the numbers (from left to
right, or right to left, or down or up).
vi) Select the first ‘n’ numbers whose X digits are between 0 and N. If N =
100 then X would be 2, if N is a four digit number then X would be 3 and
so on.
viii) If you reach the end point of the table before obtaining ‘n’ numbers, pick
another starting point and read in a different direction and then use the
first X digit instead of the last X digits and continue until the desired
sample is selected.
Example: Suppose you have a list of 80 students and want to select a sample
of 20 students using simple random sampling method. First assign each student
a number from 00 to 79. To draw a sample of 20 students using random
number table, you need to find 20 two-digit numbers in the range 00 to 79. You
can begin any where and go in any direction. For example, start from the 6th
row and 1st column of the random number table given in this Unit. Read the
last two digits of the numbers. If the number is within the range (00 to 79)
include the number in the sample. Otherwise skip the number and read the next
number in some identified direction. If a number is already selected omit it. In
the example starting from 6th row and 1st column and moving from left to right
direction the following numbers are considered to selected 20 numbers for
sample.
81972 45644 12600 01951 72166 52682 37598 11955 73018 23528
06344 50136 33122 31794 86723 58037 36065 32190 31367 96007
92363 99784 94169 03652 80824 33407 40837 97749 18361 72666
The bold faced digits in the one’s and ten’s place value indicate the selected
numbers for the sample. Therefore, the following are the 20 numbers chosen as
sample.
72 44 00 51 66 55 18 28
36 22 23 37 65 67 07 63
69 52 24 49
6 8
Geektonight Notes
Advantages Sampling
i) The simple random sample requires less knowledge about the characteristics of
the population.
ii) Since sample is selected at random giving each member of the population equal
chance of being selected the sample can be called as unbiased sample. Bias
due to human preferences and influences is eliminated.
iii) Assessment of the accuracy of the results is possible by sample error
estimation.
iv) It is a simple and practical sampling method provided population size is not large.
Limitations
i) If the population size is large, a great deal of time must be spent listing and
numbering the members of the population.
ii) A simple random sample will not adequately represent many population
characteristics unless the sample is very large. That is, if the researcher is
interested in choosing a sample on the basis of the distribution in the population
of gender, age, social status, a simple random sample needs to be very large to
ensure all these distributions are representative of the population. To obtain a
representative sample across multiple population attributes we should use
stratified random sampling.
2. Systematic Sampling: In systematic sampling the sample units are selected
from the population at equal intervals in terms of time, space or order. The
selection of a sample using systematic sampling method is very simple. From a
population of ‘N’ units, a sample of ‘n’ units may be selected by following the
steps given below:
i) Arrange all the units in the population in an order by giving serial numbers
from 1 to N.
ii) Determine the sampling interval by dividing the population by the sample
size. That is, K=N/n.
iii) Select the first sample unit at random from the first sampling interval (1 to
K).
iv) Select the subsequent sample units at equal regular intervals.
For example, we want to have a sample of 100 units from a population of 1000
units. First arrange the population units in some serial order by giving numbers
from 1 to 1000. The sample interval size is K=1000/100=10. Select the first
sample unit at random from the first 10 units ( i.e. from 1 to 10). Suppose the
first sample unit selected is 5, then the subsequent sample units are 15, 25,
35,.........995. Thus, in the systematic sampling the first sample unit is selected
at random and this sample unit in turn determines the subsequent sample units
that are to be selected.
Advantages
i) The main advantage of using systematic sample is that it is more expeditious to
collect a sample systematically since the time taken and work involved is less
than in simple random sampling. For example, it is frequently used in exit polls
and store consumers.
ii) This method can be used even when no formal list of the population units is
available. For example, suppose if we are interested in knowing the opinion of
consumers on improving the services offered by a store we may simply choose
6 9
Geektonight Notes
Research and Data every kth (say 6th) consumer visiting a store provided that we know how many
Collection consumers are visiting the store daily (say 1000 consumers visit and we want to
have 100 consumers as sample size).
Limitations
i) If there is periodicity in the occurrence of elements of a population, the selection
of sample using systematic sample could give a highly un-representative sample.
For example, suppose the sales of a consumer store are arranged
chronologically and using systematic sampling we select sample for 1st of every
month. The 1st day of a month can not be a representative sample for the whole
month. Thus in systematic sampling there is a danger of order bias.
ii) Every unit of the population does not have an equal chance of being selected
and the selection of units for the sample depends on the initial unit selection.
Regardless how we select the first unit of sample, subsequent units are
automatically determined lacking complete randomness.
3. Stratified Random Sampling: The stratified sampling method is used when
the population is heterogeneous rather than homogeneous. A heterogeneous
population is composed of unlike elements such as male/female, rural/urban,
literate/illiterate, high income/low income groups, etc. In such cases, use of
simple random sampling may not always provide a representative sample of the
population. In stratified sampling, we divide the population into relatively
homogenous groups called strata. Then we select a sample using simple
random sampling from each stratum. There are two approaches to decide the
sample size from each stratum, namely, proportional stratified sample and
disproportional stratified sample. With either approach, the stratified sampling
guarantees that every unit in the population has a chance of being selected. We
will now discuss these two approaches of selecting samples.
i) Proportional Stratified Sample: If the number of sampling units drawn
from each stratum is in proportion to the corresponding stratum population size,
we say the sample is proportional stratified sample. For example, let us say
we want to draw a stratified random sample from a heterogeneous population
(on some characteristics) consisting of rural/urban and male/female respondents.
So we have to create 4 homogeneous sub groups called stratums as follows:
Urban Rural
To ensure each stratum in the sample will represent the corresponding stratum
in the population we must ensure each stratum in the sample is represented in
the same proportion to the stratums as they are in the population. Let us
assume that we know (or can estimate) the population distribution as follows:
65% male, 35% female and 30% urban and 70% rural. Now we can determine
the approximate proportions of our 4 stratums in the population as shown below.
Urban Rural
Male Female Male Female
0.30 × 0.65 = 0.195 0.30 × 0.35 = 0.105 0.70 × 0.65 = 0.455 0.70 × 0.35 = 0.245
sample size required from each stratum. Suppose we require 1000 samples then Sampling
the required sample in each stratum is as follows:
Total: 1,000
Advantages
a) Since the sample are drawn from each of the stratums of the population,
stratified sampling is more representative and thus more accurately reflects
characteristics of the population from which they are chosen. 7 1
Geektonight Notes
Research and Data b) It is more precise and to a great extent avoids bias.
Collection
c) Since sample size can be less in this method, it saves a lot of time, money and
other resources for data collection.
Limitations
a) Stratified sampling requires a detailed knowledge of the distribution of attributes
or characteristics of interest in the population to determine the homogeneous
groups that lie within it. If we cannot accurately identify the homogeneous
groups, it is better to use simple random sample since improper stratification can
lead to serious errors.
b) Preparing a stratified list is a difficult task as the lists may not be readily
available.
4. Cluster Sampling: In cluster sampling we divide the population into groups
having heterogenous characteristics called clusters and then select a sample of
clusters using simple random sampling. We assume that each of the clusters is
representative of the population as a whole. This sampling is widely used for
geographical studies of many issues. For example if we are interested in finding
the consumers’ (residing in Delhi) attitudes towards a new product of a
company, the whole city of Delhi can be divided into 20 blocks. We assume that
each of these blocks will represent the attitudes of consumers of Delhi as a
whole, we might use cluster sampling treating each block as a cluster. We will
then select a sample of 2 or 3 clusters and obtain the information from
consumers covering all of them. The principles that are basic to the cluster
sampling are as follows:
i) The differences or variability within a cluster should be as large as possible.
As far as possible the variability within each cluster should be the same as
that of the population.
ii) The variability between clusters should be as small as possible. Once the
clusters are selected, all the units in the selected clusters are covered for
obtaining data.
Advantages
a) The cluster sampling provides significant gains in data collection costs, since
traveling costs are smaller.
b) Since the researcher need not cover all the clusters and only a sample of
clusters are covered, it becomes a more practical method which facilitates
fieldwork.
Limitations
a) The cluster sampling method is less precise than sampling of units from the
whole population since the latter is expected to provide a better cross-section of
the population than the former, due to the usual tendency of units in a cluster to
be homogeneous.
b) The sampling efficiency of cluster sampling is likely to decrease with the
decrease in cluster size or increase in number of clusters.
The above advantages or limitations of cluster sampling suggest that, in practical
situations where sampling efficiency is less important but the cost is of greater
significance, the cluster sampling method is extensively used. If the division of
clusters is based on the geographic sub-divisions, it is known as area sampling.
In cluster sampling instead of covering all the units in each cluster we can
resort to sub-sampling as two-stage sampling. Here, the clusters are termed as
primary units and the units within the selected clusters are taken as secondary
7 2 units.
Geektonight Notes
5. Multistage Sampling: We have already covered two stage sampling. Multi Sampling
stage sampling is a generalisation of two stage sampling. As the name suggests,
multi stage sampling is carried out in different stages. In each stage
progressively smaller (population) geographic areas will be randomly selected.
A political pollster interested in assembly elections in Uttar Pradesh may first
divide the state into different assembly units and a sample of assembly
constituencies may be selected in the first stage. In the second stage, each of
the sampled assembly constituents are divided into a number of segments and a
second stage sampled assembly segments may be selected. In the third stage
within each sampled assembly segment either all the house-holds or a sample
random of households would be interviewed. In this sampling method, it is
possible to take as many stages as are necessary to achieve a representative
sample. Each stage results in a reduction of sample size.
In a multi stage sampling at each stage of sampling a suitable method of
sampling is used. More number of stages are used to arrive at a sample of
desired sampling units.
Advantages
a) Multistage sampling provides cost gains by reducing the data collection on costs.
b) Multistage sampling is more flexible and allows us to use different sampling
procedures in different stages of sampling.
c) If the population is spread over a very wide geographical area, multistage
sampling is the only sampling method available in a number of practical
situations.
Limitations
a) If the sampling units selected at different stages are not representative
multistage sampling becomes less precise and efficient.
4.5.2 Non-Random Sampling Methods
The non-random sampling methods are also often called non-probability sampling
methods. In a non-random sampling method the probability of any particular unit
of the population being chosen is unknown. Here the method of selection of
sampling units is quite arbitrary as the researchers rely heavily on personal
judgment. Non-random sampling methods usually do not produce samples that
are representative of the general population from which they are drawn. The
greatest error occurs when the researcher attempts to generalise the results on
the basis of a sample to the entire population. Such an error is insidious
because it is not at all obvious from merely looking at the data, or even from
looking at the sample. The easiest way to recognise whether a sample is
representative or not is to determine whether the sample is selected randomly
or not. Nevertheless, there are occasions where non-random samples are best
suited for the researcher’s purpose.The various non-random sampling methods
commonly used are:
1) Convenience Sampling;
2) Judgement Sampling; and
3) Quota Sampling.
Let us discuss these methods in detail.
1) Convenience Sampling: Convenience sampling refers to the method of
obtaining a sample that is most conveniently available to the researcher. For
example, if we are interested in finding the overtime wage paid to employees
working in call centres, it may be convenient and economical to sample 7 3
Geektonight Notes
Research and Data employees of call centres in a nearby area. Also, on various issues of public
Collection interest like budget, election, price rise etc., the television channels often present
on-the-street interviews with people to reflect public opinion. It may be
cautioned that the generalisation of results based on convenience sampling
beyond that particular sample may not be appropriate. Convenience samples are
best used for exploratory research when additional research will be
subsequently conducted with a random sample. Convenience sampling is also
useful in testing the questionnaires designed on a pilot basis. Convenience
sampling is extensively used in marketing studies.
2) Judgement Sampling: Judgement sampling method is also known as purposive
sampling. In this method of sampling the selection of sample is based on the
researcher’s judgment about some appropriate characteristic required of the
sample units. For example, the calculation of consumer price index is based on a
judgment sample of a basket of consumer items, and other related commodities
and services which are expected to reflect a representative sample of items
consumed by the people. The prices of these items are collected from selected
cities which are viewed as typical cities with demographic profiles matching the
national profile. In business judgment sampling is often used to measure the
performance of salesmen/saleswomen. The salesmen/saleswomen are grouped
into high, medium or low performers based on certain specified qualities. Then
the sales manager may actually classify the salesmen/saleswomen working
under him/her who in his/her opinion will fall in which group. Judgment sampling
is also often used in forecasting election results. We may often wonder how a
pollster can predict an election based on only 2% to 3% of votes covered. It is
needless to say the method is biased and does not have any scientific basis.
However, in the absence of any representative data, one may resort to this kind
of non-random sampling.
3) Quota Sampling: The quota sampling method is commonly used in marketing
research studies. The samples are selected on the basis of some parameters
such as age, sex, geographical region, education, income, occupation etc, in
order to make them as representative samples. The investigators, then, are
assigned fixed quotas of the sample meeting these population characteristics.
The purpose of quota sampling is to ensure that various sub-groups of the
population are represented on pertinent sample characteristics to the extent that
the investigator desires. The stratified random sampling also has this objective
but should not be confused with quota sampling. In the stratified sampling
method the researcher selects a random sample from each group of the
population, where as, in quota sampling, the interviewer has a quota fixed for
him/her to achieve. For example, if a city has 10 market centres, a soft drink
company may decide to interview 50 consumers from each of these 10 market
centres to elicit information on their products. It is entirely left to the
investigator whom he/she will interview at each of the market centres and the
time of interview. The interview may take place in the morning, mid day, or
evening or it may be in the winter or summer.
Quota sampling has the advantage that the sample confirms the selected
characteristics of the population that the researcher desires. Also, the cost and
time involved in collecting the data are also greatly reduced. However, quota
sampling has many limitations, as given below:
a) In quota sampling the respondents are selected according to the convenience of
the field investigator rather than on a random basis. This kind of selection of
sample may be biased. Suppose in our example of soft drinks, after the sample
is taken it was found that most of the respondents belong to the lower income
group then the purpose of conducting the survey becomes useless and the
7 4 results may not reflect the actual situation.
Geektonight Notes
b) If the number of parameters, on which basis the quotas are fixed, are larger Sampling
then it becomes difficult for the researcher to fix the quota for each sub-group.
c) The field workers have the tendency to cover the quota by going to those places
where the respondents may be willing to provide information and avoid those
with unwilling respondents. For example, the investigators may avoid places
where high income group respondents stay and cover only low income group
areas.
1) Suppose there are 900 families residing in a colony. You are asked to select a
sample of families using simple random sampling for knowing the average
income. The families are identified with serial numbers 001 to 900.
i) Select a random sample using the following random table.
29283 31581 04359 45538 41435 61103 32428 94042 39971 63678
19868 49978 81699 84904 50163 22652 07845 71308 00859 87984
14292 93587 55960 23159 07370 65065 06580 46285 07884 83928
77410 52135 29495 23032 83242 89938 40516 27252 55565 64714
36580 06921 35675 81645 60479 71035 99380 59759 42161 93440
ii) While selecting the random sample in the above example, what are the
random numbers you have rejected and why?
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
7 5
Geektonight Notes
....................................................................................................................
Once the researcher determines the desired degree of precision and confidence
level, there are several formulas he/she can use to determine the sample size
and interpretation of results depending on the plan of the study. Here we will
discuss three of them.
3) If the researcher plans the results in a variety of ways or if he/she has difficulty
in estimating the proportion or standard deviation of the attribute of interest, the
following formula may be more useful.
NZ 2
× .25
n=
[d × (N − 1)] + [Z 2 × .25]
2
1000×1.96 2 × 0.25
n= = 277.7or say 280
(0.05 2 × 999)+ (1.96 2 × 0.25)
The principal sources of sampling errors are the sampling method applied, and
the sample size. This is due to the fact that only a part of the population is
covered in the sample. The magnitude of the sampling error varies from one
sampling method to the other, even for the same sample size. For example, the
sampling error associated with simple random sampling will be greater than
stratified random sampling if the population is heterogeneous in nature.
Intuitively, we know that the larger the sample the more accurate the research.
In fact, the sampling error varies with samples of different sizes. Increasing the
sample size decreases the sampling error.
7 7
Geektonight Notes
Research and Data The following Figure gives an approximate relationship between sample size and
Collection sampling error. Study the following figure carefully.
large
Sampling error
small
small large
Sample size
Fig.: 4.1
In the above two sections we have identified the most significant sources of
errors. It is not possible to eliminate completely the sources of errors.
However, the researcher’s objective and effort should be to minimise these
sources of errors as much as possible. There are ways of reducing the errors.
Some of these are:
(a) designing and executing a good questionnaire; (b) selection of appropriate
sampling method; (c) adequate sample size; (d) employing trained investigators
to collect the data; and (e) care in editing, coding and entering the data into the
computer. You have already learned the above ways of controlling the errors
in Unit 3 and in this Unit.
7 9
Geektonight Notes
....................................................................................................................
There are two broad categories of sampling methods. These are: (a) random
sampling methods, and (b) non-random sampling methods. The random sampling
methods are based on the chance of including the units of population in a
sample.
Some of the sampling methods covered in this Unit are: (a) simple random
sampling, (b) systematic random sampling, (c) stratified random sampling,
(d) cluster sampling, and (e) multistage sampling. With an appropriate sampling
plan and selection of random sampling method the sampling error can be
minimised. The non-random sampling methods include: (a) convenience sampling,
(b) judgment sampling, and (c) Quota sampling. These methods may be
convenient to the researcher to apply. These methods may not provide a
representative sample to the population and there are no scientific ways to
check the sampling errors.
There are two major sources of errors in survey research. These are:
(a) sampling errors, and (b) non-sampling errors. The sampling errors arise
because of the fact that the sample may not be a representative sample to the
population. Two major sources of non-sampling errors are due to: (a) non-
response on the part of respondent and/or respondent’s bias in providing correct
information, and (b) administrative errors like design and implementation of
questionnaire, investigators’ bias, and data processing errors.
(a) designing a good questionnaire, (b) selection of appropriate sampling method, Sampling
(c) adequate sample size, (d) employing trained investigators and, (e) care in
data processing.
2) Cluster sampling
3) Stratified sampling
4) a) true
b) false, it is convenience sampling
C. 1) The required sample size is 370
2) Decreases
3) Sampling method applied
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
8 2 university for assessment. These are for your practice only.
Geektonight Notes
Sampling
4.12 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
Gupta, C.B., & Vijay Gupta, An Introduction to Statistical Methods, Vikas
Publishing House Pvt. Ltd., New Delhi.
Kothari, C.R.(2004) Research Methodology Methods and Techniques, New Age
International (P) Ltd., New Delhi.
Levin, R.I. and D.S. Rubin. (1999) Statistics for Management, Prentice-Hall of
India, New Delhi
Mustafi, C.K.(1981) Statistical Methods in Managerial Decisions, Macmillan,
New Delhi
8 3
Geektonight Notes
5.0 OBJECTIVES
After studying this unit, you should be able to:
l explain the concepts of measurement and scaling,
l discuss four levels of measurement scales,
l classify and discuss different scaling techniques, and
l select an appropriate attitude measurement scale for your research problem.
5.1 INTRODUCTION
As we discussed earlier, the data consists of quantitative variables like price,
income, sales etc., and qualitative variables like knowledge, performance,
character etc. The qualitative information must be converted into numerical
form for further analysis. This is possible through measurement and scaling
techniques. A common feature of survey based research is to have
respondent’s feelings, attitudes, opinions, etc. in some measurable form. For
example, a bank manager may be interested in knowing the opinion of the
customers about the services provided by the bank. Similarly, a fast food
company having a network in a city may be interested in assessing the quality
and service provided by them. As a researcher you may be interested in
knowing the attitude of the people towards the government announcement of a
metro rail in Delhi. In this unit we will discuss the issues related to
measurement, different levels of measurement scales, various types of scaling
techniques and also selection of an appropriate scaling technique.
a) What is to be measured?
b) Who is to be measured?
c) The choices available in data collection techniques
The first issue that the researcher must consider is ‘what is to be measured’?
The definition of the problem, based on our judgments or prior research
indicates the concept to be investigated. For example, we may be interested in
measuring the performance of a fast food company. We may require a precise
definition of the concept on how it will be measured. Also, there may be more
than one way that we can measure a particular concept. For example, in
measuring the performance of a fast food company we may use a number of
measures to indicate the performance of the company. We may use sales
volume in terms of value of sales or number of customers or spread of
network of the company as measures of performance. Further, the
measurement of concepts requires assigning numbers to the attitudes, feelings or
opinions. The key question here is that on what basis do we assign the
numbers to the concept. For example, the task is to measure the agreement of
customers of a fast food company on the opinion of whether the food served
by the company is tasty, we create five categories: (1) strongly agree, (2)
agree, (3) undecided, (4) disagree, (5) strongly disagree. Then we may measure
the response of respondents. Suppose if a respondent states ‘disagree’ with the
statement that ‘the food is tasty’, the measurement is 4.
Research and Data have a bearing on the choice of measurement. The measurement procedure
Collection
must be designed keeping in mind the characteristics of the respondents under
consideration.
The third issue in measurement is the choice of the data collection techniques.
In Unit 3, you have already learnt various methods of data collection.Normally,
questionnaires are used for measuring attitudes, opinions or feelings.
a) Nominal Scale is the crudest among all measurement scales but it is also the
simplest scale. In this scale the different scores on a measurement simply
indicate different categories. The nominal scale does not express any values or
relationships between variables. For example, labelling men as ‘1’ and women
as ‘2’ which is the most common way of labelling gender for data recording
purpose does not mean women are ‘twice something or other’ than men. Nor it
suggests that men are somehow ‘better’ than women. Another example of
nominal scale is to classify the respondent’s income into three groups: the
highest income as group 1. The middle income as group 2, and the low-income
as group 3. The nominal scale is often referred to as a categorical scale. The
assigned numbers have no arithmetic properties and act only as labels. The only
statistical operation that can be performed on nominal scales is a frequency
count. We cannot determine an average except mode.
In designing and developing a questionnaire, it is important that the response
categories must include all possible responses. In order to have an exhaustive
number of responses, you might have to include a category such as ‘others’,
‘uncertain’, ‘don’t know’, or ‘can’t remember’ so that the respondents will not
distort their information by forcing their responses in one of the categories
provided. Also, you should be careful and be sure that the categories provided
are mutually exclusive so that they do not overlap or get duplicated in any way.
b) Ordinal Scale involves the ranking of items along the continuum of the
86 characteristic being scaled. In this scale, the items are classified according to
Geektonight Notes
whether they have more or less of a characteristic. For example, you may wish Measurement and
Scaling Techniques
to ask the TV viewers to rank the TV channels according to their preference
and the responses may look like this as given below:
The main characteristic of the ordinal scale is that the categories have a logical
or ordered relationship. This type of scale permits the measurement of degrees
of difference, (that is, ‘more’ or ‘less’) but not the specific amount of
differences (that is, how much ‘more’ or ‘less’). This scale is very common
in marketing, satisfaction and attitudinal research.
Another example is that a fast food home delivery shop may wish to ask its
customers:
How would you rate the service of our staff?
(1) Excellent • (2) Very Good • (3) Good • (4) Poor • (5) Worst •
c) Interval Scale is a scale in which the numbers are used to rank attributes such
that numerically equal distances on the scale represent equal distance in the
characteristic being measured. An interval scale contains all the information of
an ordinal scale, but it also one allows to compare the difference/distance
between attributes. For example, the difference between ‘1’ and ‘2’ is equal to
the difference between ‘3’ and ‘4’. Further, the difference between ‘2’ and ‘4’
is twice the difference between ‘1’ and ‘2’. However, in an interval scale, the
zero point is arbitrary and is not true zero. This, of course, has implications for
the type of data manipulation and analysis. We can carry out on data collected in
this form. It is possible to add or subtract a constant to all of the scale values
without affecting the form of the scale but one cannot multiply or divide the
values. Measuring temperature is an example of interval scale. We cannot say
400C is twice as hot as 200C. The reason for this is that 00C does not mean that
there is no temperature, but a relative point on the Centigrade Scale. Due to
lack of an absolute zero point, the interval scale does not allow the conclusion
that 400C is twice as hot as 200C.
Interval scales may be either in numeric or semantic formats. The following are
two more examples of interval scales one in numeric format and another in
semantic format. 87
Geektonight Notes
The interval scales allow the calculation of averages like Mean, Median and
Mode and dispersion like Range and Standard Deviation.
d) Ratio Scale is the highest level of measurement scales. This has the properties
of an interval scale together with a fixed (absolute) zero point. The absolute zero
point allows us to construct a meaningful ratio. Examples of ratio scales include
weights, lengths and times. In the marketing research, most counts are ratio
scales. For example, the number of customers of a bank’s ATM in the last
three months is a ratio scale. This is because you can compare this with
previous three months. Ratio scales permit the researcher to compare both
differences in scores and relative magnitude of scores. For example, the
difference between 10 and 15 minutes is the same as the difference between 25
and 30 minutes and 30 minutes is twice as long as 15 minutes. Most financial
research that deals with rupee values utilizes ratio scales. However, for most
behavioural research, interval scales are typically the highest form of
measurement. Most statistical data analysis procedures do not distinguish
between the interval and ratio properties of the measurement scales and it is
sufficient to say that all the statistical operations that can be performed on
interval scale can also be performed on ratio scales.
Now you must be wondering why you should know the level of measurement.
Knowing the level of measurement helps you to decide on how to interpret the
data. For example, when you know that a measure is nominal then you know
that the numerical values are just short codes for longer textual names. Also,
knowing the level of measurement helps you to decide what statistical analysis is
appropriate on the values that were assigned. For example, if you know that a
measure is nominal, then you would not need to find mean of the data values or
perform a t-test on the data. (t-test will be discussed in Unit-16 in the course).
88
Geektonight Notes
It is important to recognise that there is a hierarchy implied in the levels of Measurement and
Scaling Techniques
measurement. At lower levels of measurement, assumptions tend to be less
restrictive and data analyses tend to be less sensitive. At each level up the
hierarchy, the current level includes all the qualities of the one below it and adds
something new. In general, it is desirable to have a higher level of measurement
(that is, interval or ratio) rather than a lower one (that is, nominal or ordinal).
1) The main difference between interval scale and the ratio scale in terms of their
properties is:
...................................................................................................................
....................................................................................................................
....................................................................................................................
2) Why should the researcher know the level of measurement?
....................................................................................................................
....................................................................................................................
...................................................................................................................
89
Geektonight Notes
The comparative scales can further be divided into the following four types of
scaling techniques: (a) Paired Comparison Scale, (b) Rank Order Scale, (c)
Constant Sum Scale, and (d) Q-sort Scale.
A √ in a particular box means that the brand in that column was preferred Measurement and
Scaling Techniques
over the brand in the corresponding row. In the above recording, Coke was
preferred over Sprite, Coke over Limca, in this case the number of times coke
preferred was 2 times. Similarly, Pepsi over Coke, Pepsi over Sprite, Pepsi over
Limca, in this case Pepsi was 3 time preferred. Thus, the number of times a
brand was preferred is obtained by summing the √ s in each column.
The following table gives paired comparison of data (assumed) for four brands
of cold drinks.
Table 5.2
Paired comparison is useful when the number of brands are limited, since it
requires direct comparison and overt choice. One of the disadvantages of paired
comparison scale is violation of the assumption of transitivity may occur. For
example, in our example (Table 5.1) the respondent preferred Coke 2 times,
Pepsi 3 times, Sprite 1 time, and Limca 0 times. That means, preference-wise,
Pepsi >Coke, Coke >Sprite, and Sprite >Limca. However, the number of times
Sprite was preferred should not be that of Coke. In other words, if A>B and
B >C then C >A should not be possible. Also, the order in which the objects
are presented may bias the results. The number of items/brands for comparison
should not be too many. As the number of items increases, the number of
comparisons increases geometrically. If the number of comparisons is too large,
the respondents may become fatigued and no longer be able to carefully
discriminate among them. The other limitation of paired comparison is that this
scale has little resemblance to the market situation, which involves selection
from multiple alternatives. Also, respondents may prefer one item over certain
others, but they may not like it in an absolute sense.
91
Geektonight Notes
Research and Data Table 5.3: Preference of cold drink brands using rank order scaling
Collection
Instructions: Rank the following brands of cold drinks in order of
preference. Begin by picking out the one brand you like most and assign it a
number1. Then find the second most preferred brand and assign it a number
2. Continue this procedure until you have ranked all the brands of cold drinks
in order of preference. The least preferred brand should be assigned a rank
of 4. Also remember no two brands receive the same rank order.
Format:
Brand Rank
(a) Coke 3
(b) Pepsi 1
(c) Limca 2
(d) Sprite 4
Like paired comparison, the rank order scale, is also comparative in nature. The
resultant data in rank order is ordinal data. This method is more realistic in
obtaining the responses and it yields better results when direct comparison are
required between the given objects. The major disadvantage of this technique is
that only ordinal data can be generated.
c) Constant Sum Scale: In this scale, the respondents are asked to allocate a
constant sum of units such as points, rupees, or chips among a set of stimulus
objects with respect to some criterion. For example, you may wish to determine
how important the attributes of price, fragrance, packaging, cleaning power, and
lather of a detergent are to consumers. Respondents might be asked to divide a
constant sum to indicate the relative importance of the attributes using the
following format.
Table 5.4: Importance of detergent attributes using a constant sum scale
“If an attribute is assigned a higher number of points, it would indicate that the
attribute is more important.” From the above Table, the price of the detergent is
92
Geektonight Notes
the most important attribute for the consumers followed by cleaning power, Measurement and
Scaling Techniques
packaging. Fragrance and lather are the two attributes that the consumers
cared about the least but preferred equally.” The advantage of this technique is
saving time. However, there are two main disadvantages. The respondents may
allocate more or fewer points than those specified. The second problem is
rounding off error if too few attributes are used and the use of a large number
of attributes may be too taxing on the respondent and cause confusion and
fatigue.
d) Q-Sort Scale: This is a comparative scale that uses a rank order procedure to
sort objects based on similarity with respect to some criterion. The important
characteristic of this methodology is that it is more important to make
comparisons among different responses of a respondent than the responses
between different respondents. Therefore, it is a comparative method of scaling
rather than an absolute rating scale. In this method the respondent is given
statements in a large number for describing the characteristics of a product or a
large number of brands of a product. For example, you may wish to determine
the preference from among a large number of magazines. The following format
shown in Table 5.5 may be given to a respondent to obtain the preferences.
Table 5.5: Preference of Magazines Using Q-Sort Scale Procedure
Research and Data Note that the number of responses to be sorted should not be less than 60 or
Collection not more than 140. A reasonable range is 60 to 90 responses that result in a
normal or quasi-normal distribution. This method is faster and less tedious than
paired comparison measures. It also forces the subject to conform to quotas at
each point of scale so as to yield a quasi-normal distribution. The utility of Q-
sort in marketing research is to derive clusters of individuals who display similar
preferences, thus representing unique market segments.
Question: How would you rate the TV advertisement as a guide for buying?
Scale Type A
Strongly Strongly
agree disagree
Scale Type B
Strongly Strongly
disagree agree
Scale Type C
Strongly Strongly
agree 10 9 8 7 6 5 4 3 2 1 0 disagree
Scale Type D
Strongly Strongly
disagree 0 1 2 3 4 5 6 7 8 9 10 agree
When scale type A and B are used, the respondents score is determined either
by dividing the line into as many categories as desired and assigning the
respondent a score based on the category into which his/her mark falls, or by
measuring distance, in millimeters, centimeters, or inches from either end of the
scale. Which ever of the above continuous scale is used, the results are
normally analysed as interval scaled.
94
Geektonight Notes
The itemised rating scales can be in the form of : (a) graphic, (b) verbal, or (c) Measurement and
Scaling Techniques
numeric as shown below:
Some rating scales may have only two response categories such as : agree and
disagree. Inclusion of more response categories provides the respondent more
flexibility in the rating task. Consider the following questions:
1. How often do you visit the supermarket located in your area of residence?
2. In your case how important is the price of brand X shoes when you buy them?
Each of the above category scales is a more sensitive measure than a scale
with only two responses since they provide more information.
95
Geektonight Notes
Research and Data Table 5.6: Some common words for categories used in Itemised Rating scales
Collection
Quality:
Excellent Good Not decided Poor Worst
Very Good Good Neither good Fair Poor
nor bad
Importance:
Very Important Fairly Neutral Not so Not at all
important important important
Interest:
Very interested Somewhat Neither interested Somewhat Not very
interested nor disinterested uninterested interested
Satisfaction:
Completely Somewhat Neither satisfied Somewhat Completely
satisfied satisfied nor dissatisfied dissatisfied dissatisfied
Frequency:
All of the time Very often Often Sometimes Hardly ever
Very ofen Often Sometimes Rarely Never
Truth:
Very true Somewhat Not very true Not at all true
true
Purchase
Interest:
Definitely will Probably will Probably will Definitely
buy buy not buy will not buy
Level of
Agreement:
Strongly agree Somewhat Neither agree Somewhat Strongly
agree nor disagree disagree disagree
Dependability:
Completely Somewhat Not very Not at all
dependable dependable dependable dependable
Style:
Very stylish Somewhat Not very Completely
stylish stylish unstylish
Cost:
Extremely Expensive Neither Slightly Very
expensive expensive nor inexpensive inexpensive
inexpensive
Ease of use:
Very ease to Somewhat Not very easy Difficult to
use easy to use to use use
Modernity:
Very modern Somehwat Neither modern Somewhat Very old
modern nor old-fashioned old fashioned fashioned
Alert:
Very alert Alert Not alert Not at all alert
96
Geektonight Notes
In this section we will discuss three itemised rating scales, namely (a) Likert Measurement and
scale, (b) Semantic Differential Scale, and (c) Stapel Scale. Scaling Techniques
Research and Data Each respondent is asked to circle his opinion on a score against each
Collection
statement. The final score for the respondent on the scale is the sum of their
ratings for all the items. The very purpose of Likert’s Scale is to ensure the final
items evoke a wide response and discriminate among those with positive and
negative attitudes. Items that are poor (because they lack clarity or elicit mixed
response patterns) are detected from the final statement list. This will ensure us
to discriminate between high positive scores and high negative scores. However,
many business researchers do not follow this procedure and you may not be in a
position to distinguish between high positive scores and high negative scores
because all scores look alike. Hence a disadvantage of the Likert Scale is that it
is difficult to know what a single summated score means. Many patterns of
response to the various statements can produce the same total score. The other
disadvantage of Likert Scale is that it takes longer time to complete than other
itemised rating scales because respondents have to read each statement.
Despite the above disadvantages, this scale has several advantages. It is easy to
construct, administer and use.
b) Semantic Differential Scale: This is a seven point rating scale with end points
associated with bipolar labels (such as good and bad, complex and simple) that
have semantic meaning. The Semantic Differential scale is used for a variety of
purposes. It can be used to find whether a respondent has a positive or negative
attitude towards an object. It has been widely used in comparing brands,
products and company images. It has also been used to develop advertising and
promotion strategies and in a new product development study. Look at the
following Table, for examples of Semantic Differential Scale.
Modern — — — — — — — Old-fashioned
Good — — — — — — — Bad
Clean — — — — — — — Dirty
Important — — — — — — — Unimportant
Expensive — — — — — — — Inexpensive
Useful — — — — — — — Useless
Strong — — — — — — — Weak
Quick — — — — — — — Slow
In the Semantic Differential scale only extremes have names. The extreme
points represent the bipolar adjectives with the central category representing the
neutral position. The in between categories have blank spaces. A weight is
assigned to each position on the scale. The weights can be such as +3, +2, +1, 0,
–1, –2, –3 or 7,6,5,4,3,2,1. The following is an example of Semantic
Differential Scale to study the experience of using a particular brand of body
lotion.
98
Geektonight Notes
Measurement and
In my experience, the use of body lotion of Brand-X was: Scaling Techniques
+3 +2 +1 0 –-1 –-2 –-3
Useful Useless
Attractive Unattractive
Passive Active
Beneficial Harmful
Interesting Boring
Dull Sharp
Pleasant Unpleasant
Cold Hot
Good Bad
Likable Unlikable
In the semantic Differential scale, the phrases used to describe the object form
a basis for attitude formation in the form of positive and negative phrases. The
negative phrase is sometimes put on the left side of the scale and sometimes on
the right side. This is done to prevent a respondent with a positive attitude from
simply checking the left side and a respondent with a negative attitude checking
on the right side without reading the description of the words.
The respondents are asked to check the individual cells depending on the
attitude. Then one could arrive at the average scores for comparisons of
different objects. The following Figure shows the experiences of 100
consumers on 3 brands of body lotion.
+3 +2 +1 0 –-1 –-2 –-3
Useful Useless
Attractive Unattractive
Passive Active
Beneficial Harmful
Interesting Boring
Dull Sharp
Pleasant Unpleasant
Cold Hot
Good Bad
Likable Unlikable
Brand-X Brand-Y Brand-Z
In the above example, first the individual respondent scores for each dimension
are obtained and then the average scores of all 100 respondents, for each
dimension and for each brand were plotted graphically. The maximum score
possible for each brand is + 30 and the minimum score possible for each brand
is –30. Brand-X has score +14. Brand-Y has score +7, and Brand-Z has score
–11. From the scale we can identify which phrase needs improvement for each
Brand. For example, Brand-X needs to be improved upon benefits and Brand-Y
on pleasantness, coldness and likeability. Brand Z needs to be improved on all
the attributes.
c) Staple Scale: The Stapel scale was originally developed to measure the
direction and intensity of an attitude simultaneously. Modern versions of the
Stapel scale place a single adjective as a substitute for the Semantic differential
when it is difficult to create pairs of bipolar adjectives. The modified Stapel
scale places a single adjective in the centre of an even number of numerical
values (say, +3, +2, +1, 0, –1, –2, –3). This scale measures how close to or how
distant from the adjective a given stimulus is perceived to be. The following is an
example of a Staple scale.
99
Geektonight Notes
+4 +3 +2 +1 -1 -2 -3 -4
Fast
Fast Service
Services
Friendly
Friendly
Honest
Honest
Convenient
ConvenientLocation
Location
Convenient
ConvenientHours
Hours
Dull
Dull
Good
GoodServices
Services
High
HighSaving
SavingRates
Rates
Each respondent is asked to circle his opinion on a score against each phrase
that describes the object. The final score of the respondent on a scale is the
sum of their ratings for all the items. Also, the average score for each phrase
is obtained by totaling the final score of all the respondents for that phrase
divided by the number of respondents of the phrase. The following Figure
shows the opinions of 100 respondents on two banks.
+4 +3 +2 +1 -1 -2 -3 -4
Fast Service
Friendly
Honest
Convenient Location
Convenient Hours
Dull
Good Services
High Saving Rates
Bank-X Bank-Y
In the above example first the individual respondent’s scores for each phrase
that describes the selected bank are obtained and then the average scores of all
100
Geektonight Notes
100 respondents for each phrase are plotted graphically. The maximum score Measurement and
Scaling Techniques
possible for each bank is +32 and the minimum possible score for each brand is
–32. In the example, Bank-X has score +24, and Bank-Y has score +3. From
the scale we can identify which phrase needs improvement for each Bank.
The advantages and disadvantages of the Stapel scale are very similar to those
for the Semantic differential scale. However, the Stapel scale tends to be easier
to construct and administer, especially over telephone, since the Stapel scale
does not call for the bipolar adjectives as does the Semantic differential scale.
However, research on comparing the Stapel scale with Semantic differential
scale suggests that the results of both the scales are largely the same.
This is a non-comparative scale since it deals with a single concept (the brand of
a detergent). On the other hand, a comparative scale asks a respondent to rate a
concept. For example, you may ask:
Which one of the following brands of detergent you prefer?
Brand-X Brand-Y
In this example you are comparing one brand of detergent with another brand.
Therefore, in many situations, comparative scaling presents ‘the ideal situation’
as a reference for comparison with actual situation.
Research and Data 4) Number of Categories: While there is no single, optimal number of categories,
Collection traditional guidelines suggest that there should be between five and nine
categories. Also, if a neutral or indifferent scale response is possible for at least
some of the respondents, an odd number of categories should be used.
However, the researcher must determine the number of meaningful positions
that are best suited for a specific problem.
1) In paired comparison, the order in which the objects are presented may
____________ results.
2) A researcher wants to measure consumer preference between 7 brands of
bath soap and has decided to use the Paired comparisons scaling technique.
How many pairs of brands will the researcher present the respondents?:
________________
3) In a semantic differential scale there are 20 scale items. Should all the
positive adjectives be on the left side and all the negative adjectives be on the
right side. Can you explain?
................................................................................................................
................................................................................................................
................................................................................................................
102
Geektonight Notes
Measurement and
5.7 LET US SUM UP Scaling Techniques
Research and Data Ordinal Scale : In this scale, the items are ranked according to whether they
Collection have more or less of a characteristic.
Paired Comparison Scale : This is a comparative scaling technique in which
a respondent is presented with two objects at a time and asked to select one
object according to some criterion.
Q-Sort Scale : This is a comparative scale that uses a rank order procedure
to sort objects based on similarity with respect to some criterion.
Rank Order Scale : In this scale, the respondents are presented with several
items simultaneously and asked to order or rank them according to some
criterion.
Ratio Scale : Ratio scales permit the researcher to compare both differences
in scores and relative magnitude of scores.
Scaling : Scaling is the assignment of objects to numbers or semantics
according to a rule.
Semantic Differential Scale : This is a seven point rating scale with end
points associated with bipolar labels (such as good and bad, complex and
simple) that have semantic meaning.
Staple Scale : The Staple scale places a single adjective as a substitute for the
Semantic differential when it is difficult to create pairs of bipolar adjectives.
2) 21
3) No. Some of the positive adjectives may be placed on the left side and
some on the right side. This prevents the respondent with positive
(negative) attitude from simply checking the left (right) side without
reading the description of the words.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
105
Geektonight Notes
Processing of Data
UNIT 6 PROCESSING OF DATA
STRUCTURE
6.0 Objectives
6.1 Introduction
6.2 Editing of Data
6.3 Coding of Data
6.4 Classification of Data
6.4.1 Types of Classification
6.4.1.1 Classification According to External Characteristics
6.4.1.2 Classification According to Internal Characteristics
6.4.1.3 Preparation of Frequency Distribution
6.5 Tabulation of Data
6.5.1 Types of Tables
6.5.2 Parts of a Statistical Table
6.5.3 Requisites of a Good Statistical Table
6.6 Let Us Sum Up
6.7 Key Words
6.8 Answers to Self Assessment Exercises
6.9 Terminal Questions/Exercises
6.10 Further Reading
6.0 OBJECTIVES
After studying this unit, you should be able to:
l evaluate the steps involved in processing of data,
l check for obvious mistakes in data and improve the quality of data,
l describe various approaches to classify data,
l construct frequency distribution of discrete and continuous data, and
l develope appropriate data tabulation device.
6.1 INTRODUCTION
In Unit 3 we have discussed various methods of collection of data. Once the
collection of data is over, the next step is to organize data so that meaningful
conclusions may be drawn. The information content of the observations has to
be reduced to a relatively few concepts and aggregates. The data collected
from the field has to be processed as laid down in the research plan. This is
possible only through systematic processing of data. Data processing involves
editing, coding, classification and tabulation of the data collected so that they
are amenable to analysis. This is an intermediary stage between the collection
of data and their analysis and interpretation. In this unit, therefore, we will learn
about different stages of processing of data in detail.
Processing and Presentation taken to see that the data are as accurate and complete as possible, units of
of Data
observations and number of decimal places are the same for the same variable.
The following practical guidelines may be handy while editing the data:
1) The editor should have a copy of the instructions given to the interviewers.
2) The editor should not destroy or erase the original entry. Original entry should
be crossed out in such a manner that they are still legible.
3) All answers, which are modified or filled in afresh by the editor, have to be
indicated.
4) All completed schedules should have the signature of the editor and the date.
For checking the quality of data collected, it is advisable to take a small sample
of the questionnaire and examine them thoroughly. This helps in understanding
the following types of problems: (1) whether all the questions are answered, (2)
whether the answers are properly recorded, (3) whether there is any bias, (4)
whether there is any interviewer dishonesty, (5) whether there are
inconsistencies. At times, it may be worthwhile to group the same set of
questionnaires according to the investigators (whether any particular investigator
has specific problems) or according to geographical regions (whether any
particular region has specific problems) or according to the sex or background
of the investigators, and corrective actions may be taken if any problem is
observed.
Numerical answers to be converted to same units: Against the question “What Processing of Data
is the plinth area of your house?” answers could be either in square feet or in
square metres. It will be convenient to convert all the answers to these
questions in the same unit, square metre for example.
A careful study of the answers is the starting point of coding. Next, a coding
frame is to be developed by listing the answers and by assigning the codes to
them. A coding manual is to be prepared with the details of variable names,
codes and instructions. Normally, the coding manual should be prepared before
collection of data, but for open-ended and partially coded questions. These two
categories are to be taken care of after the data collection. The following are
the broad general rules for coding:
2) Each qualitative question should have codes. Quantitative variables may or may
not be coded depending on the purpose. Monthly income should not be coded if
one of the objectives is to compute average monthly income. But if it is used as
a classificatory variable it may be coded to indicate poor, middle or upper
income group.
3) All responses including “don’t know”, “no opinion” “no response” etc., are to
be coded.
Sometimes it is not possible to anticipate all the responses and some questions
are not coded before collection of data. Responses of all the questions are to
be studied carefully and codes are to be decided by examining the essence of
the answers. In partially coded questions, usually there is an option “Any Other
(specify)”. Depending on the purpose, responses to this question may be
examined and additional codes may be assigned.
female based on sex). This may go on based on other attributes, like married Processing of Data
and unmarried, rural and urban so on… The following table is an example of
manifold classification.
Population
Employee Unemployee
0 12 1,000-2,000 6
1 25 2,000-3,000 10
2 20 3,000-4,000 15
3 7 4,000-5,000 25
4 3 5,000-6,000 9
5 1 6,000-7,000 4
Total 68 Total 69
Processing and Presentation Illustration 1: A survey of 50 college students was conducted to know that
of Data how many times a week they go to the theatre to see movies. The following
data were obtained:
3 2 2 1 4 1 0 1 1 2 4 1 3 3 2 1 3 4 3 2 0 1 3 4 3
1 4 3 2 2 1 3 1 2 3 2 3 4 4 2 4 3 4 2 3 3 2 0 4 3
To have a discrete frequency table, we may take the help of ‘Tally’ marks as
indicated below.
0 5
1 8
2 12
3 15
4 10
Total 50
From the above frequency table it is clear that more than half the students (27
out of 50) go to the theatre twice or thrice a week and very few do not go
even once a week. These were not so obvious from the raw data.
10
Geektonight Notes
1) The highest and the lowest values of the observations are to be identified and
the lower limit of the first class and upper limit of the last class may be decided.
2) The number of classes to be decided. There is no hard and fast rule. It should
not be too little (lower than 5, say) to avoid high information loss. It should not
be too many (more than 12, say) so that it does not become unmanageable.
3) The lower and the upper limits should be convenient numerals like 0-5, 0-10,
100-200 etc.
4) The class intervals should also be numerically convenient, like 5, 10, 20 etc., and
values like 3, 9, 17 etc., should be avoided.
5) As far as possible, the class width may be made uniform for ease in subsequent
calculation.
11
Geektonight Notes
Processing and Presentation It is often quite useful to present the frequency distribution in two different
of Data
ways. One way is relative or percentage relative frequency distribution.
Relative frequencies can be computed by dividing the frequency of each class
with sum of frequency. If the relative frequencies are multiplied by 100 we will
get percentages. Another way is cumulative frequency distribution which
are cumulated to give the total of all previous frequencies including the present
class, cumulating may be done either from the lowest class (from below) or
from the highest class (from above). The following table illustrates this concept.
Illustration 3
Column (5) in the above table gives cumulative frequency of a particular class,
which is obtained as discussed earlier. Cumulative frequency of the second
class is obtained by adding of its class frequency (23) and the previous class
frequency (2). Cumulative frequency of the next class is obtained by adding of
its class frequency (19) to the cumulative frequency of the previous class (25).
Cumulative frequencies may be interpreted as the number of observation below
the upper class limit of a class. For example, a cumulative frequency of 44 in
the third class (25-30) indicates that 44 labourers received a daily wage of less
than Rs. 30. Cumulation from the highest class may also be done as shown in
column (6). It has a similar interpretation.
Firstly, Some simple graphs can be drawn to show all the frequency Processing of Data
distributions. This is done in the next unit (Unit 7). Secondly, frequency
distribution methods are also used for discrete data, if the number of
observations is large and spread is more.
Illustration 4
2 1-2 12 12 19 43
3 2-5 11 15 20 10 8 64
4 5-10 2 8 15 5 10 40
5 10-20 2 12 4 9 6 33
6 20 and 2 1 2 2 7
more
7 Total 35 40 68 20 29 8 200
The above bivariate frequency table is prepared on the basis of sales and profit
data of 200 companies. As discussed earlier, class limits for both Sales and
Profit are decided first. Tally marks are placed in appropriate row and column
(not shown here). Suppose a company’s Sales and Profit figures are Rs. 2.5
lakhs and Rs. 49000 respectively. It is placed in class 3 of Sales (2 to 5 lakhs)
and Column (5) showing class interval of profit 20 to 50 thousands. The last
column (Column (9) gives the total over all class intervals of Profit. Hence it
gives the frequency distribution of Sales. The frequency distribution in this
column is known as Marginal Frequency distribution of Sales. Similarly, the
figures in Serial No 7 (Row 7) are obtained by summing up over all the class
intervals of Sales. This is the frequency distribution of profit or the Marginal
Frequencies of Profit. The entire table is also known as Joint Frequency
Distribution of Sales and Profit.
13
Geektonight Notes
14
Geektonight Notes
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
Tables may be classified, depending upon the use and objectives of the data to
be presented, into simple tables and complex tables. Let us discuss them along
with illustrations.
Simple Table: In this case data are presented only for one variable or
characteristics. Therefore, this type of table is also known as one way table.
The table showing the data relating to the sales of a company in different years
will be an example of a single table.
15
Geektonight Notes
Processing and Presentation Look at the following tables for an example of this type of table.
of Data
Illustration 5
Table 6.5 : Population of India During 1961–2001 (In thousands)
Census Year Population
1961 439235
1971 548160
1981 683329
1991 846303
2001 1027015
Source: Census of India, various documents.
20-30 2
30-40 5
40-50 21
50-60 19
60-70 11
70-80 5
80-90 2
Total 65
A simple table may be prepared for descriptive or qualitative data also. The
following example illustrates it
Illiterate 22
Literate but below
primary 10
Primary 5
High School 2
College and above 1
All 40
16
Geektonight Notes
Complex Table: A complex table may contain data pertaining to more than Processing of Data
one characteristic. The population data given below is an example.
Illustration 6
Table 6.8 : Rural and Urban Population of India During 1961–2001
(In thousands)
Population
Census Year Rural Urban Total
1961 360298 78937 439235
1971 439046 109114 548160
1981 523867 159463 683329
1991 628691 217611 846303
2001 741660 285355 1027015
Note: The total may not add up exactly due to rounding off error.
Source: Census of India, various documents.
In the above example, rural and urban population may be subdivided into males
and females as indicated below.
Table 6.9 : Rural and Urban Population of India During 1961–2001 (sex-wise)
(In thousands)
Population
Census Year Rural Urban Total
Male Female Male Female Male Female
(1) (2) (3) (4) (5) (6) (7)
In each of the above categories, the persons could be grouped into child and
adult, worker and non-worker, or according to different age groups and so on.
A particular type of complex table that is of great use in research is a cross-
table, where the table is prepared based on the values of two or more
variables. The bivariate frequency table used earlier (illustration 4) is reproduced
here for illustration.
Illustration 7
Table 6.10 : Sales and Profit of 200 Companies
Sl. Sales Profit ( Rupees in thousands)
No. (Rupees Upto 10 10-20 20-50 50-100 100-200 200 and Total
in lakhs) more
(1) (2) (3) (4) (5) (6) (7) (8) (9)
1 Up to 1 10 3 13
2 1-2 12 12 19 43
3 2-5 11 15 20 10 8 64
4 5-10 2 8 15 5 10 40
5 10-20 2 12 4 9 6 33
6 20 and 2 1 2 2 7
more
7 Total 35 40 68 20 29 8 200
17
Geektonight Notes
Processing and Presentation From bivariate table, one may get some idea about the interrelationship between
of Data
two variables. Suppose, that all the frequencies are concentrated in the diagonal
cells, then there is likely to be a strong relationship. That is positive relationship
if it starts from top-left corner to bottom-right corner or if it is from bottom-left
corner to top-right corner then, we could say there is negative relationship. If
the frequencies are more or less equally distributed over all the cells, then
probably there is no strong relationship.
Multivariate tables may also be constructed but interpretation becomes difficult
once we go beyond two variables.
So far we have discussed and learnt about the types of tables and their
usefulness in presentation of data. Now, let us proceed to learn about the
different parts of a table, which enable us to have a clear understanding of the
rules and practices followed in the construction of a table.
A table should have the following four essential parts - title, caption or box
head (column), stub (row heading) and main data. At times it may also contain
an end note and source note below the table. The table should have a title,
which is usually placed above the statistical table. The title should be clearly
worded to give some idea of the table’s contents. Usually a report has many
tables. Hence the tables should be numbered to facilitate reference.
Caption refers to the totle of the columns. It is also termed as “box head”.
There may be sub-captions under the main caption. Stub refers to the titles
given to the rows.
Some of these features are illustrated below with reference to the table on
Rural and Urban Population during 1961-2001, which was presented in earlier
illustration-6, Table 6.8.
1. Title of the Table: Rural and Urban Population of India during 1961–
Table 2001 (in thousands)
18
Geektonight Notes
Processing of Data
Census Year
5. End Note Note: The total may not add up exactly due to
rounding off of error.
1
2
Row Number 3
4
5
Geographical: It can be used when the reader is familiar with the usual
geographical classification.
Processing and Presentation Based on Magnitude: At times, items in a table are arranged according to
of Data
the value of the characteristic. Usually the largest item is placed first and other
items follow in decreasing order. But this may be reversed also. Suppose that
state-wise population data is arranged in order of decreasing magnitude. This
will highlight the most populous state and the least populous state.
One point may be noted. The above arrangements are not exclusive. In a big
table, it is always possible and sometimes convenient to arrange the items
following two or three methods together. For example, it is possible to construct
a table in chronological order and within it in geographical order. Sometimes
information of the same table may be rearranged to produce another table to
highlight certain aspects. This will be clear from the following specimen tables.
Table A
Table B
Tables are prepared for making data easy to understand for the reader. It
should not be very large as the focus may be lost. A large table may be
logically broken into two or more small tables.
1) A good table must present the data in as clear and simple a manner as possible.
2) The title should be brief and self-explanatory. It should represent the
description of the contents of the table.
20 3) Rows and Columns may be numbered to facilitate easy reference.
Geektonight Notes
4) Table should not be too narrow or too wide. The space of columns and Processing of Data
rows should be carefully planned, so as to avoid unnecessary gaps.
5) Columns and rows which are directly comparable with one another should
be placed side by side.
6) Units of measurement should be clearly shown.
7) All the column figures should be properly aligned. Decimal points and plus
or minus signs also should be in perfect alignment.
8) Abbreviations should be avoided in a table. If it is inevitable to use, their
meanings must be clearly explained in footnote.
9) If necessary, the derived data (percentages, indices, ratios, etc.) may also
be incorporated in the tables.
10) The sources of the data should be clearly stated so that the reliability of
the data could be verified, if needed.
3 OC PRIMARY RURAL
4 BC ILLITERATE RURAL
6 SC PRIMARY RURAL
7 ST ILLITERATE RURAL
10 ST ILLITERATE RURAL
12 OC PRIMARY RURAL
14 BC ILLITERATE RURAL
15 SC PRIMARY RURAL
16 SC PRIMARY RURAL
20 ST PRIMARY URBAN
21 SC PRIMARY RURAL
23 ST PRIMARY RURAL
25 OC PRIMARY RURAL
27 OC PRIMARY URBAN
21
Geektonight Notes
..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
22 ..................................................................................................................
Geektonight Notes
Processing of Data
6.6 LET US SUM UP
Once data collection is over, the next important steps are editing and coding.
Editing helps in maintaining consistency in quality of data. Editing is the first
stage in data processing. It is the process of examining the data collected to
detect errors and omissions and correct them for further analysis. Coding
makes further computation easier and necessary for efficient analysis of data.
Coding is the process of assigning some symbols to the answers. A coding
frame is developed by listing the answers and by assigning the codes to them.
The next stage is classification. Classification is the process of arranging data in
groups or classes on the basis of some chracteristics. It helps in making
comparisons and drawing meaningful conclusions. The classified data may be
summarized by means of tabulations and frequency distributions. Cross
tabulation is particularly useful as it provides some clue about relationship and
its direction between two variables. Frequency distribution and its extensions
provide simple means to summarize data and for comparison of two sets of data.
Class Interval : The difference between the upper and lower limits of a class.
Class Limits : The lowest and the highest values that can be included in the
class.
Coding : A method to categorize data into groups and assign numerical values
or symbols to represent them.
23
Geektonight Notes
1 80–90 8
2 90–100 10
3 100–110 27
4 110–120 10
5 120–130 4
6 130–140 1
7 Total 60
1 25–30 2
2 30–35 11
3 35–40 36
4 40–45 9
5 45–50 1
6 50–55 1
7 Total 60
24
Geektonight Notes
1 Rural 34
2 Urban 16
5 All 50
Complex Table
Table : Distribution of 50 Unskilled Workers
Education Place of Origin
Level Rural Urban Total
SC ST BC OC Total SC ST BC OC Total
Illiterate 1 2 3 2 8 2 1 3 0 6 14
Below 1 0 2 3 6 1 0 0 2 3 9
Primary
Primary 5 2 1 6 14 0 1 0 3 4 18
High 2 0 0 4 6 0 0 0 3 3 9
School
Total 9 4 6 15 34 3 2 3 8 16 50
6) Draw a “less than” and “more than” cumulative frequency distribution for the
following data.
Income (Rs.) 500-600600-700700-800800-900900-1000
No. of families 25 40 65 35 15
7) What is tabulation? Draw the format of a statistical table and indicate its
various parts.
8) Describe the requisites of a good statistical table.
9) Prepare a blank table showing the age, sex and literacy of the population in a
city, according to five age groups from 0 to 100 years.
10) The following figures relate to the number of crimes (nearest-hundred) in four
metropolitan cities in India. In 1961, Bombay recorded the highest number of
crimes i.e. 19,400 followed by Calcutta with 14,200, Delhi 10,000 and Madras
5,700. In the year 1971, there was an increase of 5,700 in Bombay over its
1961 figure. The corresponding increase was 6,400 in Delhi and 1,500 in
Madras. However, the number of these crimes fell to 10,900 in the case of
Calcutta for the corresponding period. In 1981, Bombay recorded a total of
36,300 crimes. In that year, the number of crimes was 7,000 less in Delhi as
compared to Bombay. In Calcutta the number of crimes increased by 3,100 in
1981 as compared to 1971. In the case of Madras the increase in crimes was
by 8,500 in 1981 as compared to 1971. Present this data in tabular form.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
26
Geektonight Notes
Diagrammatic and
UNIT 7 DIAGRAMMATIC AND GRAPHIC Graphic Presentation
PRESENTATION
STRUCTURE
7.0 Objectives
7.1 Introduction
7.2 Importance of Visual Presentation of Data
7.3 Diagrammatic Presentation
7.3.1 Rules for Preparing Diagrams
7.4 Types of Diagrams
7.5 One Dimensional Bar Diagrams
7.5.1 Simple Bar Diagram
7.5.2 Multiple Bar Diagram
7.5.3 Sub-divided Bar Diagram
7.6 Pie Diagram
7.7 Structure Diagrams
7.7.1 Organisational Charts
7.7.2 Flow Charts
7.8 Graphic Presentation
7.9 Graphs of Time Series
7.9.1 Graphs of One Dependent Variable
7.9.2 Graphs of More Than One Dependent Variable
7.10 Graphs of Frequency Distribution
7.10.1 Histograms and Frequency Polygon
7.10.2 Cumulative Frequency Curves
7.11 Let Us Sum Up
7.12 Key Words
7.13 Answers to Self Assessment Exercises
7.14 Terminal Questions/Exercises
7.15 Further Reading
7.0 OBJECTIVES
After studying this Unit, you should be able to:
7.1 INTRODUCTION
In the previous Unit 6, you have studied the importance and techniques of
editing, coding, classification and tabulation that help to arrange the mass of
data (collected data) in a logical and precise manner. Tabulation is one of the
techniques for presentation of collected data which makes it easier to establish
trend, pattern, comparison etc. However, you might have noticed, it is a difficult 2 7
Geektonight Notes
Processing and Presentation and cumbersome task for a researcher to interpret a table having a large mass
of Data of numerical information. Sometimes it may fail to convey the message
meaningfully to the readers for whom it is meant. To overcome this
inconvenience, diagrammatic and graphic presentation of data has been invented
to supplement and explain the tables. Practically every day we can find the
presentation of cricket score, stock market index, cost of living index etc., in
news papers, television, magazines, reports etc. in the form of diagrams and
graphs. This kind of presentation is also termed as 'visual presentation' or
‘charting’.
In this unit, you will learn about the importance of visual presentation of
research data and some of the reasons why diagrammatic and graphic
presentation of data is so widely used. You will also study the different kinds of
diagrams and graphs, which are more popularly used for presenting the data in
research work. Also its principles on how to present the frequency distribution
in the form of diagrams and graphs. As you are already familiar with graphs
and diagrams, we will proceed with further discussions.
1) They relieve the dullness of the numerical data: Any list of figures
becomes less comprehensible and difficult to draw conclusions from as its
length increases. Scanning of the figures from tables causes undue strain on the
mind. The data when presented in the form of diagrams and graphs, gives a
birds eye-view of the entire data and creates interest and leaves an impression
on the mind of readers for a long period.
2) They make comparison easy: This is one of the prime objectives of visual
presentation of data. Diagrams and graphs make quick comparison between two
or more sets of data simpler, and the direction of curves bring out hidden facts
and associations of the statistical data.
3) They save time and effort: The characteristics of statistical data, through
tables, can be grasped only after a great strain on the mind. Diagrams and
graphs reduce the strain and save a lot of time in understanding the basic
characteristics of the data.
6) They have become an integral part of research: In fact, now a days it Diagrammatic and
Graphic Presentation
is difficult to find any research work without visual support. The reason is that
this is the most convincing and appealing way of presenting the data. You can
find diagrammatic and graphic presentation of data in journals, magazines,
television, reports, advertisements etc. After having understood about the
importance of visual presentation, we shall move on to discuss about the
Diagrams and graphs which are more frequently used in the area of business
research.
1) You must have noted that the diagrams must be geometrically accurate.
Therefore, they should be drawn on the graphic axis i.e., ‘X’ axis (horizontal
line) and ‘Y’ axis (vertical line). However, the diagrams are generally drawn on
a plain paper after considering the scale.
2) While taking the scale on ‘X’ axis and ‘Y’ axis, you must ensure that the scale
showing the values should be in multiples of 2, 5, 10, 20, 50, etc.
3) The scale should be clearly set up, e.g., millions of tons, persons in Lakhs, value
in thousands etc. On ‘Y’ axis the scale starts from zero, as the vertical scale is
not broken.
4) Every diagram must have a concise and self explanatory title, which may be
written at the top or bottom of the diagram.
5) In order to draw the readers' attention, diagrams must be attractive and well
propotioned.
6) Different colours or shades should be used to exhibit various components of
diagrams and also an index must be provided for identification.
7) It is essential to choose a suitable type of diagram. The selection will depend
upon the number of variables, minimum and maximum values, objects of
presentation.
2 9
Geektonight Notes
........................................................................................................................
........................................................................................................................
........................................................................................................................
........................................................................................................................
A large number of one dimensional diagrams are available for presenting data.
Such as line diagram, simple bar diagram, multiple bar diagram, sub-divided bar
diagram, percentage bar diagram, deviation bar diagram etc. We shall, however,
study only the simple bar diagram, multiple bar diagram, and sub-divided bar
diagram. Let us study these three kinds of diagrams with the support of
relevant illustrations.
Exports
(In Million kgs.) 167 209 410 316 192 215 160
Solution: The quantity of tea exported is given in million kgs. for different
years. A simple bar diagram will be constructed with 7 bars corresponding to
the 7 years. Now study the following vertical construction of bar diagram by
referring the guide lines for construction of simple bars, as explained in section
7.5.1.
450 410
400
Tea Export (in million kgs.)
350 316
300
250 215
209
192
200 167 160
150
100
50
0
1995-96 1996-97 1997-98 1998-99 1999-00 2000-01 2001-02
Years
Figure 7.1: Simple Bar Diagram Showing the Tea Exports in Different Years.
Illustration-2
The following data relates to the Profit and Loss of different industries in 1999-
2002. Present the data through simple bar diagram.
Solution : The given data represents positive and negative values i.e., profit
and loss. Let us draw the bars horizontally. Observe fig: 7.2 carefully and try
to understand the construction of simple bars horizontally.
3 1
Geektonight Notes
Sugar 14
Industries
–12 Textiles
Oil 25
Cement 48
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
In this type of diagram, two or more than two bars are constructed side by
side horizontally for a period or related phenomenon. This type of diagram is
also called Compound Bar or Cluster Bar Diagram. The technique of preparing
such a diagram is the same as that of simple bar diagram. This diagram, on
the one hand, facilitates comparison of the values of different variables in a set
and on the other, it facilitates the comparison of the values of the same variable
over a period of time or phenomenon. To facilitate easy comparison, the
3 2 different bars of a set may be coloured or shaded differently to distinguish
Geektonight Notes
between them. But the Colour or shade for the bars representing the same Diagrammatic and
Graphic Presentation
variable in different sets should be the same.
Let us consider the following illustration and learn the method of presentation
of the data in the form of a multiple bar diagram.
Illustration-3
2155
Foregin Investment- Industrywise Inflow
2000
1800
1580 1550
1500 1423
(Rs. In crores)
1194
1000 956
78
0
1997-98 1998-99 1999-2000
Years
Figure 7.3: Multiple Bar Diagram Showing the Inflow of Foreign Investment in Selected
Sectors During 1997-2000
3 3
Geektonight Notes
The following table relates the Indian Textile Exports to different countries
Countries Year
1997-98 1998-99 1999-2000
USA 746.13 759.36 882.41
Germany 366.01 300.46 338.88
UK 403.07 337.94 341.42
Italy 241.64 233.14 215.48
Korea (Republic) 127.00 88.30 185.13
In this diagram one bar is constructed for the total value of the different
components of the same variable. Further it is sub-divided in proportion to the
values of various components of that variable. This diagram shows the total of
the variables as well as the total of its various components in a single bar.
Hence, it is clear that the sub-divided bar serves the same purpose as multiple
bars. The only difference is that, in case of the multiple bar each component of
a variable is shown side by side horizontally, where as in construction of sub-
divided bar diagram each component of a variable is shown one upon the other.
It is also called a component bar diagram. This method is suitable if the total
values of the variables are small, otherwise the scale becomes very narrow to
depict the data. To study the relative changes, all components may be
converted into percentages and drawn as sub-divided bars. Such a bar
construction is called a sub-divided percentage bar. The limitation is that all
the parts do not have a common base to enable us to compare accurately the
various components of a set.
Illustration-4
The following data relates to India's exports of electronic goods to different
3 4 countries during 1994-98. Represent the data by sub-divided bar diagram.
Geektonight Notes
1800
130
90
1600 93
Indian's Export of Electronics Goods (Rs. in crores)
350
327
1400 349
175 200
1200 118
221
1000 248 220
467 189
800
91
600 159
275
105 880 900
400 789
56
86
200 378
210
0
1994-95 1995-96 1996-97 1997-98 1998-99
Years
Figure 7.4: Sub-divided Bar Diagram Showing the India's Exports of Electronic Goods to
Different Countries During 1994-99.
3 5
Geektonight Notes
Draw sub-divided bar diagram for the following table. Do you agree that
this diagram is more effective for comparison of figures rather than the
Multiple bar diagram? Justify your opinion.
In constructing a pie diagram the first step is to convert the various values of
components of the variable into percentages and then the percentages
transposed into corresponding degrees. The total percentage of the various
components i.e., 100 is taken as 360° (degrees around the centre of a circle)
and the degree of various components are calculated in proportion to the
percentage values of different components. It is expressed as:
360 o
× component ' s percentage
100
It should be noted that in case the data comprises of more than one variable, to
show the two dimensional effect for making comparison among the variables,
we have to obtain the square root of the total of each variable. These square
3 6
Geektonight Notes
roots would represent the radius of the circles and then they will be sub- Diagrammatic and
Graphic Presentation
divided. A pie diagram helps us in emphasizing the area and in ascertaining the
relationship between the various components as well as among the variables.
However, compared to a bar diagram, a pie diagram is less effective for
accurate interpretation when the components are in large numbers. Let us draw
the pie diagram with the help of the data contained in the following table.
Illustration 5
3 7
Geektonight Notes
10.9%
18.2%
56.4%
Radio Daily wages Local traders Co-farmers Personal visits M arket office
What features of this distribution does your pie diagram mainly illustrate?
3 8
Geektonight Notes
Diagrammatic and
7.7 STRUCTURE DIAGRAMS Graphic Presentation
There are several important diagram formats that are used to display the
structural information (qualitative) in the form of charts. The format depends
upon the nature of information. Under these type of diagrams we will discuss
two different diagrams, i.e., (1) Organisational Charts and (2) Flow Charts.
These types of charts are most commonly used to represent the internal
structure of organisations. There is no standard format for these kind of
diagrams as the design of the diagram depends on the nature of the
organization. A special format is used in the following illustration which relates
to the organisational structure of the IGNOU. Study the Fig. 7.6 and try to
understand the preparation of this kind of diagram relating to other
organisations.
VISITOR
BOARD OF MANAGEMENT
Vice Chancellor
Pro-Vice Chancellors
Flow charts are used most commonly in any situation where we wish to
represent the information which flows through different situations to its ultimate
point. These charts can also be used to indicate the flow of information about
various aspects i.e., material flow, product flow (distribution channels), funds
flow etc.
3 9
Geektonight Notes
Processing and Presentation The following Figure 7.7 relates to the marketing channels for fruits, which will
of Data give you an understanding about flow charts.
Growers
Processors Pre-Harvest
Contractors
Commission Agent in
Wholesaler Exports Wholesale Market
Retail Wholesaler
Consumer Exports
4 0
Geektonight Notes
Diagrammatic and
7.8 GRAPHIC PRESENTATION Graphic Presentation
The shape of a graph offers easy and appropriate answers to several questions,
such as:
l The direction of curves on the graph makes it very easy to draw comparisons.
l The presentation of time series data on a graph makes it possible to interpolate
or extrapotrate the values, which helps in forecasting.
l The graph of frequency distribution helps us to determine the values of Mode,
Median, Quartiles, percentiles, etc.
l The shape of the graph helps in demonstrating the degree of inequality and
direction of correlation
For all such advantages it is necessary for a researcher to have an
understanding of different types of graphic presentation of data. In practice,
there are a variety of graphs which can be used to depict the data. However,
here we will discuss only a few graphs which are more frequently used in
business research.
Broadly, the graphs of statistical data may be classified into two types, one is
graphs of time series, another is graphs of frequency distribution. We will
discuss both these types, after studying the parts of a graph.
Parts of a Graph
The foremost requirement for a researcher is to be aware of the basic
principles for using the graph paper for presentation of statistical data
graphically.
4 1
Geektonight Notes
5
QUADRANT-II 4 QUADRANT-I
3
X–Negative Values X–Positive Values
2
Y–Positive Values Y–Positive Values
1
X X
-5 -4 -3 -2 -1 0 1 2 3 4 5
-1
-2
QUADRANT-III QUADRANT-IV
-3
X–Negative Values -4 X–Positive Values
Y–Negative Values Y–Negative Values
-5
Y
Chart 7.1 : Parts of a Graph
After understanding the above parts of a graph, let us study the different types
of graphs.
1) On X-axis we take the time as an independent variable and on Y axis the values
of data as dependent variable. Plot the different points corresponding to given
data; then the points are joined by a straight line in the order of time.
2) Equal magnitude of scale must be maintained on X-axis as well as on Y-axis.
3) The Y-axis normally starts with zero. In case, there is a wide difference
between the lowest value of the data and zero (origin point), the Y-axis can be
broken and a false base line may be drawn. However, it will be explained under
4 2 the related problem in this section.
Geektonight Notes
4) If the variables are in different units, double scales can be taken on the Y Diagrammatic and
Graphic Presentation
axis.
5) The scales adopted should be clearly indicated and the graph must have a
self-explanatory title.
6) Unfortunately, graphs lend themselves to considerable misuse. The same
data can give different graphical shapes depending on the relative size of
two axes. In order to avoid such misrepresentations the convention in
research is to construct graphs, wherever possible, such that the vertical axis
is around 2/3 to 3/4 the length of the horizontal.
After having learnt about the principles for construction of historigrams, we
move on to discuss the types of historigrams. There are various types which
have been developed. Among them the frequently used graphs are one variable
graphs and more than one dependent variable graphs. We will now look at the
construction of these graphs.
When there is only one dependent variable, the values of the dependent variable
are taken on Y axis, while the time is taken on X-axis. Study the following
illustration carefully and try to understand the method of construction for one
dependent variable historigrams.
Illustration 6
The following data relates to India's exports to USA during the period of 1994-
2000. Represent the data graphically.
Y
11000 10687
10000
9000 9071
Export to USA (in million $)
8237
8000
7322
7000
6000 6130
5726
5000 5310
4000
3000
2000
1000
0 X
1994 1995 1996 1997 1998 1999 2000
Years
Figure 7.8: Historigram Showing India's Exports to USA During 1994-2000. 4 3
Geektonight Notes
Y
11000
10500
10000 10687
9500
9000
8500 9071
8000
7500 8237
7000
7322
6500
6000
6170
5500 5726
5000 5310
0 X
1994 1995 1996 1997 1998 1999 2000
When the data of time series relate to more than one dependent variable,
curves must be drawn for each variable separately. These graphs are prepared
in the same manner as we prepare one dependent variable historigram. Let us
consider the following data to construct historigrams. Study Figure 7.10 carefully
and understand the procedure for preparation of this type of graph.
4 4
Geektonight Notes
Years 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Sales
(In 31 58 42 65 75 80 72 96 83 98
lakh)
Cost
of Sales 42 50 48 55 82 75 62 80 67 73
(Rs. In
Lakh)
Profit/ –11 +8 –6 +10 –7 +5 +10 +20 +16 +25
Loss
Solution : The given data comprises of three variables, so, we have to draw
a separate curve for each variable. In this graph, it is not necessary to draw
false base line because the minimum value is close to the point of origin (zero).
For easy identification, each curve is marked differently.
80
60
40
20
0 X
-20 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Years
Figure 7.10 : Historigram Showing Sales, Cost of Sales and Profit/Loss of a Company
During 1991-2000
The above graph clearly reveals that with passage of time the profits are rising
after 1996, even though the sales are fluctuating slightly.
4 5
Geektonight Notes
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
Let us study the procedure involved in the preparation of these types of graphs.
not be any gap between two successive rectangles, and the data must be in Diagrammatic and
Graphic Presentation
exclusive form of classes. However, we cannot construct histogram for
distribution with open-end classes and it can be quite misleading if the
distribution has unequal class intervals.
The value of mode can be determined from the histogram. The procedure for
locating the mode is to draw a straight line from the top right corner of the
highest rectangle (Modal Class) to the top right corner of the preceding
rectangle (Pre Modal Class). Similarly, draw a straight line from the top left
corner of the highest rectangle to top left corner of the succeeding rectangle
(Post Modal Class). Draw a perpendicular from the point of intersection of
these two straight lines to X-axis. The point where it meets the X-axis gives
the value of mode. This is shown in Figure 7.11. However, graphic location of
Mode is not possible in a multi-distribution.
Let us, now, take up an illustration to learn how to draw a histogram, and
frequency polygon practically and also determine the mode. The data relates to
the sales of computers by different companies.
Illustration-8
Sales (Rs. 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
In crores)
No. of
Companies 8 20 35 50 90 70 30 15
4 7
Geektonight Notes
80
70
Histogram
No. of companies
60
50
Frequency Polygan
40
30
20
10
0
10 20 30 40 50 60 70 80
Z = 46.67
Sale of computers (Rs. in crores)
Figure 7.11: Histogram and Frequency Polygon for Computer Sales of Various
Companies
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
4 8
Geektonight Notes
Some times we are interested in knowing how many families are there in a
city, whose earnings are less than Rs. 5,000 p.m. or whose earning are more
than Rs. 20,000 p.m. In order to obtain this information, we have first of all
to convert the ordinary frequency table into cumulative frequency table. When
the frequencies are added they are called cumulative frequencies. The curves
so obtained from the cumulative frequencies are called ‘cumulative frequency
curves’, popularly known as “ogives”. There are two types of ogives namely
less than ogive, and more than ogive. Let us know about the procedure
involved in drawing these two ogives.
In less than ogive, we start with the upper limit of each class and the
cumulative (addition) starts from the top. When these frequencies are plotted
we get less than ogive. In case of more than ogive we start with the lower
limit of each class and the cumulation starts from the bottom. When these
frequencies are plotted we get more than ogive. You should bear in mind that
while drawing ogives the classes must be in exclusive form.
The ogives are useful to determine the number of items above or below a
given value. It is also useful for comparison between two or more frequency
distributions and to determine certain values (positional values) such as mode,
median, quartiles, percentiles etc. Let us take up an illustration to understand
how to draw ogives practically. Observe carefully the procedures involved in it.
Note: Mode and Median are explained in Unit 8. Similarly, quartiles are in
Unit 9. This illustration can be better understood only after studying those units.
Illustration-9
The cumulative frequencies presented in the above table have the following
interpretation.
The ‘less than’ cumulative frequencies are to be read against upper class limits.
In contrast, the ‘more than’ cumulative frequencies are to be read against
lower class boundaries. For instance, there are 7 units with operating expenses
of less than Rs. 20,000, there are 160 units with operating expenses of less
than Rs. 120,000. On the other hand, there are 153 units with operating
expenses more than Rs. 60,000; no units with operating expenses more than or
equal to Rs. 2,00,000.
160
140
120
100
80
60
40
20
0
0 20 40 60 80 100 120 140 160 180 200
Q1 = 60.18 Me = 80.77 Q3 = 112.31
Operating Expenses (Rs. in 000'
Fig 7.12: ‘Less than’ and ‘More than’ Cumulative Frequency Curves Showing the Operating Expenses
5 0 (Rs. in’ 000) of Small Scale Industrial Units.
Geektonight Notes
Now, look at Figure 7.12 which shows both the cumulative curves on the same Diagrammatic and
Graphic Presentation
graph. Study carefully and understand the procedures for drawing ogives.
From the above ogives, the median can be located by drawing a perpendicular
from the intersection of the two ogives to X-axis. The point where the
perpendicular touches X-axis would be the Median of the distribution. Similarly,
the perpendicular drawn from the intersection of the two curves to the Y-axis
would divide the sum of frequencies into two equal parts. The values of
positional averages like Q1, D6, P50, etc., can also be located with the help of
an item’s value on the less than ogive. In the above figure determination of Q1
and Q3 are shown as an illustration.
c) How many sample families are approximately spending less than 3,800 on
food.
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
5 1
Geektonight Notes
We have discussed the method for constructing simple bar diagram, multiple bar
diagram, sub-divided bar diagram, pie diagram and structure diagrams.
Continuous Data : Data that may progress from one class to the next without
a break and may be expressed by either fractions or whole number.
Discrete Data : Data that do not progress from one class to the next without
break, i.e., where classes represent distinct categories or counts and may be
represented by whole numbers only.
False Base Line : A line that is drawn between the origin point (zero) and
the first c.m., by breaking Y-axis in case of historigrams. Hence the scale of
Y-axis does not start at zero.
Flow Chart : Presents the information which flows through various situations
to the ultimate point.
E. Steps : 1) Find out the percentages of each reason for buying face cream.
2) Convert the percentages into degree of angle. 3) Then depict the percentages
in a circle with the help of their respective degree of angles.
7) Draw a Multiple bar and sub-divided bar diagrams to represent the following
data relating to the enrollment of various programmes in an open university
over a period of four years and comment on it.
5 3
Geektonight Notes
8) Construct a pie diagram to describe the following data which relates to the
amount spent on various heads under Rural development programme.
What features of this distribution does your pie diagram mainly illustrate?
9) The following table gives the Index numbers of wholesale Prices (Average) of
Cereals, Pulses and oilseeds over a period of 7 yrs. Compare these prices
through a suitable graph.
10)Draw histogram and frequency polygon of the following distribution. Locate the
approximate mode with the help of histogram.
5 4
Geektonight Notes
11) The following data relating to sales of 80 companies are given below Diagrammatic and
Graphic Presentation
Sales (Rs.Lakhs) No. of Companies
5-15 8
15-25 13
25-35 19
35-45 14
45-55 10
55-65 7
65-75 6
75-85 3
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
5 5
Geektonight Notes
8.0 OBJECTIVES
After studying this unit, you should be able to:
l explain the meaning and use of percentages, ratios and rates for data
analysis,
l discuss the computational aspects involved in working out the statistical
derivatives,
l describe the concept and significance of various measures of central
tendency, and
l compute various measures of central tendency, such as arithmetic mean,
weighted mean, median, mode, geometric mean, and hormonic mean.
8.1 INTRODUCTION
In Unit 6 we discussed the method of classifying and tabulating of data.
Diagrammatic and graphic presentations are covered in the previous unit
(Unit-7). They give some idea about the existing pattern of data. So far no big
numerical computation was involved. Quantitative data has to be condensed in a
meaningful manner, so that it can be easily understood and interpreted. One of
the common methods for condensing the quantitative data is to compute
statistical derivatives, such as Percentages, Ratios, Rates, etc. These are simple
derivatives. Further, it is necessary to summarise and analyse the data. The first
step in that direction is the computation of Central Tendency or Average, which
gives a bird's-eye view of the entire data. In this Unit, we will discuss computation
of statistical derivatives based on simple calculations. Further, numerical methods
for summarizing and describing data – measures of Central Tendency – are
discussed. The purpose is to identify one value, which can be obtained from the
5 6 data, to represent the entire data set.
Geektonight Notes
Statistical Derivatives and
8.2 STATISTICAL DERIVATIVES Measures of Central Tendency
Statistical derivatives are the quantities obtained by simple computation from the
given data. Though very easy to compute, they often give meaningful insight to
the data. Here we discuss three often-used measures: percentage, ratio and
rate. These measures point out an existing relationship among factors and
thereby help in better interpretation.
8.2.1 Percentage
As we have noted earlier, the frequency distribution may be regarded as simple
counting and checking as to how many cases are in each group or class. The
relative frequency distribution gives the proportion of cases in individual classes.
On multiplication by 100, the percentage frequencies are obtained. Converting to
percentages has some advantages - it is now more easily understood and
comparison becomes simpler because it standardizes data. Percentages are quite
useful in other tables also, and are particularly important in case of bivariate
tables. We show one application of percentages below. Let us try to understand
the following illustration.
Illustration 1
The following table gives the total number of workers and their categories for
all India and major states. Compute meaningful percentages.
Table: Total Workers and Their Categories-India and Major States : 2001
(In thousands)
Sl. State/ Cultivators Agricultural Household Other Total
No. India Labourers Industry Workers Workers
Workers
Processing and Presentation Solution: In the table above, the row total gives the total workers of a state/
of Data
all India and column total gives the aggregate values of different categories of
workers and all workers. Thus, it is possible to compute meaningful percentages
from both rows and columns. The row percentages are computed by dividing
the figures in columns (3), (4), (5) and (6) by the figure in column (7) and
multiplied by 100. The figures are presented in tabular form below. Percentage
of cultivators in Jammu & Kashmir is obtained as (1600 ÷ 3688) × 100 which
equals 43.37. Similarly other figures are obtained.
1. Jammu &
Kashmir
2. Himachal
Pradesh
3. Punjab
4. Haryana
5. Rajasthan
6. Uttar Pradesh
7. Bihar
8. Assam
9. West Bengal
10. Orissa
11. Madhya
Pradesh
12. Gujarat
13. Maha-
rashtra
14. Andhra
Pradesh
15. Karnataka
16. Kerala
17. Tamil Nadu
INDIA 100.00 100.00 100.00 100.00 100.00
8.2.2 Ratio
Processing and Presentation between Rs 30–35 is 70:14 or 5:1. Where ever possible, it is convenient to
of Data reduce the ratios in the form of n1: n2, the most preferred value of n2 being 1.
Thus, representation in the form of ratio also reduces the size of the number
which facilitates easy comparison and quick grasp. As the number of categories
increases, the ratio is a better derivative for presentation as it will be easy and
less confusing.
There are several types of ratios used in statistical work. Let us discuss them.
Time ratio: This ratio is a measure which expresses the changes in a series
of values arranged in a time sequence and is typically shown as percentage.
Mainly, there are two types of time ratios :
i) Those employing a fixed base period: Under this method, for instance, if
you are interested in studying the sales of a product in the current year, you
would select a particular past year, say 1990 as the base year and compare
the current year’s production with the production of 1990.
ii) Those employing a moving base: For example, for computation of the
current year's sales, last year's sales would be assumed as the base (for
1991, 1990 is the base. For 1992, 1991 is the base and so on …. .
Ratios are more often used in financial economics to indicate the financial
status of an organization. Look at the following illustration:
Illustration 2
The following table gives the balance sheet of XYZ Company for the year
2002–03. Compute useful financial ratios.
Solution: Three common ratios may be computed from the above balance
sheet: current ratio, cash ratio, and debt-equity ratio. However, these ratios are
discussed in detail in MCO-05 : Accounting for Managerial Decisions, under
Unit-5 : Techniques of Financial Analysis.
Current assets , loans , advances + current investment s 330 + 10
Current ratio = =
Current liabilitie s and provisions + short term debt 150 + 50 + 60
= 1.31
8.2.3 Rate
The concept of ratio may be extended to the rate. The rate is also a
comparison of two figures, but not of the same variable, and it is usually
expressed in percentage. It is a measure of the number of times a value occurs
in relation to the number of times the value could occur, i.e. number of actual
occurrences divided by number of possible occurrences. Unemployment rate in
a country is given by total number of unemployed person divided by total
number of employable persons. It is clear now that a rate is different from a
ratio. For example, we may say that in a town the ratio of the number of
unemployed persons to that of all persons is 0.05: 1. The same message would
be conveyed if we say that unemployment rate in the town is 0.05, or more
commonly, 5 per cent. Sometimes rate is defined as number of units of a
variable corresponding to a single unit of another variable; the two variables
6 1
Geektonight Notes
Processing and Presentation could be in different units. For example, seed rate refers to amount of seed
of Data
required per unit area of land. The following table gives some examples of
rates.
2) What is a rate?
..................................................................................................................
..................................................................................................................
..................................................................................................................
To start with, we list the properties that could be defined by an ideal measure
of central tendency. Some of the measures are discussed in detail later.
Some of the important measures of central tendency which are most commonly Statistical Derivatives and
Measures of Central Tendency
used in business and industry are: Arithmetic Mean, Weighted Arithmetic Mean,
Median, Mode, Geometric mean and Harmonic mean. Among them Median and
Mode are the positional averages and the rest are termed as Methamatical
Averages.
Most of the time, when we refer to the average of something, we are talking
about the arithmetic mean. This is the most important measure of central
tendencies which is commonly called mean.
Mean of ungrouped data: The mean or the arithmetic mean of a set of data
is given by:
X1 + X 2 + … + X n
X=
N
This formula can be simplified as follows:
∑x Sum of values of all observations.
Arithmetic mean ( x ) =
N Number of observations.
The Greek letter sigma, Σ, indicates “the sum of ”
Illustration 3
Suppose that wages (in Rs) earned by a labourer for 7 days are 22, 25, 29, 58,
30, 24 and 23.The mean wage of the labourer is given by:
Mean of grouped data: We have seen how to obtain the result of mean from
ungrouped data. In Unit-6, we have learnt the preparation of frequency
distribution (grouped data). Let us consider what modifications are required for
grouped data for calculation of mean.
When we have grouped data, either in the form of discrete or continuous, the
expression for the mean would be :
fx
(x ) = ∑ Σ (f × x)
N Sum of the frequency (Σf)
Let us consider an illustration to understand the application of the formula.
Illustration 4
Now, to compute the mean wage, multiply each variable with its corresponding
frequency (f × x) and obtain the total (Σfx).
6 3
Geektonight Notes
Processing and Presentation Divide this total by number of observations (Σf or N). Practically, we compute
of Data the mean as follows:
102 Σfx
= = 29.26
35 Σf or N
Illustration-5
The following table gives the daily wages for 70 labourers on a particular day.
Daily Wages (Rs) : 15-20 20-25 25-30 30-35 35-40 40-45 45-50
No of labourer : 2 23 19 14 5 4 3
Solution: For obtaining the estimated value of mean we have to follow the
procedure as explained above. This is elaborated below.
6 4
Geektonight Notes
Statistical Derivatives and
X=
∑ fx = 2030 = Rs. 29 Measures of Central Tendency
N 70
Hence, the mean daily wage is Rs. 29.
Assume A as 32.5
X=A+
∑ fd × i
N
− 49
= 32.5 + × 5 = 29
70
Hence mean daily wage is Rs. 29, as obtained earlier.
The important property of arithmetic mean is that the means of several sets
of data may be combined into a single mean for the combined sets of data.
The combined mean may be defined as:
N X + N X ..... + N n X n
X = 1 1 2 2
12...n N + N ..... + N n
1 2
If we have to combine means of four sets of data, then the above formula can
be generalized as:
N 1 X1 + N 2 X 2 + N 3 X 3 + N 4 X 4
X1234 =
N1 + N 2 + N 3 + N 4
6 5
Geektonight Notes
Weighted Mean
The arithmetic mean, as discussed above, gives equal importance (weight) to all
the observations. But in some cases, all observations do not have the same
weightage. In such a case, we must compute weighted mean. The term
‘weight’, in statistical sense, stands for the relative importance of the different
variables. It can be defined as:
xW =
∑ Wx
∑W
where, x w is the weighted mean, ‘w’ are the weights assigned to the variables (x).
Weighted mean is extensively used in Index numbers, it will be discussed in
detail in Unit 12 : Index Numbers, of this course. For example, to compute the
cost of living index, we need the price index of different items and their
weightages (percentage of consumption). The important issue that arises is the
selection of weightages. If actual weightages are not available then estimated or
arbitrary weightages may be used. This is better than no weightages at all.
However, keeping the phenomena in mind, the weightages are to be assigned
logically. To understand this concept, let us take an illustration.
Illustration 6
Given below are Price index numbers and weightages for different group of
items of consumption for an average industrial worker. Compute the cost of
living index.
Solution: The cost of living index is obtained by taking the weighted average
as explained in the table below:
Group Item Group Price Weight Wi. Pi
Index( Pi) (Wi )
Food 150 55 8250
Clothing 186 15 2790
House rent 125 17 2125
Fuel and Light 137 8 1096
Others 184 5 920
6 6 – ΣW = 100 ΣWx = 15181
Geektonight Notes
Statistical Derivatives and
Xw =
∑ Wx = 15181 = 151.81 Measures of Central Tendency
Therefore, the cost of living index is
∑ W 100
Self Assessment Exercise C
1) A student's marks in a Computer course are 69, 75 and 80 respectively in
the papers on Theory, Practical and Project Work.
What are the mean marks if the weights are 1, 2 and 4 respectively?
What would be the mean marks if all the papers have equal importance?
Use the following table
Since the class width is 150 for all the classes, the method of assumed
mean is useful. The following table may be helpful.
6 7
Geektonight Notes
8.3.3 Median
The median is another measure of central tendency. The median represents the
middle value of the data that measures the central item in the data. Half of the
items lie above the median, and the other half lie below it.
Median of Ungrouped Data: To find the median from ungrouped data, first
array the data either in ascending order or in descending order. If the number
of observations (N) contains an odd number then the median is the middle
value. If it is even, then the median is the mean of the middle two values.
th
N + 1
In formal language, the median is item in a data array, where N is
2
the number of items. Let us consider the earlier illustration 3 to locate the
median value in two different sets of data.
Illustration-7
On arranging the daily wage data of the labourers (as given in illustration 3) in
ascending order, we get
Had there been one more observation, say, Rs. 6, the order would have been
as below:
Rs. 6, 22, 23, 24, 25, 29, 30, 58
There are eight observations, and the median is given by the mean of the
8 +1
fourth and the fifth observations (i.e., th item = 4.5th item. So, median
2
wage = (24 + 25)/2 = Rs. 24.5.
Median of Grouped Data: Now, let us calculate the median from grouped
data. When the data is in the form of discrete series, the median can be
computed by examining the cumulative frequency distribution, as is shown
6 8 below.
Geektonight Notes
To compute median wage from the data given in Illustration 4, we add one
more row of cumulative frequency (the formation of cumulative frequency, we
have discussed in Unit 6 of this course : Processing of Data.
th
N + 1
According to the formula, item, the number of observations is 35.
2
35 + 1
Therefore, item th is the 18th item. Hence the 18th observation will be the
2
median. By inspection it is clear that Median wage is Rs. 29.
6 9
Geektonight Notes
N/ 2 − cf 35− 25
Thus median is : L + ×i = 25+ ×5 = Rs. 27.63.
f 19
It is to be noted that the median value may also be located with the help of
graph by drawing ogives or a less than cumulative frequency curve. This
method was discussed in detail in Unit 7 : Diagrammatic and Graphic
Presentation, of this block.
..................................................................................................................
7 0 ..................................................................................................................
Geektonight Notes
Mode is also a measure of central tendency. This measure is different from the
arithmetic mean, to some extent like the median because it is not really
calculated by the normal process of arithmetic. The mode, of the data, is the
value that appears the maximum number of times. In an ungrouped data, for
example, the foot size (in inches) of ten persons are as follows: 5, 8, 6, 9, 11,
10, 9, 8, 10, 9. Here the number 9 appeares thrice. Therefore, mode size of
foot is 9 inches. In grouped data the method of calculating mode is different
between discrete distribution and continous distribution.
In discrete data, for example consider the earlier illustration 6, the modal wage
is Rs. 29 as is the wage for maximum number of days, i.e. six days. For
continuous data, usually we refer to modal class or group as the class with the
maximum frequency (as per observation approach). Therefore, the mode from
continuous distribution may be computed using the expression:
∆1
Mode = L + ×i
∆1 + ∆2
where, L = lower limit of the modal class, i = width of the modal class, ∆1 =
excess of frequency of the model class (pi) over the frequency of the
preceding class (f0),
∆2 = excess of frequency of the model class (f1) over the frequency of the
succeeding class (f2). The letter ∆ is read as delta.
It is to be noted that while using the formula for mode, you must arrange the
class intervals uniformly throughout, otherwise you will get misleading results.
To illustrate the computation of mode, let us consider the grouped data of
earlier illustration 7.
Illustration 10
20-25 23 40-45 4
25-30 19 45-50 3
30-35 14
7 1
Geektonight Notes
It may not be unique all the time. There may be more than one mode or no
mode (no value that occurs more than once) at all. In such a case it is difficult
to interpret and compare the distributions. It is not amenable to arithmetic and
algebraic manipulations. For example, we cannot get the mode of the combined
data set from the modes of the constituent data sets.
L = , f0 = , f1 = , f2 = and i= .
For a moderately skewed distribution, it has been empirically observed that the
difference between Mean and Mode is approximately three times the difference
between Mean and Median. This was illustrated in the Fig. 8.1 (b) and (c).
The expression is:
Sometimes this expression is used to calculate value of one measure when the
value of the other two measures are known.
7 2
Geektonight Notes
From the above discussion, it is clear that for nominal data only mode can be
used, for ordinal data both mode and median can be used whereas for ratio and
interval levels of data all three measures can be calculated.
Mo= Me = X
(a)
Mode Mode
Median Median
Mean Mean
(b) (c)
Figure 8.1 7 3
Geektonight Notes
Processing and Presentation Stability: Quite often a researcher studies a sample to infer about the entire
of Data population. Mean is generally more stable than median or mode. If we calculate
means, medians and modes of different samples from a population, the means
will generally be more in agreement then the medians or the modes. Thus,
mean is a more reliable measure of central tendency. Normally, the choice of a
suitable measure of central tendency depends on the common practice in a
particular industry. According to its requirement, each case must be judged
independently. For example, the Mean sales of different products may be useful
for many business decisions. The median price of a product is more useful to
middle class families buying a new product. The mode may be a more useful
measure for the garment industry to know the modal height of the population to
decide the quantity of garments to be produced for different sizes.
Hence, the choice of the measure of central tendency depends on (1) type of
data (2) shape of the distribution (3) purpose of the study. Whenever possible,
all the three measures can be computed. This will indicate the type of
distribution.
Geometric Mean (GM): Geometric mean is defined as the Nth root of the
product of all the N observations. It may be expressed as:
G.M. = n Pr oduct of all n values . Thus, the geometric mean of four numbers 2, 5,
1
N / Σ For example, the harmonic mean of 4 and 6 is 2 / (1/4 + 1/6) = 2 /
x
(5/12) = 2/ (5/12) = 21 (0.4166) = 4.8. Suppose a car moves half the distance
at the speed of 60 km/hr and the other half at the speed of 80 km/hr. Then the
average speed of the car is 68.57 km/hr, which is the harmonic mean of 60
and 80. Harmonic mean is useful in averaging rates.
For any set of data wherever computation is possible, the following inequality
holds
x > GM > HM
7 4
Geektonight Notes
Rate : Amount of one variable per unit amount of some other variable. 7 5
Geektonight Notes
Processing and Presentation Ratio : Relative value of one value with respect to another value.
of Data
Weighted Mean : An average in which each observation value is weighted by
some index of its importance.
The interpretation is obvious - Out of all the workers in all India, 13.46 percent
are in Uttar Pradesh and 10.45 percent in Maharashtra. Andhra Pradesh has
the highest number of Agricultural Labourers (12.86%) followed by Uttar
Pradeh (12.66%) and Bihar (12.59%). The lowest number of Household
7 6 Industry workers is in Himachal Pradesh, etc.
Geektonight Notes
N1x1 + N 2 x 2 + N 3 x 3
3) x123 = N1 + N 2 + N 3
x 123 = 116 .43
N / 2 − c.f
D: Median = L + ×c
f
Me = 359.76
∆1
E: Mode = L + ∆ + ∆ × i ; Modal sales value is Rs. 363.64 thousands.
1 2
7 7
Geektonight Notes
Processing and Presentation 5) The monthly salaries (in Rupees) of 11 staff members of an office are:
of Data
2000, 2500, 2100, 2400, 10000, 2100, 2300, 2450, 2600, 2550 and 2700.
Find mean, median and mode of the monthly salaries.
Which one among the above do you consider the best measure of central
tendency for the above data set and why?
6) Consider the data set given in problem 2 above.
Find mean deviation of the data set from (i) median (ii) 2400 and (iii) 2500.
Find mean squared deviation of the data set from (i) mean (ii) 3000 and (iii)
3100.
7) Mean examination marks in Mathematics in three sections are 68, 75 and 72, the
number of students being 32, 43 and 45 respectively in these sections. Find the
mean examination marks in Mathematics for all the three sections taken
together.
8) The followings are the volume of sales (in Rupees) achieved in a month by 25
marketing trainees of a firm:
1220 1280 1700 1400 400 350 1200 1550 1300 1400
1450 300 1800 200 1150 1225 1300 1100 450 1200
1800 475 1200 600 1200
The firm has decided to give the trainees some performance bonus as per the
following rule - Rs. 100 if the volume of sales is below Rs. 500; Rs. 250 if the
volume of sales is between Rs. 500 and Rs.1000; Rs.400 if the volume of sales
is between Rs. 1000 and Rs, 1500 and Rs.600 if the volume of sales is above
Rs. 1500.
Find the average value of performance bonus of the trainees.
9) In an urban cooperative bank, the minimum deposit in a savings bank is Rs. 500.
The deposit balance at the end of a working day is given in the table below :
Table: Average Deposit Balance in ABC Urban Cooperative Bank
10) Refer to the table given in the previous problem. Compute (a) median and (b) Statistical Derivatives and
Measures of Central Tendency
mode by graphical approach.
11) Refer to the problem 8. Compute the approximate value of mode using the
relationship: Mean - Mode = 3 (Mean - Median), and compare with the
computed value obtained earlier.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
7 9
Geektonight Notes
9.0 Objectives
9.1 Introduction
9.2 Variation – Why is it Important?
9.3 Significance of Variation
9.4 Measures of Variation
9.4.1 Range
9.4.2 Quartile Deviation
9.4.3 Mean Deviation
9.4.4 Standard Deviation
9.4.5 Coefficient of Variation
9.5 Skewness
9.6 Relative Skewness
9.7 Let Us Sum Up
9.8 Key Words
9.9 Answers to Self Assessment Exercises
9.10 Terminal Questions/Exercises
9.11 Further Reading
9.0 OBJECTIVES
After studying this Unit, you should be able to:
l describe the concept and significance of measuring variability for data analysis,
l compute various measures of variation and its application for analysing the
data,
l choose an appropriate measure of variation under different situations,
l describe the importance of Skewness in data analysis,
l explain and differentiate the symmetrical, positively skewed and negatively
skewed data, and
l ascertain the value of the coefficient of skewness and comment on the nature
of distribution.
9.1 INTRODUCTION
In Unit 8, we have learnt about the measures of central tendency. They give us
only a single figure that represents the entire data. However, central tendency
alone cannot adequately describe a set of data, unless all the values of the
variables in the collected data are the same. Obviously, no average can
sufficiently analyse the data, if all the values in a distribution are widely spread.
Hence, the measures of central tendency must be supported and supplemented
with other measures for analysing the data more meaningfully. Generally, there
are three other characteristics of data which provide useful information for data
analysis i.e., Variation, Skewness, and Kurtosis. The third characteristic,
Kurtosis, is not with in the scope of this course. In this unit, therefore, we shall
discuss the importance of measuring variation and skewness for describing
distribution of data and their computation. We shall also discuss the role of
normal curves in characterizing the data.
80
Geektonight Notes
Measures of Variation
9.2 VARIATION – WHY IS IT IMPORTANT? and Skewness
Measures of variation are statistics that indicate the degree to which numerical
data tend to spread about an average value. It is also called dispersion, scatter,
spread etc., It is related to the homogeniety of the data. In the simple words of
Simpson of Kafka “the measurement of the scatterness of the mass of figures
(data) in a series about an average is called measure of variation”. Therefore,
we can say, variation measures the extent to which the items scatter from
average. To be more specific, an average is more meaningful when the data
are examined in the light of variation. Infact, in the absence of measure of
dispersion, it will not be possible to say which one of the two or more sets of
data is represented more closely and adequately by its arithmetic mean value.
Here, the following illustration helps you to understand the necessity of
measuring variability of data for effective analysis.
Illustration-1
The data given below relates to the marks secured by three students (A, B and
C) in different subjects
Subjects Marks
A B C
Research methodology 50 50 10
Accounting for Mangers 50 70 100
Financial Management 50 40 80
Marketing Management 50 40 30
Managerial Economics 50 50 30
Total 250 250 250
Mean ( x ) 50 50 50
In the above illustration, you may notice that the marks of the three students
have the same mean i.e. the average marks of A, B and C are the same i.e.,
50 Marks, and we may analyse and conclude that the three distributions are
similar. But, you should note that, by observing distributions (subject-wise) there
is a wide difference in the marks of these three students. In case of 'A' the
marks in each subject are 50, hence we can say each and every item of the
data is perfectly represented by the mean or in other words, there is no
variation. In case of B there is slight variation as compared to 'C', where as in
case of 'C' not a single item is perfectly represented by the mean and the
items vary widely from one another. Thus, different set of data may have the
same value of average, but may differ greatly in terms of spread or scatter of
items. The study of variability, therefore, is necessary to know the average
scatter of the item from the average to gauge the degree of variability in the
collected data.
81
Geektonight Notes
Processing and Presentation A family intends to cross a lake. They come to know that the average depth of
of Data the lake is 4 feet. The average height of the family, is 5.5 feet. Then they
decide that the lake can be crossed safely. While crossing the lake, at a
particular place all the members of the family get drowned where the level of
water is more than 6.5 feet deep. The reason for drowning is that they rely on
the average depth of the lake and their average height but do not rely on the
variability of the Lake's depth and their height. In the light of the above
example, we may understand the reason for measuring variability of a given
data.
Keeping in view the above purposes, the variation of data must be taken into
account while taking business decisions.
9.4.1 Range
The Range is the simplest measure of variation. It is defined simply as the
difference between the highest value and the lowest value of observation in a
set of data. In equation form for absolute measure of range from ungrouped
and grouped data, we can say
In a grouped data, the absolute range is the difference between the upper limit
of the highest class and lower limit of the lowest class. The equation form for
relative measure of range from ungrouped and grouped data, called coefficient
of range is as follows:
Illustration-2
The following data relates to the total fares collected on Monday from three
different transport agencies.
The interpretation for the above result is simple. In the above illustration, the
variation is nil in case of taxi fare of agency ‘B’. While the variation is small in
agency ‘A’ and the variation is high in transport agency C. The coefficient of
Range for transport agency ‘A’ and ‘C’ is as follows:
Processing and Presentation Its usefulness as a measure of variation is limited. Since it considers only the
of Data
highest and lowest values of the data, it is greatly affected by extreme values
of the data. Therefore, the range is likely to change drastically from sample to
sample. Range cannot be computed in case of open-ended distribution.
The following data relates to the record of time (in minutes) of trucks waiting
to unload material.
Calculate the absolute and relative range and comment on whether you think it
is a useful measure of variation.
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
Q3 − Q1
Q.D. =
2
84
Geektonight Notes
The relative measure of Q.D., called coefficient of quartile deviation, is Measures of Variation
and Skewness
calculated as:
Q3 − Q1
Coefficient of Q.D. =
Q3 + Q1
It is to be noted that the above formulae (absolute and relative) are applicable
to ungrouped data and grouped data as well. Let us take up an illustration to
ascertain the value of quartile deviation and coefficient of Q.D.
Illustration-3
The following data relates to the daily expenditure of the students in Delhi
University. Calculate quartile deviation and its co-efficient.
Daily expenditure 50- 100- 150- 200- 250- 300- 350- 400- 450-
100 150 200 250 300 350 400 450 500
No. of Students 18 14 21 15 12 13 8 5 2
Where, ‘L1’ is the lower limit of the Q1 class, c.f. is the cumulative
frequency of the preceding class of Q1 class ‘f’ is the frequency of
the Q1, class and ‘i’ is the class interval. Now we present these
values to obtain the result of Q1.
27 − 18
Q1 = 100 + × 50 = Rs.132.14 85
14
Geektonight Notes
Processing and Presentation Q3, has 3(n/4)th observation i.e., 3(108/4)th = 81th observation. This
of Data observation lies in 93 cumulative frequency. So Q3 lies in the 300-350 class.
3N / 4 − c.f .
Q3 = L1 + ×i
f
Here, as explained above, L1; c.f; f; and i are related to Q3 class
81 − 80
Therefore, Q 3 = 300 + × 50 = Rs.303.85
13
Q3 − Q1 303.85 − 132.14
Q.D. = = = Rs.85.85
2 2
Q3 − Q1 303.85 − 132.14
Re lative measure of Q.D. i.e., Coefficient of Q.D. = = = 0.39
Q3 + Q1 303.85 + 132.14
From the above data it may be concluded that the variation of daily expenditure
among the sample students of DU is Rs. 85.85. The coefficient of Q.D. is 0.39
this relative value of variation may be compared with the other dependent
variables of the expenditure like family income of the students, pocket money,
habit of spending etc.
Compute the quartile deviation and its co-efficient. Do you think this is an
appropriate measure for measuring variability? Comment on your opinion.
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
86
Geektonight Notes
M.D.
Co-efficient of M.D. = The average used ( x or Me)
As an illustration, let us consider the following data, which relates to the sales
of Company A and Company B during 1995-2001.
Illustration-4
Compute the mean deviation and its co-efficient of the sales of two companies
A and B and comment on the result.
87
Geektonight Notes
∑X 3829
Mean Sales of Company ‘A’ = N = 7 = Rs. 547 thousand
A
∑X 34258
Mean Sales of Company ‘B’ = N =
B
∑ x−x
Formula for Mean Deviation from Mean =
N
1294
M.D. of Company ‘A’ = = Rs. 184.9 thousand
7
8456
M.D. of Company ‘B’ = = Rs. 1208 thousand
7
Co-efficient of M.D.
M .D .A 184 .9
Company ‘A’ = Mean = = 0 .34
A 547
M.D. 1208
Company ‘B’ = Mean = 4894 = 0.25
B
The coefficient of mean deviation of company ‘A’ sales is more than the
company 'B' sales. Hence we can conclude there is greater variability in the
sales of company ‘A’.
The drawbacks of this method are, it may be observed, the algebraic signs (+
or –) of the deviations are ignored. From the mathematical point of view it is
unjustifiable as it is not useful for further algebraic treatment. That is the
reason mean deviation is not frequently used in business research. The
accuracy of the result of mean deviation depends upon the degree of
representation of average. Despite of few drawbacks of this measure, it is most
useful measure in case of : i) small samples with no-elaborate analysis is
required, ii) the reports presented to the general public not familiar with
statistical methods, and iii) it has some specific utility in the area of inventory
88 control.
Geektonight Notes
..........................................................................................................................
..........................................................................................................................
The Standard deviation is the most familiar, important and widely used measure
of variation. It is a significant measure for making comparison of variability
between two or more sets of data in terms of their distance from the mean.
The mean deviation, in practice, has been replaced by the standard deviation.
As discussed earlier, while calculating mean deviation algebraic signs ( – / +)
are ignored and can be computed from any of the averages. Whereas, in
computation of standard deviation signs are taken into account, and the
deviation of items, always from mean are taken, squared (instead of ignoring
signs) and averaged. So, finally square root of this value is extracted. Thus,
standard deviation may be defined as “the square root of the arithmetic mean
of the squares of deviations from arithmatic mean of given distribution.” This
measure is also known as root mean square deviation. If the values in a given
data are dispersed more widely from the mean, then the standard deviation
becomes greater. It is usually denoted by σ (read as sigma). The square of the
standard deviation (σ2) is called “variance”.
σ=
∑ ( x − x ) 2 , In simple σ = x2
89
N N
Geektonight Notes
Processing and Presentation where, x2 = sum of the squares of deviations ( x − x ) and N = No. of
of Data
observations.
If the collected data are very large, then considering the assumed mean is more
convenient to compute standard deviation. In such case, the formula is slightly
modified as:
2
∑ f ( x − x A ) 2 − ∑ f dx
σ= ×C
N or Σf N or Σf
X − Assumed Mean
Where, x A = Assumed mean, dx = , C = Common factor .
C
The above formula is applicable only when the class intervals are equal.
Illustration-5
x − xA
Profit M.V.
c
(Rs. In lakh) f x dx fdx fdx2
6-10 9 8 –2 –18 36
10-14 11 12 –1 –11 22
14-18 20 16 0 0 0
18-22 16 20 1 16 16
22-26 9 24 2 18 36
26-30 5 28 3 15 45
N=70 Σfdx =20 Σfdx2 = 155
In the above computation, we have taken the mid value "16" as assumed mean
(AM), the common factor is 4.
2
σ=
∑ fdx 2 ∑ fdx
− ×C
90 N N
Geektonight Notes
Measures of Variation
2
∑ fdx 2 ∑ fdx
and Skewness
σ= − ×C
N N
2
155 20
= − × 4 = 2.21 − 0.08 × 4 = 2.13 × 4 = 1.46 × 4 = 5.84
70 70
Among all the measures of variation, standard deviation is the only measure
possessing the necessary mathematical properties which enhance its utility in
advanced statistical work. It is least affected by the fluctuations of sampling. In
a normal distribution, x ± σ covers 68% of the values whereas x ± QD covers
50% values and x ± M.D. covers 57% values. This is the reason that
standard deviation is called a “Standard Measure”.
I 83 9.93
II 40 5.24
III 70 8.12
IV 59 10.89
You may notice that the mean production of Paddy in four states is not equal.
In such a situation, to determine which state is more consistent in terms of
production, we shall compute the coefficient of variation.
σ
C.V. = × 100
X 91
Geektonight Notes
8.12 10.89
C.V. of State III = × 100 = 11.60%; C.V. of State IV = × 100 = 18.46%
70 59
It is seen that the standard deviation is low in State II when we compare with
the other states. However, since the C.V. is less in State III, it is the more
consistent state in production of paddy than the other three states. Among the
4 states, state IV is extremely inconsistent in production of paddy.
A Prospective buyer tested the bursting pressure of a sample of 120 carry bags
received from A and B manufactures. The results are tabulated below:
No. of bags of A 3 14 30 56 12 5
No. of bags of B 8 16 23 34 24 15
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
92
Geektonight Notes
Measures of Variation
9.5 SKEWNESS and Skewness
The measure of skewness tells us the direction of dispersion about the centre
of the distribution. Measures of central tendency indicate only the single
representative figure of the distribution while measures of variation, indicate only
the spread of the individual values around the means. They do not give any
idea of the direction of spread. Two distributions may have the same mean and
variation but may differ widely in the shape of their distribution. A distribution is
often found skewed on either side of its average, which is termed as
asymmetrical distribution. Thus, skewness refers to the lack of symmetry in
distribution. Symmetry signifies that the value of variables are equidistant from
the average on both sides. In other words, a balanced pattern of a distribution
is called symmetrical distribution, where as unbalanced pattern of distribution is
called asymmetrical distribution.
Fig.9.1
Carefully observe the figures presented above and try to understand the
following rules governing them.
It is clear from Figure 9.1 (a) that the data are symmetrical when the spread
of the frequencies is the same on both sides of the middle point of the
frequency polygon. In this case the value of mean, median, and mode coincide
i.e., Mean = Median = Mode.
In Figure (c), when there is a longer tail towards the left hand side of the
centre, then the skewness is said to be ‘Negatively Skewed’. In such a case,
Mean < Median < Mode.
Processing and Presentation the magnitude as well as the direction of skewness in a distribution. The
of Data relationship of mean, median and mode in measuring the degree of skewness is
that, for a moderately symmetrical distribution the interval between the mean
and the median is approximately 1/3rd of the interval between the mean and
mode.
Tests of Skewness
In the light of the above discussion, we can summerise the following facts
regarding presence of skewness in a given distribution.
X − MO
SKp =
σ
This method computes the co-efficient skewness by considering all the items
of the data set. The value of variation usually varies in value between the limits
± 3.
If mode is ill-defined and cannot be easily located then using the approximate
empirical relationship between mean, median, and mode as stated in Unit-8
section 8.3.5, (mode = 3 median – 2 mean) the coefficient of skewness can be
determined by the removal of the mode and substituting median in its place.
Thus the changed formula is:
3 (Mean − Median)
Sk p =
σ
Let us consider the following data to understand the application of Karl
Pearson’s formula for measuring the co-efficient of skewness.
94
Geektonight Notes
The following measures are obtained from the profits of 100 shops in two
different regions. Calculate Karl Pearson’s co-efficient of skewness and
comment on the results.
16.62 − 18.47
Coefficient of skewness for Re gion I = = − 0.61
3.04
45.36 − 36.94
Coefficient of skewness for Re gion II = = 0.49
17.71
Based on the results we can comment on the distributions of the two regions
as follows: The coefficient of skewness for Region-I is negative, while that of
Region - II is positive. Hence the earnings of profit in Region I is more
skewed. Since the result in Region-I, indicates that the distribution is negatively
skewed, there is a greater concentration towards higher profits. In case of
Region-II the value of coefficient of skewness indicates that the distribution is
positively skewed. Therefore there is a greater concentration towards lower
profits.
Illustration-7
The following statistical measures are given from a data of a factory before
and after the settlement of wage dispute. Calculate the Pearson’s co-efficient
of skewness and comment.
3 ( Mean − Median )
Kal Pearson’s Co-efficient of Skewness (SKp) =
σ
95
Geektonight Notes
3 ( 24 − 23 ) 3
b) After settlement of wage dispute: SK p = = = 0 . 61
4 .9 4 . 95
From the above calculated values of coefficient of skewness, under different
situations, we may comment upon the nature of distribution as follows:
Before the settlement of dispute the distribution was negatively skewed and
hence there is a greater concentration of wages towards the higher wages.
Whereas it was positively skewed after the settlement of dispute, which reveals
that even though the mean wage of workers has increased after the settlement
of disputes (before settlement wages were 1,200 × 22.8 = Rs. 27360. After
settlement total wages were 1175 × 24 = Rs. 28,200). The workers who were
getting low wages are getting considerably increased wages after settlement of
their dispute, while wages of the workers getting high wages before settlement
had fallen.
σ
C.V. = × 100
X
5.9
a) Before settlement the coefficient of variation = × 100 = 25.88%
22.8
4.95
b) After settlement the coefficient of variation = × 100 = 20.62%
24.0
Based on the computed values of variation, it may be concluded that there is
sufficient evidence that there is lesser inequality in the distribution of wages
after settlement of the dispute. It means that there was a greater scattered in
wage payment before the dispute was settled.
96 ..........................................................................................................................
Geektonight Notes
(Q 3 − Q 2 ) − (Q 2 − Q 1 )
SK = ,
( Q 3 − Q 2 ) + ( Q 2 − Q 1 ) Alternatively;
B
Q 3 + Q1 − 2 Median
SK B =
Q 3 − Q1
Illustration-8
Q1 = 62 Q2 = 141 Q3 = 190
Q3 + Q1 − 2 Median
Bowley’s coefficient of skewness (SKB) = Q3 − Q1
Processing and Presentation since the distribution is slightly negatively skewed there is greater concentration
of Data of the sales towards higher sales than the lower sales of the distribution.
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
Through skewness, we study the shape of the distribution, i.e., whether the Measures of Variation
and Skewness
distribution is symmetrical or asymmetrical. Symmetrical distribution means the
frequency distribution that forms a balanced pattern on both sides of the mean,
median, and mode.
In such a distribution mean, median, and mode are equal and they lie at the
centre of the distribution. In contrast, asymmetrical distribution means
unbalanced pattern of frequency distribution, called as 'skewed' distribution.
Skewed distribution may be positively skewed or negatively skewed. In a
positively skewed distribution the mean is greater than mode and median
( x > Me > Mo) and has a long tail on the right hand side of the data curve.
On the other hand, in a negatively skewed distribution the mode is greater than
the mean and median (Mo > Me > x ) and has a long tail on the left hand
side of the data curve. In a skewed distribution, the relationship of Mean and
median is that the interval between both is approximately 1/3rd of the interval
between the mean and mode. Based on this relationship the degree of
skewness is measured. There are two formulae we use for measuring the
coefficient of skewness, which are called relative measures of skewness,
proposed by Karl Pearson and Bowley. Bowley's formula is normally applied
when the data is an open-end type or/and the classes are unequal.
Range : It is the difference between the highest value and the lowest value of
observations.
The Median earnings of the 1600 families is Rs. 1381. It reveals that 50% of
the families are earning between Rs. 1,000 to Rs. 2,000. It is to be noted that
very few (44 families out of 1,600 families) fall in the last three classes of
higher-earning groups.
Infact, this measure of variability gives us best results when deviations are
taken from Median, but Median is not a satisfactory measure when the
dispersion in a distribution is very high. It is also not appropriate for large
samples.
Since the mean bursting pressure of manufacturer B's bags is higher, these
bags may be regarded more standard. However, the bags of manufacturer
A may be suggested for purchase as these bags of manufacturer A are
more consistant because CV is significantly lesser than the bags of
manufacturer B.
If the buyer would not like to buy bags having more than 16 kgs. bursting
pressure then:
In case the buyer would not like to buy bags having more than 16 kgs
100 bursting pressure, then the average bursting pressure of manufacturer A's
Geektonight Notes
bags is higher than manufacturer B. The co-efficient of variation is also Measures of Variation
and Skewness
much lesser in case of manufacturer A than manufacturer B. Hence, in this
case, we may suggest to buy from manufacturer A.
8 A transport agency had tested the tyres of two brands A and B. The results are
given in the following table below.
Life (thousand units) Brand A Brand B
15-20 6 8
20-25 15 8
25-30 10 22
30-35 16 17
35-40 13 12
40-45 9 6
45-50 11 0
101
Geektonight Notes
Processing and Presentation i) Which brand of tyres do you suggest to the transport agency to use on their
of Data
fleet of trucks?
8) In a manufacturing firm, four employees on the same job show the following
results over a period of time.
A B C D
Mean time of completing the Job 61 70 83 80.5
(minutes)
Variance (σ2) 64 81 121 100
10) The following Table gives the No. of defects per product and its frequency.
102
Geektonight Notes
Measures of Variation
11) The following information was obtained from records of a factory relating to the and Skewness
wages, before and after settlement of wages.
Regular M.Com 20 24 18 22 26 25 21 28 23 29
Distance M.Com 24 29 40 46 34 27 31 28 38 23
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
103
Geektonight Notes
Correlation and Simple
UNIT 10 CORRELATION AND SIMPLE REGRESSION Regression
STRUCTURE
10.0 Objectives
10.1 Introduction
10.2 Correlation
10.2.1 Scatter Diagram
10.3 The Correlation Coefficient
10.3.1 Karl Pearson’s Correlation Coefficient
10.3.2 Testing for the Significance of the Correlation Coefficient
10.3.3 Spearman’s Rank Correlation
10.4 Simple Linear Regression
10.5 Estimating the Linear Regression
10.5.1 Standard Error of Estimate
10.5.2 Coefficient of Determination
10.6 Difference Between Correlation and Regression
10.7 Let Us Sum Up
10.8 Key Words
10.9 Answers to Self Assessment Exercises
10.10 Terminal Questions/Exercises
10.11 Further Reading
Appendix Tables
10.0 OBJECTIVES
After studying this unit, you should be able to:
10.1 INTRODUCTION
In previous units, so far, we have discussed the statistical treatment of data
relating to one variable only. In many other situations researchers and decision-
makers need to consider the relationship between two or more variables. For
example, the sales manager of a company may observe that the sales are not
the same for each month. He/she also knows that the company’s advertising
expenditure varies from year to year. This manager would be interested in
knowing whether a relationship exists between sales and advertising
expenditure. If the manager could successfully define the relationship, he/she 5
Geektonight Notes
Relational and might use this result to do a better job of planning and to improve predictions of
Trend Analysis yearly sales with the help of the regression technique for his/her company.
Similarly, a researcher may be interested in studying the effect of research and
development expenditure on annual profits of a firm, the relationship that exists
between price index and purchasing power etc. The variables are said to be
closely related if a relationship exists between them.
This unit, therefore, introduces the concept of correlation and regression, some
statistical techniques of simple correlation and regression analysis. The methods
used are important to the researcher(s) and the decision-maker(s) who need to
determine the relationship between two variables for drawing conclusions and
decision-making.
10.2 CORRELATION
If two variables, say x and y vary or move together in the same or in the
opposite directions they are said to be correlated or associated. Thus,
correlation refers to the relationship between the variables. Generally, we find
the relationship in certain types of variables. For example, a relationship exists
between income and expenditure, absenteesim and production, advertisement
expenses and sales etc. Existence of the type of relationship may be different
from one set of variables to another set of variables. Let us discuss some of
the relationships with the help of Scatter Diagrams.
Y r=1 Y r = –1
(a) X (b) X
Perfect Positive Correlation Perfect Negative Correlation
6
Geektonight Notes
Correlation and Simple
r<0
Y r>0 Y Regression
X X
(c) (d)
Positive Correlation Negative Correlation
Y Y r=0
(e) X (f) X
Non-linear
Non-linearCorrelation
correlation No
NoCorrelation
correlation
If X and Y variables move in the same direction (i.e., either both of them
increase or both decrease) the relationship between them is said to be positive
correlation [Fig. 10.1 (a) and (c)]. On the other hand, if X and Y variables
move in the opposite directions (i.e., if variable X increases and variable Y
decreases or vice-versa) the relationship between them is said to be negative
correlation [Fig. 10.1 (b) and (d)]. If Y is unaffected by any change in X
variable, then the relationship between them is said to be un-correlated [Fig.
10.1 (f)]. If the amount of variations in variable X bears a constant ratio to the
corresponding amount of variations in Y, then the relationship between them is
said to be linear-correlation [Fig. 10.1 (a) to (d)], otherwise it is non-linear
or curvilinear correlation [Fig. 10.1 (e)]. Since measuring non-linear
correlation for data analysis is far more complicated, we therefore, generally
make an assumption that the association between two variables is of the linear
type.
7
Geektonight Notes
Table 10.1 : A Company’s Advertising Expenses and Sales Data (Rs. in crore)
Years : 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Sales (Y) 60 55 50 40 35 30 20 15 11 10
The company’s sales manager claims the sales variability occurs because the
marketing department constantly changes its advertisment expenditure. He/she is
quite certain that there is a relationship between sales and advertising, but does
not know what the relationship is.
The different situations shown in Figure 10.1 are all possibilities for describing
the relationships between sales and advertising expenditure for the company. To
determine the appropriate relationship, we have to construct a scatter diagram
shown in Figure 10.2, considering the values shown in Table 10.1.
60
50
Sales (Rs. Crore)
40
30
20
10
01 2 3 4 5 6
Advertising Expenditure (Rs. Crore)
Figure 10.2 : Scatter Diagram of Sales and Advertising Expenditure for a Company.
Figure 10.2 indicates that advertising expenditure and sales seem to be linearly
(positively) related. However, the strength of this relationship is not known, that
is, how close do the points come to fall on a straight line is yet to be
determined. The quantitative measure of strength of the linear relationship
between two variables (here sales and advertising expenditure) is called the
correlation coefficient. In the next section, therefore, we shall study the
methods for determining the coefficient of correlation.
. ..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
8
Geektonight Notes
2) How does a scatter diagram approach help in studying the correlation between Correlation and Simple
Regression
two variables?
. ..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
The simplified formulae (which are algebraic equivalent to the above formula)
are:
∑ xy
1) r= , where x = X − X, y = Y − Y
2 2
∑x ∑y
∑ X.∑ Y
∑ XY −
2) r= n
2 (∑ X) 2 2 (∑ Y)
2
∑X − ∑Y −
n n
i) ‘r’ is a dimensionless number whose numerical value lies between +1 to –1. The
value +1 represents a perfect positive correlation, while the value –1 represents
a perfect negative correlation. The value 0 (zero) represents lack of correlation.
Figure 10.1 shows a number of scatter plots with corresponding values for
correlation coefficient.
ii) The coefficient of correlation is a pure number and is independent of the units of
measurement of the variables.
iii) The correlation coefficient is independent of any change in the origin and scale
of X and Y values.
Remark: Care should be taken when interpreting the correlation results.
Although a change in advertising may, in fact, cause sales to change, the fact
that the two variables are correlated does not guarantee a cause and effect
relationship. Two seemingly unconnected variables may often be highly
correlated. For example, we may observe a high degree of correlation: (i)
between the height and the income of individuals or (ii) between the size of the
9
Geektonight Notes
Relational and shoes and the marks secured by a group of persons, even though it is not
Trend Analysis possible to conceive them to be casually related. When correlation exists
between such two seemingly unrelated variables, it is called spurious or non-
sense correlation. Therefore we must avoid basing conclusions on spurious
correlation.
Illustration 2
We know that
∑ ( X)∑ (Y)
∑ XY −
r= n
2 (∑ X) 2 2 (∑ Y)
2
∑X − ∑Y −
n n
that data will be useful for forecasting (it is discussed in Section 10.4 on Correlation and Simple
Regression
‘Simple Linear Regression’).
You may notice that the manual calculations will be cumbersome for real life
research work. Therefore, statistical packages like minitab, SPSS, SAS, etc.,
may be used to calculate ‘r’ and other devices as well.
Once the coefficient of correlation has been obtained from sample data one is
normally interested in asking the questions: Is there an association between the
two variables? Or with what confidence can we make a statement about the
association between the two variables? Such questions are best answered
statistically by using the following procedure.
Testing of the null hypothesis (testing hypothesis and t-test are discussed in
detail in Units 15 and 16 of this course) that population correlation coefficient
equals zero (variables in the population are uncorrelated) versus alternative
hypothesis that it does not equal zero, is carried out by using t-statistic
formula.
n−2
t=r , where, r is the correlation coefficient from sample.
1− r2
Referring to the table of t-distribution for (n–2) degree of freedom, we can find
the critical value for t at any desired level of significance (5% level of
significance is commonly used). If the calculated value of t (as obtained by the
above formula) is less than or equal to the table value of t, we accept the null
hypothesis (H0), meaning that the correlation between the two variables is not
significantly different from zero.
Illustration 3
Solution: Let us take the null hypothesis (H0) that the variables in the
population are uncorrelated.
Applying t-test,
n−2 12 − 2
t=r 2
= 0.55
1− r 1 − 0.552
From the t-distribution (refer the table given at the end of this unit) with 10
degrees of freedom for a 5% level of significance, we see that the table value
of t0.05/2, (10–2) = 2.228. The calculated value of t is less than the table value of
t. Therefore, we can conclude that this r of 0.55 for n = 12 is not significantly
different from zero. Hence our hypothesis (H0) holds true, i.e., the sample
variables in the population are uncorrelated. 11
Geektonight Notes
Solution: Let us take the hypothesis that the variables in the population are
uncorrelated. Apply the t-test:
n−2 100 − 2
t=r = 0.55
1− r2 1 − 0.552
= 6.52
Referring to the table of the t-distribution for n–2 = 98 degrees of freedom, the
critical value for t at a 5% level of significance [t0.05/2, (10–2)] = 1.99
(approximately). Since the calculated value of t (6.52) exceeds the table value
of t (1.99), we can conclude that there is statistically significant association
between the variables. Hence, our hypothesis does not hold true.
6∑ d 2
R =1− where, N = Number of pairs of ranks, and Σd2 =
N3 − N
squares of difference between the ranks of two variables.
Illustration 5
Salesmen employed by a company were given one month training. At the end
of the training, they conducted a test on 10 salesmen on a sample basis who
were ranked on the basis of their performance in the test. They were then
posted to their respective areas. After six months, they were rated in terms of
their sales performance. Find the degree of association between them.
Salesmen: 1 2 3 4 5 6 7 8 9 10
Ranks in
training (X): 7 1 10 5 6 8 9 2 3 4
Ranks on
sales
Peformance
(Y): 6 3 9 4 8 10 7 2 1 5
12
Geektonight Notes
Solution: Table 10.3: Calculation of Coefficient of Rank Correlation. Correlation and Simple
Regression
Salesmen Ranks Secured Ranks Secured Difference
in Training on Sales in Ranks D2
X Y D = (X–Y)
1 7 6 1 1
2 1 3 –2 4
3 10 9 1 1
4 5 4 1 1
5 6 8 –2 4
6 8 10 –2 4
7 9 7 2 4
8 2 2 0 0
9 3 1 2 4
10 4 5 –1 1
ΣD2 = 24
6∑ D 2 6 × 24
R =1 − = 1 −
N3 − N 103 − 10
144
=1 − = 0.855
990
we can say that there is a high degree of positive correlation between the
training and sales performance of the salesmen.
Now we proceed to test the significance of the results obtained. We are
interested in testing the null hypothesis (H0) that the two sets of ranks are not
associated in the population and that the observed value of R differs from zero
only by chance. The test which is used is t-statistic.
n−2 10 − 2
t=R = 0.855
1− R2 1 − 0.8552
Referring to the t-distribution table for 8 d.f (n–2), the critical value for t at a
5% level of significance [t0.05/2, (10–2)] is 2.306. The calculated value of t is
greater than the table value. Hence, we reject the null hypothesis concluding that
the performance in training and on sales are closely associated.
13
Geektonight Notes
t3 − t t3 − t
6 ∑ d 2 + + + …
12 12
r =1−
N −N
3
1) Compute the degree of relationship between price of share (X) and price of
debentures over a period of 8 years by using Karl Pearson’s formula and test
the significance (5% level) of the association. Comment on the result.
Price of 42 43 41 53 54 49 41 55
shares:
Price of 98 99 98 102 97 93 95 94
debentures:
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
2) Consider the above exercise and assign the ranks to price of shares and price
of debentures. Find the degree of association by applying Spearman’s formula
and test its significance.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
14
Geektonight Notes
Correlation and Simple
10.4 SIMPLE LINEAR REGRESSION Regression
When we identify the fact that the correlation exists between two variables, we
shall develop an estimating equation, known as regression equation or estimating
line, i.e., a methodological formula, which helps us to estimate or predict the
unknown value of one variable from known value of another variable. In the
words of Ya-Lun-Chou, “regression analysis attempts to establish the nature of
the relationship between variables, that is, to study the functional relationship
between the variables and thereby provide a mechanism for prediction, or
forecasting.” For example, if we confirmed that advertisment expenditure
(independent variable), and sales (dependent variable) are correlated, we can
predict the required amount of advertising expenses for a given amount of sales
or vice-versa. Thus, the statistical method which is used for prediction is called
regression analysis. And, when the relationship between the variables is linear,
the technique is called simple linear regression.
Hence, the technique of regression goes one step further from correlation and
is about relationships that have been true in the past as a guide to what may
happen in the future. To do this, we need the regression equation and the
correlation coefficient. The latter is used to determine that the variables are
really moving together.
Yi = β0 + β1 Xi + ei
wherein
β0 = Y-intercept,
ei = error term (i.e., the difference between the actual Y value and the value
of Y predicted by the model.
i) Regression of Y on X
ii) Regression of X on Y.
15
Geektonight Notes
Relational and When we draw the regression lines with the help of a scatter diagram as
Trend Analysis shown earlier in Fig. 10.1, we may get an infinite number of possible regression
lines for a set of data points. We must, therefore, establish a criterion for
selecting the best line. The criterion used is the Least Squares Method.
According to the least squares criterion, the best regression line is the one that
minimizes the sum of squared vertical distances between the observed (X, Y)
points and the regression line, i.e., ∑ ( Y − Ŷ ) 2 is the least value and the sum of
the positive and negative deviations is zero, i.e., ∑ (Y − Ŷ) = 0 . It is important
to note that the distance between (X, Y) points and the regression line is called
the ‘error’.
Regression Equations
As we discussed above, there are two regression equations, also called
estimating equations, for the two regression lines (Y on X, and X on Y). These
equations are, algebraic expressions of the regression lines, expressed as
follows:
Regression Equation of Y on X
Ŷ = a + bx
Ŷ − Y = byx ( X − X )
(∑ X ) (∑ Y )
σy ( ∑ XY ) −
byx = r = N
σx 2 (∑ X ) 2
∑X −
N
Regression equation of X on Y
X̂ = a + by
X̂ − X = bxy ( Y − Y )
(∑ X ) (∑ Y )
∑ XY −
σx N
bxy = r =
σy (∑ Y )2
∑ Y2 −
N
It is worthwhile to note that the estimated simple regression line always passes
through X and Y (which is shown in Figure 10.3). The following illustration
shows how the estimated regression equations are obtained, and hence how
they are used to estimate the value of Y for given X value.
16
Geektonight Notes
(Rs. in lakh)
Advertisement
Expenditure: 0.8 1.0 1.6 2.0 2.2 2.6 3.0 3.0 4.0 4.0 4.0 4.6
Sales: 22 28 22 26 34 18 30 38 30 40 50 46
Solution:
(Rs. in lakh)
Advertising Sales
(X) (Y) X2 Y2 XY
0.8 22 0.64 484 17.6
1.0 28 1.00 784 28.0
1.6 22 2.56 484 35.2
2.0 26 4.00 676 52.0
2.2 34 4.84 1156 74.8
2.6 18 6.76 324 46.8
3.0 30 9.00 900 90.0
3.0 38 9.00 1,444 114.0
4.0 30 16.00 900 120.0
4.0 40 16.00 1600 160.0
4.0 50 16.00 2,500 200.0
4.6 46 21.16 2,116 211.6
Now we establish the best regression line (estimated by the least square
method).
Ŷ − Y = byx (X − X )
384 32.8
Y= = 32 ; X = = 2.733 Ŷ − Y = byx (X − X )
12 12
(∑ X ) (∑ Y )
∑ XY −
byx = N
2 (∑ X )2
∑X −
N
17
Geektonight Notes
Relational and
(32.8) (384)
Trend Analysis
1,150 −
= 12 = 5.801
(32.8) 2
106.96 −
12
Ŷ − 32 = 5.801 (X − 2.733)
Ŷ = 5.801X − 15.854 + 32 = 5.801X + 16.146
or Ŷ = 16.146 + 5.801X
which is shown in Figure 10.3. Note that, as said earlier, this line passes
through X (2.733) and Y (32).
observed points used to
fit the estimating line
points on the estimating
Y
line
50
Estimating
Ŷ = 16.143 + 5.01X line
40
Positive
Error Ŷ Negative
Y = 32
Error
Sales (Rs. Lac)
30 Y
Y
20 Ŷ
10
0 X
0 1 2 3 4 5
x = 2.73
Advertising (Rs. Lac)
Figure 10.3: Least Squares Regression Line of a Company’s Advertising Expenditure
and Sales.
Ŷ = 16.146 + 5.801X
wherein Ŷ = estimated sales for given value of X, and
X = level of advertising expenditure.
18
To find Ŷ , the estimate of expected sales, we substitute the specified
Geektonight Notes
advertising level into the regression model. For example, if we know that the Correlation and Simple
Regression
company’s marketing department has decided to spend Rs. 2,50,000/- (X = 2.5)
on advertisement during the next quarter, the most likely estimate of sales ( Ŷ )
is :
= Rs. 30,64,850
Regression Equation of X on Y
X̂ − X = bxy (Y − Y)
(∑ X ) (∑ Y ) (32.8) (384)
∑ XY − 1,150 −
bxy = N = 12 = 0.093
2 (∑ Y )
2 (384) 2
∑Y − 13368 −
N 12
X̂ = – 0.243 + 0.093Y
The following points about the regression should be noted:
1) The geometric mean of the two regression coefficients (byx and bxy) gives
coefficient of correlation.
2) Both the regression coefficients will always have the same sign (+ or –).
19
Geektonight Notes
Once the line of best fit is drawn, the next process in the study of regression
analysis is how to measure the reliability of the estimated regression equation.
Statisticians have developed a technique to measure the reliability of the
estimated regression equation called “Standard Error of Estimate (Se).” This Se
is similar to the standard deviation which we discussed in Unit-9 of this course.
We will recall that the standard deviation is used to measure the variability of a
distribution about its mean. Similarly, the standard error of estimate
measures the variability, or spread, of the observed values around the
regression line. We would say that both are measures of variability. The
larger the value of Se, the greater the spread of data points around the
regression line. If Se is zero, then all data points would lie exactly on the
regression line. In that case the estimated equation is said to be a perfect
estimator. The formula to measure Se is expressed as:
Se =
∑ (Y − Ŷ) 2
n
Illustration 7
R&D (Rs. lakh): 2.5 3.0 4.2 3.0 5.0 7.8 6.5
Solution: To calculate Se for this problem, we must first obtain the value of
2
∑ ( Y − Ŷ ) . We have done this in Table 10.5.
20
Geektonight Notes
Σ (Y − Ŷ) 2= 24.62
We can, now, find the standard error of estimate as follows.
∑ (Y − Ŷ)
2
Se =
n
24.62
= 1.875
7
2
Explained var iation ∑ ( Y − Ŷ )
R2 = or , 1 − 2
Total var iation ∑ (Y − Y )
21
Geektonight Notes
Relational and
(∑ Y) 2
∑ (Y − Y ) = ∑ Y2 −
Trend Analysis 2
R2 = r 2
Refer to Illustration 6, where we have computed ‘r’ with the help of regression
coefficients (bxy and byx), as an example for R2
r = 0.734
R2 = r2 = 0.7342 = 0.5388
This means that 53.88 per cent of the variation in the sales (Y) can be
explained by the level of advertising expenditure (X) for the company.
You are given the following data relating to age of Autos and their maintenance
costs. Obtain the two regression equations by the method of least squares and
estimate the likely maintenance cost when the age of Auto is 5 years and also
compute the standard error of estimate.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
reflects upon the nature of the variables (i.e., which is the dependent variable Correlation and Simple
Regression
and which is independent variable). Regression coefficients, therefore, are not
symmetric in X and Y (i.e., byx ≠ bxy).
2) Correlation need not imply cause and effect relationship between the variables
under study. But regression analysis clearly indicates the cause and effect
relationship between the variables. The variable corresponding to cause is taken
as independent variable and the variable corresponding to effect is taken as
dependent variable.
3) Correlation coefficient ‘r’ is a relative measure of the linear relationship
between X and Y variables and is independent of the units of measurement.
It is a number lying between ±1. Whereas the regression coefficient byx (or
bxy) is an absolute measure representing the change in the value of the
variable Y (or X) for a unit change in the value of the variable X (or Y).
Once the functional form of the regression curve is known, by susbstituting
the value of the dependent variable we can obtain the value of the
independent variable which will be in the unit of measurement of the
variable.
4) There may be spurious (non-sense) correlation between two variables which
is due to pure chance and has no practical relevance. For example, the
correlation between the size of shoe and the income of a group of
individuals. There is no such thing as spurious regression.
5) Correlation analysis is confined only to study of linear relationship between
the variables and, therefore, has limited applications. Whereas regression
analysis has much wider applications as it studies linear as well as non-linear
relationships between the variables.
Least Squares Criterion: The criterion for determining a regression line that
minimizes the sum of squared errors.
2. R = – 0.185
t = – 1.149
C) Y on X : Ŷ = 5 + 3.25x
X on Y : X̂ = – 3 + 0.297y
Students: A B C D E F G H I J
Rank by
Ist judge: 5 2 4 1 8 9 7 6 3 10
Rank by
IInd judge: 1 9 7 8 10 2 4 5 3 6
Find out whether the judges are in agreement with each other or not and apply
the t-test for significance at 5% level.
9) A sales manager of a soft drink company is studying the effect of its latest
advertising campaign. People chosen at random were called and asked how
many bottles they had bought in the past week and how many advertisements
of this product they had seen in the past week.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
25
Geektonight Notes
Relational and
Trend Analysis 10.11 FURTHER READING
A number of good text books are available for the topics dealt with in this unit.
The following books may be used for more indepth study.
Richard I. Levin and David S. Rubin, 1996, Statistics for Management.
Prentice Hall of India Pvt. Ltd., New Delhi.
Peters, W.S. and G.W. Summers, 1968, Statistical Analysis for Business
Decisions, Prentice Hall, Englewood-cliffs.
Hooda, R.P., 2000, Statistics for Business and Economics, MacMillan India
Ltd., New Delhi.
Gupta, S.P. 1989, Elementary Statistical Methods, Sultan Chand & Sons :
New Delhi.
Chandan, J.S. - Statistics for Business and Economics, Vikas Publishing
House Pvt. Ltd., New Delhi.
APPENDIX : TABLE OF t-DISTRIBUTION AREA
The table gives points of t- distribution corresponding to degrees of freedom and the upper tail area
(suitable for use n one tail test).
0 tα
Values of ta, m
27
Geektonight Notes
Relational and
Trend Analysis UNIT 11 TIME SERIES ANALYSIS
STRUCTURE
11.0 Objectives
11.1 Introduction
11.2 Definition and Utility of Time Series Analysis
11.3 Components of Time Series
11.4 Decomposition of Time Series
11.5 Preliminary Adjustments
11.6 Methods of Measurement of Trend
11.6.1 Freehand Method
11.6.2 Least Square Method
11.7 Let Us Sum Up
11.8 Key Words
11.9 Answers to Self Assessment Questions
11.10 Terminal Questions/Exercises
11.11 Further Reading
11.0 OBJECTIVES
After studying this unit, you should be able to:
11.1 INTRODUCTION
In the previous units, you have learnt statistical treatment of data collected for
research work. The nature of data varied from case to case. You have come
across quantitative data for a group of respondents collected with a view to
understanding one or more parameters of that group, such as investment, profit,
consumption, weight etc. But when a nation, state, an institution or a business
unit etc., intend to study the behaviour of some element, such as price of a
product, exports of a product, investment, sales, profit etc., as they have
behaved over a period of time, the information shall have to be collected for a
fairly long period, usually at equal time intervals. Thus, a set of any quantitative
data collected and arranged on the basis of time is called ‘Time Series’.
Depending on the research objective, the unit of time may be a decade, a year,
a month, or a week etc. Typical time series are the sales of a firm in
successive years, monthly production figures of a cement mill, daily closing
price of shares in Bombay stock market, hourly temperature of a patient.
Usually, the quantitative data of the variable under study are denoted by y1, y2,
...yn and the corresponding time units are denotecd by t1, t2......tn. The variable
‘y’ shall have variations, as you will see ups and downs in the values. These
changes account for the behaviour of that variable.
Instantly it comes to our mind that ‘time’ is responsible for these changes, but
this is not true. Because, the time (t) is not the cause and the changes in the
variable (y) are not the effect. The only fact, therefore, which we must
understand is that there are a number of causes which affect the variable and
have operated on it during a given time period. Hence, time becomes only the
2 8 basis for data analysis.
Geektonight Notes
Forecasting any event helps in the process of decision making. Forecasting is Time Series Analysis
possible if we are able to understand the past behaviour of that particular
activity. For understanding the past behaviour, a researcher needs not only the
past data but also a detailed analysis of the same. Thus, in this unit we will
discuss the need for analysis of time series, fluctuations of time series which
account for changes in the series over a period of time, and measurement of
trend for forecasting.
Another question: Shall sunflower oil be sold again in future for Rs. 60 per
kg? No doubt, your answer would be ‘Yes’. Have you ever thought about how
you answered the above two questions? Probably you have not! The analysis of
these answers shall lead us to arrive at the following observations:
– There are several causes which affect the variable gradually and permanently.
Therefore we are prompted to answer ‘No’ for the first question.
2 9
Geektonight Notes
Relational and – There are several causes which affect the variable for the time being only. For
Trend Analysis this reason we are prompted to answer ‘Yes’ for the second question.
The causes which affect the variable gradually and permanently are termed as
“Long-Term Causes”. The examples of such causes are: increase in the rate of
capital formation, technological innovations, the introduction of automation,
changes in productivity, improved marketing etc. The effect of long term causes
is reflected in the tendency of a behaviour, to move in an upward or downward
direction, termed as ‘Trend’ or ‘Secular Trend’. It reveals as to how the time
series has behaved over the period under study.
The causes which affect the variables for the time being only are labelled as
“Short-Term Causes”. The short term causes are further divided into two parts,
they are ‘Regular’ and ‘Irregular’. Regular causes are further divided into two
parts, namely ‘cyclical causes’ and ‘seasonal causes’. The cyclical variations
are also termed as business cycle fluctuations, as they influence the variable. A
business cycle is composed of prosperity, recession, depression and recovery.
The periodic movements from prosperity to recovery and back again to
prosperity vary both in time and intensity. The seasonal causes, like weather
conditions, business climate and even local customs and ceremonies together
play an important role in giving rise to seasonal movements to almost all the
business activities. For instance, the yearly weather conditions directly affect
agricultural production and marketing.
It is worthwhile to say that the seasonal variations analysis will be possible only
if the season-wise data are available. This fact must be checked first. For
analysing the seasonal effects various methods are available. Among them
seasonal index by ‘Ratio to Moving Average Method’ is the most widely used.
However, if collected data provides only yearly values, there is no possibility of
obtaining seasonal variations. Therefore, the residual amount after eliminating
trend will be the effect of irregular or random causes.
Additive Model: It is based on the assumption that the four components are Time Series Analysis
independent of one another. Under this assumption, the pattern of occurrence
and the magnitude of movements in any particular component are not affected
by the other components. In this model the values of the four components are
expressed in the original units of measurement. Thus, the original data or
observed data, ‘Y’ is the total of the four component values, that is,
Y=T+S+C+I
Y=T×S×C×I
In this model the values of all the components, except trend values, are
expressed as percentages.
In business research, normally, the multiplicative model is more suited and used
more frequently for the purpose of analysis of time series. Because, the data
related to business and economic time series is the result of interaction of a
number of factors which individually cannot be held responsible for generating
any specific type of variations.
Components
Year Quarter Series Trend Seasonal Cyclical-
(O) (T) (100 S) erratic
(100 C1)
1 1 79 80 120 82
2 58 85 80 85
3 84 90 92 102
4 107 95 108 105
3 1
Geektonight Notes
a) Time is the cause for the ups and downs in the values of the variable under
study.
b) The variable under study in time series analysis is denoted by ‘y’.
c) ‘Trend’ values are a major component of the time series.
d) Analysis of time series helps in knowing current accomplishment.
e) Weather conditions, customs, habits etc., are causes for cyclical variations.
f) The analysis of time series is done to know the expected quantity
change in the variable under study.
..................................................................................................................
..................................................................................................................
3 2
..................................................................................................................
Geektonight Notes
Time Series Analysis
11.6 METHODS OF MEASUREMENT OF TREND
The effect of long-term causes is seen in the trend values we compute. A
trend is also known as ‘secular trend’ or ‘long-term trend’ as well. There are
several methods of isolating the trend of which we shall discuss only two
methods which are most frequently used in the business and economic time
series data analysis. They are: Free Hand Method, and Method of Least
Square.
Though this method is very simple, it does not have a common acceptance
because it gives varying trend values for the same data when efforts are made
by different persons or even by the same persons at different times. It is to be
noted that free-hand method is highly subjective and therefore, different
researchers may draw different trend lines from the same data set. Hence, it is
not advisable to use it as a basis for forecasting, particularly, when the time
series is subject to very irregular movements. Let us consider an illustration to
draw a trend line by free-hand method.
Illustration 1
From the following data, find the trend line by using Free hand (graphline)
Method
Years: 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003
Foodgrain production 35, 55, 40, 85, 135, 110, 130, 150, 130, 120
(lakh tonnes)
180
160
140
Production (lakh tones)
120
100
80 Original data
60
Trend line
40
20
0
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Years
Fig. 1: Food Grain Production (in lakh tons) 3 3
Geektonight Notes
This is also known as straight line method. This method is most commonly used
in research to estimate the trend of time series data, as it is mathematically
designed to satisfy two conditions. They are:
The straight line method gives a line of best fit on the given data. The straight
line which can satisfy the above conditions and make use of the regression
equation, is given by :
Yc = a + bx
where, ‘Yc’ represents the trend value of the time series variable y, ‘a’ and ‘b’
are constant values of which ‘a’ is the trend value at the point of origin
and ‘b’ is the amount by which the trend value changes per unit of
time, and ‘x’ is the unit of time (value of the independent variable).
The values of constants, ‘a’ and ‘b’, are determined by the following two
normal equations.
∑y = na + b ∑x .................(i)
Therefore, the values of two constants are obtained by the following formulae:
∑y ∑xy
a= , and b = 2
N ∑x
It is to be noted that when the number of time units involved is even, the point
of origin will have to be chosen between the two middle time units.
Illustration 2
3 4
Geektonight Notes
1998 70
1999 75
2000 90
2001 98
2002 85
2003 91
2004 100
Solution: To find the straight line equation (Yc = a + bx) for the given time
series data, we have to substitute the values of already arrived expression, that
is:
∑y ∑ xy
a= , and
N ∑ x2
In order to make the total of x = ‘zero’, we must take median year (i.e., 2001)
as origin. Study the following table carefully to understand the procedure for
fitting the straight line.
∑y ∑ xy 117
= 87 ; b = = = 4.18
609
a= = 2 28
N 7 ∑x
Yc = 87 + 4.18x
From the above equation, we can also find the monthly increase in sales as
follows:
4.180
= 348.33 tons
12
The reason for this is that the trend values increased by a constant amount ‘b’
every year. Hence the annual increase in sales is 4.18 thousand tons.
3 5
Geektonight Notes
Estimation of sales for 2006, ‘x’ would be 5 (because for 2004 ‘x’ was 3).
Y2006 = 87 + 4.18 (5) = 107.9 thousand tonnes.
Estimation of sales for 2008, ‘x’ would be 7
Y2008 = 87 + 4.18 (7) = 116.3 thousand tonnes.
Years Production
1996 40
1997 60
1998 45
1999 83
2000 130
2001 135
2002 150
2003 120
2004 200
3 6
Geektonight Notes
The quantitative values of the variable under study are denoted by y1, y2, y3......
and the corresponding time units are denoted as x1, x2, x3...... . The variable ‘y’
shall have variations, you will see ups and downs in the values. There are a
number of causes during a given time period which affect the variable.
Therefore, time becomes the basis of analysis. Time is not the cause and the
changes in the values of the variable are not the effect.
The causes which affect the variable gradually and permanently are termed as
Long-term causes. The causes which affect the variable only for the time being
are termed as Short-term causes. The time series are usually the result of the
effects of one or more of the four components. These are trend variations (T),
seasonal variations (S), Cyclical variations (C) and Irregular variations (I).
When we try to analyse the time series, we try to isolate and measure the
effects of various kinds of these components on a series.
1) Additive model, which considers the sum of various components resulting in the
given values of the overall time series data and symbolically it would be
expressed as: Y = T + C + S + I.
The trend analysis brings out the effect of long-term causes. There are
different methods of isolating trends, among these we have discussed only two
methods which are usually used in research work, i.e. free hand and least
square methods.
3 7
Geektonight Notes
Relational and Long-term predictions can be made on the basis of trends, and only the least
Trend Analysis square method of trend computation offers this possibility.
‘y’ 24 28 38 33 49 50 66 68
5) The production (in thousand tons) in a sugar factory during 1994 to 2001
has been as follows:
Year 1994 1995 1996 1997 1998 1999 2000 2001
Produ- 35 38 49 41 56 58 76 75
ction
(Hint: The point of origin must be taken between 1997 and 1998).
i) Find the trend values by applying the method of least square.
ii) What is the monthly increase in production?
3 8 iii) Estimate the production of sugar for the year 2008.
Geektonight Notes
6) The following data relates to a survey of used car sales in a city for the Time Series Analysis
period 1993-2001. Predict sales for 2006 by using the linear trend
equation.
Years 1993 1994 1995 1996 1997 1998 1999 2000 2001
Sales 214 320 305 298 360 450 340 500 520
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
3 9
Geektonight Notes
Probability and
UNIT 13 PROBABILITY AND PROBABILITY Probability Rules
RULES
STRUCTURE
13.0 Objectives
13.1 Introduction
13.2 Meaning and History of Probability
13.3 Terminology
13.4 Fundamental Concepts and Approaches to Probability
13.5 Probability Rules
13.5.1 Addition Rule for Mutually Exclusive Events
13.5.2 Addition Rule for Non-mutually Exclusive Events
13.6 Probability Under Statistical Independence
13.7 Probability Under Statistical Dependence
13.8 Bayes’ Theorem: Revision of A- Priori Probability
13.9 Let Us Sum Up
13.10 Key Words
13.11 Answers to Self Assessment Exercises
13.12 Terminal Questions/Exercises
13.13 Further Reading
13.0 OBJECTIVES
After studying this unit, you should be able to:
l comprehend the concept of probability,
l acquaint yourself with the terminology related to probability,
l understand the probability rules and their application in determining probability,
l differentiate between determination of probability under the condition of
statistical independence and statistical dependence,
l apply probability concepts and rules to real life problems, and
l appreciate the relevance of the study of probability in decision making.
13.1 INTRODUCTION
In the previous units we have discussed the application of descriptive statistics.
The subject matter of probability and probability rules provide a foundation for
Inferential Statistics. There are various business situations in which the decision
makers are forced to apply the concepts of probability. Decision making in
various situations is facilitated through formal and precise expressions for the
uncertainties involved. For instance formal and precise expression of
stockmarket prices and product quality uncertainties, may go a long way to help
analyse, and facilitate decision on portfolio and sales planning respectively.
Probability theory provides us with the means to arrive at precise expressions
for taking care of uncertainties involved in different situations.
This unit starts with the meaning of probability and its brief historical evolution.
Its meaning has been described. The next section covers fundamental concept
of probability as well as three approaches for determining probability. These
approaches are : i) Classical approach; ii) Relative frequency of occurrence
approach, and iii) Subjective approach. 5
Geektonight Notes
Probability and Hypothesis Thereafter the addition rule for probability has been explained for both mutually
Testing
exclusive events and non-mutually exclusive events. Proceeding further the unit
addresses the important aspects of probability rules, the conditions of statistical
independence and statistical dependence. The concept of marginal, joint, and
conditional probabilities have been explained with suitable examples.
If the conditions of certainty only were to prevail, life would have been much
more simple. As is obvious there are numerous real life situations in which
conditions of uncertainty and risk prevail. Consequently, we have to rely on the
theory of chance or probability in order to have a better idea about the possible
outcomes. There are social, economic and business sectors in which decision
making becomes a real challenge for the managers. They may be in the dark
about the possible consequences of their decisions and actions. Due to
increasing competitiveness the stakes have become higher and cost of making a
wrong decision has become enormous.
13.3 TERMINOLOGY
Before we proceed to discuss the fundamental concepts and approaches to
determining probability, let us now acquaint ourselves with the terminology
relevant to probability.
6
Geektonight Notes
ii) Trial and Events: To conduct an experiment once is termed as trial, while Probability and
Probability Rules
possible outcomes or combination of outcomes is termed as events. For
example, toss of a coin is a trial, and the occurence of either head or a tail is an
event.
iii) Sample Space: The set of all possible outcomes of an experiment is called the
sample space for that experiment. For example, in a single throw of a dice, the
sample space is (1, 2, 3, 4, 5, 6).
iv) Collectively Exhaustive Events: It is the set of all possible events that can
result from an experiment. It is obvious that the sum total of probability value of
each of these events will always be one. For example, in a single toss of a fair
coin, the collectively exhaustive events are either head or tail. Since
vi) Equally Likely Events: When all the possible outcomes of an experiment
have an equal probability of occurance, such events are called equally likely
events. For example, in case of throwing of a fair coin, we have already seen
that
P(Head) = P (Tail) = 0.5
Many common experiments in real life also can have events, which have all of
the above properties. The best example is that of a single toss of a coin, where
both the possible outcomes or events of either head or tail coming on top are
collectively exhaustive, mutually exclusive and equally likely events.
(i) The value of probability of any event lies between 0 to 1. This may be
expressed as follows:
0 ≤ P (Event) ≤ 1
If the value of probability of an event is equal to zero, then the event is never
expected to occur and if the probability value is equal to one, the event is
always expected to occur.
(ii) The sum of the simple probabilities for all possible outcomes of an activity must
be equal to one.
Before proceeding further, first of all, let us discuss different approaches to
defining probability concept.
Approaches to Probability
There are three approaches to determine probability. These are :
7
Geektonight Notes
Probability and Hypothesis a) Classical Approach: The classical approach to defining probability is based on
Testing
the premise that all possible outcomes or elementary events of experiment are
mutually exclusive and equally likely. The term equally likely means that each of
all the possible outcomes has an equal chance of occurance. Hence, as per this
approach, the probability of occuring of any event ‘E’ is given as:
Example: When we toss a fair coin, the probability of getting a head would be:
3
Similarly, when a dice is thrown, the probability of getting an odd number is
6
1
or .
2
The premise that all outcomes are equally likely assumes that the outcomes are
symmetrical. Symmetrical outcomes are possible when a coin or a die being
tossed are fair. This requirement restricts the application of probability only to
such experiments which give rise to symmetrical outcomes. The classical
approach, therefore, provides no answer to problems involving asymmetrical
outcomes. And we do come across such situations more often in real life.
Thus, the classical approach to probability suffers from the following limitations:
i) The approach cannot be useful in such a case, when it is not possible in the
events to be considered “equally likely”. ii) It fails to deal with questions like,
what is the probability that a bulb will burn out within 2,000 hours? What is the
probability that a female will die before the age of 50 years? etc.
From this available data the company can estimate the probability of death for Probability and
Probability Rules
that age group as ‘P’ where,
30
P= = 0.003
10,000
This approach too has limited practical utility because the computation of
probability requires repetition of an experiment a large number of times. This is
practically true where an event occurs only once so that repetitive occurrences
under precisely the same conditions is neither possible nor desirable.
Impartantly, these three approaches compliment one another because where one
fails, the other takes over. However, all are identical in as much as probability
is defined as a ratio or a weight assigned to the likelylihood of occurrence of
an event.
The following rules of probability are useful for calculating the probability of an
event/events under different situations.
If A and B are mutually exclusive events, this rule is depicted in Figure 13.1,
below.
P(A) P(B)
Figure: 13.1
The essential requirement for any two events to be mutually exclusive is that
there are no outcomes common to the occurance of both. This condition is
satisfied when sample space does not contain any outcome favourable to the
occurance of both A and B means A ∩ B = φ
There is an important special case for any event E, either E happens or it does Probability and
Probability Rules
not. So, the events E and not E are exhaustive and exclusive.
So, P (E) = 1 − P (E | ) = 1 − P( E ) .
P(A) P(B)
P(A and B) = (A ∩ B)
Figure: 13.2
Thus, it is clear that the probability of outcomes that are common to both the
events is to be subtracted from the sum of their simple probability.
Solution: These events are not mutually exclusive, so the required probability
of drawing a Jack or a spade is given by:
Solution: P (Male or over 35) = P (Male) + P (over 35) – P (Male and over
35)
3 2 1 4
= + − =
5 5 5 5
P (H) = ½ = 0.5
Another example can be given in a throw of a fair die, the marginal probability
of the face bearing number 3, is:
P(3) = 1/6 = 0.166
Since, the tosses of the die are independent of each other, this is a case of
statistical independence.
Take another example: When a fair die is thrown twice in quick succession,
then to find the probability of having 2 in the 1st throw and 4 in second throw
is, given as:
13
Geektonight Notes
Probability and Hypothesis P (2 in 1st throw and 4 in 2nd throw)
Testing
= P (2 in the 1st throw) × P (4 in the 2nd throw)
1 1 1
= × = = 0.028
6 6 36
For example, if we want to find out what is the probaility of heads coming up
in the second toss of a fair coin, given that the first toss has already resulted in
head. Symbolically, we can write it as:
P (H2/H1)
As, the two tosses are statistically independent of each other
so, P (H2/H1) = P (H2)
The following table 13.1 summarizes these three types of probabilities, their
symbols and their mathematical formulae under statistical independence.
Table 13.1
For example, if the first child of a couple is a girl to find the probability that
the second child will be a boy. In this case:
P (B/G) = P (B)
As both the events are independent of each other, the conditional probability of
having the second child as a boy on condition that their first child was a girl, is
equal to the marginal probability of having the second child as a boy.
For example, take the case of rain in different states of India. Suppose, the
probability of having rain in different states of India is given as:
Then, to find out probability of having rainfall in Gujarat, on condition that Probability and
Probability Rules
during this period, Bihar receives a heavy rainfall, is a case of statistical
independence, as both the events (rain in Gujarat and rain in Bihar) are quite
independent to each other. So, this conditional probability is equal to marginal
probability of having rainfall in Gujarat (which is equal to 0.5).
For example, an urn contains 3 white balls and 7 black balls. We draw a ball
from the Urn, replace it and then again draw a second ball. Now, we have to
find the probability of drawing a black ball in the second draw on condition that
the ball drawn in the first attempt was a white one.
a) The time until the failure of a watch and of a second watch marketed by
different companies – yes/no
b) The life span of the current Indian PM and that of current Pakistani
President – Yes/no.
c) The takeover of a company and a rise in the price of its stock – Yes/no.
3) What is the probability that in selecting two cards one at a time from a deck
with replacement, the second card is
4) A bag contains 32 marbles: 4 are red, 9 are black, 12 are blue, 6 are yellow and
1 is purple. Marbles are drawn one at a time with replacement. What is the
probability that:
a) The second marble is yellow given the first one was yellow?
b) The second marble is yellow given the first one was black?
c) The third marble is purple given both the first and second were purple?
15
Geektonight Notes
Probability and Hypothesis ..................................................................................................................
Testing
..................................................................................................................
..................................................................................................................
There are three types of probability under statistical dependence case. They
are:
a) Conditional Probability;
b) Joint Probability;
c) Marginal Probability
Let us discuss the concept of the three types.
The conditional probability of event A, given that the event B has already
occured, can be calculated as follows:
P (AB)
P (A / B) =
P (B)
(i) P(CD) = 3/10 = Joint Probability of drawn ball becoming a coloured as well
as a dotted one.
16
Geektonight Notes
Similarly, P (CS) = 1/10, P (GD) = 2/10, and P (GS) = 4/10 Probability and
Probability Rules
P(DC)
So, = P(D / C) =
P(C)
where, P(C) = Probability of drawing a coloured ball from the box = 4/10 (4
coloured balls out of 10 balls).
3 / 10
∴ P (D / C) = = 0.75
4 / 10
ii)Similarly, P(S/C) = Conditional probability of drawing a stripped ball on the
condition of knowing that it is a coloured one.
P(SC) 1 / 10
= = = 0.25
P(C) 4 / 10
Thus, the probability of coloured and dotted ball is 0.75. Similarly, the
probability of coloured and stripped ball is 0.25.
b) Continuing the same illustration, if we wish to find the probability of
(i) P (D/G) and (ii) P (S/G)
P (DG) 2 / 10 1
Solution: i) P (D / G ) = P (G ) = 6 / 10 = 3 = 0.33
P (SG ) 4 / 10 2
ii) (S / G ) = P ( G ) = 6 / 10 = 3 = 0 .66
P (GD) 2 / 10
Solution: (i) P (G / D) = P (D) = 5 / 10 = 0.4
P ( CD ) 3 / 10
and (ii) P ( C / D ) = P ( D ) = 5 / 10 = 0 .6
P (CS) 1 / 10
Solution: (i) P (C / S) = P (S) = 5 / 10 = 0.2
P (GS) 4 / 10
(ii) P (G / S) = P (S) = 5 / 10 = 0.8
The formula for calculating joint probability of two events under the condition of
statistical independence is derived from the formula of Bayes’ Theorem. 17
Geektonight Notes
Probability and Hypothesis Therefore, the joint probability of two statistically dependent events A and B is
Testing
given by the following formula:
Since, P (A/B) = P (B/A), So the product on the RHS of the formula must
also be equal to each other.
Notice that this formula is not the same under conditions of statistical
independence, i.e., P (BA) = P(B) × P (A). Continuing with our previous
illustration 4, of a box containing 10 balls, the value of different joint
probabilities can be calculated as follows:
Converting the above general formula i.e., P (AB) = P (A/B) × P (B) into our
illustration and to the terms coloured, dotted, stripped, and grey, we would have
calculated the joint probabilities of P (CD), P (GS), P (GD), and P (CS) as
follows:
Note: The values of P (C/D), P (G/S), P (G/D), and P (C/S) have been already
computed in conditional probability under statistical dependence.
c) Marginal Probability Under the Condition of Statistical Dependence
Finally, we discuss the concept of marginal probability under the condition of
statistical dependence. It can be computed by summing up all the probabilities
of those joint events in which that event occurs whose marginal probability we
want to calculate.
Solution: i) We can obtain the marginal probability of the event dotted balls by
adding the probabilities of all the joint events in which dotted balls occured.
In the same manner, we can compute the joint probabilities of the remaining
events as follows:
18
iv) P (S) = P (CS) + P (GS) = 1/10 + 4/10 = 0.5
Geektonight Notes
The following table 13.2 summarizes three types of probabilities, their symbols Probability and
Probability Rules
and their mathematical formulae under statistical dependence.
1) According to a survey, the probability that a family owns two cars of annual
income greater than Rs. 35,000 is 0.75. Of the households surveyed, 60 per
cent had incomes over Rs. 35,000 and 52 per cent had two cars. What is the
probability that a family has two cars and an income over Rs. 35,000 a year?
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
2) Given that P(A) = 3/14, P (B) = 1/6, P(C) = 1/3, P (AC) = 1/7 and P (B/C)
= 5/21, find the following probabilities: P (A/C), P (C/A), P (BC), P (C/B).
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
3) At a restaurant, a social worker gathers the following data. Of those visiting the
restaurant, 59% are male, 32 per cent are alcoholics and 21 per cent are male
alcoholics. What is the probability that a random male visitor to the restaurant is
an alcoholic?
..................................................................................................................
..................................................................................................................
..................................................................................................................
Priori
probability
Bayes’ Posterior
Process Probabilities
New
Information
P (A / B)
P (A / B) =
P (B)
Now, we can find out the value of P (F/3), as well as P (L/3), by using the
formula
P (F and3) 0.083
P (F / 3) = = = 0.216 and
P (3) 0.383
P (L and 3) 0.300
P (L / 3) = = = 0.784
P (3) 0.383
Our original estimates of probability of fair die being rolled was 0.5 and
similarly for loaded die was again 0.5. But, with the single roll of the die, the
probability of the loaded die being rolled is given that 3 has appeared on the
top, increases to 0.78, while for rolled die to be the fair one decreases to 0.22.
This example illustrated the power of Bayes’s theorem.
1. There are two machines, A and B, in a factory. As per the past information,
these two machines produced 30% and 70% of items of output respectively.
Further, 5% of the items produced by machine A and 1% produced by machine
B were defective. If a defective item is drawn at random, what is the
probability that the defective item was produced by machine A or machine B?
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
21
Geektonight Notes
Probability and Hypothesis
Testing 13.9 LET US SUM UP
At the beginning of this unit the historical evolution and the meaning of
probability has been discussed. Contribution of leading mathematicians has been
highlighted. Fundamental concepts and approaches to determining probability
have been explained. The three approaches namely; the classical, the relative
frequency, and the subjective approaches are used to determine the probability
in case of risky and uncertain situation have been discussed.
2. a) 1/2; b) 1/2.
3. a) P (Face2/Red1) = 3/13
b) P (Ace2/Face1) = 1/13
3. 0.356.
Supplementary Illustrations
Here, we have
P (A) = 0.25
P (B) = 0.40 and
P ( A ∪ B) = 0.5, then P ( A ∩ B) = ?
P (A ∪ B) = P (A) + P (B) – P (A ∩ B)
A non-leap year consists of 365 days, i.e., a total 52 full weeks and one
extra day. So, a non-leap year contain 53 Mondays, only when that extra
day must also be a Monday.
But, as that day can be from any of the following set, viz., either Sunday
or Monday or Wednesday or Thursday or Friday or Saturday.
4) What is the probability of having at least one head on two tosses of a fair coin?
The possible ways in which a head may occur are H1 H2; H1 T2; T1 H2.
Each of these has a probability of 0.25.
The results are similar for P (H1 T2) and P (T1 H2) also. Since, the two
tosses are statistically independent events. Therefore, the probability of at
least one head on two tosses is 0.25 × 3 = 0.75.
5) Suppose we are tossing an unfair coin, where the probability of getting head in
a toss is 0.8. If we have to calculate the probability of having three heads in
three consecutive trials
Then, as given
24
Geektonight Notes
If we have to calculate the probability of having three consecutive tails in Probability and
Probability Rules
three trials.
Suppose at random one ball in picked out from the urn, then we have to
find out the probability that:
P (WL) = 0.4
P (YL) = 0.3
P (WN) = 0.2
P (YN) = 0.1
Also, P (W) = 0.6
P (Y) = 0.4
P (L) = 0.7
P (N) = 0.3
As we also knew that:
P (W) = P (WL) + P (WN) = 0.4 + 0.2 = 0.6
P (Y) = P (YL) + P (YN) = 0.3 + 0.1 = 0.4
P (L) = P (WL) + P (YL) = 0.4 + 0.3 = 0.7
and P (N) = P (WN) + P (YN) = 0.2 + 0.1 = 0.3
So, (i) (L) = 0.7
P (LY) 0.3
(ii) for P (L/Y) = = = 0.75
P (Y) 0.4
P (CG ) 3 / 10
and P (C/D) = = = 0.6
P (D) 5 / 10
P (M 1 and D) 0.018
P (M1 / D) = = = 0.1837
P (D) 0.098
P (M 2 and D) 0.03
P (M 2 / D) = = = 0.3061
P (D) 0.098
P (M 3and D) 0.05
P ( M 3 / D) = = = 0.5102
P ( D) 0.098
1.000
These three conditional probabilities are called the posterior probabilities.
It is clear from the revised probability values that the probabilities of defective
units produced in M1 is 0.18, M2 is 0.31 and M3 is 0.51 against the past
probabilities 0.3, 0.2, and 0.5 respectively. And the probability that a defective
unit produced by this firm is 0.098.
26
Geektonight Notes
Probability and
13.12 TERMINAL QUESTIONS/EXERCISES Probability Rules
4. State and prove the addition rule of probability for two mutually exclusive
events.
7. One ticket is drawn at random from an urn containing tickets numbered from 1
to 50. Find out the probability that:
i) It is a multiple of 5 or 7
ii) It is a multiple of 4 or 3
[Answer: i) 17/50, ii) 12/25]
8. If two dice are being rolled, then find out the probabilities that:
(b) If two coins are tossed once, what is the probability of getting
(i) Both heads
(ii) At least one head ?
[Answer: (a) 1/2 (b) (i) 1/4 (ii) 3/4]
10. Given that P(A) = 3/14P (B) = 1/6, P (C) = 1/3, P (AC) = 1/7, P (B/C) = 5/21.
27
Geektonight Notes
Probability and Hypothesis 11. A T.V. manufacturing firm purchases a certain item from three suppliers X, Y
Testing
and Z. They supply 60%, 30% and 10% respectively. It is known that 2%, 5%
and 8% of the items supplied by the respective suppliers are defective. On a
particular day, the firm received items from three suppliers and the contents get
mixed. An item is chosen at random:
[Ans. P (D) = 0.035 P (X/D) = 0.34, P (Y/D) = 0.43, and P (Z/D) = 0.23].
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
Levin, R.I. and Rubin, D.S., 1991, Statistics for Management, PHI, : New
Delhi.
Hooda, R.P. 2001, Statistics for Business and Economics. MacMillan India
Limited, Delhi.
Gupta, S.P. Statistical Methods, 2000, Sultan Chand & Sons, Delhi.
28
Geektonight Notes
Probability
UNIT 14 PROBABILITY DISTRIBUTIONS Distributions
STRUCTURE
14.0 Objectives
14.1 Introduction
14.2 Types of Probability Distribution
14.3 Concept of Random Variables
14.4 Discrete Probability Distribution
14.4.1 Binomial Distribution
14.4.2 Poisson Distribution
14.5 Continuous Probability Distribution
14.5.1 Normal Distribution
14.5.2 Characteristics of Normal Distribution
14.5.3 Importance and Application of Normal Distribution
14.6 Let Us Sum Up
14.7 Key Words
14.8 Answers to Self Assessment Exercises
14.9 Terminal Questions/Exercises
14.10 Further Reading
14.0 OBJECTIVES
After studying this unit, you should be able to:
14.1 INTRODUCTION
A probability distribution is essentially an extension of the theory of probability
which we have already discussed in the previous unit. This unit introduces the
concept of a probability distribution, and to show how the various basic
probability distributions (binomial, poisson, and normal) are constructed. All these
probability distributions have immensely useful applications and explain a wide
variety of business situations which call for computation of desired probabilities.
This means that the unity probability of a certain event is distributed over a set
of disjointed events making up a complete group. In general, a tabular recording
of the probabilities of all the possible outcomes that could result if random 2 9
Geektonight Notes
In the frequency distribution, the class frequencies add up to the total number
of observations (N), where as in the case of probability distribution the possible
outcomes (probabilities) add up to ‘one’. Like the former, a probability
distribution is also described by a curve and has its own mean, dispersion, and
skewness.
Table 14.2: Probability Distribution of the Possible No. of Heads from Two-toss
Experiment of a Fair Coin
No. of Tosses Probability of
Heads (H) outcomes P (H)
0 (T, T) 1/4 = 0.25
1 (H, T) + (T, H) 1/2 = 0.50
2 (H, H) 1/4 = 0.25
3 0
Geektonight Notes
We must note that the above tables are not the real outcome of tossing a fair Probability
Distributions
coin twice. But, it is a theoretical outcome, i.e., it represents the way in which
we expect our two-toss experiment of an unbaised coin to behave over time.
The example given in the Introduction, we have seen that the outcomes of the
experiment of two-toss of a fair coin were expressed in terms of the number
of heads. We found in the example, that H (head) can assume values of 0, 1
and 2 and corresponding to each value, a probability is associated. This
uncertain real variable H, which assumes different numerical values depending
on the outcomes of an experiment, and to each of whose value a possibility
assignment can be made, is known as a random variable. The resulting
representation of all the values with their probabilities is termed as the
probability distribution of H.
H: 0 1 2
Probability and In this case, as we find that H takes only discrete values, the variable H is
Hypothesis Testing called a discrete random variable, and the resulting distribution is a discrete
probability distribution. The function that specifies the probability distribution
of a discrete random variable is called the probability mass function (p.m.f.).
In the above situations, we have seen that the random variable takes a limited
number of values. There are certain situations where the variable under
consideration may have infinite values. Consider for example, that we are
interested in ascertaining the probability distribution of the weight of one kg.
coffee packs. We have reasons to believe that the packing process is such that
a certain percentage of the packs slightly below one kg., and some packs are
above one kg. It is easy to see that it is essentially by chance that the pack
will weigh exactly 1 kg., and there are an infinite number of values that the
random variable ‘weight’ can take. In such cases, it makes sense to talk of the
probability that the weight will be between two values, rather than the
probability of the weight taking any specific value. These types of random
variables which can take an infinitely large number of values are called
continuous random variables, and the resulting distribution is called a
continuous probability distribution. The function that specifies the probability
distribution of a continuous random variable is called the probability density
function (p.d.f.).
Table 14.4
100 0.3 30
110 0.6 66
120 0.1 12
Now, we will examine situations involving discrete random variables and discuss
3 2 the methods for assessing them.
Geektonight Notes
Probability
14.4 DISCRETE PROBABILITY DISTRIBUTION Distributions
It is the basic and the most common probability distribution. It has been used to
describe a wide variety of processes in business. For example, a quality control
manager wants to know the probability of obtaining defective products in a
random sample of 10 products. If 10 per cent of the products are defective, he/
she can quickly obtain the answer, from tables of the binomial probability
distributions. It is also known as Bernoulli Distribution, as it was originated
by Swiss Mathematician James Bernoulli (1654-1705).
The binomial distribution describes discrete, not continuous, data resulting from
an experiment known as Bernoulli Process. Binomial distribution is a probability
distribution expressing the probability of one set of dichotomous alternatives, i.e.,
success or failure.
c) The trials are mutually independent i.e., the outcome of any trial is neither
affected by others nor affects others.
Assumptions i) Each trial has only two possible outcomes either Yes or No,
success or failure, etc.
ii) Regardless of how many times the experiment is performed, the probability of
the outcome, each time, remains the same.
Probability and The determining equation for nCr can easily be written as:
Hypothesis Testing
n!
n
Cr =
r! (n − r )!
n! can be simplified as follows:
Hence the following form of the equations, for carrying out computations of the
binomial probability is perhaps more convenient.
n!
P( r ) = prqn – r
r! (n − r ) !
If n is large in number, say, 50C3, then we can write (with the help of the
above explanation)
50 × 49 × 48
=
3× 2 ×1
Similarly,
75 × 74 × 73 × 72 × 71
= , and so on.
5× 4 × 3× 2 ×1
Illustration 1
A fair coin is tossed six times. What is the probability of obtaining four or more
heads?
Solution: When a fair coin is tossed, the probabilities of head and tail in case
3 4 of an unbiased coin are equal, i.e.,
Geektonight Notes
p = q = ½ or 0.5 Probability
Distributions
n!
P( r ) = prqn −r
r! (n − r ) !
6!
P (4) = (0.5) 4 (0.5) 2
4 ! (6 − 4) !
6 × 5× 4 × 3× 2 ×1
= (0.625) (0.25)
(4 × 3 × 2 × 1) (2 × 1)
720
= (0.625) (0.25) = 15 × 0.625 × 0.25
(24) (2)
= 0.234
The probability of obtaining 5 heads is :
P(5) = 6C5(1/2)5 (1/2)6-5
6!
P (5) = (0.5) 5 (0.5)1
5 ! (6 − 5) !
6 × 5 × 4 × 3 × 2 × 1
= (0.03125) (0.5)
5 × 4 × 3 × 2 × 1 (1 × 1)
= 6 × (0.03125) (0.5)
= 0.094
The probability of obtaining 6 heads is : P(6) = 6C6 (1/2)6 (1/2)6-6
6!
P (6) = (0.5) 2 (0.5) 0
6 ! (6 − 6) !
6 × 5 × 4 × 3 × 2 × 1
= (0.015625) (1)
6 × 5 × 4 × 3 × 2 × 1 (1)
= 1 × 0.015625 × 1
= 0.016
∴ The probability of obtaining 4 or more heads is :
0.234 + 0.094 + 0.016 = 0.344
Illustration 2
Probability and The probability of a worker not suffering from the disease i.e.,
Hypothesis Testing
1 4
q = 1 − =
5 5
By binomial probability law, the probability that out of 10 workers, ‘x’ workers
suffer from a disease is given by:
P(r) = nCr pr qn–r
10 − r
10 C . 1 r. 4 ; r = 0, 1, 2, …10
r
5 5
i) The required probability that exactly 2 workers will suffer from the disease is
given by :
2 10 − 2
1 4
P(2) = 10C 2
5 5
ii) The required probability that not more than 2 workers will suffer from the
disease is given by :
0 10 − 0
1 4
P (0) = 10C 0 = 0.107
5 5
1 10 −1
1 4
P (1) = 10C1 = 0.269
5 5
2 10 − 2
1 4
P (2) = 10 C 2 = 0.302
5 5
3 6
Geektonight Notes
Illustration 3
If the probability of defective bolts is 0.1, find the mean and standard deviation
for the distribution of defective bolts in a total of 50.
∴ σ = 500 × .1 × .9 = 6.71
i) Determine the values of ‘p’ and ‘q’. If one of these values is known, the other
can be found out by the simple relationship p = 1–q and q = 1–p. If p and q are
equal, we can say, the distribution is symmetrical. On the other hand if ‘p’ and
‘q’ are not equal, the distribution is skewed. The distribution is positively
skewed, in case ‘p’ is less than 0.5, otherwise it is negatively skewed.
ii) Expand the binomial (p + q)n. The power ‘n’ is equal to one less than the
number of terms in the expanded binomial. For example, if 3 coins are tossed
(n = 3) there will be four terms, when 5 coins are tossed (n = 5) there will be 6
terms, and so on.
iii) Multiply each term of the expanded binomial by N (the total frequency), in
order to obtain the expected frequency in each category.
Let us consider an illustration for fitting a binomial distribution.
Illustration 4
Eight coins are tossed at a time 256 times. Number of heads observed at each
throw is recorded and the results are given below. Find the expected
frequencies. What are the theoretical values of mean and standard deviation?
Also calculate the mean and standard deviation of the observed frequencies.
Probability and Solution: The chance of getting a head is a single throw of one coin is ½.
Hypothesis Testing Hence, p = ½, q = ½, n = 8, N = 256
1 1
= 8× × = 2 = 1.414
2 2
Note: The procedure for computation of mean and standard deviation of the
observed frequencies has been already discussed in Units 8 and 9 of this
course. Check these values by computing on your own.
3) The following data shows the result of the experiment of throwing 5 coins at a
time 3,100 times and the number of heads appearing in each throw. Find the
expected frequencies and comment on the results. Also calculate mean and
standard deviation of the theoretical values.
No. of heads: 0 1 2 3 4 5
frequency: 32 225 710 1,085 820 228
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
Probability and practice where there are infrequently occuring events with respect to time,
Hypothesis Testing volume (similar units), area, etc. For instance, the number of deaths or
accidents occuring in a specific time, the number of defects in production, the
number of workers absent per day etc.
This would comparatively be simpler in dealing with and is given by the Poisson
distribution formula as follows:
mre–m
p (r ) = ,
r!
where, p (r) = Probability of successes desired
c) It consists of a single parameter “m” only. So, the entire distribution can be
obtained by knowing this value only.
In poisson distribution, the mean (m) and the variance (s2) represent the same
value, i.e.,
Mean = variance = np = m
Since, n is large and p is small, the poisson distribution is applicable. Apply the
formula:
mre–m
p (r ) =
r!
m 5e – m
p (5) = , where m = np = 200 × 0.02 = 4;
5!
e = 2.7183 (constant)
1
5 − 4 (1024)
∴ P (5) =
4 2.7183
= 2.71834
5 × 4 × 3 × 2 ×1 120
(1024) 0.0183
= = 0.156
120
Illustration 6
m 4 e−m
P (4) = , where, m = np = 30 (0.02) = 0.6
4!
e = 2.7183 (constant)
Illustration 7
0.439 r × e –0.439
We can write P( r ) = . Substituting r = 0, 1, 2, 3, and 4, we get
r!
the probabilities for various values of r, as shown below:
m r e −m 0.439 0 × 2.7183−0.439
(P0) = =
r! 0!
1 (0.6443)
= = 0.6443
1
N(P0) = (P0) × N = 0.6443 × 330 = 212.62
4 2
Geektonight Notes
Note: We can use Appendix Table-2, given at the end of this block, to
determine poisson probabilities quickly.
3) Four hundred car air-conditioners are inspected as they come off the
production line and the number of defects per set is recorded below. Find
the expected frequencies by assuming the poisson model.
No. of defects : 0 1 2 3 4 5
4 3
Geektonight Notes
Probability and
Hypothesis Testing 14.5 CONTINUOUS PROBABILITY DISTRIBUTION
In the previous sections, we have examined situations involving discrete random
variables and the resulting probability distributions. Let us now consider a
situation, where the variable of interest may take any value within a given
range. Suppose that we are planning to release water for hydropower
generation and irrigation. Depending on how much water we have in the
reservoir, viz., whether it is above or below the ‘normal’ level, we decide on
the quantity of water and time of its release. The variable indicating the
difference between the actual level and the normal level of water in the
reservoir, can take positive or negative values, integer or otherwise. Moreover,
this value is contingent upon the inflow to the reservoir, which in turn is
uncertain. This type of random variable which can take an infinite number of
values is called a continuous random variable, and the probablity distribution
of such a variable is called a continuous probability distribution.
Now we present one important probability density function (p.d.f), viz., the
normal distribution.
The normal distribution is the most versatile of all the continuous probability
distributions. It is useful in statistical inferences, in characterising uncertainities
in many real life situations, and in approximating other probability distributions.
As stated earlier, the normal distribution is suitable for dealing with variables
whose magnitudes are continuous. Many statistical data concerning business
problems are displayed in the form of normal distribution. Height, weight and
dimensions of a product are some of the continuous random variables which are
found to be normally distributed. This knowledge helps us in calculating the
probability of different events in varied situations, which in turn is useful for
decision-making.
Now we turn to examine the characteristics of normal distribution with the help
of the figure 14.1, and explain the methods of calculating the probability of
different events using the distribution.
Mean
Median
Mode
Normal probability distribution
is symmetrical around a vertical
line erected at the mean
Left hand tail extends
indefinitely but never Right hand tail extends
reaches the horizontal indefinitely but never
axis reaches the horizontal
axis
1) The curve has a single peak, thus it is unimodal i.e., it has only one mode and
has a bellshape.
3) The two tails of the normal probability distribution extend indefinitely but never
touch the horizontal axis.
Irrespective of the value of mean (µ) and standard deviation (σ), for a normal
distribution, the total area under the curve is 1.00. The area under the normal
curve is approximately distributed by its standard deviation as follows:
µ±1σ covers 68% area, i.e., 34.13% area will lie on either side of µ.
X−µ
Z=
σ
Where,
Step 2: Look up the probability of z value from the Appendix Table-3, given at
the end of this block, of normal curve areas. This Table is set up to
provide the area under the curve to any specified value of Z. (The
area under the normal curve is equal to 1. The curve is also called
the standard probability curve).
4 5
Geektonight Notes
Probability and Let us consider the following illustration to understand as to how the table
Hypothesis Testing should be consulted in order to find the area under the normal curve.
Illustration 8
(a) Find the area under the normal curve for Z = 1.54.
Solution: Consulting the Appendix Table-3 given at the end of this block, we
find the entry corresponding to Z = 1.54 the area is 0.4382 and this measures
the Shaded area between Z = 0 and Z = 1.54 as shown in the following figure.
0.4382
µ 1.54
Solution: Since the curve is symmetrical, we can obtain the area between z =
–1.46 and Z = 0 by considering the area corresponding to Z = 1.46. Hence,
when we look at Z of 1.46 in Appendix Table-3 given at the end of this block,
we see the probability value of 0.4279. This value is also the probability value
of Z = –1.46 which must be shaded on the left of the µ as shown in the
following figure.
0.4279
-1.46 µ
4 6
Geektonight Notes
0.987 Probability
Distributions
0.4013
µ 0.25
d) Find the area to the left of Z = 1.83.
Solution: If we are interested in finding the area to the left of Z (positive
value), we add 0.5000 to the table value given for Z. Here, the table value for
Z (1.83) = 0.4664. Therefore, the total area to the left of Z = 0.9664 (0.5000 +
0.4664) i.e., equal to the shaded area as shown below:
5.000
0.4664
µ 1.83
X−µ
Solution: Z=
σ
X = 72 inches; µ = 68.22 inches; and σ = 10.8 = 3.286
72 − 68 .22
∴Z = = 1.15
3.286
4 7
Geektonight Notes
Probability and
Hypothesis Testing
0.3749
0.1251
68.22
µ 72
Area to the right of the ordinate at 1.16 from the normal table is (0.5–0.3749)
= 0.1251. Hence, the probability of getting soldiers above six feet is 0.1251 and
out of 1,000 soliders, the expectation is 1,000 × 0.1251 = 125.1 or 125. Thus,
the expected number of soldiers over six feet tall is 125.
Illustration 10
(a) 15,000 students appeared for an examination. The mean marks were 49 and the
standard deviation of marks was 6. Assuming the marks to be normally
distributed, what proportion of students scored more than 55 marks?
X−µ
Solution: Z =
σ
X = 55; µ = 49; σ = 6
55 − 49
∴Z = =1
6
For Z = 1, the area is 0.3413 (as per Appendix Table-3).
(b) If in the same examination, Grade ‘A’ is to be given to students scoring more
than 70 marks, what proportion of students will receive grade ‘A’?
X−µ
Solution: Z =
σ
X = 70; µ = 49; σ = 6
70 − 49
∴Z = = 3.5
6
The table gives the area under the standard normal curve corresponding to
Z = 3.5 is 0.4998
4 8
Geektonight Notes
Illustration 11 Probability
Distributions
In a training programme (self-administered) to develop marketing skills of marketing
personnel of a company, the participants indicate that the mean time on the
programme is 500 hours and that this normally distributed random variable has a
standard deviation of 100 hours. Find out the probability that a participant selected
at random will take:
i) fewer than 570 hours to complete the programme, and
ii) between 430 and 580 hours to complete the programme.
Solution: (i) To get the Z value for the probability that a candidate selected at
random will take fewer than 570 hours, we have
x −µ 570 − 500
Z = =
σ 100
70
= = 0.7
100
0.2580
P( less than 570) Z= 0.7
= 0.7580
(µ) 570
Thus, the probability of a participant taking less than 570 hours to complete the
programme, is marginally higher than 75 per cent.
ii) In order to get the probability, of a participant chosen at random, that he will take
between 430 and 580 hours to complete the programme, we must, first, compute
the Z value for 430 and 580 hours.
x −µ
Z=
σ
580 − 500 80
Z for 580 = = = 0 .8
100 100 4 9
Geektonight Notes
Probability and The table shows the probability values of Z values of –0.7 and 0.8 are 0.2580
Hypothesis Testing and 0.2881 respectively. This situation is shown in the following figure.
Z= –0.8
Z= –0.7
Thus, the probability that the random variables lie between 430 and 580 hours is
0.5461 (0.2580 + 0.2881).
(iii) To fit sampling distribution of various statistics like mean or variance etc.
2) Give a normal distribution with µ = 100 and σ = 10, what is the Probability
Distributions
probability that:
a) X > 75?
b) X < 70?
c) X > 112?
d) 75 < X > 85?
e) X < 80 or X > 110?
Probability and Poisson Distribution: It is the limiting form of the binomial distribution.
Hypothesis Testing Hence, the probability of success is very low and the total number of trials is
very high.
Probability: Any numerical value between 0 and 1 both inclusive, telling about
the likelihood of occurrence of an event.
Probability Distribution: A curve that shows all the values that the random
variable can take and the likelihood that each will occur.
3) Define a binomial probability distribution. State the conditions under which the
binomial probability model is appropriate by illustrations.
8) If the probability of a defective bolt is 0.1, find (a) the mean, and (b) the Probability
Distributions
standard deviation of defective bolts in a total of 900. (Ans. (a) 90; (b)81)
a) What is the probability that at least one browsing customer will buy
something during a specified hour?
b) What is the probability that at least 4 browsing customers will buy
something during a specified hour?
c) What is the probability that no browsing customer will buy anything during a
specified hour?
d) What is the probability that not more than 4 browsing customers will
buy something during a specified hour?
[Ans. (a) .9953 (b) .7031 (c) .0047 (d) .5155]
10) Given a binomial distribution with n = 28 trials and p = .025, use the Poisson
approximation to the binomial to find:
11) The average number of customer arrivals per minute at a departmental stores is
2. Find the probability that during one particular minute:
12) A set of 5 fair coins was thrown 80 times, and the number of heads in each
throw was recorded and given in the following table. Estimate the probability of
the appearance of head in each throw for each coin and calculate the
theoretical frequency of each number of heads on the assumption that the
binomial law holds:
No. of heads: 0 1 2 3 4 5
Frequency: 6 20 28 12 8 6
13) Fit a poisson distribution to the following observed data and calculate the
expected frequencies:
Deaths: 0 1 2 3 4
Frequency: 122 60 15 2 1
5 3
Geektonight Notes
Probability and 14) Given that a random variable X, has a binomial distribution with n = 50 trials
Hypothesis Testing and p = .25, use the normal approximation to the Binomial to find:
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
15.0 Objectives
15.1 Introduction
15.2 Point Estimation and Standard Errors
15.3 Interval Estimation
15.4 Confidence Limits, Confidence Interval and Confidence Co-efficient
15.5 Testing Hypothesis – Introduction
15.6 Theory of Testing Hypothesis – Level of Significance, Type-I and
Type-II Errors and Power of a Test
15.7 Two-tailed and One-tailed Tests
15.8 Steps to Follow for Testing Hypothesis
15.9 Tests of Significance for Population Mean–Z-test for variables
15.10 Tests of Significance for Population Proportion – Z-test for Attributes
15.11 Let Us Sum Up
15.12 Key Words and Symbols Used
15.13 Answers to Self Assessment Exercises
15.14 Terminal Questions/Exercises
15.15 Further Reading
15.0 OBJECTIVES
After studying this unit, you should be able to:
l estimate population characteristics (parameters) on the basis of a sample,
l get familiar with the criteria of a good estimator,
l differentiate between a point estimator and an interval estimator,
l comprehend the concept of statistical hypothesis,
l perform tests of significance of population mean and population proportion,
and
l make decisions on the basis of testing hypothesis.
15.1 INTRODUCTION
Let us suppose that we have taken a random sample from a population with a
view to knowing its characteristics, also known as its parameters. We are then
confronted with the problem of drawing inferences about the population on the
basis of the known sample drawn from it. We may look at two different
scenarios. In the first case, the population is completely unknown and we would
like to throw some light on its parameters with the help of a random sample
drawn from the population. Thus, if µ denotes the population mean, then we
intend to make a guess about it on the basis of a random sample. This is
known as estimation. For example, one may be interested to know the average
income of people living in the city of Delhi or the average life in burning hours
of a fluorescent tube light produced by ‘Indian Electrical’ or proportion of
people suffering from T.B. in city ‘B’ or the percentage of smokers in town
‘C’ and so on.
Probability and This is known as problem of testing of hypothesis. In the previous examples,
Hypothesis Testing we may be interested of in testing whether the average income in the city of
Delhi is, say, Rs. 2,000 per month. In the second example, we may like to
verify whether the claims made by Indian Electrical, that their fluorescent lamps
would last 5,000 hours, is justified. Some social workers may believe that 20%
of the population in city B suffers from T.B. We would like to make our
comment after a test of hypothesis. In the last example, some human activists,
concerned about the hazards of passive smoking, assert that 30% of the people
staying in town C are smokers. We may share their opinion once we have
satisfied ourselves after peforming a statistical test of hypothesis.
At this juncture, we must make a distinction between the two terms Estimator
and estimate. ‘T’ is defined to be an estimator of a parameter θ, if T esimtates
θ. Thus T is a statistic and its value may differ from one sample to another
sample. In other words, T may be considered as a random variable. The
probability distribution of T is known as sampling distribution of T. As already
discussed, the sample mean x is an estimator of population mean µ. The value
of the estimator, as obtained on the basis of a given sample, is known as its
estimate. Thus x is an estimator of µ, the average income of Delhi, and the
value of x i.e., Rs. 2,000/-, as obtained from the sample, is the estimate of µ.
In order to choose the best estimator among these estimators along with
“unbiasedness”, we introduce a second criterion, known as, minimum variance.
A statistic T is defined to be a minimum variance unbiased estimator (MVUE)
of θ if T is unbiased for θ and T has minimum variance among all the
unbiased estimators of θ. We may note that sample mean ( x ) is an MVUE for
µ.
∑ xi
We know that x = …(15.1)
n
(∑ x i )
∴ E( x ) = E
n
1
= .[∑ E (xi )]
n
1
= [ ∑ µ ] [x1, x2, …xn are taken from population having as
n µ population mean]
1
= . nµ
n
E( x ) = µ
1
= ∑σ
2
= [where σ2 is population variance]
n2
1 σ2
= 2
.[nσ 2 ] = ……(15.2)
n n
It can be proved that v ( x ) has the minimum variance among all the unbiased
estimators of µ.
Consistency: If T is an estimator of θ, then it is obvious that T should be in
the neighbourhood of θ. T is known to be consistent for θ, if the difference
between T and θ can be made as small as we please by increasing the sample
size n sufficiently.
We can further add that T would be a consistent estimator of θ if
i) E (T) → θ and
x nP
E (p) = E i = =P …(15.3)
n n
V(x i )
and V (p)= V (xi/n) =
n2
nP (1 − P )
=
n2
P (1 − P )
= …(15.4)
n
Thus if we take a random sample of size ‘n’ from a population where the
proportion of population possessing a certain characteristic is ‘P’ and the
sample contains x units possessing that characteristic, then an estimate of
population proportion (P) is given by:
x
P̂ = …(15.5)
n
In other words, the estimate of the population proportion is given by the
corresponding sample estimate i.e., P̂ = p …(15.6)
P(1 − P)
As v (p) = → 0 as n →∝
n
it follows from Eq. (15.4) that p is a consistent estimator of P. We can further
establish that p is an efficient as well as a sufficient estimator of P. Thus we
advocate the use of sample proportion to estimate the population proportion as
p which satisfies all the desirable properties of an estimator.
In order to estimate the proportion of people suffering from T.B. in city B, if
we find the number of people suffering from T.B. is ‘x’ in a random sample of
size ‘n’, taken from city B, then sample estimate p = x/n would provide the
estimate of the proportion of people in that city suffering from T.B. Similarly,
the percentage of smokers as found from a random sample of people of town
C would provide the estimate of the percentage of smokers in town C.
C) Estimation of Population Variance and Standard Error: Standard error
of a statistic T, to be denoted by S.E. (T), may be defined as the standard
deviation of T as obtained from the sampling distribution of T. In order to
compute the standard error of sample mean, it may be noted that from Eq.
σ
(15.2) S.E. ( x ) = for simple random sampling with replacement (SRSWR).
n
σ N−η
S.E. ( x ) = for simple random sampling without replacement
n N −1
(SRSWOR)].
where σ is the population standard deviation (S.D.), n is sample size, N is
N−η
population size and the factor is known as finite population corrector
N −1 5 9
Geektonight Notes
Probability and (f.p.c.) or finite population multiplier (f.p.m.) which may be ignored for a large
Hypothesis Testing population.
In order to find S.E., it is necessary to estimate σ 2 or σ in case it is unknown.
If x1, x2 …, xn denote n sample observations drawn from a population with
mean µ and variance σ 2, then the sample variance:
∑ (x i − x)2
S2 = …(15.7)
n
may be considered to be an estimator of σ 2
Since E(x) = µ and V ( z/ ) = E (x–µ)2 = σ2 …(15.8)
We have
= ∑[(x i − µ) − (x − µ)]2
= ∑ (x i − µ) 2 − 2(x − µ).n (x − µ) + n ( x − µ) 2
[since Σ (xi–µ) = Σ xi – Σµ
= nx − nµ = n (x − µ)]
= ∑ (x i − µ) 2 − 2n (x − µ)2 + n (x − µ)2
σ2
And E ( x − µ) 2 = v ( x ) = …(15.10)
n
2 σ2
= ∑σ − n. = nσ 2 − σ 2 = (n–1) σ
2
n
n −1 2
∴ E(S2 ) = σ ≠ σ2 …(15.11)
n
2 n −1 2
As E (S ) = σ
n
n 2
∴E s = σ2 …(15.12)
n −1
n 2 (∑ x i − x) 2
6 0 thus s = = (s| ) 2 is an unbiased estinator of σ2 …(15.13)
n −1 n −1
Geektonight Notes
Tests of Hypothesis–I
∑ (x i − x) 2
so, we use (s ) =
| 2
as an estimator of σ2 and
n −1
∑ (x i − x) 2
S| = as an estimator of σ
n −1
An estimate of S.E. ( x ) is given by:
S|
>
S| N−n
= for SRSWOR ……(15.14)
n N −1
P (1 − P)
From (15.4), it follows that v(p) =
n
P(1 − P)
S.E.(p) = for SRSWR
n
P (1 − P) N−n
= . for SRSWOR ……(15.15)
n N −1
An estimate of standard error of sample proportion is given by:
>
p (1 − p )
S.E. ( p ) = for SRSWR
n
p (1 − p) N−n
= . for SRSWOR ……(15.16)
n n
Let us consider the following illustrations to estimate variance from sample and
also estimate the standard error.
Illustration 1
A sample of 32 fluorescent lights taken from Indian Electricals was tested for
the lives of the lights in burning hours. The data are presented below:
Probability and
Hypothesis Testing µˆ = x . If we are further interested in estimating the standard error of x , then
we are to compute
s|
>
S.E .( x ) =
n
2
∑ (xi− x)2 ∑ xi − nx2
where, s = =
|
η −1 n −1
∑ xi
and x = , n = Sample size
n
We ignore f.p.c. as the population of lights is very large.
∑ ui − 490
∴u = = = − 15.3125
32 32
As u i = x i − 5000
∴ u = x − 5000
2 2
∑ u i − nu 23742 − 7503 .1248
(s | ) 2 = (s | x ) 2 = (s | ) 2 = = = 7410 .9316
n −1 31
∴ s| = 86.0868
s| 86.0868
hence S.E.(x) = = = 15.2183
n 32
so the estimate of the average life of lights as manufactured by Indian
Electricals is 4985 hours. Estimate of the population variance in 7410.9316
(hours)2 and the standard error is 15.2183 hours.
Illustration 2
x 70
Thus we have p = = = 0 .2
n 350
Hence the estimate of the proportions of smokers in the city is 0.2 or 20%.
Further
>
p (1 − p ) 0 .2 (1 − 0 .2 )
S.E. ( p ) = = = 0 .0214
n 350
6 3
Geektonight Notes
..............................................................................................................
.............................................................................................................
3) In choosing between sample mean and sample median – which one would you
prefer?
..............................................................................................................
.............................................................................................................
4) The monthly earnings of 20 families, obtained from a random sample from a
village in West Bengal are given below:
Find an estimate of the average monthly earnings of the village. Also obtain an
estimate of the S.E. of the sample estimate.
..............................................................................................................
6 4
Geektonight Notes
..............................................................................................................
.............................................................................................................
..............................................................................................................
5) In a sample of 900 people, 429 people are found to be consumers of tea. Estimate
the proportion of consumers of tea in the population. Also find the corresponding
standard error.
.............................................................................................................
..............................................................................................................
.............................................................................................................
Regarding the estimation of the average income of the people of Delhi city, one
may argue that it would be better to provide an interval which is likely to
contain the population mean. Thus, instead of saying the estimate of the
average income of Delhi is Rs. 2,000/-, we may suggest that, in all probability,
the estimate of the average income of Delhi would be from Rs. 1,900/- to Rs.
2,100/-. In the second example of estimating the average life of lights produced
by Indian Electricals where the estimate came out to be 4985 hours, the point
estimation may be a bone of contention between the producer and the potential
buyer. The buyer may think that the average life is rather less than 4985 hours.
An interval estimation of the life of lights might satisfy both the parties. Figure
15.1 shows some intervals for θ on the basis of different samples of the same
size from a population characterized by a parameter θ. A few intervals do not
contain θ.
6 5
Geektonight Notes
Probability and
Hypothesis Testing
θ
Fig.15.1: Confidence Intervals to θ
P (t1 < θ) = α1
Where α1 and α2 are two small positive numbers. Combining these two
conditions, we may write:
Where α = α1 + α2
One may like to know why the term ‘confidence’ comes into the picture. If we
choose α1 and α2 such a way that α = 0.01, then the probability that θ would
belong to the random interval [t1, t2] is 0.99. In other words, one feels 99%
confident that [t1, t2] would contain the unknown parameter θ. Similarly if we
select α = 0.05, then P [t1 ≤ θ ≤ t2] = 0.95, thereby implying that we are 95%
confident that θ lies between t1 and t2. (15.17) suggests that as α decreases,
(1–α) increases and the probability that the confidence interval [t1, t2] would
include the parameter θ also increases. Hence our endeavour would be to
reduce ‘α’ and thereby increase the confidence co-efficient (1–α).
6 6
Geektonight Notes
Referring to the estimation of the average life of lights (θ), if we observe that Tests of Hypothesis–I
θ lies between 4935 hours and 5035 hours with probability 0.98, then it would
imply that if repeated samples of a fixed size (say n = 32) are taken from the
population of lights, as manufactured by Indian Electricals, then in 98 per cent
of cases, the interval [4935 hours, 5035 hours] would contain θ, the average life
of lights in the population while in 2 per cent of cases, the interval would not
contain θ. In this case, the confidence interval for θ is [4935 hours, 5035
hours]. Lower Confidence Limit of θ is 4935 hours, Upper Confidence Limit of
θ is 5035 hours, and the Confidence Co-efficient is 98 per cent.
Our next task would be to select the basis for estimating confidence interval.
Let us assume that we have taken a random sample of size ‘n’ from a normal
population characterized by the two parameters µ and σ, the population mean
and standard deviation respectively. Thus, in the case of estimating a
Confidence Interval for average income of people dwelling in Delhi city, we
assume that the distribution of income is normal and we have taken a random
sample from the city. In another example concerning average life of fluorescent
lights as produced by Indian Electricals, we assume that the life of a
fluorescent light is normally distributed and we have taken a random sample
from the population of fluorescent lights manufactured by Indian Electricals.
Figure 15.2 shows percentage of area under Normal Curve. It can be shown
that if a random sample of size ‘n’ is drawn from a normal population with
mean ‘µ’ and variance σ2, then ( x ) , the sample mean also follows normal
distribution with ‘µ’ as mean and σ2/n as variance. Further as we have
observed in Section 15.2.
σ
S.E. ( x ) =
n
From the properties of normal distribution, it follows that the interval :
[µ − S.E. ( x ), µ + S.E. ( x )] covers 68.27% area.
The interval [µ − 2 S.E. ( x ), µ + 2 S.E.( x )] covers 95.45% area and the interval
34.135% 34.135%
13.59% 13.59%
2.14%
2.14% µ
–3 –2 –1
68.27% 1 2 3
95.45%
99.73%
Probability and Now let us consider a situation where the assumption of normality may not
Hypothesis Testing hold. If the sample size is large enough, then the sample mean x follows
approximately, i.e., asymptotically normal distribution with mean as µ and
standard error as σ/ n , µ and σ being the mean and S.D. of the population
under consideration. In case σ is unknown, we can replace it by the
corresponding sample standard deviation. One may ask the question as to how
large ‘n’ should be. It is rather difficult to specify an exact value of ‘n’ so that
the distribution of x would be asymptotically normal. Larger the value of ‘n’,
the better. However for practical purposes, if ‘n’ exceeds 30, then we may
assume that x is asymptotically normal.
Our next question may be what would be the confidence interval for µ.
x −µ
Or, P [−u ≤ ≤ u] = 1 − α
S.E. (x)
x −µ
Or, P [−u ≤ Z ≤ u ] = 1 − α [where Z =
S.E. (x) is a standard normal variable]
Or. 2 φ (u) = 2 − α
or. u = 1.645
Thus 100 (1–α) % or 100 (1–0.1)% or 90% confidence interval to population
mean µ is :
σ σ
Given by X − 1.645 , X + 1.645
n n
Putting α =0.05, 0.02 and 0.01 respectively in (15.19) and proceeding in a similar
σ 1.96σ
manner, we get 95% Confidence Interval to µ = x − 1.96 ,x+ …(15.20)
n n
σ σ
98% Confidence Interval to µ = x − 2.33 , x + 2.33 …(15.21)
n n
6 8
Geektonight Notes
Tests of Hypothesis–I
σ σ
and 99% Confidence Interval to µ = x − 2.58 , x + 2.58 …(15.22)
n n
Theoretically we may take any Confidence interval by choosing ‘u’ accordingly.
However in a majority of cases, we prefer 95% or 99% Confidence Interval.
These are shown in Figure 15.3 and Figure 15.4 below.
σ σ
x − 1.96 x x + 1.96
n n
Fig. 15.3: 95% Confidence Interval for Population Mean
σ σ
x − 2.58 x x + 2.58
n n
Fig. 15.4: 99% Confidence Interval for Population Mean
σ σ
x − 1.96 , x + 1.96
n n
If the assumption of normality does not hold but ‘n’ is greater than 30, the
above 95% confidence interval still may be used for estimating population mean.
In case σ is unknown, it may be replaced by the corresponding unbiased
estimate of σ, namely S|, so long as ‘n’ exceeds 30. However, we may face a
difficult situation in case σ is unknown and ‘n’ does not exceed 30. This
problem has been discussed in the next unit (Unit-16). Similarly, 99%
confidence interval to µ is given by :
σ σ
x − 2.58 , x + 2.58
n n
In case σ is unknown. The 99% confidence interval to µ is :
S| S|
x − 2.58 , x + 2.58 …(15.23)
n n
in case σ is unknown and n > 30. 6 9
Geektonight Notes
>
p (1 − p )
S.E.( p̂ ) =
n
Hence, 95% confidence interval to p is given by :
p (1 − p) p (1 − p)
p − 1.96 , p + 1.96 …(15.24)
n n
p(1 − p) p(1 − p )
p − 2.58 , p + 2.58 …(15.25)
n n
Illustration 3
In a random sample of 1,000 families from the city of Delhi, it was found that
the average of income as obtained from the sample is Rs. 2,000/-, it is further
known that population S.D. is Rs. 258. Find 95% as well as 99% confidence
interval to population mean.
σ σ
x − 1.96 , x + 1.96
n n
σ σ
x − 2.58 , x − 2.58
n n
σ 258
∴ x − 1.96 = Rs. 2000 − 1.96 × = Rs. 1984.01
n 1000
σ 258
x + 1.96 = Rs. 2000 + 1.96 × = Rs. 2015.99
n 1000
7 0
Geektonight Notes
Tests of Hypothesis–I
σ 258
x − 2.58 = Rs. 2000 − 2.58 × = Rs. 1979
n 1000
σ 258
and x + 2.58 = Rs. 2000 + 2.58 × = Rs. 2021
n 1000
Hence we have
95% confidence interval to average income for the people of Delhi = [Rs.
1984.01 to Rs. 2015.99] and 99% confidence interval to average income for the
people of Delhi = [Rs. 1979 to Rs. 2021].
Illustration 4
Calculate the 95% and 99% confidence limits to the average life of fluorescent
lights produced by Indian Electricals.
S| S|
x − 1 .96 , x + 1 .96
n n
S| S|
Similarly, 99% confidence interval for µ = x − 2.58 , x + 2.58
n n
Where, x = Sample mean = 4985 hours, n = Sample size = 32; and
S| 86.0868
∴ x − 1.96 = 4985 − 1.96 × = 4955.17 hours
n 32
S| 86.0868
x + 1.96 = 4985 + 1.96 × = 5014.83 hours
n 32
S| 86.0868
x − 2.58 = 4985 − 2.58 × = 4945.74 hours
n 32
S| × 86.0868
x + 2.58 = 4985 + 2.58 = 5024.26 hours
n 32
Illustration 5
While interviewing 350 people in a city, the number of smokers was found to
7 1
Geektonight Notes
Probability and be 70. Obtain 99% lower confidence limit and the corresponding upper
Hypothesis Testing confidence limit to the proportion of smokers in the city.
p (1 − p)
p − 2.58
n
and 99% Upper Confidence Limt to P is:
p (1 − p)
p + 2.58
n
provided np ≥ 5 and np (1–p) ≥ 5.
x 70
∴ p= = = 0 .2
n 350
As np = 350 × 0.2 = 70 and n (1–p) = 350 × 0.8 = 280 are rather large, we
can apply the formula for 99% Confidence Limit as mentioned already.
∴ 99% Lower Confidence Limit to P is :
0.2 × (1 − 0.2)
0.2 − 1.96 × = 0.2 − 0.0214 = 0.1786
350
99% Upper Confidence Limit to P is :
0.2 × (1 − 0.2)
0.2 + 1.96 × = 0.2 + 0.0214 = 0.2214
350
Hence 99% Lower Confidence Limit and 99% Upper Confidence Limit for the
proportion of smokers in the city are 0.1786 and 0.2214 respectively.
Illustration 6
In a random sample of 19586 people from a town, 2358 people were found to
be suffering from T.B. With 95% Confidence as well as 98% Confidence, find
the limits between which the percentage of the population of the town suffering
from T.B. lies.
Solution: Let x be the number of people suffering from T.B. in the sample and
‘n’ as the number of people who were examined. Then the proportion of
people suffering from T.B. in the sample is given by:
x 2358
p = = = 0.1204
n 19586
As np = x = 2358 and n (1–p) = n–np = n–x
= 19586–2358 = 17228
are both very large numbers, we can apply the formula for finding Confidence
Interval as mentioned in the previous section. Thus 95% Confidence Interval to
7 2
Geektonight Notes
P, the proportion of the population of the town suffering from T.B., is given by : Tests of Hypothesis–I
p (1 − p) p (1 − p)
p − 1.96 , p + 1.96
n n
p (1 − p) p (1 − p)
p − 2.33 , p + 2.33
n n
= [0.1150, 0.1258]
Thereby, we can say with 95% confidence that the percentrage of population in
the town suffering from T.B. lies between 11.81 and 12.27 and with 98%
confidence that the percentage of population suffering from T.B. lies between
11.50 and 12.58.
Illustration 7
A famous shoe company produces 80,000 pairs of shoes daily. From a sample
of 800 pairs, 3% are found to be of poor quality. Find the limits for the number
of substandard pair of shoes that can be expected when the Confidence Level
is 0.99.
p (1 − p ) 0 .03 (1 − 0 .03 )
∴ S.E. ( p̂ ) = = = 0 .0060
n 800
Probability and The Upper Limit to the number of substandard, pairs of shoes at 99% Level of
Hypothesis Testing Confidence is
1) State with reasons, whether the following statements are true or false.
a) Confidence Interval provides a range of values that may not contain the
parameter.
b) Confidence Interval is a function of Confidence Co-efficient.
c) 95% Confidence Interval for population mean is x ± 1.96 S.E. ( x ) .
d) While computing Confidence Interval for population mean, if the population
S.D. is unknown, we can always replace it by the corresponding sample S.D.
p (1 − p)
e) 99% Upper Confidence Limit for population proportion is p + 1.96 .
n
f) Confidence co-efficient does not contain Lower Confidence Limit and Upper
Confidence Limit.
p (1 − p )
g) If np ≥ 5 and np (1–p) ≥ 5, one may apply the formula p ± z α for
n
computing Confidence Interval for population proportion.
h) The interval µ ± 3 S.E. ( x ) covers 96% area of the normal curve.
2) Differentiate between Point Estimation and Interval Estimation.
...............................................................................................................
..............................................................................................................
4. Out of 25,000 customer’s ledger accounts, a sample of 800 accounts was taken
to test the accuracy of posting and balancing and 50 mistakes were found.
Assign limits within which the number of wrong postings can be expected with
99% confidence.
...............................................................................................................
...............................................................................................................
...............................................................................................................
5. A sample of 20 items is drawn at random from a normal population comprising
200 items and having standard deviation as 10. If the sample mean is 40,
obtain 95% Interval Estimate of the population mean.
...............................................................................................................
...............................................................................................................
...............................................................................................................
7 4
Geektonight Notes
6) A new variety of potato grown on 400 plots provided a mean yield of 980 Tests of Hypothesis–I
quintals per acre with a S.D. of 15.34 quintals per acre. Find 99% Confidence
Limits for the mean yield in the population.
................................................................................................................
................................................................................................................
................................................................................................................
In order to answer this question, let us familiarise ourselves with a few terms
associated with the problem. A statement like ‘The average income of the
people belonging to the city of Delhi is Rs. 3,000 per month’ is known as a
null hypothesis. Thus, a null hypothesis may be described as an assumption or
a statement regarding a parameter (population mean, ‘µ’, in this case) or about
the form of a population. The term ‘null’ is used as we test the hypothesis on
the assumption that there is no difference or, to be more precise, no significant
difference between the value of a parameter and that of an estimator as
obtained from a random sample taken from the population. A hypothesis may
be simple or composite.
H0 : µ = 3,000
i.e., the null hypothesis is that the population mean is Rs. 3,000. Generally, we
write
H0 : µ = µ0
i.e., the null hypothesis is that the population mean is µ, whereas µ0 may be
any value as specified in a given situation.
Probability and decide whether to accept or reject a hypothesis is known as test of hypothesis
Hypothesis Testing
or test of significance or decision rule. Thus, the entire process of hypothesis
testing is either to reject or accept H0 only.
In the present problem, one may argue that since many people of Delhi city are
living in the slums and even on the pavements, the average income should be
less than Rs. 3000. So one alternative hypothesis may be :
H1 : µ < 3,000 i.e., the average income is less than Rs. 3,000 or
symbolically:
or, H1 : µ < µ0 i.e., the population mean (µ) is less than µ0.
Again one may feel that since there are many multistoried buildings and many
new models of vehicles run through the streets of the city, the average income
must be more than Rs. 3,000. So another alternative hypothesis may be :
H2 : µ > 3000 i.e., the average income is more than Rs. 3,000.
Lastly, another group of people may opine that the average income is
significantly different from µ0. So the third alternative could be :
Now while testing H0 we are liable to commit two types of errors. In the first
case, it may be that H0 is true but x falls on ω and as such, we reject H0.
This is known as type-I error or error of the first kind. Thus type-I error is
committed in rejecting a null hypothesis which is, in fact, true. Secondly, it may
be that H0 is false but x falls on A and hence we accept H0. This is known as
type-II error or error of the second kind. So type-II error may be described as
the error committed in accepting a null hypothesis which is, in fact, false. The
two kinds of errors are shown in Table 15.3.
It is obvious that we should take into account both types of errors and must try
to reduce them.Since committing these two types of errors may be regarded as
random events, we may modify our earlier statement and suggest that an
appropriate test of hypothesis should aim at reducing the probabilities of both
types of errors. Let ‘α’ (read as ‘alpha’) denote the probability of type-I error
and ‘β’ (read as ‘beta’) the probability of type-II error. thus by definition, we
have
α = The probability of the sample point falling on the critical region when H0 is
true i.e., the value of θ is θ0 = P (x ∈ ω | θ0) …(15.26)
and β = The probability of the sample point falling on the critical region when
H1 is true, i.e., the value of θ is θ1
= P (x ∈ A | θ1) … (15.27)
Surely, our objective would be to reduce both type-I and type-II errors. But
since we have taken recourse to sampling, it is not possible to reduce both
types of errors simultaneously for a fixed sample size. As we try to reduce ‘α’,
β increases and a reduction in the value of β results in an increase in the value
of ‘α’. Thus, we fix α, the probability of type-I error to a given level (say, 5
per cent or 1 per cent) and subject to that fixation, we try to reduce β,
probability of type-II error. ‘α’ is also known as size of the critical region. It
is further known as level of significance as ‘α’ constitutes the basis for making
the difference (θ – θ0) as significant. The selection of ‘α’ level of significance,
depends on the experimenter.
: 1– β = 1 – P (x ∈ A | θ = θ1)
1.0
P(θ)
0
θ
Fig. 15.5: Power Curve of a Test
7 8 i.e., the probability that u0 would exceed uα/2 or u0 is less than u (1-α/2)
is α.
Geektonight Notes
u(1- α /2) u α /2
50 α % area 50 α % area
Fig. 15.6: Critical region of a two-tailed Test
If the sample point x falls on one of the two tails, we reject H0 and accept H1
: θ ≠ θ0. The statistical test for H0 : θ = θ0 against H1 : θ ≠ θ0 is known as
both-sided test or two-tailed test as the critical region, ‘ω’ lies on both sides of
the probability curve, i.e., on the two tails of the curve. The critical region is
ω : u0 ≥ uα/2 and ω : u0 ≤ u(1-α/2). It is obvious that a two-tailed test is
appropriate when there are reasons to believe that ‘u’ differs from θ0
significantly on both the left side and the right side, i.e., the value of the test
statistic ‘u’ as obtained from the sample is significantly either greater than or
less than the hypothetical value.
For testing the null hypothesis H0 : µ = 3000, i.e., the average income of the
people of Delhi city is Rs. 3000, one may think that the alternative hypothesis
would be H1 : µ ≠ 3000 i.e., the average income is not Rs. 3000 and as such,
we may advocate the application of a two-tailed test. Similarly, for testing the
null hypothesis that the average life of lights produced by Indian Electricals is
5,000 hours against the alternative hypothesis that the average life is not 5,000
hours, i.e., for testing H0 : µ = 5,000 against H1 : µ ≠ 5,000, we may prescribe
a two-tailed test. In the problem concerning the health of city B, we may be
interested in testing whether 20% of the population of city B really suffers from
T.B. i.e., testing H0 : P = 0.2 against H1 : P ≠ 0.2 and again a two-tailed test
is necessary and lastly regarding the harms of smoking, we may like to test H0
: P = 0.3 against H1 : P ≠ 0.3.
Right-tailed Tests
We may think of testing a null hypothesis against another pair of alternatives. If
we wish to test H0 : θ = θ0 against H1 : θ > θ0, then from (15.30) we have
P (u0 ≥ uα) = α. This suggests that a low value of α, say α = 0.01, implies
that the probability that u0 exceeds uα is 0.01. So the probability that u0
exceeds uα is rather small. Thus on the basis of a random sample drawn from
this population if it is found that u0 is greater than uα, then we have enough
evidence to suggest that H0 is not true. Then we reject H0 and accept H1. This
is exhibited in Figure 15.7 as shown below:
7 9
Geektonight Notes
Probability and
Hypothesis Testing
uα
100 α % area
Fig. 15.7: Critical region of a right-tailed Test
As shown in figure 15.7, the critical region lies on the right tail of the curve.
This is a one-sided test and as the critical region lies on the right tail of the
curve, it is known as right-tailed test or upper-tailed test. We apply a right-
tailed test when there is evidence to suggest that the value of the statistic u is
significantly greater than the hypothetical value θ0. In case of testing about the
average income of the citizens of Delhi, if one has prior information to suggest
that the average income of Delhi is more than Rs. 3,000, then we would like to
test H0 : µ = 3,000 against H1 : µ > 3,000 and we select the right-tailed test.
In a similar manner for testing the hypothesis that the average life of lights by
Indian Electricals is more than 5,000 hours or for testing the hypothesis that
more than 20 per cent suffer from T.B. in city B or for testing the hypothesis
that the per cent of smokers in town C is more than 30, we apply the right-
tailed test.
Left-tailed test
Lastly, we may be interested to test H0 : θ = θ0 against H2 : θ < θ0.From
(15.31), we have P (u0 ≤ u1–α) = α. Choosing α = 0.01, this implies that the
probability that u0 would be less than uα is 0.01, which is surely very low. So, if
on the basis of a random sample taken from the population, it is found that u0
is less than u1-α, then we have very serious doubts about the validity of H0. In
this case, we reject H0 and accept H2 : θ < θ0. This is reflected in Figure 15.8
shown below.
100 α % u(1- α )
area
Fig. 15.8: Critical Region of a Left-tailed Test
tailed test or a lower-tailed test. We apply a left-tailed test when there is Tests of Hypothesis–I
enough indication to suggest that the value of the test statistic ‘u’ is significantly
less than the hypothetical value. Then for determining the status of Delhi city, if
somebody suggests with evidence that the average income is less than Rs.
3,000 and as such Delhi should not be regarded as a top grade city, then we
are to test H0 : µ = 3000 against H1 : µ < 3000, which is a left-tailed test. We
may further note that we apply left-tailed test when we would like to test the
hypothesis that the average life of lights of Indian Electricals is less than 5,000
hours or less than 20 per cent are suffering from T.B. in city B or less than 30
per cent are smokers in town C.
2) Choose the appropriate test statistic ‘u’ and sampling distribution of ‘u’ under
H0. In most cases ‘u’ follows a standard normal distribution under H0 and
hence Z-test can be recommended in such a case.
3) Select α, the level of significance of the test if it is not provided in the given
problem. In most cases, we choose α = 0.05 and α = 0.01 which are known as
5% level of significance and 1% level of significance.
7) Draw your own conclusion in very simple language which should be understood
even by a layman.
H : µ ≠ µ0 or,
H1 : µ > µ0 or,
H2 : µ < µ0. 8 1
Geektonight Notes
Probability and As we have discussed in Section 15.2, the best statistic for the parameter µ is
Hypothesis Testing
x . It has been proved in that Section, E ( x ) = µ .
σ
S.E. ( x ) =
n
As such the test statistic :
x − E(x) x −µ
z= =
S.E.(x) σ/ n
is a standard normal variable. Under H0, i.e., assuming the null hypothesis to be
true,
n (x − µ0 )
z0 = is a standard normal variable. As such, the test is known as
σ
standard normal variate test or standard normal deviate test or Z-test. In order
to find the critical region for testing H0 against H from (15.28) and (15.29), we
find that :
α
P (u 0 ≥ u ( α / 2 , ) =
2
α
and P ( u 0 ≤ u (1− α / 2 , ) =
2
If we denote the standard normal variate by Z, and the upper α-point of the
standard normal distribution by Zα, and by Z(1–α/2) = –Zα/2, (as the standard
normal distribution is symmetrical about 0), the lower α-point of the standard
normal distribution, then the above two equations are reduced to :
α
P ( Z0 ≥ Zα / 2 ) = ……(15.33)
2
α
and P ( Z0 ≤ − Zα / 2 ) = ……(15.34)
2
From (15.33), we have:
α
1 − P ( Z0 < Z( α / 2 ) =
2
α
or 1 − φ (Zα / 2 ) =
2
α
or φ ( Zα / 2 ) = 1 −
2
ω : z o ≥ 1 .96
n (x − µ0 )
Z0 = ……(15.36)
σ
Proceeding in a similar manner, the critical region for the two-tailed test at 1%
level of significance is given by :
ω : z o ≥ 2.58 ……(15.37)
or, 1 − φ (Zα ) = α
α
or, 1 − φ (Zα / 2 ) = …… (15.38)
2
Putting α = 0.05 in (15.38), we get
Hence the critical region for this right-tailed test at 5% level of significance is :
ω : Z0 ≥ 1.645
Similarly the critical region at 1% level of significance would be :
ω : Z0 ≥ 2.35
Finally if we make up our minds to test H0 against H2 : µ < µ0, then from
(15.35), we get
P (Z0 ≤ –Zα) = α
or, φ (− Z α ) = α
or, 1− φ ( Z α ) = α
or, φ (Z α ) = 1 − α
Probability and
Hypothesis Testing
95 % area
Acceptance Region
–1.96 µ0 1.96
2.5 % area 2.5 % area
Figure 15.9: Two-tailed Critical Region for Testing Population Mean at 5% Level of Significance
99 % Area
Acceptance Region
–2.58 µ0 2.58
0.5 % area 0.5 % area
Figure 15.10: Two-tailed Critical Region for Testing Population Mean at 1% Level of Significance.
95 % Area
Acceptance Region
Critical Region
ω : Z0 ≥ 1.645
µ0 1.645
5 % Area
Figure 15.11: Right-tailed Critical Region for Testing Population Mean at 5% Level of Significance
8 4
Geektonight Notes
Tests of Hypothesis–I
99 % Area
Acceptance Region
Critical Region
ω : Z0 ≥ 2.35
µ0 2.35
1 % area
Figure 15.12: Right-tailed Critical Region for Testing Population Mean at 1% Level of
Significance
95 % Area
Acceptance Region
Critical Region
ω : Z0 ≤ –1.645
5 % area –1.645 µ0
Figure 15.13: Left-tailed Critical Region for Testing Population Mean at 5% Level of
Significance
99 % Area
Acceptance Region
Critical Region
ω : Z0 ≤ –2.35
1 % area –2.35 µ0
Figure 15.14: Left-tailed Critical Region for Testing Population Mean at 1% Level of Significance
2
∑(Xi −x )
S| =
n −1 8 5
Geektonight Notes
Probability and in the test statistic used in Case-I, provided we have a sufficiently large sample
Hypothesis Testing (as discussed earlier n should exceed 30). Thus we consider
n (x − µ0 )
Z0 =
s|
ω | Z 0 | ≥ 1 .96
ω : | Z0 | ≥ 2.58
ω : | Z0 | ≥ 1.645
and ω : Z0 ≥ 2.33 when the level of significance is 1%.
Lastly the critical region for the left-sided alternative H2 : µ > µ0 would be
provided by :
ω : Z0 ≤ –1.645
For example, if we want to test whether a fresh coin just coming out from a
mint is unbiased, then we are to test H0 : P = 0.5. Similarly, the problem of
testing whether 20% population of city B is suffering from T.B amounts to
testing Ho : P = 0.2 or testing whether 30% population of a town are smokers
is equivalent to testing H0 : P = 0.3.
x
Hence, it follows that the sample proportion (p) = follows normal distribution
n
P0 (1 − P0 )
with mean as P0 and S.D. as under H0.
n
8 6
Geektonight Notes
Tests of Hypothesis–I
p − P0 n (p − P0 )
Thus Z 0 = =
P0 (1 − P0 ) P0 (1 − P0 )
n
is a standard normal variate and as such we can apply Z-test for attributes.
Illustration 8
n ( x − 1900)
we use Z 0 =
σ
The critical region for this right-sided alternative is given by :
50 (1926 − 1900)
Z0 = = 1.671 8 7
110
Geektonight Notes
Probability and Thus, we reject H0 at 5% level of significance but accept the null hypothesis at
Hypothesis Testing 1% level of significance. On the basis of the given data, we thus conclude that
the manufacturer’s claim is justifiable at 5% level of significance but at 1%
level of significance, we infer that the manufacturer has been unable to produce
cables with a higher breaking strength.
Illustration 9
A random sample of 500 flower stems has an average length of 11 cm. Can
this be regarded as a sample from a large population with mean as 10.8 cm
and standard deviation as 2.38 cm?
Solution: Let the length of the stem be denoted by x. Assume that µ denotes
the mean of stems in the population. The sample size 500 being very large, we
apply Z-test for testing H0 : µ = 10.8, i.e., the population mean is 10.8 cm.
against H : µ ≠ 108, i.e., the population mean is not 10.8.
n ( x − 10.8)
Z0 =
σ
and choosing the level of significance as 5%, we note that the critical region is :
ω : |Z0| ≥ 1.96
as per given data,
m = 500, x = 11 cm, σ = 2.38 cm
Thus we accept H0. We conclude that on the basis of the given data, the
sample can be regarded as taken from a large population with mean as 10.8
cm and standard deviation as 2.38 cm.
Illustration 10
623, 648, 672, 685, 692, 650, 649, 666, 638, 629
n ( x − 650)
We consider Z0 =
σ
8 8
Geektonight Notes
and recall that the critical region at 1% level of significance (selecting α = Tests of Hypothesis–I
0.01) for this left-tailed test is given by
ω : Z0 < –2.33
since n = 10, σ = 12.83 hours, and
623 + 648 + 672 + 685 + 692 + 650 + 649 + 666 + 638 + 629
x= = 655 .2 hours
10
10 (655.2 − 650)
∴ Z0 = = 1.282
12.83
As this does not fall in the critical region, H0 is accepted. Thus on the basis of
the given sample, we conclude that the manufactuer’s assertion was right.
Illustration 11
The heights of 12 students taken at random from St. Nicholas College, which
has 1,000 students and a standard deviation of height as 10 inches, are
recorded in inches as 65, 67, 63, 69, 71, 70, 65, 68, 63, 72, 61 and 66.
Do the data support the hypothesis that the mean height of all the students in
that college is 68.2 inches?
Solution: Letting x stand for height of the students of St. Nicholas College, we
would like to test
x − 68.2
where Z0 =
S.E. (x)
In this case,
65 + 67 + 63 + 69 + 71 + 70 + 65 + 68 + 63 + 72 + 61 + 66
x= = 66 . 67 inches
12
10 1000 − 12
= = 2.9027 inches
12 1000 − 1
66.67 − 68.2
∴ Z0 = = 0.527
2.9027
Since n = 950; nP0 = 950 × 0.5 = 475; and nP0 (1–P0) = 237.5, we can apply
Z-test for proportion. Thus we compute :
n (p − 0.5)
Z0 = and note that the critical region at 1% level of
0.5 (1 − 0.5)
significance for this two-tailed test is :
ω : |Z0| ≥ 2.58
x 500
As p= = = 0.5263
n 950
Illustration 13
Solution: Let ‘p’ be the sample proportion of defectives and P, the proportion
of defective parts in the whole manufacturing process. Then we are to test
If we select α = 0.05, then the critical region for this right-tailed test is :
ω : Z0 ≥ 1.645
x 60
We have, as given, p= = = 0.075
n 800
Thus, Z0 falls on the acceptance region and we accept the null hypothesis. We
conclude that on the basis of the given information, the manufacturer’s claim is
9 0 valid.
Geektonight Notes
A family-planning activist claims that more than 33 per cent of the families in
her town have more than one child. A random sample of 160 families from the
town reveals that 50 families have more than one child. what is your inference
? Select α = 0.01.
Solution: If ‘P’ denotes the proportion of families in the town having more
than one child, then we want to test H0 : P = 0.33 against H1 : P > 0.33.
n (p − 0.33)
We consider Z0 = as test statistic and note that at 1% level
0.33 (1 − 0.33)
of significance the critical region is ω : Z0 ≥ 2.35.
50
Here, p= = 0.3125 , n = 160
160
We have concluded our discussion by conducting tests for population mean and
population proportion under different types of alternative hypothesis.
Critical Region or Rejection Region: The set of values of the test statistic Tests of Hypothesis–I
leading to the rejection of H0. It is a part of the sample space and is denoted
by ω, if the sample point falls on ω, we reject H0.
σ σ
= x − 1 .96 , x + 1 .96
n n
σ σ
= x − 2 . 58 , x + 2 . 58
n n
∑ (X i − x ) 2
s =
|
provided ‘n’ exceeds 30.
n −1
Probability and estimator) for θ if T has the minimum variance among all the unbiased
Hypothesis Testing estimators of θ.
Z-test for population mean: For testing H0: µ = µ0, test statistic is given by
(x − µ0 )
Z0 = n If σ is unknown and n > 30, we replace σ by s| in the
σ
expression for Z0.
n ( p − P0 )
Z0 = provided n is large,
P0 (1 − P0 )
Under the assumption that the null hypothesis is true, Z0 follows standard
normal distribution. At 5% level of significance, the critical region for the two-
tailed test is given by
ω : |Zo| ≥ 1.96
The critical region for the right-tailed test is
ω : Zo ≥ 1.645
and the critical region for the left-tailed test is
ω : Zo ≤ –1.645
Similarly when the level of significance is 1%, the critical region for the two-
tailed test is
ω : |Zo| ≥ 2.58
For the right-tailed test, the critical region is
ω : Zo ≥ 2.35
and that for the left-tailed test is
ω : Zo ≤ –2.35
C) 1. a) Yes, b) Yes, c) No, (d) No, e) Yes, f) No, g) Yes, Tests of Hypothesis–I
h) Yes.
4. Yes, Z0 = – 0.447
5. No, Z0 = 4.12
6. Yes, Z0 = 1.774
3) Discuss the role of normal distribution in interval estimation and also in testing
hypothesis.
5) Discuss how far the sample proportion satisfies the desirable properties of a
good estimator.
7) Describe how you could set confidence limits to population proportion on the
basis of a large sample.
9) Describe the different steps for testing the significance of population proportion.
10) 15 Life Insurance Policies in a sample of 250 taken out of 60,000 were found to
be insured for less than Rs. 7500. How many policies can be reasonably
expected to be insured for less than Rs. 7500 in the whole lot at 99%
confidence level.
(Ans: 1278 to 5922)
12) A manufacturer of ball-point pens claims that a certain type of pen produced by
him has a mean writing life of 550 pages with a S.D. of 35 pages. A purchaser
selects 20 such pens and the mean life is found to be 539 pages. At 5% level of
significance should the purchaser reject the manufacturer’s claim ?
(Ans: Yes, Z0 = –2.30)
13) In a sample of 550 guavas from a large consignment, 50 guavas are found to be
rotten. Estimate the percentage of defective guavas and assign limits within
which 95% of the rotten guavas would lie.
[Ans: (i) 9.09%; (ii) 0.0668 to 0.1150]
14) A die is thrown 59215 times out of which six appears 9500 times. Would you
consider the die to be unbiased ?
(Ans: No, Z0 = – 4.113)
15) A sample of 50 items is taken from a normal population with mean as 5 and 9 5
Geektonight Notes
Probability and standard deviation as 3. The sample mean comes out to be 4.38. Can the
Hypothesis Testing sample be regarded as a truly random sample?
(Ans: No, Z = –1.532)
16) A random sample of 600 apples was taken from a large consignment of 10,000
apples and 70 of them were found to be rotten. Show that the number of rotten
apples in the consignment with 95% confidence may be expected to be from
910 to 1,424.
17) The mean life of 500 bulbs, as obtained in a random sample manufactured by a
company, was found to be 900 hours with a standard deviation of 300 hours.
Test the hypothesis that the mean life is less than 900 hours. Select α = 0.05
and 0.01.
(Ans: Yes, Z0 = – 3.7268
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
Gupta, C.B., and Vijay Gupta, 1998, An Introduction to Statistical Methods, Vikas
Publishing House Pvt. Ltd., New Delhi.
9 6
Geektonight Notes
Tests of Hypothesis-II
UNIT 16 TESTS OF HYPOTHESIS – II
STRUCTURE
16.0 Objectives
16.1 Introduction
16.2 Small Samples versus Large Samples
16.3 Student’s t-distribution
16.4 Application of t-distribution to determine Confidence Interval for
Population Mean
16.5 Application of t-distribution for Testing Hypothesis Regarding Mean
16.6 t-test for Independent Samples
16.7 t-test for Dependent Samples
16.8 Let Us Sum Up
16.9 Key Words and Symbols
16.10 Answers to Self Assessment/Exercises
16.11 Terminal Questions/Exercises
16.12 Further Reading
16.0 OBJECTIVES
After studying this unit, you should be able to:
l differentiate between exact tests i.e., small sample tests and approximate tests,
i.e., large sample tests,
l be familiar with the properties and applications of t-distribution,
l find the interval estimation for mean using t-distribution,
l have an idea about the theory required for testing hypothesis using
t-distribution,
l apply t-test for independent samples, and
l apply t-test for dependent samples.
16.1 INTRODUCTION
In the previous unit, we considered different aspects of the problems of
inferences. We further noted the limitations of standard normal test or Z-test.
As discussed in Unit 15, we can not apply normal distribution for estimating
confidence intervals for population mean in case the population standard
deviation is unknown and sample size does not exceed 30, i.e., small samples.
We may further recall that as mentioned in Unit 15, we can not test hypothesis
concerning population mean when the sample is small and population standard
deviation is unspecified. In a situation like this, we use t-distribution which is
also known as student’s t-distribution. t-distribution was first applied by W.S.
Gosset who used to work in ‘Guinners Brewery’ in Dublin. The workers of
Guinners Brewery were not allowed to publish their research work. Hence
Gosset was compelled to publish his research work under the penname
‘student’ and hence the distribution is known as student’s t-distribution or simply
student’s distribution. Before we discuss t-distribution, let us differentiate
between exact tests and approximate tests.
9 7
Geektonight Notes
Probability and
Hypothesis Testing 16.2 SMALL SAMPLES VERSUS LARGE SAMPLES
Normally a sample is considered as small if its size is 30 or less whereas, the
sample with size exceeding 30 is considered as a large sample. All the tests
under consideration may be classified into two categories namely exact tests
and approximate tests. Exact tests are those tests that are based on the exact
sampling distribution of the test statistic and no approximation is made about the
form of parent population or the sampling distribution of the test statistic. Since
exact tests are valid for any sample size and usually cost as well as labour
increases with an increase in sample size; we prefer to take small samples for
conducting exact tests. Hence, the exact tests are also known as small sample
tests. It may be noted that while testing for population mean on the basis of a
random sample from a normal distribution, we apply exact tests or small sample
tests provided the population standard deviation is known. This was
demonstrated in Unit 15.
T − θ0 T − θ0
Z0 = or Z 0 = ∧
S.E (T ) S. E (T )
n ( x − µ)
Z=
S|
∑ (x i − x) 2
where , S| =
(n − 1)
n (p − p 0 )
Z0 =
P0 (1 − P0 )
P0 being the specified population proportion, which again, for a large sample, is
9 8 an approximate standard normal variable.
Geektonight Notes
Tests of Hypothesis-II
16.3 STUDENT’S t-DISTRIBUTION
Since we cannot use Z-test, for a small sample, for population mean when the
population standard deviation is not known, we are on the look out for a new
test statistic. It is necessary to know a few terms first.
2
Again x12 + x22 = ∑ x i ~ χ2 and in general,
2 2
i =1
n
x 12 + x 22 + … + x 2n = ∑ x i2 ~ χ 2n …… (16.1)
i =1
n 2
If we write u = ∑ xi , then the probability density function of u is given by :
i =1
0 α
Figure 16.1: χ distribution.
2 9 9
Geektonight Notes
Probability and If x1, x2, x3 …xn are ‘n’ independent variables, each following normal
Hypothesis Testing
xi − µ
distribution with mean (µ) and variance (σ2), then Xi = is a standard
σ
normal variable and as such
n n ( x i − µ) 2
u = ∑ X i2 = ∑ …(16.2)
i =1 i =1 σ2
∑ (x i − x) 2
S2 =
n
∑ (x i − µ) 2
As ~ χ2n
σ 2
(x − µ)2 (x − µ)2 (x − µ2 ) 2
and n = 2 = 2 ~ χ1
σ2 σ /n σ / n
σ
sin ce x ~ N µ, ,
n
ns2
Hence it follows that 2 ~ χ n −1
2
Student’s t-distribution: Consider two independent variables ‘y’ and ‘u’ such
that ‘y’ follows standard normal distribution and ‘u’ follows χ2-distribution with
md.f.
y
Then the ratio t = , follows t-distribution
u/m
with m d.f. The probability density function of t is given by :
m +1
−
t2 2
f ( t ) = const. 1 +
m
+ or – ∞ < t < ∞ .........…(16.3)
( x − µ)
where, t = n , const. = a constant required to make the area under the
s
curve equal to unity; m = n–1, the degree of freedom of t.
−∞ ∞
Since we have :
− m+1
t2 2
f (t) = const. 1 +
m
for − ∞ < t < ∞
m +1 t2
∴Logf = k − Log1 + where, ‘k’ is a constant.
2 m
m +1 t 2 t4
=k− − 2
+ ……to α
2 m 2m
x2
[as Log (1+x) = − + ........ to α
2
for –1 < x ≤ 1
t2
and is rather small for a large m].
m
m +1 t 2 m +1 4
Hence Log f = k − . + t …… to α
m 2 4m 2
m +1
since m is very large tends to 1 and other terms containing powers of
m
‘m’ higher than 2 would tend to zero. Thus we have:
t2
Log f = k −
2
101
Geektonight Notes
Probability and
t2 2 2 /2
Hypothesis Testing K−
or, f =e 2 = e k .e − t /2
= const. e − t
Looking from another angle, as the mean of t-distribution is zero and standard
m
deviation = which tends to unity for a large m,
m−2
If x1, x2, x3 …xn denote the n observations of a random sample drawn from a
normal population with mean as µ and the standard deviation as σ, then x1, x2,
x3 …xn can be described as ‘n’ independent random variables each following
normal distribution with the same mean µ and a common standard deviation σ.
If we consider the statistic:
( x − µ)
n
S|
∑ xi − x) 2 ∑ (x i
where, x = , the sample mean; and S =
|
is the standard
n n −1
deviation with divisor as (n–1) instead of n, then we may write :
n ( x − µ)
( x − µ) σ
n =
S| S| , dividing both numerator and denominator by σ
σ
(x − µ / σ / n y
= =
∑ (x i− x) 2 u /( n − 1)
(n − 1)σ 2
x −µ
y= is a standard normal variate
σ/ n
∑ (x i − x)2
Also u = follows χ2-distribution with (n–1) d.f
σ2
(x − µ) y
Hence, by definition n =
s′ u /(n −1)
(x − µ )
102 t = n ~ t n −1
s|
Geektonight Notes
We apply t-distribution for finding confidence interval for mean as well as Tests of Hypothesis-II
testing hypothesis regarding mean. These are discussed in Sections 16.4 and
16.5 respectively.
Let us assume that we have a random sample of size ‘n’ from a normal
population with mean as µ and standard deviation as σ. We consider the case
when both µ and σ are unknown. We are interested in finding confidence
interval for population mean. In view of our discussion in Section 16.3, we
know that :
(x − µ)
t= n
s|
follows t-distribution with (n–1) d.f. We may recall here that x denotes the
sample mean and s|, the sample standard deviation with divisor as (n–1) and not
‘n’. We denote the upper α-point of t-distribution with (n–1) d.f as tα, (n–1).
Since t-distribution is symmetrical about t = 0, the lower α-point of t-distribution
with (n–1) d.f would be denoted by –tα, (n–1). As per our discussion in Unit
15, in order to get 100 (1–α)% confidence interval for µ, we note that :
(x − µ)
P − t α / 2 , (n −1) ≤ n ≤ t α / 2 , (n −1) = 1 − α
s
s| s|
or P − x − .t α/ 2 , (n −1) ≤ −µ ≤ −x + .t α/ 2 , (n −1) = 1− α
n n
s| . s| .t |α / 2 ,
or P x − t α / 2 , (n − 1) ≤ µ ≤ x + (n − 1) = 1 − α
n n
Thus 100 (1-α) % confidence interval to µ is :
s| . s| .
x − t α/2 , ( n − 1), x + t α / 2 , (n − 1) …(16.6)
n n
s|
100 (1-α) % Lower Confidence Limit to µ = x − .t α / 2 , (n − 1)
n
s|
and 100 (1-α) % Upper Confidence Limit to µ = x + .t α / 2 , (n − 1)
n
Selecting α = 0.05, we may note that
s|
95% Lower Confidence Limit to µ = x − .t 0.025 , ( n − 1)
n
s|
and 95% Upper Confidence Limit to µ = x + .t 0.025 , ( n − 1) …(16.7)
n 103
Geektonight Notes
Probability and In a similar manner, setting α = 0.01, we get 99% Lower Confidence Limit to µ
Hypothesis Testing
s|
=x− .t 0 .005 , ( n − 1) and
n
|
s
99% Upper Confidence Limit to µ = x + . t 0.005 , ( n − 1) …(16.8)
n
Values of tα, m for m = 1 to 30 and for some selected values of a are provided
in Appendix Table 5. Figures 16.3, 16.4 and 16.5 exhibit confidence intervals to
µ applying t-distribution as follows :
α/2 α/2
s|
s | µ= x+ .t α / 2 , (n − 1)
µ=x− .t α / 2 , (n −1) n
n
α%) Confidence Interval to µ
Fig. 16.3: 100 (1–α
95% area
s| s|
µ=x− .t 0.005 , (n − 1) µ=x+ .t 0.005 , (n − 1)
n n
Fig. 16.4: 95% Confidence Interval to µ
104
Geektonight Notes
Tests of Hypothesis-II
99% area
s| s|
µ=x− .t 0.005 , (n − 1) µ=x+ .t 0.005 , (n − 1)
n n
Illustration 1
Following are the lengths (in ft.) of 7 iron bars as obtained in a sample out of
100 such bars taken from SUR IRON FACTORY.
we have to find 95% confidence interval for the mean length of iron bars as
produced by SUR IRON FACTORY.
Solution: Let x denote the length of iron bars. We assume that x is normally
distributed with unknown mean µ and unknown standard deviation σ. Then 95%
Lower Confidence Limit to µ
s| N−n
=x− . t 0.005 , 6
n N −1
s| N−n
and 99% Upper Confidence Limit to µ = x + . .t 0.025 ,6
n N −1
∑ (x − x)
2
∑ xi
where, x = ; S =
| i
; n = sample size = 7; N = population size
n n −1
= 100
N−n
and = finite population correction (fpc)
N −1
105
Geektonight Notes
28
Thus, we have : x = =4
7
∑ (x i − x ) 2 = ∑ x i2 − nx 2 = 112.0404 – 7 × 42
= 0.0404
100 − 7
f .p.c. = = 0 .969223
100 − 1
0.08 2057
Hence 95% Lower confidence Limit to µ = 4 − × 0.969223 × 2.365
7
= 4 – 0.188091 = 3.811909
So 95% Confidence Interval for mean length of iron bars = [3.81 ft, 4.19 ft].
Illustration 2
Find 90% confidence interval to µ given sample mean and sample S.D as 20.24
and 5.23 respectively, as computed on the basis of a sample of 11 observations
from a population containing 1000 units.
s| s|
x − .t 0.05 , 10, x + .t 0.05 , 10
n n
∑ (xi − x2 )
As S= , the sample standard deviation (S.D).
n
∴nS2 = ∑ (x i − x) 2
106
Geektonight Notes
∑ (x i − x) 2 Tests of Hypothesis-II
Hence (s ) =
| 2
n −1
nS 2
= [since ∑ (xi − x 2) = nS 2]
n −1
n
or, s| = .S
n −1
11
= × 5.23 = 5.4853
10
Consulting Appendix Table-5, given at the end of this block, we find t0.05,
10 = 1.812
Thus 90% confidence interval to µ is given by:
5.4853 5.4853
20.24 − 3.1623 ×1,812, 20.24 + 3.1623 × 1.812 = [17.0969, 23.3831]
Illustration 3
The study hours per week of 17 teachers, selected at random from different
parts of West Bengal, were found to be:
6.6, 7.2, 6.8, 9.2, 6.9, 6.2, 6.7, 7.2, 9.7, 10.4, 7.4, 8.3, 7.0, 6.8, 7.6, 8.1, 7.8
Suppose, we are interested in computing 95% and 99% confidence intervals for
the average hours of study per week per teacher in the state of West Bengal.
Solution: If µ denotes the average hours of study per week per teacher in
West Bengal, then as discussed earlier,
s| s|
95% confidence interval to µ = x − .t 0.025, (n −1), x + .t 0.025, (n −1)
n n
s| s|
and 99% confidence interval to µ = x − .t 0.005 , ( n − 1), x + .t 0.005 , (n − 1)
n n
∑ (x i− x)2
2
∑xi −n.(x)2
s =
|
=
n −1 n −1
1014.41 − 17 × (7.64) 2
=
17 − 1
1014.41 − 992.28
= = 1.1761
16
From Appendix Table-5, given at the end of this block, t0.025, 16 = 2.120; t0.005,
16 = 2.921
1.1761 1.1761
= (7.64 − × 2.12) hrs , (7.64 + × 2.12) hrs
4 4
= [7.0167 hours, 8.2633 hours]
Similarly 99% confidence interval to µ
1.1761 1.1761
= (7.64 − × 2.921) hours, (7.64 + × 2.921) hours
4 4
= [6.7812 hours, 8.4988 hours]
Illustration 4
s| s|
x − .t 0.05 , ( n − 1), x + .t 0.05 , (n − 1)
n n
In this case, n = 26, From Appendix Table-5, given at the end of this block,
t0.05, 25 = 1.708
s|
Hence we have x − ×1.708 = 46.584
26
or x − 0.33497s| = 46.584 …(1)
s|
and x + ×1.708 = 53.416
26
or, x + 0 .33497 s | = 53 .416 …(2)
on adding equation (1) and (2) we get
2 x = 100 or x = 50
108
Geektonight Notes
50 − 0.33497 s| = 46.584
3.416
or s| = = 10.19793
0.33497
n −1 |
Hence S = s [from illustration 2]
n
= 0.98058 × 10.19793 = 9.9999 ~− 10
Thus the sample mean is 50 units and sample S.D is approximately 10 units.
∑ (x i− x) 2
~ χ 2n
σ 2
k) Z test has the widest range of applicability among all the commonly used
tests.
4) A random sample of size 10 drawn from a normal population yields sample mean
as 85 and sample S.D as 8.7. Compute 90% and 95% confidence intervals to
population mean. 109
Geektonight Notes
5) Find 99% confidence limits for ‘µ’ given that a sample of 19 units drawn from a
population of 98 units provides sample mean as 15.627 and sample S.D as 2.348.
..........................................................................................................
..........................................................................................................
..........................................................................................................
6) A sample of size 10 drawn from a normal population produces the following results.
Σxi = 92 and Σxi2 = 889
Obtain 95% confidence limits to µ.
..........................................................................................................
..........................................................................................................
..........................................................................................................
H0 : µ = µ 0
against H : µ ≠ µ0 i.e., the population mean is anything but µ0.
or H1 : µ > µ0 i.e., the population mean is greater than µ0.
or H2 : µ < µ0 i.e., the population mean is less than µ0.
As we have noted in Section 16.1 the proper test to apply in this situation is
undoubtedly t-test. If we denote the upper α-point and lower α-point of t-
distribution with m.d.f. by tα, m and t1–a,m = – tα,m (as t-distribution is
symmetrical about 0) then for testing H0, based on the distribution of t, it may
be possible to find 4 values of t such that :
Hence, on the basis of a small random sample drawn from the population, if it Tests of Hypothesis-II
is found that t0 is greater than tα/2, m or t0 is less than –tα/2, m i.e.,
|t0| > tα/2, m, then we may suggest that there is enough evidence to suggest that
H0 is untrue and H is true. Then we reject H0 and accept H. The critical
region for this both sided alternative is provided by :
This is shown in the following Figure 16.6. Critical region lies on both the tails.
Acceptance Region
α/2 α/2
– ta/2, m 0 ta/2, m
Fig. 16.6: Critical Region for Both-tailed Test.
Secondly, in order to test the null hypothesis against the right-sided alternative
i.e., to test H0 against H1 : µ > µ0, from (16.11) we note that, as before, if we
choose a small value of α, then the probability that the observed value of t,
would exceed the critical value tα, m is very low. Thus one may have serious
questions in this case, about the validity of H0 if the value of t, as obtained on
the basis of a small random sample, really exceeds tα, m. We then reject H0
and accept H1. The critical region
ω : t0 ≥ tα, m ………(16.15)
lies on the right-tail of the curve and the test as such is called right-tailed test.
This is shown in Figure 16.7.
Acceptance Region
α
0 tα, m
Fig. 16.7: Critical Region for Right-tailed Test.
111
Geektonight Notes
Probability and Lastly, when we proceed to test H0 against the left-sided alternative
Hypothesis Testing H2 : µ < µ0, we note that (16.12) suggests that if α is small, then the
probability that t0 would be less than the critical value –tα, m is very small. So
if the value of t0 as computed, on the basis of a small sample, is found to be
less than –tα, m, we would doubt the validity of H0 and accept H2. The critical
region
ω : t0 ≤ – tα, m …(16.16)
would lie on the left-tail and the test would be left-tailed test. This is depicted
in Fig. 16.8.
Acceptance Region
α
-tα, m 0
Fig. 16.8: Critical Region for Left-tailed Test.
4) Whether the sample drawn is a small one. Again if the answer is ‘no’ i.e., n >
30, we would be satisfied with Z-test. However, if n ≤ 30 and the first three
conditions are fulfilled, we should recommend t-test.
n (x − µ)
t=
s|
where, n = sample size; x = sample mean; and s| = sample S.D with divisor
as (n-1). The test statistic follows t-distribution with (n–1) d.f
112
Geektonight Notes
tα, m being the upper α-point of t-distribution with m d.f, then we reject H0. In
other words, H0 is rejected and H : µ ≠ µ0 is accepted if the observed value
of t, as computed from the sample, exceeds or is equal to the critical value
tα/2, (n-1).
Figure 16.9 shows critical region at 5% level of significance while Figure 16.10
shows critical region at 1% level of significance.
Acceptance Region
95 % of area
Critical Region
ω : t0 ≤-t0.025, (n-1) Critical Region
ω : t0 ≥ t0.025, (n-1)
0.025 0.025
0 t0.025, (n-1)
- t0.025 , (n-1)
Fig.16.9: Critical Region for Both-tailed Test at 5% Level of Significance
Acceptance Region
0.005 0.025
0 t0.005, (n-1)
- t0.005, (n-1)
Fig. 16.10: Critical Region for Both-tailed Test at 1% Level of Significance 113
Geektonight Notes
Probability and Similarly, for testing H0 against the right-sided alternative H1 : µ > µ0, the
Hypothesis Testing critical region is given by :
ω : t0 ≥ tα , (n–1)
ω : t0 ≥ t0.05 , (n–1)
The following Figures 16.11 and 16.12 show these two critical regions.
Acceptance Region
95 % of area
Critical Region
ω : t0 ≥ t0.05, (n-1)
0.05
0 t0.05, (n-1)
Fig. 16.11: Critical Region for Right-tailed Test at 5% Level of Significance
Acceptance Region
95 % of area
Critical Region
ω : t0 ≥ t0.01, (n-1)
0.01
0 t0.01, (n-1)
114
Geektonight Notes
Lastly, when we test H0 against the left-tailed test H2 : µ < µ0, the critical Tests of Hypothesis-II
region would be:
ω : t0 ≤ –tα , (n–1)
ω : t0 ≤ –t0.05 , (n–1)
ω : t0 ≤ –t0.01 , (n–1)
These are depicted in the following Figure 16.13 and Figure 16.14 respectively.
Acceptance Region
0.05
- t0.05, (n-1) 0
Acceptance Region
0.01
- t0.01, (n-1) 0
Illustration 5
Probability and random sample of 13 tins was taken to test the machine. Following were the
Hypothesis Testing weights in kilograms of the 13 tins.
9.7, 9.6, 10.4, 10.3, 9.8, 10.2, 10.4, 9.5, 10.6, 10.8, 9.1, 9.4, 10.7
Solution: Let x denote the weight of the packed tins of oil. Since,
n (x −10)
t0 = |
, where, x = ∑ x i
s n
2
∑(xi − x) −n.x2
2
∑xi
s| = =
n −1 n −1
ω : t0 ≤ –tα, (n–1)
13 (10.038 − 10)
Hence t 0 = = 0.245
0.5591
which is greater than –1.782
As t0 does not fall on the critical region w, we accept H0. So, on the basis of
the given data as obtained from the sample observations, we conclude that the
machine worked in accordance with the given specifications.
116
Geektonight Notes
Solution: Let x denote the inner diameter of steel tubes as produced by the
company. We are interested in testing
H0 : µ = 4 against
H :µ≠ 4
Assuming that x follows normal distribution, we note that the sample size is 15
(<30) and the population S.D. is unknown. All these factors justify the
application of t-distribution. Thus we compute our test statistic as:
n ( x − 4)
t=
s|
As given, x = 3.96; and s = 0.032
n 15
∴s| = s = × 0.032 = 0.033
n −1 14
n ( x − 4)
t0 =
s|
14 (3.96 − 4)
So, t 0 = = − 4.536
0.033
Hence t 0 = 4.536
ω : t 0 ≥ t α / 2, ( n −1)
Selecting the level of significance as 1%, from the t-table (Appendix Table-5),
we get t0.01/2, (15–1)
= t0.005, 14 = 2.977
Thus, ω : t 0 ≥ 2.977
Since the computed value of t i.e., t 0 = 4.536 , falls on w, we reject H0. Hence
the sample mean is significantly different from the population mean.
Illustration 7
The mean weekly sales of detergent powder in the department stores of the
city of Delhi as produced by a company was 2,025 kg. The company carried
out a big advertising campaign to increase the sales of their detergent powder.
After the advertising campaign, the following figures were obtained from 20
departmental stores selected at random from all over the city (weight in kgs.).
Probability and Solution: Let us assume that x represents the weekly sales (in kg) of
Hypothesis Testing detergent powder as produced by the company. If µ denotes the average (i.e.
mean) weekly sales in the city of Delhi, then we would like to test:
n ( x − 2025)
t0 =
s|
and the critical region for the right-sided alternative is given by :
ω : t 0 ≥ (t 0 , (n −1)
or ω : t0 ≥ 1.729
[By selecting α = 0.05 and consulting Appendix Table-5, given at the end of
this block, we find that for m = 20–1 = 19 and for α = 0.05, value of t is
1.729].
xi ui = xi – 2000 u i2
2000 0 0
2023 23 529
2056 56 3136
2048 48 2304
2010 10 100
2025 25 625
2100 100 10000
2563 563 316969
2289 289 813521
2005 5 25
2082 82 6724
2056 56 3136
2049 49 2401
2020 20 400
2310 310 96100
2206 206 42436
2316 316 99856
2186 186 34596
2243 243 59049
2013 13 169
Total 2600 762076
2600
118 From the above table, we have x = 2000 + kg = 2130 kg
20
Geektonight Notes
Tests of Hypothesis-II
2
∑ ui − nu 2
s =
|
n −1
762076 − 20 × (130 ) 2
= = 149 .3981 kg
19
n ( x − 2025)
As t 0 =
s|
20 ( 2130 − 2025 )
∴ t0 = = 3.143
149 .3981
A glance at the critical region suggests that we reject H0 and accept H1. On
the basis of the given sample we, therefore, conclude that the advertising
campaign was successful in increasing the sales of the detergent powder
produced by the company.
Illustration 8
A random sample of 26 items taken from a normal population has the mean as
145.8 and S.D. as 15.62. At 1% level of significance, test the hypothesis that
the population mean is 150.
Solution: Here we would like to test H0 : µ = 150 i.e., the population mean is
150 against H : µ ≠ 150 i.e., the population mean is not 150. As the necessary
conditions for applying t-test are fulfilled, we compute
n ( x − 150)
t0 =
s|
and the critical region at 1% level of significance is :
ω : t 0 ≥ 2.787
n 26
∴s | = S = × 15.62 = 15.9293
n −1 25
26 (145.8 − 150)
So, t 0 = = − 1.344
15.9293
thereby, t 0 = 1.344
Looking at the critical region, we find acceptance of H0. So on the basis of the
given data, we infer that the population mean is 150.
Probability and We may note that there are other factors such as age, height, food habits, living
Hypothesis Testing conditions etc., which could be attributed to a change in body weight. In case,
we apply the drug to the same group of babies, these factors would be
constant and if there is a significant increase in bodyweights, it would be due to
the treatment i.e. the application of restorative except, may be, the chance
factor. Thus, in order to verify the efficacy of the restorative, the best course
of action would be to take a random sample of babies affected with rickets,
measure their bodyweights before applying the restorative, and take their
bodyweights for a second time, say a couple of months, after applying the
restorative. The appropriate test to apply, in this case, is a paired t-test.
Similarly one may apply paired t-test to verify the necessity of a costly
management training for its sales personnel by recording the sales of the
selected trainees before and after the management training or the validity of
special coaching for a group of educationally backward students by verifying
their progress before and after the coaching programme or the increase in
productivity due to the application of a particular kind of fertiliser by recording
the productivity of a crop before and after applying this particular fertiliser and
so on.
Let us now discuss the theoretical background for the application of paired t-
test. In our earlier discussions, we were emphatic about the observations being
independent of each other. Now we consider a pair of random variables which
are dependent or correlated. Earlier, we considered normal distribution, to be
more precise, univariate normal distribution. Similarly, we may think of bivariate
normal distribution. Let x and y be two random variables following bivariate
normal distribution with mean µ1 and µ2 respectively, standard deviations σ1 and
σ2 respectively and a correlation co-efficient (ρ).
Thus ‘x’ and ‘y’ may be the bodyweight of the babies before and after the
application of the restorative, sales before and after the training programme,
marks of the weak students before and after the coaching, yield of a crop
before and after applying the fertiliser and so on.
Let us consider ‘n’ pairs of observations on ‘x‘ and ‘y’ and denote the ‘n’
pairs by (xi, yi) for i = 1, 2, 3, …, n.
Thus testing H0 : µ1–µ2 is analogous to testing for population mean when the
population standard deviation is unknown. In view of our discussion in Section
16.5, if the sample size is small, it is obvious that the appropriate test statistic
would be:
n (u − µ u )
120 t= ……(16.17)
s| u
Geektonight Notes
Tests of Hypothesis-II
∑ ui
where n = sample size; u = ;u=x−y
n
∑ (ui − u)2 ∑ ui − nu 2
su =
|
=
n −1 n −1
nu
As before, under H0, t0 = follows t-distribution with (n–1) d.f.
s| u
Thus for testing H0 : µu = 0 against H : µu ≠ 0,
the critical region is provided by :
ω : t0 ≥ tα/2 (n–1)
For testing H0 against H1 : µ1 > µ2 i.e., H1 : µu > 0
We consider the critical region
ω : t ≥ tα, (n–1)
when the sample size exceeds 30, the assumption of normality for u may be
nu
avoided and the test statistic can be taken as a standard normal variable
s| u
and accordingly we may recommend Z-test.
Illustration 9
Is it reasonable to believe that the drug has no effect on the change of blood-
pressure?
Solution: Let x denote blood-pressure before applying the drug and y, the
blood-pressure after applying the drug. Further let µ1 denote the average blood-
pressure in the population before applying the drug and µ2, the average blood-
pressure after applying the drug. Thus the problem is reduced to testing :
nu
follows t-distribution with (n–1) d.f under H0.
s|u
Thus the critical region would be
ω : t0 ≥ tα , (n–1)
or ω : t0 ≥ 1.895
121
Geektonight Notes
Probability and By taking α = 0.05, tα, (n–1)= t0.05, 7 = 1.895 from Appendix Table-5.
Hypothesis Testing
From the given data, we find that n = 8, Σui = 6, Σui2 = 120
∑ ui 6
Hence u = = = 0.75
n 8
n u
t0 =
s |u
8 × 0.75
∴t0 = = 0.522
4.062
Looking at the critical region, we find that H0 is accepted. Thus on the basis of
the given data we conclude that the drug has been unsuccessful in reducing
blood-pressure.
Illustration 10
A group of students was selected at random from the set of weak students in
statistics. They were given intensive coaching for three months. The marks in
statistics before and after the coaching are shown below.
1 19 32
2 38 36
3 28 30
4 32 30
5 35 40
6 10 25
7 15 30
8 29 20
9 16 15
Solution: Let x and y denote the marks in statistics before and after the
coaching respectively. If the corresponding mean marks in the population be µ1
and µ2 respectively, then we are to test :
H0 : µ1 = µ2 i.e., the coaching has really improved the standard of the students,
against the alternative hypothesis H : µ1 < µ2.
We compute :
nu
122 t0 = which follows t-distribution with (n–1) d.f under H0.
s| u
Geektonight Notes
∑ ui2 − n(u)2
s| u =
n −1
since α = 0.05, n = 9,
consulting Appendix Table-5, we find that t0.05 , 8 = 1.86.
Thus the left-sided critical region is provided by w : t0 ≤ –1.86.
Marks in Statistics
Serial No. (x0) (y0) ui = (xi–yi) u i2
of student Before After
coaching coaching
1 19 32 –13 169
2 38 36 2 4
3 28 30 –2 4
4 32 30 2 4
5 35 40 –5 25
6 10 25 –15 225
7 15 30 –15 225
8 29 20 9 81
9 16 15 1 1
Total – – –36 738
∑ u i − 36
Thus u = = =−4
n 9
738 − 9 × (−4) 2
s| u = = 8.6168
8
8 × −4
∴t0 = = − 1.313
8.6168
A glance at the critical region suggests that we accept H0. On the basis of the
given data, therefore, we infer that the coaching has failed to improve the
standard of the students.
Illustration 11
123
Geektonight Notes
Probability and
Serial number Sales (’000 Rs.)
Hypothesis Testing
of trainee Before the course After the course
1 15 16
2 16 17
3 13 19
4 20 18
5 18 22.5
6 17 18.3
7 16 19.2
8 19 18
9 20 20
10 15.5 16
11 16.2 17
12 15.8 17
13 18.7 20
14 18.3 18
15 20 22
Was the training programme effective in promoting sales? Select α = 0.05.
H0 : µ1 = µ2 against
H1 : µ1 < µ2
µ1 and µ2 being the average sales in the population before the training and
after the training. As before the critical region is :
ω : t0 ≤ –1.761
as m = n–1 = 14 and t0.05, 14 = 1.761
1 15 16 –1 1
2 16 17 –1 1
3 13 19 –6 36
4 20 18 2 4
5 18 22.5 – 4.5 20.25
6 17 18.3 – 1.3 1.69
7 16 19.2 – 3.2 10.24
8 19 18 1 1
9 20 20 0 0
10 15.5 16 – 0.5 0.25
11 16.2 17 – 0.8 0.64
12 15.8 17 – 1.2 1.44
13 18.7 20 – 1.3 1.69
14 18.3 18 0.3 0.09
15 20 22 –2 4
Total – – –19.l5 83.29
124
Geektonight Notes
Tests of Hypothesis-II
∑ µi − nu
2
su =
|
n −1
From the above table 16.5, we have
− 19.5
u= = − 1.3
15
61 − 6 ( 0 .833 ) 2
su =
|
= 3 .3715
5
nu 15 × − 1.3
Hence t 0 = |
= = − 2.428
su 2.0343
to being less than –1.761, we reject H0. Thus on the basis of the given sample,
we conclude that the training programme was effective in promoting sales.
Illustration 12
Six pairs of husbands and wives were selected at random and their IQs were
recorded as follows:
Pair : 1 2 3 4 5 6
IQ of Husband : 105 112 98 92 116 110
IQ of Wife : 102 108 100 96 112 110
Do the data suggest that there is no significant difference in average IQ
between the husband and wife? Use 1% level of significance.
Solution: Let x denote the IQ of husband and y, that of wife. We would like
to test
H0 : µ1 = µ2 i.e., there is no difference in IQ.
ω : t 0 ≥ t 0.01 , (6 − 1)
2
i.e., ω : t 0 ≥ t 0.05,5
i.e., ω : t 0 ≥ 4 . 032
1 105 102 3 9
2 112 108 4 16
3 98 100 –2 4
4 92 96 –4 16
5 116 112 4 16
6 110 110 0 0
Total – – 5 61
125
Geektonight Notes
2
∑ ui − n(u ) 2
su =
|
n −1
61 − 6 (0.833) 2
s| u = = 3.3715
5
n u
t0 =
s |u
5 × 0.8333
so, t0 = = 0.553
3.3715
Therefore, we accept H0 and conclude that, on the basis of the given sample,
there is no reason to believe that IQs of husbands and wives are different.
2) Describe the different steps one should undertake in order to apply t-test.
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
5) A certain diet was introduced to increase the weight of pigs. A random sample of
12 pigs was taken and weighed before and after applying the new diet. The
differences in weights were :
7, 4, 6, 5, – 6, – 3, 1, 0, –5, –7, 6, 2
can we conclude that the diet was successful in increasing the weight of the pigs?
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
ω : t 0 ≥ t α / 2 , ( n −1)
ω : t0 ≥ tα, (n–1) and the critical region for the left-sided alternative
We have concluded our discussion by describing paired t-test and the critical
region for t-tests applied to dependent samples. 127
Geektonight Notes
Probability and
Hypothesis Testing 16.9 KEY WORDS AND SYMBOLS
Chi-square Distribution: If x1, x2 …, xn are ‘m’ independent standard normal
variables, then u = ∑ x i 2 follows χ2-distribution with md.f and this is denoted by
u ~ χ 2m .
Degree of Freedom (d.f.): no. of observations – no. of constraints.
Large Sample: when sample size (n) is more than 30.
Large Sample Tests or Approximate Tests: tests based on large samples.
Paired Samples: Another term used for dependent samples.
Small Sample: when sample size (n) is less than 30.
Small Sample Tests or Exact Tests: tests based on small samples only.
t-distribution: If x is a standard normal variable and u is a chi-square with
m.d.f., and x and u are independent variables, then the ratio.
x
follows t-distribution
u/m
with m.d.f. and is denoted by t ~ tm
100 (1–α) % confidence interval to m
s| s|
x − t 1 / − α / 2 , ( n −1) × , x + t 1− α / 2 , ( n −1) ×
n n
For testing population mean from independent samples, we use the test statistic
n (x − µ0 )
t0 =
s|
and for testing for a particular effect, we use
nu
t0 =
s| u
where u0 = specific value for mean; s| = simple S.D. with (n–1) divisor;
u = x–y = difference in paired sample; and s|u= sample of S.D. of u with (n–1)
divisor
5. No, t0 = – 0.226
6. No, t0 = 0.518
2) How would you distinguish between a t-test for independent sample and a paired
t-test?
6) A technician is making engine parts with axle diameter of 0.750 inch. A random
sample of 14 parts shows a mean diameter of 0.763 inch and a S.D. of 0.0528
inch.
7) St. Nicholas college has 500 students. The heights (in cm.) of 11 students chosen
at random provides the following results:
175, 173, 165, 170, 180, 163, 171, 174, 160, 169, 176
Determine the limits of mean height of the students of St. Nicholas college at 1%
level of significance.
(Ans: 164.6038 cm. and 176.4870 cm.)
8) For a sample of 15 units drawn from a normal population of 150 units, the mean
and S.D. are found to be 10.8 and 3.2 respectively. Find the confidence level for
the following confidence intervals.
(i) 9.415, 12.185
(ii) 9.113, 12.487
[Ans: (i) 90% (ii) 95%]
Probability and
Hypothesis Testing
10) The following data relates to the sales of a new type of toothpaste in 15 selected
shops before and after a special sales promotion campaign.
Shop 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
No.
Sales
(’000 Rs.)
Before 15 16 13 14 18 19 12 16 20 11 12 9 15 17 21
campaign
Sales
(’000 Rs.)
After 17 17 12 15 20 19 14 15 24 12 10 12 18 17 34
campaign
11. A suggestion was made that husbands are more intelligence than wives. A social
worker took a sample of 12 couples and applied I.Q. Tests to both husbands and
wives. The results are shown below:
Sl.No. I.Q. of
Husbands Wives
1. 110 115
2. 115 113
3. 102 104
4. 98 90
5. 90 93
6. 105 103
7. 104 106
8. 116 118
9. 109 110
10. 111 110
11. 87 100
12. 100 98
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
130
Geektonight Notes
Tests of Hypothesis-II
16.12 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt within this unit.
Levin and Rubin, 1996, Statistics for Management. Printice-Hall of India Pvt. Ltd.,
New Delhi.
Hooda, R.P., 2000, Statistics for Business and Economics, MacMillan India Ltd.,
Delhi.
Gupta, S.P., 1999, Statistical Methods, Sultan Chand & Sons, New Delhi.
131
Geektonight Notes
17.0 Objectives
17.1 Introduction
17.2 Chi-Square Distribution
17.3 Chi-Square Test for Independence of Attributes
17.4 Chi-Square Test for Goodness of Fit
17.5 Conditions for Applying Chi-Square Test
17.6 Cells Pooling
17.7 Yates Correction
17.8 Limitations of Chi-Square Test
17.9 Let Us Sum Up
17.10 Key Words
17.11 Answers to Self Assessment Exercises
17.12 Terminal Questions/Exercises
17.13 Further Reading
Appendix Tables
17.0 OBJECTIVES
After studying this unit, you should be able to:
l explain and interpret interaction among attributes,
l use the chi-square distribution to see if two classifications of the same data
are independent of each other,
l use the chi-square statistic in developing and conducting tests of goodness-
of-fit, and
l analyse the independence of attributes by using the chi-square test.
Chi-square tests enable us to test whether more than two population proportions
are equal. Also, if we classify a consumer population into several categories
(say high/medium/low income groups and strongly prefer/moderately prefer/
indifferent/do not prefer a product) with respect to two attributes (say consumer
income and consumer product preference), we can then use chi-square test to
test whether two attributes are independent of each other. In this unit you will
learn the chi-square test, its applications and the conditions under which the chi-
square test is applicable.
132
Geektonight Notes
Chi-Square Test
17.2 CHI-SQUARE DISTRIBUTION
The chi-square distribution is a probability distribution. Under some proper
conditions the chi-square distribution can be used as a sampling distribution of
chi-square. You will learn about these conditions in section 17.5 of this unit.
The chi-square distribution is known by its only parameter – number of degrees
of freedom. The meaning of degrees of freedom is the same as the one you
have used in student t-distribution. Figure 17.1 shows the three different chi-
square distributions for three different degrees of freedom.
df = 2
df = 3
Probability
df = 4
0 2 4 6 8 10 12 14 16
χ2
Figure 17.1. Chi-Square Sampling Distributions for df=2, 3 and 4
It is to be noted that as the degrees of freedom are very small, the chi-square
distribution is heavily skewed to the right. As the number of degrees of
freedom increases, the curve rapidly approaches symmetric distribution. You
may be aware that when the distribution is symmetric, it can be approximated
by normal distribution. Therefore, when the degrees of freedom increase
sufficiently, the chi-square distribution approximates the normal distribution. This
is illustrated in Figure 17.2.
df = 2
Probability
df = 4
df = 10
df = 20
0 5 10 15 20 25 30 35 40
χ2
Figure 17.2. Chi-Square Sampling Distributions for df=2, 4, 10, and 20
Like student t-distribution there is a separate chi-square distribution for each
number of degrees of freedom. Appendix Table-1 gives the most commonly
used tail areas that are used in tests of hypothesis using chi-square distribution.
It will explain how to use this table to test the hypothesis when we deal with
examples in the subsequent sections of this unit. 133
Geektonight Notes
Illustration 1
Suppose in our example of consumer preference explained above, we divide
India into 6 geographical regions (south, north, east, west, central and north
east). We also have two brands of a product brand A and brand B.
The survey results can be classified according to the region and brand
preference as shown in the following table.
Consumer preference
Region Brand A Brand B Total
South 64 16 80
North 24 6 30
East 23 7 30
West 56 44 100
Central 12 18 30
North-east 12 18 30
Total 191 109 300
observed frequencies. Using this data we have to determine whether or not Chi-Square Test
the consumer geographical location (region) matters for brand preference. Here
the null hypothesis (H0) is that the brand preference is not related to the
geographical region. In other words, the null hypothesis is that the two
attributes, namely, brand preference and geographical location of the consumer
are independent. As a basis of comparison, we use the sample results that
would be obtained on the average if the null hypothesis of independence was
true. These hypothetical data are referred to as the expected frequencies.
For example, the cell entry in row-1 and column-2 of the brand preference 6x2
contingency table referred to earlier is:
80 × 191 15280
E= = = 50.93
300 300
Accordingly, the following table gives the calculated expected frequencies for
the rest of the cells of the 6x2 contingency table.
Consumer Preference
Region Brand A Brand B Total
South (80×191)/300 = 50.93 (80×109)/300 = 29.07 80
North (30×191)/300 = 19.10 (30×109)/300 = 10.90 30
East (30×191)/300 = 19.10 (30×109)/300 = 10.90 30
West (100×191)300= 63.67 (100×109)/300 =36.33 100
Central (30×191)300 = 19.10 (30×109)/300 = 10.90 30
Northern (30×191)/300 = 19.10 (30×109)/300 = 10.90 30
Total 191 109 300
We use the following formula for calculating the chi-square value.
(O i − E i )
χ2 = ∑
Ei
1) Subtract Ei from Oi for each of the 12 cells and square each of these differences
(O i–E i) 2.
(O i − E i ) 2
2) Divide each squared difference by Ei and obtain the total, i.e., ∑ .
Ei
This gives the value of chi-squares which may be ranged from zero to infinity.
Thus, value of χ2 is always positive. 135
Geektonight Notes
Illustration 2
A TV channel programme manager wants to know whether there are any
significant differences among male and female viewers between the type of the
programmes they watch. A survey conducted for the purpose gives the
following results.
136
Geektonight Notes
Chi-Square Test
Type of TV Viewers Sex
programme Male Female Total
News 30 10 40
Serials 20 40 60
Total 50 50 100
Since we have a 2x2 contingency table, the degrees of freedom will be (r–1) ×
(c–1) = (2–1) × (2–1) = 1× 1 = 1. At 1 degree of freedom and 0.10
significance level the table value (from Appendix Table-4) is 2.706. Since the
calculated χ2 value (16.66) is greater than table value of χ2 (2.706) we reject
the null hypothesis and conclude that the type of TV programme is dependent
on viewers' sex. It should, therefore, be noted that the value of χ2 is greater
than the table value of x2 the difference between the theory and observation is
significant.
137
Geektonight Notes
...............................................................................................................
......................................................................................................
......................................................................................................
...............................................................................................................
138 ...............................................................................................................
Geektonight Notes
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
The logic inherent in the chi-square test allows us to compare the observed
frequencies (Oi) with the expected frequencies (Ei). The expected frequencies
are calculated on the basis of our theoretical assumptions about the population
distribution. Let us explain the procedure of testing by going through some
illustrations.
Illustration 3
A sales man has 3 products to sell and there is a 40% chance of selling each
product when he meets a customer. The following is the frequency distribution
of sales.
Ho: The sales of three products has a binomial distribution with P=0.40. 139
Geektonight Notes
Probability and Hypothesis H1: The sales of three products do not have a binomial distribution with P=0.40.
Testing
0 0.216
1 0.432
2 0.288
3 0.064
1.000
We now calculate the expected frequency of sales for each situation. There are
130 customers visited by the salesman. We multiply each probability by 130 (no.
of customers visited) to arrive at the respective expected frequency. For
example, 0.216 × 130 = 28.08.
The following table shows the observed frequencies and the expected
frequencies.
(O i − E i ) 2
χ2 = ∑
Ei
140
Geektonight Notes
Chi-Square Test
Observed Expected (O i–E i) (O i–E i) 2 (O i–E i) 2/E i
frequencies frequencies
(O i) (Ei)
Illustration 4
In order to plan how much cash to keep on hand, a bank manager is interested
in seeing whether the average deposit of a customer is normally distributed with
mean Rs. 15000 and standard deviation Rs. 6000. The following information is
available with the bank.
Calculate the χ2 statistic and test whether the data follows a normal distribution
with mean Rs.15000 and standard deviation Rs.6000 (take the level of
significance
(a) as 0.10).
H0: The sample data of deposits is from a population having normal distribution
with mean Rs.15000 and standard deviation Rs.6000.
H1: The sample data of deposits is not from a population having normal
distribution with mean Rs.15000 and standard deviation Rs.6000.
Probability and Hypothesis the expected frequencies are calculated by multiplying the area under the
Testing
respective normal curve and the total sample size (n=150).
For example, to obtain the area for deposits less than Rs.10000, we calculate
the normal deviate as follows:
From Appendix Table-3 (given at the end of this unit), this value (–0.83)
corresponds to a lower tail area of 0.5000–0.2967 = 0.2033. Multiplying 0.2033
by the sample size (150), we obtain the expected frequency 0.2033 × 150 =
30.50 depositors.
The calculations of the remaining expected frequencies are shown in the
following table.
1.0000 150
We should note that from Appendix Table-3 for 0.83 the area left to x is
0.5000 + 0.2967 = 0.7967 and for ∞ the area left to x is 0.5000 + 0.5000 =
1.0000. Similarly, the area of deposit range for normal deviate 0.83 = 0.7967–
0.2033 = 0.5934 and for ∞ = 1.0000–0.7967 = 0.2033.
Once the expected frequencies are calculated, the procedure for calculating χ2
statistic will be the same as we have seen in illustration 3.
(O i − E i ) 2
χ2 = ∑
Ei
The following table gives the calculation of chi-square.
conclude that the data are well described by the normal distribution with mean Chi-Square Test
= Rs.15000 and standard deviation = Rs. 6000.
Illustration 5
A small car company wishes to determine the frequency distribution of
warranty financed repairs per car for its new model car. On the basis of past
experience the company believes that the pattern of repairs follows a Poisson
distribution with mean number of repairs ( l) as 3. A sample data of 400
observations is provided below:
No. of repairs 0 1 2 3 4 5 or
more per car
No. of cars 20 57 98 85 78 62
H0: The number of repairs per car during warranty period follows a Poisson
probability distribution.
H1: The number of repairs per car during warranty period does not follow a Poisson
probability distribution.
As usual the expected frequencies are determined by multiplying the probability
values (in this case Poisson probability) by the total sample size of observed
frequencies. Appendix Table-2 provides the Poisson probability values. For
λ = 3.0 and for different x values we can directly read the probability values.
For example for λ = 3.0 and x = 0 the Poisson probability value is 0.0498, for
λ = 3.0 and x = 1 the Poisson probability value is 0.1494 and so on … .
4 0.1680 67.20
5 or more 0.1848 73.92
Total 1.0000 400
It is to be noted that from Appendix Table-2 for λ = 3.0 we have taken the
Poisson probability values directly for x = 0,1,2,3 and 4. For x = 5 or more we
added the rest of the probability values (for x = 5 to x = 12) so that the sum
of all the probability for x = 0 to x = 5 or more will be 1.000. 143
Geektonight Notes
Probability and Hypothesis As usual we use the following formula for calculating the chi-square (χ2) value.
Testing
2 (O i − E i ) 2
χ =∑
Ei
The following table gives the calculated χ2 value
Illustration 6
In order to know the brand preference of two washing detergents, a sample of
1000 consumers were surveyed. 56% of the consumers preferred Brand X
and 44% of the consumers preferred Brand Y. Do these data conform to the
idea that consumers have no special preference for either brand? Take
significance level as 0.05.
144
Geektonight Notes
The table value (by consulting the Appendix Table-4) at 5% significance level
and n–1 = 2–1 = 1 degree of freedom is 3.841. Since the value of calculated
χ2 is 14.4 which is greater than table value, we reject the null hypothesis and
conclude that the brand names have special significance for consumer
preference.
a) Random Sample: In chi-square test the data set used is assumed to be a random
sample that represents the population. As with all significance tests, if you have a
random sample data that represents population data, then any differences in the
table values and the calculated values are real and therefore significant. On the
other hand, if you have a non-random sample data, significance cannot be established,
though the tests are nonetheless sometimes utilised as crude “rules of thumb” any
way. For example, we reject the null hypothesis, if the difference between observed
and expected frequencies is too large. But if the chi-square value is zero, we
should be careful in interpreting that absolutely no difference exists between
observed and expected frequencies. Then we should verify the quality of data
collected whether the sample data represents the population or not.
b) Large Sample Size: To use the chi-square test you must have a large
sample size that is enough to guarantee the test, to test the similarity
between the theoretical distribution and the chi-square statistic. Applying chi-
square test to small samples exposes the researcher to an unacceptable rate
of type-II errors. However, there is no accepted cutoff sample size. Many
researchers set the minimum sample size at 50. Remember that chi-square
test statistic must be calculated on actual count data (nominal, ordinal or
interval data) and not substituting percentages which would have the effect
of projecting the sample size as 100.
c) Adequate Cell Sizes: You have seen above that small sample size leads to
type-II error. That is, when the expected cell frequencies are too small, the
value of chi-square will be overestimated. This in turn will result in too
many rejections of the null hypothesis. To avoid making incorrect inferences
from chi- square tests we follow a general rule that the expected frequency
in any cell should be a minimum of 5.
145
Geektonight Notes
Illustration 7
A company marketing manager wishes to determine whether there are any
significant differences between regions in terms of a new product acceptance.
The following is the data obtained from interviewing a sample of 190
consumers.
Degree of Region
acceptance South North East West Total
Strong 30 25 20 30 105
Moderate 15 15 20 20 70
Poor 5 10 0 0 15
Total 50 50 40 50 190
Calculate the chi-square statistic. Test the independence of the two attributes at
0.05 degrees of freedom.
Degree of Region
acceptance South North East West Total
Strong 27.63 27.63 22.11 27.63 105
Since the expected frequencies (cell values) in the third row are less than 5 we Chi-Square Test
pool the third row with the second row of both observed frequencies and
expected frequencies. The revised observed frequency and expected frequency
tables are given below.
Degree of Region
acceptance South North East West Total
Strong 30 25 20 30 105
Moderate and 20 25 20 20 85
poor
Total 50 50 40 50 190
Degree of Region
acceptance South North East West Total
Strong 27.63 27.63 22.11 27.63 105
Moderate and 22.37 22.37 17.89 22.37 85
poor
Total 50 50 40 50 190
Illustration 8
The following table gives the number of typing errors per page in a 40 page
report. Test whether the typing errors per page have a Poisson distribution with
mean (λ) number of errors is 3.0.
147
Geektonight Notes
Since the expected frequencies of the first row are less than 5, we pool first
and second rows of observed and expected frequencies. Similarly, the expected
frequencies of the last 6 rows (with 5,6,7,8,9, and 10 or more errors) are less
than 5. Therefore we pool these rows with the row having the expected typing
errors as 4 or more.
As usual we use the following formula for calculating the chi-square (χ2) value.
2 (O i − E i ) 2
χ =∑
Ei
148
Geektonight Notes
The following table gives the calculated χ2 value after pooling cells Chi-Square Test
Suppose for a 2 × 2 contingency table, the four cell values a, b, c and d are
arranged in the following order.
a b
c d
Illustration 9
Suppose we have the following data on the consumer preference of a new
product collected from the people living in north and south India.
South India North India Row total
Number of consumers who 4 51 55
prefer present product
Number of consumers who 14 38 52
prefer new product
Column total: 18 89 107
149
Geektonight Notes
Probability and Hypothesis Do the data suggest that the new product is preferred by the people
Testing independent of their region? Use a = 0.05.
H0: PS = PN (the proportion of people who prefer new product among south and north
India are the same).
H1: PS ≠ PN (the proportion of people who prefer new product among south and north
India are not the same).
In this illustration, (i) the sample size (n) = 107 (ii) the cell values are: a = 4,
b = 51, c = 14, d = 38, (iii) The corresponding row totals are: (a + b) = 55 and
(c + d) = 52, and column totals are (a + c) = 18 and (b + d) = 89.
2
107
107 | 4 × 38 − 51 × 14 | −
2
107 [ | 152 − 714 | − 53 . 5 ] 2
χ 2
= =
55 × 52 × 18 × 89 4581720
The table value for degrees of freedom (2–1) (2–1) = 1 and significance level
∝ = 0.05 is 3.841. Since calculated value of chi-square is 6.0386 which is
greater than the table value we can reject H0 and accept H1 and conclude that
the preference for the new product is not independent of the geographical
region.
It may be observed that when N is large, Yates correction will not make much
difference in the chi square value. However, if N is small, the implication of
Yates correction may overstate the probability.
a) As explained in section 17.5 (conditions for applying chi square test), the chi square
test is highly sensitive to the sample size. As sample size increases, absolute
differences become a smaller and smaller proportion of expected value. This means
150
Geektonight Notes
that a reasonably strong association may not come up as significant if the sample Chi-Square Test
size is small. Conversely, in a large sample, we may find statistical significance
when the findings are small and insignificant. That is, the findings are not substantially
significant, although they are statistically significant.
b) Chi-square test is also sensitive to small frequencies in the cells of contingency
table. Generally, when the expected frequency in a cell of a table is less than 5,
chi-square can lead to erroneous conclusions as explained in section 17.5. The
rule of thumb here is that if either (i) an expected value in a cell in a 2 × 2 contingency
table is less than 5 or (ii) the expected values of more than 20% of the cells in a
greater than 2 × 2 contingency table are less than 5, then chi square test should not
be applied. If at all a chi-square test is applied then appropriately either Yates
correction or cell pooling should also be applied.
c) No directional hypothesis is assumed in chi-square test. Chi-square tests the
hypothesis that two attributes/variables are related only by chance. That is if a
significant relationship is found, this is not equivalent to establishing the researchers’
hypothesis that attribute A causes attribute B or attribute B causes attribute A.
Self Assessment Exercise B
1) While calculating the expected frequencies of a chi-square distribution it was found
that some of the cells of expected frequencies have value below 5. Therefore,
some of the cells are pooled. The following statements tell you the size of the
contingency table before pooling and the rows/columns pooled. Determine the
number of degrees of freedom.
a) 5 × 4 contingency table. First two and last two rows are pooled.
b) 4 × 6 contingency table. First two and last two columns are pooled.
c) 6 × 3 contingency table. First two rows are pooled. 4th, 5th, and 6th rows
are pooled.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
The chi-square test for testing the goodness-of-fit establishes whether the
sample data supports the assumption that a particular distribution applies to the
parent population. It should be noted that the statistical procedures are based on
some assumptions such as normal distribution of population. A chi-square
procedure allows for testing the null hypothesis that a particular distribution
applies. We also use chi-square test whether to test whether the classification
criteria are independent or not.
When performing chi-square test using contingency tables, it is assumed that all
cell frequencies are a minimum of 5. If this assumption is not met we may use
the pooling method but then there is a loss of information when we use this
method. In a 2 × 2 contingency table if one or more cell frequencies are less
than 5 we should apply Yates correction for computing the chi-square value.
In a chi-square test for goodness of-fit, the degrees of freedom are number of
categories – 1 (n–1). In a chi-square test for independence of attributes, the
degrees of freedom are (number of rows–1) × (number of columns–1). That is,
(r–1) × (c–1).
Cells Pooling: When a contingency table contains one or more cells with
expected frequency less than 5, we combine two rows or columns before
calculating χ2. We combine these cells in order to get an expected frequency of
5 or more in each cell.
their degree of freedom, used to test a number of different hypotheses about Chi-Square Test
variances, proportions and distributional goodness of fit.
Goodness of Fit: The chi-square test procedure used for the validation of our
assumption about the probability distribution is called goodness of fit.
153
Geektonight Notes
3. b. H0: The preference for the brand is distributed independent of the consumers’
education level.
3. c. Table value χ2 at 3 d.f and α = 0.05 is 7.815. Since calculated value (7.1178)
is less than the table value of χ2 (7.815), we accept the H0.
B) 1. a) 6, b) 9, c) 4
2. a) 20.090, b)22.362, c) 23.542, d) 8.558
3. i) Poisson probabilities and expected values
No. of repairs Poisson probability Expected frequency
per car (x) Ei =(2)x150
(1) (2) (3)
0 0.0498 7.47
1 0.1494 22.41
2 0.2240 33.60
3 0.2240 33.60
4 0.1680 25.20
5 or more 0.1848 27.72
154
Geektonight Notes
3.iii) At 0.05 significance level and 4 degrees of freedom the table value is 9.488.
Since the calculated chi-square value is greater than the table value we reject
the null hypothesis that the frequency of telephone calls follows Poisson
distribution.
155
Geektonight Notes
Probability and Hypothesis 9) The following table gives the number of telephone calls attended by a credit card
Testing information attendant.
Day Sunday Monday Tuesday Wednesday Thursday Friday Saturday
No. 45 50 24 36 33 27 42
of calls
attended
Test whether the telephone calls are uniformly distributed? Use 0.10
significance level.
10)The following data gives preference of car makes by type of customer.
(a) Test the independence of the two attributes. Use 0.05 level of significance.
(b) Draw your conclusions.
11) A bath soap manufacturer introduced a new brand of soap in 4 colours. The
following data gives information on the consumer preference of the brand.
Good 20 10 20 30 80
Fair 20 10 10 30 70
Poor 10 45 35 10 100
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
156
Geektonight Notes
Chi-Square Test
17.13 FURTHER READING
A number of good text books are available on the topics dealth within this unit. The
following books may be used for more indepth study.
157
Appendix Table-1 Binomial Probabilities
158
Testing
p
n r .01 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95
2 0 .980 .902 .810 .723 .640 .563 .490 .423 .360 .303 .250 .203 .160 .123 .090 .063 .040 .023 .010 .002
1 .020 .095 .180 .255 .320 .375 .420 .455 .480 .495 .500 .495 .480 .455 .420 .375 .320 .255 .180 .095
2 .000 .002 .010 .023 .040 .063 .090 .123 .160 .203 .250 .303 .360 .423 .490 .563 .640 .723 .810 .902
Geektonight Notes
3 0 .970 .857 .729 .614 .512 .422 .343 .275 .216 .166 .125 .091 .064 .043 .027 .016 .008 .003 .001 .00
1 .029 .135 .243 .325 .384 .422 .441 .444 .432 .408 .375 .334 .288 .239 .189 .141 .096 .057 .027 .007
Probability and Hypothesis
2 .000 .007 .027 .057 .096 .141 .189 .239 .288 .334 .375 .408 .432 .444 .441 .422 .384 .325 .243 .135
3 .000 .000 .001 .003 .008 .016 .027 .043 .064 .091 .125 .166 .216 .275 .343 .422 .512 .614 .729 .857
4 0 .961 .815 .656 .522 .410 .316 .240 .179 .130 .092 .062 .041 .026 .015 .008 .004 .002 .001 .000 .000
1 .039 .171 .292 .368 .410 .422 .412 .384 .346 .300 .250 .200 .154 .112 .076 .047 .026 .011 .004 .000
2 .001 .014 .049 .098 .154 .211 .265 .311 .346 .368 .375 .368 .346 .311 .265 .211 .154 .098 .049 .014
3 .000 .000 .004 .011 .026 .047 .076 .112 .154 .200 .250 .300 .346 .384 .412 .422 .410 .368 .292 .171
4 .000 .000 .000 .001 .002 .004 .008 .015 .026 .041 .062 .092 .130 .179 .240 .316 .410 .522 .656 .815
5 0 .951 .774 .590 .444 .328 .237 .168 .116 .078 .050 .031 .019 .010 .005 .002 .001 .000 .000 .000 .000
1 .048 .204 .328 .392 .410 .396 .360 .312 .259 .206 .156 .113 .077 .049 .028 .015 .006 .002 .000 .000
2 .001 .021 .073 .138 .205 .264 .309 .336 .346 .337 .312 .276 .230 .181 .132 .088 .051 .024 .008 .001
3 .000 .001 .008 .024 .051 .088 .132 .181 .230 .276 .312 .337 .346 .336 .309 .264 .205 .138 .073 .021
4 .000 .000 .000 .002 .006 .015 .028 .049 .077 .113 .156 .206 .259 .312 .360 .396 .410 .392 .328 .204
5 .000 .000 .000 .000 .000 .001 .002 .005 .010 .019 .031 .050 .078 .116 .168 .237 .328 .444 .590 .774
6 0 .941 .735 .531 .377 .262 .178 .118 .075 .047 .028 .016 .008 .004 .002 .001 .000 .000 .000 .000 .000
1 .057 .232 .354 .399 .393 .356 .303 .244 .187 .136 .094 .061 .037 .020 .010 .004 .002 .000 .000 .000
2 .001 .031 .098 .176 .246 .297 .324 .328 .311 .278 .234 .186 .138 .095 .060 .033 .015 .006 .001 .000
3 .000 .002 .015 .042 .082 .132 .185 .236 .276 .303 .312 .303 .276 .236 .185 .132 .082 .042 .015 .002
4 .000 .000 .001 .006 .015 .033 .060 .095 .138 .186 .234 .278 .311 .328 .324 .297 .246 .176 .098 .031
5 .000 .000 .000 .000 .002 .004 .010 .020 .037 .061 .094 .136 .187 .244 .303 .356 .393 .399 .354 .232
6 .000 .000 .000 .000 .000 .000 .001 .002 .004 .008 .016 .028 .047 .075 .118 .178 .262 .377 .531 .735
7 0 .932 .698 .478 .321 .210 .133 .082 .049 .028 .015 .008 .004 .002 .001 .000 .000 .000 .000 .000 .000
1 .066 .257 .372 .396 .367 .311 .247 .185 .131 .087 .055 .032 .017 .008 .004 .001 .000 .000 .000 .000
2 .002 .041 .124 .210 .275 .311 .318 .299 .261 .214 .164 .117 .077 .047 .025 .012 .004 .001 .000 .000
3 .000 .004 .023 .062 .115 .173 .227 .268 .290 .292 .273 .239 .194 .144 .097 .058 .029 .011 .003 .000
4 .000 .000 .003 .011 .029 .058 .097 .144 .194 .239 .273 .292 .290 ;268 .227 .173 .115 .062 .023 .004
5 .000 .000 .000 .001 .004 .012 .025 .047 .077 .117 .164 .214 .261 .299 .318 .311 .275 .210 .124 .041
6 .000 .000 .000 .000 .000 .001 .004 .008 .017 .032 .055 .087 .131 .185 .247 .311 .367 .396 .372 .257
7 .000 .000 .000 .000 .000 .000 .000 .001 .002 .004 .008 .015 .028 .049 .082 .133 .210 .321 .478 .698
8 0 .923 .663 .430 .272 .168 .100 .058 .032 .017 .008 .004 .002 .001 .000 .000 .000 .000 .000 .000 .000
1 .075 .279 .383 .385 .336 .267 .198 .137 .090 .055 .031 .016 .008 .003 .001 .000 .000 .000 .000 .000
2 .003 .051 .149 .238 .294 .311 .296 .259 .209 .157 .109 .070 .041 .022 .010 .004 .001 .000 .000 .000
3 .000 .005 .033 .084 .147 .208 .254 .279 .279 .257 .219 .172 .124 .081 .047 .023 .009 .003 .000 .000
4 .000 .000 .005 :018 .046 .087 .136 .188 .232 .263 .273 .263 .232 .188 .136 .087 .046 .018 .005 .000
5 .000 .000 .000 .003 .009 .023 .047 .081 .124 .172 .219 .257 .279 .279 .254 .208 .147 .084 .033 .005
6 .000 .000 .000 .000 .001 .004 .010 .022 .041 .070 .109 .157 .209 .259 .296 .311 .294 .238 .149 .051
7 .000 .000 .000 .000 .000 .000 .001 .003 .008 .016 .031 .055 .090 .137 .198 .267 .336 .385 .383 .279
8 .000 .000 .000 .000 .000 000 .000 .000 .001 .002 .004 .008 .017 .032 .058 .100 .168 .272 .430 .663
Appendix Table-1 Binomial Probabilities (continued)
p
n r .01 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95
9 0 .914 .630 .387 .232 .134 .075 .040 .021 .010 .005 .002 .001 .000 .000 .000 .000 .000 .000 .000 .000
1 .083 .299 .387 .368 .302 .225 .156 .100 .060 .034 .018 .008 .004 .001 .000 .000 .000 .000 .000 .000
2 .003 .063 .172 .260 .302 .300 .267 .216 .161 .111 .070 .041 .021 .010 .004 .001 .000 .000 .000 .000
Geektonight Notes
3 .000 .008 .045 .107 .176 .234 .267 .272 .251 .212 .164 .116 .074 .042 .021 .009 .003 .001 .000 .000
4 .000 .001 .007 .028 .066 .117 .172 .219 .251 .260 .246 .213 .167 .118 .074 .039 .017 .005 .001 .000
5 .000 .000 .001 .005 .017 .039 .074 .118 .167 .213 .246 .260 .251 .219 .172 .117 .066 .028 .007 .001
6 .000 .000 .000 .001 .003 .009 .021 .042 .074 .116 .164 .212 .251 .272 .267 .234 .176 .107 .045 .008
7 .000 .000 .000 .000 .000 .001 .004 .010 .021 .041 .070 .111 .161 .216 .267 .300 .302 .260 .172 .063
8 .000 .000 .000 .000 .000 .000 .000 .001 .004 .008 .018 .034 .060 .100 .156 .225 .302 .368 .387 .299
9 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .002 .005 .010 .021 .040 .075 .134 .232 .387 .630
10 0 .904 .599 .349 .197 .107 .056 .028 .014 .006 .003 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .091 .315 .387 .347 .268 .188 .121 .072 .040 .021 .010 .004 .002 .000 .000 .000 .000 .000 .000 .000
2 .004 .075 .194 .276 .302 .282 .233 .176 .121 .076 .044 .023 .011 .004 .001 .000 .000 .000 .000 .000
3 .000 .010 .057 .130 .201 .250 .267 .252 .215 .166 .117 .075 .042 .021 .009 .003 .001 .000 .000 .000
4 .000 .001 .011 .040 .088 .146 .200 .238 .251 .238 .205 .160 .111 .069 .037 .016 .006 .001 .000 .000
5 .000 .000 .001 .008 .026 .058 .103 .154 .201 .234 .246 .234 .201 .154 .103 .058 .026 .008 .001 .000
6 .000 .000 .000 .001 .006 .016 .037 .069 .111 .160 .205 .238 .251 .238 .200 .146 .088 .040 .011 .001
7 .000 .000 .000 .000 .001 .003 .009 .021 .042 .075 .117 .166 .215 .252 .267 .250 .201 .130 .057 .010
8 .000 .000 .000 .000 .000 .000 .001 .004 .011 .023 .044 .076 .121 .176 .233 .282 .302 .276 .194 .07.
9 .000 .000 .000 .000 .000 .000 .000 .000 .002 .004 .010 .021 .040 .072 .121 .188 .268 .347 .387 .315
10 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .006 .014 .028 .056 .107 .197 .349 .599
11 0 .895 .569 .314 .167 .086 .042 .020 .009 .004 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .099 .329 .384 .325 .236 .155 .093 .052 .027 .013 .005 .002 .001 .000 .000 .000 .000 .000 .000 .000
2 .005 .087 .213 .287 .295 .258 .200 .140 .089 .051 .027 .013 .005 .002 .001 .000 .000 .000 .000 .000
3 .000 .014 .071 .152 .221 .258 .257 .225 .177 .126 .081 .046 .023 .010 .004 .001 .000 .000 .000 .000
4 .000 .001 .016 .054 .111 .172 .220 .243 .236 .206 .161 .113 .070 .038 .017 .006 .002 .000 .000 .000
5 .000 .000 .002 .013 .039 .080 .132 .183 .221 .236 .226 .193 .147 .099 .057 .027 .010 .002 .000 .000
6 .000 .000 .000 .002 .010 .027 .057 .099 .147 .193 .226 .236 .221 .183 .132 .080 .039 .013 .002 .000
7 .000 .000 .000 .000 .002 .006 .017 .038 .070 .113 .161 .206 .236 .243 .220 .172 .111 .054 .016 .001
8 .000 .000 .000 .000 .000 .001 .004 .010 .023 .046 .081 .126 .177 .225 .257 .258 .221 .152 .071 .014
9 .000 .000 .000 .000 .000 .000 .001 .002 .005 .013 .027 .051 .089 .140 .200 .258 .295 .287 .213 .087
10 .000 .000 .000 .000 .000 .000 .000 .000 .001 .002 .005 .013 .027 .052 .093 .155 .236 .325 .384 .329
11 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .004 .009 .020 .042 .086 .167 .314 .569
12 0 .886 .540 .282 .142 .069 .032 .014 .006 .002 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .107 .341 .377 .301 .206 .127 .071 .037 .017 .008 .003 .001 .000 .000 .000 .000 .000 .000 .000 .000
2 .006 .099 .230 .292 .283 .232 .168 .109 .064 .034 .016 .007 .002 .001 .000 .000 .000 .000 .000 .000
3 .000 .017 .085 .172 .236 .258 .240 .195 .142 .092 .054 .028 .012 .005 .001 .000 .000 .000 .000 .000
4 .000 .002 .021 .068 .133 .194 .231 .237 .213 .170 .121 .076 .042 .020 .008 .002 .001 .000 .000 .000
5 .000 .000 .004 .019 .053 .103 .158 .204 .227 .223 .193 .149 .101 .059 .029 .011 .003 .001 .000 .000
6 .000 .000 .000 .004 .016 .040 .079 .128 .177 .212 .226 .212 .177 .128 .079 .040 .016 .004 .000 .000
159
Chi-Square Test
160
Testing
Geektonight Notes
p
n r .01 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95
7 .000 .000 .000 .001 .003 .011 .029 .059 .101 .149 .193 .223 .227 .204 .158 .103 .053 .019 .004 .000
8 .000 .000 .000 .000 .001 .002 .008 .020 .042 .076 .121 .170 .213 .237 .231 .194 .133 .068 .021 .002
9 .000 .000 .000 .000 .000 .000 .001 .005 .012 .028 .054 .092 .142 .195 .240 .258 .236 .172 .085 .017
10 .000 .000 .000 .000 .000 .000 .000 .001 .002 .007 .016 .034 .064 .109 .168 .232 .283 .292 .230 .099
11 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .008 .017 .037 .071 .127 .206 .301 .377 .341
12 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .002 .006 .014 .032 .069 .142 .282 .540
15 0 .860 .463 .206 .087 .035 .013 .005 .002 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .130 .366 .343 .231 .132 .067 .031 .013 .005 .002 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
2 .009 .135 .267 .286 .231 .156 .092 .048 .022 .009 .003 .001 .000 .000 .000 .000 .000 .000 .000 .000
3 .000 .031 .129 .218 .250 .225 .170 .111 .063 .032 .014 .005 .002 .000 .000 .000 .000 .000 .000 .000
4 .000 .005 .043 .116 .188 .225 .219 .179 .127 .078 .042 .019 .007 .002 .001 .000 .000 .000 .000 .000
5 .000 .001 .010 .045 .103 .165 .206 .212 .186 .140 .092 .051 .024 .010 .003 .001 .000 .000 .000 .000
6 .000 .000 .002 .013 .043 .092 .147 .191 .207 .191 .153 .105 .061 .030 .012 .003 .001 .000 .000 .000
7 .000 .000 .000 .003 .014 .039 .081 .132 .177 .201 .196 .165 .118 .071 .035 .013 .003 .001 .000 .000
8 .000 .000 .000 .001 .003 .013 .035 .071 .118 .165 .196 .201 .177 .132 .081 .039 .014 .003 .000 .000
9 .000 .000 .000 .000 .001 .003 .012 .030 .061 .105 .153 .191 .207 .191 .147 .092 .043 .013 .002 .000
10 .000 .000 .000 .000 .000 .001 .003 .010 .024 .051 .092 .140 .186 .212 .206 .165 .103 .045 .010 .001
11 .000 .000 .000 .000 .000 .000 .001 .002 .007 .019 .042 .078 .127 .179 .219 .225 .188 .116 .043 .005
12 .000 .000 .000 .000 .000 .000 .000 .000 .002 .005 .014 .032 .063 .111 .170 .225 .250 .218 .129 .031
13 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .009 .022 .048 .092 .156 .231 .286 .267 .135
14 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .005 .013 .031 .067 .132 .231 .343 .366
15 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .005 .013 .035 .087 .206 .463
Geektonight Notes
Appendix Table-2 Direct Values for Determining Poisson Probabilities Chi-Square Test
For a given value of l, entry indicates the probability of obtaining a specified value of X.
µ
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0.9048 0.8187 0.7408 0.6703 0.6065 0.5488 0.4966 0.4493 0.4066 0.3679
1 0.0905 0.1637 0.2222 0.2681 0.3033 0.3293 0.3476 0.3595 0.3659 0.3679
2 0.0045 0.0164 0.0333 0.0536 0.0758 0.0688 0.1217 0.1438 0.1647 0.1839
3 0.0002 0.0011 0.0033 0.0072 0.0126 0.0198 0.0284 0.0383 0.0494 0.0613
4 0.0000 0.0001 0.0003 0.0007 0.0016 0.0030 0.0050 0.0077 0.0111 0.0153
5 0.0000 0.0000 0.0000 0.0001 0.0002 0.0004 0.0007 0.0012 0.0020 0.0031
6 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0003 0.0005
7 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
µ
x 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
0 0.3329 0.3012 0.2725 0.2466 0.2231 0.2019 0.1827 0.1653 0.1496 0.1353
1 0.3662 0.3614 0.3543 0.3452 0.3347 0.3230 0.3106 0.2975 0.2842 0.2707
2 0.2014 0.2169 0.2303 0.2417 0.2510 0.2584 0.2640 0.2678 0.2700 0.2707
3 0.0738 0.0867 0.0998 0.1128 0.1255 0.1378 0.1496 0.1607 0.1710 0.1804
4 0.0203 0.0260 0.0324 0.0395 0.0471 0.0551 0.0636 0.0723 0.0812 0.0902
5 0.0045 0.0062 0.0084 0.0111 0.0141 0.0176 0.0216 0.0260 0.0309 0.0361
6 0.0008 0.0012 0.0018 0.0026 0.0035 0.0047 0.0061 0.0078 0.0098 0.0120
7 0.0001 0.0002 0.0003 0.0005 0.0008 0.0011 0.0015 0.0020 0.0027 0.0034
8 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0003 0.0005 0.0006 0.0009
9 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002
µ
x 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
0 0.1225 0.1108 0.1003 0.0907 0.0821 0.0743 0.0672 0.0608 0.0550 0.0498
1 0.2572 0.2438 0.2306 0.2177 0.2052 0.1931 0.1815 0.1703 0.1596 0.1494
2 0.2700 0.2681 0.2652 0.5613 0.2565 0.2510 0.2450 0.2384 0.2314 0.2240
3 0.1890 0.1966 0.2033 0.2090 0.2138 0.2176 0.2205 0.2225 0.2237 0.2240
4 0.0992 0.1082 0.1169 0.1254 0.1336 0.1414 0.1488 0.1557 0.1622 0.1680
5 0.0417 0.0476 0.0538 0.0602 0.0668 0.0735 0.0804 0.0872 0.0940 0.1008
6 0.0146 0.0174 0.0206 0.0241 0.0278 0.0319 0.0362 0.0407 0.0455 0.0504
7 0.0044 0.0055 0.0068 0.0083 0.0099 0.0118 0.0139 0.0163 0.0188 0.0216
8 0.0011 0.0015 0.0019 0.0025 0.0031 0.0038 0.0047 0.0057 0.0068 0.0081
9 0.0003 0.0004 0.0005 0.0007 0.0009 0.0011 0.0014 0.0018 0.0022 0.0027
10 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0004 0.0005 0.0006 0.0008
11 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0002
12 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
µ
x 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
0 0.0450 0.0408 0.0369 0.0334 0.0302 0.0273 0.0247 0.0224 0.0202 0.0183
1 0.1397 0.1304 0.1217 0.1135 0.1057 0.0984 0.0915 0.0850 0.0789 0.0733
2 0.2165 0.2087 0.2008 0.1929 0.1850 0.1771 0.1692 0.1615 0.1539 0.1465
3 0.2237 0.2226 0.2209 0.2186 0.2158 0.2125 0.2087 0.2046 0.2001 0.1954
4 0.1734 0.1781 0.1823 0.1858 0.1888 0.1912 0.1931 0.1944 0.1951 0.1954
5 0.1075 0.1140 0.1203 0.1264 0.1322 0.1377 0.1429 0.1477 0.1522 0.1563
6 0.0555 0.0608 0.0662 0.0716 0.0771 0.0826 0.0881 0.0936 0.0989 0.1042
7 0.0246 0.0278 0.0312 0.0348 0.0385 0.0425 0.0466 0.0508 0.0551 0.0595
8 0.0095 0.0111 0.0129 0.0148 0.0169 0.0191 0.0215 0.0241 0.0269 0.0298
9 0.0033 0.0040 0.0047 0.0056 0.0066 0.0076 0.0089 0.0102 0.0116 0.0132
10 0.0010 0.0013 0.0016 0.0019 0.0023 0.0028 0.0033 0.0039 0.0045 0.0053
11 0.0003 0.0004 0.0005 0.0006 0.0007 0.0009 0.0011 0.0013 0.0016 0.0019
12 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 0.0006
13 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002
14 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 161
Geektonight Notes
Probability and Hypothesis Appendix Table-2 Direct Values for Determining Poisson Probabilities (continued….)
Testing
µ
x 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0
0 0.0166 0.0150 0.0136 0.0123 0.0111 0.0101 0.0091 0.0082 0.0074 0.0067
1 0.0679 0.0630 0.0583 0.0540 0.0500 0.0462 0.0427 0.0395 0.0365 0.0337
2 0.1393 0.1323 0.1254 0.1188 0.1125 0.1063 0.1005 0.0948 0.0894 0.0842
3 0.1904 0.1852 0.1798 0.1743 0.1687 0.1631 0.1574 0.1517 0.1460 0.1404
4 0.1951 0.1944 0.1933 0.1917 0.1898 0.1875 0.1849 0.1820 0.1789 0.1755
5 0.1600 0.1633 0.1662 0.1687 0.1708 0.1725 0.1738 0.1747 0.1753 0.1755
6 0.1093 0.1143 0.1191 0.1237 0.1281 0.1323 0.1362 0.1398 0.1432 0.1462
7 0.0640 0.0686 0.0732 0.0778 0.0824 0.0869 0.0914 0.0959 0.1022 0.1044
8 0.0328 0.0360 0.0393 0.0428 0.0463 0.0500 0.0537 0.0575 0.0614 0.0653
9 0.0150 0.0168 0.0188 0.0209 0.0232 0.0255 0.0280 0.0307 0.0334 0.0363
10 0.0061 0.0071 0.0081 0.0092 0.0104 0.0118 0.0132 0.0147 0.0164 0.0181
11 0.0023 0.0027 0.0032 0.0037 0.0043 0.0049 0.0056 0.0064 0.0073 0.0082
12 0.0008 0.0009 0.0011 0.0014 0.0016 0.0019 0.0022 0.0026 0.0030 0.0034
13 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009 0.0011 0.0013
14 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005
15 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.000
µ
x 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0
0 0.0061 0.0055 0.0050 0.0045 0.0041 0.0037 0.0033 0.0030 0.0027 0.0025
1 0.0311 0.0287 0.0265 0.0244 0.0225 0.0207 0.0191 0.0176 0.0162 0.0149
2 0.0793 0.0746 0.0701 0.0659 0.0618 0.0580 0.0544 0.0509 0.0477 0.0446
3 0.1348 0.1293 0.1239 0.1185 0.1133 0.1082 0.1033 0.0985 0.0938 0.0892
4 0.1719 0.1681 0.1641 0.1600 0.1558 0.1515 0.1472 0.1428 0.1383 0.1339
5 0.1753 0.1748 0.1740 0.1728 0.1714 0.1697 0.1678 0.1656 0.1632 0.1606
6 0.1490 0.1515 0.1537 0.1555 0.1571 0.1584 0.1594 0.1601 0.1605 0.1606
7 0.1086 0.1125 0.1163 0.1200 0.1234 0.1267 0.1298 0.1326 0.1353 0.1377
8 0.0692 0.0731 0.0771 0.0810 0.0849 0.0887 0.0925 0.0962 0.0998 0.1033
9 0.0392 0.0423 0.0454 0.0486 0.0519 0.0552 0.0586 0.0620 0.0654 0.0688
10 0.0200 0.0220 0.0241 0.0262 0.0285 0.0309 0.0334 0.0359 0.0386 0.0413
11 0.0093 0.0104 0.0116 0.0129 0.0143 0.0157 0.0173 0.0190 0.0207 0.0225
12 0.0039 0.0045 0.0051 0.0058 0.0065 0.0073 0.0082 0.0092 0.0102 0.0113
13 0.0015 0.0018 0.0021 0.0024 0.0028 0.0032 0.0036 0.0041 0.0046 0.0052
14 0.0006 0.0007 0.0008 0.0009 0.0011 0.0013 0.0015 0.0017 0.0019 0.0022
15 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009
16 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003
17 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001
µ
x 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0
0 0.0022 0.0020 0.0018 0.0017 0.0015 0.0014 0.0012 0.0011 0.0010 0.0009
1 0.0137 0.0126 0.0116 0.0106 0.0098 0.0090 0.0082 0.0076 0.0070 0.0064
2 0.0417 0.0390 0.0364 0.0340 0.0318 0.0296 0.0276 0.0258 0.0240 0.0223
3 0.0848 0.0806 0.0765 0.0726 0.0688 0.0652 0.0617 0.0584 0.0552 0.0521
4 0.1294 0.1249 0.1205 0.1162 0.1118 0.1076 0.1034 0.0992 0.0952 0.0912
5 0.1579 0.1549 0.1519 0.1487 0.1454 0.1420 0.1385 0.1349 0.1314 0.1277
6 0.1605 0.1601 0.1595 0.1586 0.1575 0.1562 0.1546 0.1529 0.1511 0.1490
7 0.1399 0.1418 0.1435 0.1450 0.1462 0.1472 0.1480 0.1486 0.1489 0.1490
8 0.1066 0.1099 0.1130 0.1160 0.1188 0.1215 0.1240 0.1263 0.1284 0.1304
9 0.0723 0.0757 0.0791 0.0825 0.0858 0.0891 0.0923 0.0954 0.0985 0.1014
10 0.0441 0.0469 0.0498 0.0528 0.0558 0.0588 0.0618 0.0649 0.0679 0.0710
11 0.0245 0.0265 0.0285 0.0307 0.0330 0.0353 0.0377 0.0401 0.0426 0.0452
12 0.0124 0.0137 0.0150 0.0164 0.0179 0.0194 0.0210 0.0227 0.0245 0.0264
13 0.0058 0.0065 0.0073 0.0081 0.0089 0.0098 0.0108 0.0119 0.0130 0.0142
14 0.0025 0.0029 0.0033 0.0037 0.0041 0.0046 0.0052 0.0058 0.0064 0.0071
15 0.0010 0.0012 0.0014 0.0016 0.0018 0.0020 0.0023 0.0026 0.0029 0.0033
16 0.0004 0.0005 0.0005 0.0006 0.0007 0.0008 0.0010 0.0011 0.0013 0.0014
17 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006
18 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002
162 19 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001
Geektonight Notes
Appendix Table-2 Direct Values for Determining Poisson Probabilities (continued….) Chi-Square Test
µ
x 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0
0 0.0008 0.0007 0.0007 0.0006 0.0006 0.0005 0.0005 0.0004 0.0004 0.0003
1 0.0059 0.0054 0.0049 0.0045 0.0041 0.0038 0.0035 0.0032 0.0029 0.0027
2 0.0208 0.0194 0.0180 0.0167 0.0156 0.0145 0.0134 0.0125 0.0116 0.0107
3 0.0492 0.0464 0.0438 0.0413 0.0389 0.0366 0.0345 0.0324 0.0305 0.0286
4 0.0874 0.0836 0.0799 0.0764 0.0729 0.0696 0.0663 0.0632 0.0602 0.0573
5 0.1241 0.1204 0.1167 0.1130 0.1094 0.1057 0.1021 0.0986 0.0951 0.0916
6 0.1468 0.1445 0.1420 0.1394 0.1367 0.1339 0.1311 0.1282 0.1252 0.1221
7 0.1489 0.1486 0.1481 0.1474 0.1465 0.1454 0.1442 0.1428 0.1413 0.1396
8 0.1321 0.1337 0.1351 0.1363 0.1373 0.1382 0.1388 0.1392 0.1395 0.1396
9 0.1042 0.1070 0.1096 0.1121 0.1144 0.1167 0.1187 0.1207 0.1224 0.1241
10 0.0740 0.0770 0.0800 0.0829 0.0858 0.0887 0.0914 0.0941 0.0967 0.0993
11 0.0478 0.0504 0.0531 0.0558 0.0585 0.0613 0.0640 0.0667 0.0695 0.0722
12 0.0283 0.0303 0.0323 0.0344 0.0366 0.0388 0.0411 0.0434 0.0457 0.0481
13 0.0154 0.0168 0.0181 0.0196 0.0211 0.0227 0.0243 0.0260 0.0278 0.0296
14 0.0078 0.0086 0.0095 0.0104 0.0113 0.0123 0.0134 0.0145 0.0157 0.0169
15 0.0037 0.0041 0.0046 0.0051 0.0057 0.0062 0.0069 0.0075 0.0083 0.0090
16 0.0016 0.0019 0.0021 0.0024 0.0026 0.0030 0.0033 0.0037 0.0041 0.0045
17 0.0007 0.0008 0.0009 0.0010 0.0012 0.0013 0.0015 0.0017 0.0019 0.0021
18 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006 0.0006 0.0007 0.0008 0.0009
19 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003 0.0003 0.0004
20 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002
21 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001
µ
x 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0
0 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0001 0.0001
1 0.0025 0.0023 0.0021 0.0019 0.0017 0.0016 0.0014 0.0013 0.0012 0.0011
2 0.0100 0.0092 0.0086 0.0079 0.0074 0.0068 0.0063 0.0058 0.0054 0.0050
3 0.0269 0.0252 0.0237 0.0222 0.0208 0.0195 0.0183 0.0171 0.0260 0.0150
4 0.0544 0.0517 0.0491 0.0466 0.0443 0.0420 0.0398 0.0377 0.0357 0.0337
5 0.0882 0.0849 0.0816 0.0784 0.0752 0.0722 0.0692 0.0663 0.0635 0.0607
6 0.1191 0.1160 0.1128 0.1097 0.1066 0.1034 0.1003 0.0972 0.0941 0.0911
7 0.1378 0.1358 0.1338 0.1317 0.1294 0.1271 0.1247 0.1222 0.1197 0.1171
8 0.1395 0.1392 0.1388 0.1382 0.1375 0.1366 0.1356 0.1344 0.1332 0.1318
9 0.1256 0.1269 0.1280 0.1290 0.1299 0.1306 0.1311 0.1315 0.1317 0.1318
10 0.1017 0.1040 0.1063 0.1084 0.1104 0.1123 0.1140 0.1157 0.1172 0.1186
11 0.0749 0.0776 0.0802 0.0828 0.0853 0.0878 0.0902 0.0925 0.0948 0.0970
12 0.0505 0.0530 0.0555 0.0579 0.0604 0.0629 0.0654 0.0679 0.0703 0.0728
13 0.0315 0.0334 0.0354 0.0374 0.0395 0.0416 0.0438 0.0459 0.0481 0.0504
14 0.0182 0.0196 0.0210 0.0225 0.0240 0.0256 0.0272 0.0289 0.0306 0.0324
15 0.0098 0.0107 0.0116 0.0126 0.0136 0.0147 0.0158 0.0169 0.0182 0.0194
16 0.0050 0.0055 0.0060 0.0066 0.0072 0.0079 0.0086 0.0093 0.0101 0.0109
17 0.0024 0.0026 0.0029 0.0033 0.0036 0.0040 0.0044 0.0048 0.0053 0.0058
18 0.0011 0.0012 0.0014 0.0015 0.0017 0.0019 0.0021 0.0024 0.0026 0.0029
19 0.0005 0.0005 0.0006 0.0007 0.0008 0.0009 0.0010 0.0011 0.0012 0.0014
20 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004 0.0005 0.0005 0.0006
21 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0002 0.0003
22 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
µ
x 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0
0 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0000
1 0.0010 0.0009 0.0009 0.0008 0.0007 0.0007 0.0006 0.0005 0.0005 0.0005
2 0.0046 0.0043 0.0040 0.0037 0.0034 0.0031 0.0029 0.0027 0.0025 0.0023
3 0.1040 0.0131 0.0123 0.0115 0.0107 0.0100 0.0093 0.0087 0.0081 0.0076
4 0.0319 0.0302 0.0285 0.0269 0.0254 0.0240 0.0226 0.0213 0.0201 0.0189
163
Geektonight Notes
Probability and Hypothesis Appendix Table-2 Direct Values for Determining Poisson Probabilities (continued….)
Testing
5 0.0581 0.0555 0.0530 0.0506 0.0483 0.0460 0.0439 0.0418 0.0398 0.0378
6 0.0881 0.0851 0.0822 0.0793 0.0764 0.0736 0.0709 0.0682 0.0656 0.0631
7 0.1145 0.1118 0.1091 0.1064 0.1037 0.1010 0.0982 0.0955 0.0928 0.0901
8 0.1302 0.1286 0.1269 0.1251 0.1232 0.1212 0.1191 0.1170 0.1148 0.1126
9 0.1317 0.1315 0.1311 0.1306 0.1300 0.1293 0.1284 0.1274 0.1263 0.1251
10 0.1198 0.1210 0.1219 0.1228 0.1235 0.1241 0.1245 0.1249 0.1250 0.1251
11 0.0991 0.1012 0.1031 0.1049 0.1067 0.1083 0.1098 0.1112 0.1125 0.1137
12 0.0752 0.0776 0.0799 0.0822 0.0844 0.0866 0.0888 0.0908 0.0928 0.0948
13 0.0526 0.0549 0.0572 0.0594 0.0617 0.0640 0.0662 0.0685 0.0707 0.0729
14 0.0342 0.0361 0.0380 0.0399 0.0419 0.0439 0.0459 0.0479 0.0500 0.0521
15 0.0208 0.0221 0.0235 0.0250 0.0265 0.0281 0.0297 0.0313 0.0330 0.0347
16 0.0118 0.0127 0.0137 0.0147 0.0157 0.0168 0.0180 0.0192 0.0204 0.0217
17 0.0063 0.0069 0.0075 0.0081 0.0088 0.0095 0.0103 0.0111 0.0119 0.0128
18 0.0032 0.0035 0.0039 0.0042 0.0046 0.0051 0.0055 0.0060 0.0065 0.0071
19 0.0015 0.0017 0.0019 0.0021 0.0023 0.0026 0.0028 0.0031 0.0034 0.0037
20 0.0007 0.0008 0.0009 0.0010 0.0011 0.0012 0.0014 0.0015 0.0017 0.0019
21 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006 0.0006 0.0007 0.0008 0.0009
22 0.0001 0.0001 0.0002 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004
23 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002
24 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001
µ
x 11 12 13 14 15 16 17 18 19 20
0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
1 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2 0.0010 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
3 0.0037 0.0018 0.0008 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000
4 0.0102 0.0053 0.0027 0.0013 0.0006 0.0003 0.0001 0.0001 0.0000 0.0000
5 0.0224 0.0127 0.0070 0.0037 0.0019 0.0010 0.0005 0.0002 0.0001 0.0001
6 0.0411 0.0255 0.0152 0.0087 0.0048 0.0026 0.0014 0.0007 0.0004 0.0002
7 0.0646 0.0437 0.0281 0.0174 0.0104 0.0060 0.0034 0.0018 0.0010 0.0005
8 0.0888 0.0655 0.0457 0.0304 0.0194 0.0120 0.0072 0.0042 0.0024 0.0013
9 0.1085 0.0874 0.0661 0.0473 0.0324 0.0213 0.0135 0.0083 0.0050 0.0029
10 0.1194 0.1048 0.0859 0.0663 0.0486 0.0341 0.0230 0.0150 0.0095 0.0058
11 0.1194 0.1144 0.1015 0.0844 0.0663 0.0496 0.0355 0.0245 0.0164 0.0106
12 0.1094 0.1144 0.1099 0.0984 0.0829 0.0661 0.0504 0.0368 0.0259 0.0176
13 0.0926 0.1056 0.1099 0.1060 0.0956 0.0814 0.0658 0.0509 0.0378 0.0271
14 0.0728 0.0905 0.1021 0.1060 0.1024 0.0930 0.0800 0.0655 0.0514 0.0387
15 0.0534 0.0724 0.0885 0.0989 0.1024 0.0992 0.0906 0.0786 0.0650 0.0516
16 0.0367 0.0543 0.0719 0.0866 0.0960 0.0992 0.0963 0.0884 0.0772 0.0646
17 0.0237 0.0383 0.0550 0.0713 0.0847 0.0934 0.0963 0.0936 0.0863 0.0760
18 0.0145 0.0256 0.0397 0.0554 0.0706 0.0830 0.0909 0.0936 0.0911 0.0844
19 0.0084 0.0161 0.0272 0.0409 0.0557 0.0699 0.0814 0.0887 0.0911 0.0888
20 0.0046 0.0097 0.0177 0.0286 0.0418 0.0559 0.0692 0.0798 0.0866 0.0888
21 0.0024 0.0055 0.0109 0.0191 0.0299 0.0426 0.0560 0.0684 0.0783 0.0846
22 0.0012 0.0030 0.0065 0.0121 0.0204 0.0310 0.0433 0.0560 0.0676 0.0769
23 0.0006 0.0016 0.0037 0.0074 0.0133 0.0216 0.0320 0.0438 0.0559 0.0669
24 0.0003 0.0008 0.0020 0.0043 0.0083 0.0144 0.0226 0.0328 0.0442 0.0557
25 0.0001 0.0004 0.0010 0.0024 0.0050 0.0092 0.0154 0.0237 0.0336 0.0446
26 0.0000 0.0002 0.0005 0.0013 0.0029 0.0057 0.0101 0.0164 0.0246 0.0343
27 0.0000 0.0001 0.0002 0.0007 0.0016 0.0034 0.0063 0.0109 0.0173 0.0254
28 0.0000 0.0000 0.0001 0.0003 0.0009 0.0019 0.0038 0.0070 0.0117 0.0181
29 0.0000 0.0000 0.0001 0.0002 0.0004 0.0011 0.0023 0.0044 0.0077 0.0125
30 0.0000 0.0000 0.0000 0.0001 0.0002 0.0006 0.0013 0.0026 0.0049 0.0083
32 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0007 0.0015 0.0030 0.0054
32 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0004 0.0009 0.0018 0.0034
33 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0005 0.0010 0.0020
34 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0006 0.0012
35 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0007
36 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0004
37 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002
38 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
164 39 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
Geektonight Notes
Appendix Table-3 Areas of a Standard Normal Probability Distribution Between the Chi-Square Test
Mean and Positive Values of z.
0.4429 of area
Mean z=1.58
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359
0.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753
0.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141
0.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517
0.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879
0.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224
0.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .2549
0.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852
0.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133
0.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389
1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621
1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830
1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015
1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177
1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319
1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441
1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545
1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633
1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706
1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767
2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817
2.1 :4821 .4826 :4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857
2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890
2.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916
2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936
2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952
2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964
2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974
2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981
2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986
3.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
3.1 .4990 .4991 .4991 .4991 .4992 .4992 .4992 .4992 .4993 .4993
3.2 .4993 .4993 .4994 .4994 .4994 .4994 .4994 .4995 .4995 .4995
3.3 .4995 .4995 .4995 .4996 .4996 .4996 .4996 .4996 .4996 .4997
3.4 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4998
3.5 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998
3.6 .4998 .4998 .4998 .4999 .4999 .4999 .4999 .4999 .4999 .4999
165
Geektonight Notes
166
Geektonight Notes
Chi-Square Test
Appendix : Table-5 Table of t
(One Tail Area)
0 tα
Values of ta, m
167
Geektonight Notes
Interpretation
UNIT 18 INTERPRETATION OF of Statistical
Data
STATISTICAL DATA
STRUCTURE
18.0 Objectives
18.1 Introduction
18.2 Meaning of Interpretation
18.3 Why Interpretation?
18.4 Essentials for Interpretation
18.5 Precautions in Interpretation
18.6 Concluding Remarks on Interpretation
18.7 Conclusions and Generalizations
18.8 Methods of Generalization
18.8.1 Logical Method
18.8.2 Statistical Method
18.9 Statistical Fallacies
18.10 Conclusions
18.11 Let Us Sum Up
18.12 Key Words
18.13 Answers to Self Assessment Exercises
18.14 Terminal Questions
18.15 Further Reading
18.0 OBJECTIVES
After studying this unit, you should be able to:
l define interpretation,
l explain the need for interpretation,
l state the essentials for interpretation,
l narrate the precautions to be taken before interpretation,
l describe a conclusion and generalization,
l explain the methods of generalization, and
l illustrate statistical fallacies.
18.1 INTRODUCTION
We have studied in the previous units the various methods applied in the
collection and analysis of statistical data. Statistics are not an end in themselves
but they are a means to an end, the end being to draw certain conclusions
from them. This has to be done very carefully, otherwise misleading conclusions
may be drawn and the whole purpose of doing research may get vitiated.
Interpretation and
Reporting 18.2 MEANING OF INTERPRETATION
The following definitions can explain the meaning of interpretation.
a) The data are homogeneous: It is necessary to ascertain that the data are
strictly comparable. We must be careful to compare the like with the like and
not with the unlike.
b) The data are adequate: Sometimes it happens that the data are incomplete
or insufficient and it is neither possible to analyze them scientifically nor is it
possible to draw any inference from them. Such data must be completed
first.
c) The data are suitable: Before considering the data for interpretation, the
researcher must confirm the required degree of suitability of the data.
6
Geektonight Notes
Inappropriate data are like no data. Hence, no conclusion is possible with Interpretation
of Statistical
unsuitable data.
Data
Interpretation and The researcher must accomplish the task of interpretation only after considering
Reporting all relevant factors affecting the problem to avoid false generalizations. He/she
should not conclude without evidence. He/she should not draw hasty
conclusions. He/she should take all possible precautions for proper interpretation
of the data.
1) Interpretation means:
....................................................................................................................
....................................................................................................................
....................................................................................................................
In every day life, we often make generalizations. We believe that what is true
of the observed instances will be true of the unobserved instances. Since, we
have had an uniform experience, we expect that we shall have it even in the
future. We are quite conscious of the fact that the observed instances do not
constitute all the members of a class concerned. But we have a tendency to
8 generalize. A generalization is a statement, the scope of which is wider
Geektonight Notes
This method was first introduced by John Stuart Mill, who said that
generalization should be based on logical processes. Mill thought that discovering
causal connections is the fundamental task in generalization. If casual
connections hold good, generalization can be done with confidence. Five
methods of experimental enquiry have been given by Mill. These methods serve
the purpose of discovering causal connections. These methods are as follows.
A+B+C Produce X
A+P+Q Produce X
M + N + Non-A Produce Non-X
G + H + Non-A Produce Non-X
∴ A and X are causally connected. 9
Geektonight Notes
Interpretation and iv) The Method of Residues: This method is based on the principle of
Reporting elimination. The statement of this method is that, subtract from any phenomenon
such part as is known by previous inductions to be the effect of certain
antecedents, and the residue of the phenomenon is the effect of the remaining
antecedents. For example: A loaded lorry weighs 11 tons. The dead weight of
the lorry is 1 ton. The weight of load = 11 – 1 = 10 tons.
i) Collection of Data: The facts pertaining to the problem under study are to
be collected either by survey method or by observation method or by
experiment or from a library. (It was discussed in Unit-3).
iii) Analysis of Data: The processed data then should be properly analyzed
with the help of statistical tools, such as measures of central tendency,
measures of variation, measures of sknewness, correlation, time series, index
numbers etc. (This was discussed in Units 8, 9, 10, 11 and 12 of this course).
....................................................................................................................
5) Fill in the blanks with appropriate word (s) :
i) Extending the conclusion from observed instances to unobserved instances
is also called ____________.
ii) Logical method is associated with the name of ______________.
Unconscious bias is even more insidious. Perhaps, all statistical reports contain
some unconscious bias, since the statistical results are interpreted by human
beings after all. Each may look at things in terms of his own experience and
his attitude towards the problem under study. People suffer from several
inhibitions, prejudices, ideologies and hardened attitudes. They can not help
reflecting these in their interpretation of results. For example: A pessimist will
see the future as being dark, where as an optimist may see it as being bright.
Interpretation and Inappropriate Comparisons: Comparisons between two things can not be
Reporting made unless they are really alike. Unfortunately, this point is generally forgotten
and comparisons are made between two dissimilar things, thereby, leading to
fallacious conclusions. For example, the cost of living index of Bangalore is 150
(with base year 1999) and that of Hyderabad is 155 (with base 1995).
Therefore, Hyderabad is a costlier city than Bangalore city. This conclusion is
misleading as the base years of the Indices are different.
Failure to Comprehend the Data: Very often figures are interpreted without
comprehending the total background of the data and it may lead to wrong
conclusions. For example, see the following interpretations:
– The death rate in the army is 9 per thousand, where as in the city of Delhi it is
15 per thousand. Therefore, it is safer to be in the army than in the city.
– Most of the patients who were admitted in the intensive care (IC) ward of a
hospital died. Therefore, it is unsafe to be admitted to intensive care ward in that
hospital.
18.10 CONCLUSIONS
Statistical methods and techniques are only tools. As such, they may be very
often misused. Some people believe that “figures can prove anything.” Figures
don’t lie but liers can figure”. Some people regard statistics as the worst type
of lies. That is why it is said that “an ounce of truth can be produced from
tons of statistics”. Mere quantitative results, or huge body of data, without any
definite purpose, can never help to explain anything. The misuse of statistics
12 may arise due to:
Geektonight Notes
As a principle, statistics can not prove anything, but they can be made to prove
anything because statistics are like clay with which one can make God or the
Devil. The fault lies not with statistics but with the person who is using
statistics. The interpreter must carefully look into these points before he sets
about the task of interpretation. We may conclude with the words of Marshall
who said “ Statistical arguments are often misleading at first, but free discussion
clears away statistical fallacies”.
Interpretation and
Reporting 18.11 LET US SUM UP
A statistician having collected and analyzed data has to draw inferences and
explain their significance. The process of explaining the data, after analysis, is
called interpretation of data. Interpretation is necessary, because, it is only
through interpretation that the researcher can explain relations and patterns that
underlie his findings. Before interpretation it is to be satisfied whether the data
are homogeneous, adequate, suitable and scientifically analyzed.
ii) There is no proportionate relationship between rainfall and yield. In fact Interpretation
of Statistical
excessive rain spoils crop.
Data
iii) At some places it may be 10 ft. The average of which is 5 ft. Hence, it is
dangerous.
iv) It can also be due to increasing the excise duties.
v) Not necessary. It may be due to increasing population or increase in
consumption.
Note: These questions/exercises will help you to understand the unit better. Try
to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
15
Geektonight Notes
Interpretation and
Reporting 18.15 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
B.N.Gupta. Statistics. Sahitya Bhavan, Agra.
S.P. Gupta. Statistical Methods, Sultan Chand & Sons, New Delhi.
B.N.Agarwal. Basic Statistics, Wiley Eastern Ltd.
P.Saravanavel. Research Methodology, Kitab Mahal, Allahabad.
C.R. Khothari. Research Methodology (Methods and Techniques), New Age
International Pvt. Ltd, New Delhi.
16
Geektonight Notes
Report Writing
UNIT 19 REPORT WRITING
STRUCTURE
19.0 Objectives
19.1 Introduction
19.2 Purpose of a Report
19.3 Meaning
19.4 Types of Reports
19.5 Stages in Preparation of a Report
19.6 Characteristics of a Good Report
19.7 Structure of the Research Report
19.7.1 Prefactory Items
19.7.2 The Text/Body of the Report
19.7.3 Terminal Items
19.8 Check List for the Report
19.9 Let Us Sum Up
19.10 Key Words
19.11 Answers to Self Assessment Exercises
19.12 Terminal Questions
19.13 Further Reading
19.0 OBJECTIVES
After going through this unit, you should be able to :
l define a Report,
l explain the need for reporting,
l discuss the subject matter of various types of reports,
l identify the stages in preparation of a report,
l explain the characteristics of a good report,
l explain different parts of a report, and
l distinguish between a good and bad report.
19.1 INTRODUCTION
The last and final phase of the journey in research is writing of the report.
After the collected data has been analyzed and interpreted and generalizations
have been drawn the report has to be prepared. The task of research is
incomplete till the report is presented.
Writing of a report is the last step in a research study and requires a set of skills
some what different from those called for in respect of the earlier stages of
research. This task should be accomplished by the researcher with utmost care.
19.3 MEANING
Reporting simply means communicating or informing through reports. The
researcher has collected some facts and figures, analyzed the same and arrived
at certain conclusions. He has to inform or report the same to the parties
interested. Therefore “reporting is communicating the facts, data and information
through reports to the persons for whom such facts and data are collected and
compiled”.
A report is not a complete description of what has been done during the period
of survey/research. It is only a statement of the most significant facts that are
necessary for understanding the conclusions drawn by the investigator. Thus, “
a report by definition, is simply an account”. The report thus is an account
describing the procedure adopted, the findings arrived at and the conclusions
drawn by the investigator of a problem.
b) Written Report : Written reports are more formal, authentic and popular.
It is, thus, clear that the results of a research enquiry can be presented in a Report Writing
number of ways. They may be termed as a technical report, a popular report,
an article, or a monograph.
i) Journalistic Report
ii) Business Report
iii) Project Report
iv) Dissertation
v) Enquiry Report (Commission Report), and
vi) Thesis
between one aspect and another by means of logical analysis. Logical treatment Report Writing
often consists of developing material from the simple to the most complex.
Designing the Final Outline of the Report: It is the second stage in writing
the report. Having understood the subject matter, the next stage is structuring
the report and ordering the parts and sketching them. This stage can also be
called as planning and organization stage. Ideas may pass through the author’s
mind. Unless he first makes his plan/sketch/design he will be unable to achieve
a harmonious succession and will not even know where to begin and how to
end. Better communication of research results is partly a matter of language
but mostly a matter of planning and organizing the report.
Preparation of the Rough Draft: The third stage is the write up/drafting of
the report. This is the most crucial stage to the researcher, as he/she now sits
to write down what he/she has done in his/her research study and what and
how he/she wants to communicate the same. Here the clarity in
communicating/reporting is influenced by some factors such as who the readers
are, how technical the problem is, the researcher’s hold over the facts and
techniques, the researcher’s command over language (his communication skills),
the data and completeness of his notes and documentation and the availability
of analyzed results. Depending on the above factors some authors may be able
to write the report with one or two drafts. Some people who have less
command over language, no clarity about the problem and subject matter may
take more time for drafting the report and have to prepare more drafts (first
draft, second draft, third draft, fourth draft etc.,)
Finalization of the Report: This is the last stage, perhaps the most difficult
stage of all formal writing. It is easy to build the structure, but it takes more
time for polishing and giving finishing touches. Take for example the
construction of a house. Up to roofing (structure) stage the work is very quick
but by the time the building is ready, it takes up a lot of time.
i) It must be clear in informing the what, why, who, whom, when, where and how
of the research study.
ii) It should be neither too short nor too long. One should keep in mind the fact
that it should be long enough to cover the subject matter but short enough to
sustain the reader’s interest.
2 1
Geektonight Notes
Interpretation and iii) It should be written in an objective style and simple language, correctness,
Reporting
precision and clarity should be the watchwords of the scholar. Wordiness,
indirection and pompous language are barriers to communication.
iv) A good report must combine clear thinking, logical organization and sound
interpretation.
v) It should not be dull. It should be such as to sustain the reader’s interest.
vi) It must be accurate. Accuracy is one of the requirements of a report. It should
be factual with objective presentation. Exaggerations and superlatives should
be avoided.
vii) Clarity is another requirement of presentation. It is achieved by using familiar
words and unambiguous statements, explicitly defining new concepts and
unusual terms.
viii) Coherence is an essential part of clarity. There should be logical flow of ideas
(i.e. continuity of thought), sequence of sentences. Each sentence must be so
linked with other sentences so as to move the thoughts smoothly.
ix) Readability is an important requirement of good communication. Even a
technical report should be easily understandable. Technicalities should be
translated into language understandable by the readers.
x) A research report should be prepared according to the best composition
practices. Ensure readability through proper paragraphing, short sentences,
illustrations, examples, section headings, use of charts, graphs and diagrams.
xi) Draw sound inferences/conclusions from the statistical tables. But don’t repeat
the tables in text (verbal) form.
xii) Footnote references should be in proper form. The bibliography should be
reasonably complete and in proper form.
xiii) The report must be attractive in appearance, neat and clean whether typed or
printed.
xiv) The report should be free from mistakes of all types viz. language mistakes,
factual mistakes, spelling mistakes, calculation mistakes etc.,
The researcher should try to achieve these qualities in his report as far as possible.
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
5) What is meant by coherence?
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
7. Table of contents
8. List of tables
9. List of graphs/charts/
figures
10. List of cases, if any
11. Abstract or high lights (optional)
Let us discuss these items one by one in detail.
2 3
Geektonight Notes
Interpretation and 19.7.1 Prefactory Items
Reporting
The various preliminaries to be included in the front pages of the report are
briefly narrated hereunder:
1) Title Page: The first page of the report is the title page. The title page should
carry a concise and adequately descriptive title of the research study, the name
of the author, the name of the institution to whom it is submitted, the date of
presentation.
4) Dedication: If the author wants to dedicate the work to whom soever he/she
likes, he/she may do so.
7) List of Tables: The researcher must have collected lot of data and analyzed
the same and presented in the form of tables. These tables may be listed
chapter wise and the list be presented with page numbers for easy location and
reference.
After the preliminary items, the body of the report is presented. It is the major
and main part of the report. It consists of the text and context chapters of the
study. Normally the body may be divided into 3 (three) parts.
2 4
Geektonight Notes
i) Introduction
Generally this is the first chapter in the body of the report. It is devoted
introducing the theoretical background of the problem and the methodology
adopted for attacking the problem.
This is the major and main part of the report. It is divided into several chapters
depending upon the number of objectives of the study, each being devoted to
presenting the results pertaining to some aspect. The chapters should be well
balanced, mutually related and arranged in logical sequence. The results should
be reported as accurately and completely as possible explaining as to their
bearing on the research questions and hypotheses.
Each chapter should be given an appropriate heading. Depending upon the need,
a chapter may also be divided into sections. The entire verbal presentation
should run in an independent stream and must be written according to best
composition rules. Each chapter should end with a summary and lead into the
next chapter with a smooth transition sentence.
While dealing with the subject matter of text the following aspects should be
taken care of. They are :
1) Headings
2) Quotations 2 5
Geektonight Notes
Interpretation and 3) Foot notes
Reporting
4) Exhibits
Centre Head. A Centre head is typed in all capital letters. If the title is long,
the inverted pyramid style (i.e., the second line shorter than the first, the third
line shorter than the second) is used. All caps headings are not underlined.
Underlining is unnecessary because capital letters are enough to attract the
reader’s attention.
Example
Centre Subhead. The first letter of the first and the last word and all nouns,
adjectives, verbs and adverbs in the title are capitalized. Articles, prepositions
and conjunctions are not capitalized.
Example
Side Heads. Words in the side head are either written in all capitals or
capitalized as in the centre sub head and underlined.
Example: Import Substitution and Export Promotion
Paragraph Head. Words in a paragraph head are capitalized as in the centre
sub head and underlined. At the end, a colon appears, and then the paragraph
starts.
Example: Import Substitution and Export Promotion: The Seventh Five-Year
Plan of India has attempted ……
2) Quotations
How to Quote: a) All quotations should correspond exactly to the original in Report Writing
wording, spelling, and punctuation.
3) Foot Notes
Types of Footnotes: A foot note either indicates the source of the reference
or provides an explanation which is not important enough to include in the text.
In the traditional system, both kinds of footnotes are treated in the same form
and are included either at the bottom of the page or at the end of the chapter
or book.
In the modern system, explanatory footnotes are put at the bottom of the page
and are linked with the text with a footnote number. But source references are
incorporated within the text and are supplemented by a bibliographical note at
the end of the chapter or book.
Where to put the Footnote: Footnotes appear at the bottom of the page or
at the end of the chapter (before the appendices section).
b) In the text Arabic numerals are used for footnoting. Each new chapter begins
with number 1.
c) The number is typed half a space above the line or within parentheses. No
space is given between the number and the word. No punctuation mark is used
after the number.
2 7
Geektonight Notes
Interpretation and d) The number is placed at the end of a sentence or, if necessary to clarify the
Reporting
meaning, at the end of the relevant word or phrase. Commonly, the number
appears after the last quotation mark. In an indented paragraph, the number
appears at the end of the last sentence in the quotation.
4) Exhibits
Tables:
Tables can be numbered consecutively throughout the chapter as 1.1, 1.2, 1.3,…
wherein the first number refers to the chapter and the second number to the
table.
b) For the title and sub title, all capital letter are used.
c) Abbreviations and symbols are not used in the title or sub title.
1) Have the explanation and reference to the table been given in the text?
2) Is it essential to have the table for clarity and extra information?
3) Is the representation of the data comprehensive and understandable?
4) Is the table number correct?
5) Are the title and subtitle clear and concise?
6) Are the column headings clearly classified?
7) Are the row captions clearly classified?
8) Are the data accurately entered and represented?
9) Are the totals and other computations correct?
10) Has the source been given?
11) Have all the uncommon abbreviations been spelt out?
12) Have all footnote entries been made?
13) If column rules are used, have all rules been properly drawn?
1) Appendices
1) Original data
2) Long tables
3) Long quotations
4) Supportive legal decisions, laws and documents
5) Illustrative material
6) Extensive computations
7) Questionnaires and letters
8) Schedules or forms used in collecting data
9) Case studies / histories
10) Transcripts of interviews
2) Bibliographies
A bibliography contains the source of every reference cited in the footnote and
any other relevant works that the author has consulted. It gives the reader an
idea of the literature available on the subject that has influenced or aided the
author. 2 9
Geektonight Notes
Interpretation and Bibliographical Information: The following information must be given for
Reporting
each bibliographical reference.
3) Glossary
4) Index
Index may be either subject index or author index. Author index consists of
important names of persons discussed in the report, arranged in alphabetical
order. Subject index includes a detailed reference to all important matters
discussed in the report such as places, events, definitions, concepts etc., and
presented in alphabetical order. Index is not generally included in graduate /
post graduate students research reports. However, if the report is prepared for
publication or intended as a work of reference, an index is desirable.
Typing Instructions: For typing a report, the following points should be kept in
mind.
Paper: Quarter - size (A4 size) white thick, unruled paper is used.
Typing: Typing is done on only one side of the paper in double space.
Margins: Left side 1.5 inches, right side 0.5 inch, top and bottom 1.0 inch. But
on the first page of every major division, for example, at the beginning of a
chapter give 3 inches space at the top.
3 1
Geektonight Notes
Interpretation and
Reporting 19.9 LET US SUM UP
The final stage of research investigation is reporting. The research results,
findings and conclusions drawn etc., have to be communicated. This can be
done in two ways i.e. orally or in writing. Written reports are more popular and
authentic even though oral reporting also has its place. Based on requirement
reports can be of two types viz., Technical reports and popular reports.
The total structure of a report can be divided into three main parts.
3 2
Geektonight Notes
Report Writing
19.11 ANSWERS TO SELF ASSESSMENT
EXERCISES
Self Assessment Exercise C
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
3 3
Geektonight Notes
Interpretation and
Reporting 19.13 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
1) V.P. Michael, Research Methodology in Management, Himalaya Publishing
House, Bombay.
2) O.R. Krishna Swamy, Methodology of Research in Social Sciences,
Himalaya Publishing House, Mumbai.
3) C.R. Kothari, Research Methodology, Wiley Eastern, New Delhi
4) Berenson, Conrad and Raymond Cotton, Research and Report Writing for
Business and Economics, Random House, New York.
3 4