Research Methdology
Research Methdology
Research Methdology
Structure
Introduction
What is Research?
Types of Research
Exploratory Research
Conclusive Research
Process of Research
Research Applications in Social and Business Sciences
Features of a Good Research Study
Summary
Introduction
You might have watched on TV the panel discussion that takes place before a cricket
match starts. The facilitator asks the panel members questions like:
Which team will win the match today? Will Sachin Tendulkar score a century?
What will be the score that the batting side will pile? quote factors such as the
following:
The outcome of previous instances when the two sides met and the winning streak
of the teams at the venue
The number of centuries Tendulkar has scored on a particular ground and against the
opposite side Weather conditions, etc.
What the panel members are doing is that they are using the existing evidence or data
systematically to make match predictions. In other words, we could say that they are using
research methodology to answer the questions.
Research methodology refers to the procedures used in making systematic observations or
otherwise obtaining data, evidence, or information as part of a research project or study. It
defines what the activity of research is, how to proceed, how to measure progress and what
constitutes success. We will study more about the various aspects of research methodology in
this unit. But first, let us understand what research is.
1|Page
name would be more appealing. One of the ways in which this can be done is by using the
scientific method of inquiry and following a structured approach to collect and analyse
information and then eventually subject it to the manager's judgement. This is no magic
mantra but a scientific and structured tool available to every manager, namely, research.
Thus, research refers to a wide range of activities involving a search for information, which is
used in various disciplines.
Research activities may range from a simple collection of facts (example, the number of
MBA students who opt for higher studies abroad in a particular institute) to validation of
information (for example, is the new diet cola more popular among women?) to an
exhaustive theory and model construction (for example, constructing a model of India's
weather patterns in 2050 based on climate change projections).In this unit, we will discuss the
meaning of research, the types of researches available to a researcher and the process of a
research study. We will also discuss the application of research in different areas of
management and describe the features of a good research study.
What is Research?
Different scholars have interpreted the term 'research' differently. Fred Kerlinger
(1986) stated that 'Scientific research is a systematic, controlled and critical investigation of
propositions about various phenomena'. Grinnell (1993) has simplified the debate and stated
'The word research is composed of two syllables, 're' and 'search'.
The dictionary defines the former as a prefix meaning 'again', 'anew' or 'over again'. Search is
defined as a verb meaning 'to examine closely and carefully', 'to test and try', or 'to probe'.
Together, they form a noun describing a careful, systematic, patient study and investigation in
some field of knowledge, undertaken to establish facts or principles. Thus, drawing from the
common threads of the above definitions, we derive that management research is an unbiased,
structured, and sequential method of enquiry, directed towards a clear implicit or explicit
business objective. This enquiry might lead to proving existing theorems and models or
arriving at new theories and models. Let us now understand each part of the definition. The
most important and difficult task of a researcher is to be as objective and neutral as possible.
Even though the researcher might have a lot of knowledge about the topic, he/she must not
try to deliberately get results in the direction of the hypotheses.
The second thing to be remembered is that you follow a structured and sequential method of
enquiry. For example, you may want to look at what are the options that you can choose if
you study abroad. You search the Internet and ask your relatives and friends about what are
the options for studying abroad. This is search and not research. For research, there must be a
structured approach that you need to follow, and then only will it be called scientific. Thus,
you may do a background analysis of how many students go abroad to study, and based on
this, form a hypothesis that 80 per cent of young Indians go to universities in the USA for
further study. Then, you conduct a small survey amongst the students who are intending to go
abroad for study. Based on the data collected, you are able to prove or disprove the
hypotheses. So, we can state that you had conducted a research study. We will understand the
process of research later in the unit. The last and most important aspect of our definition that
needs to be carefully considered is the decision-assisting nature of business research. As
Easterby- Smith et al. (2002) state, business research must have some practical consequences,
either immediately, when it is conducted for solving an immediate business problem or when
the theory or model developed demands that managers and researchers work towards a goal
—whether immediate or futuristic, else the research loses its significance in the field of
management.
2|Page
The advantage of doing research is that one is able to take a decision with more confidence as
one has tested it through research. For example, you can conduct a study of young women
professionals and see that they have a need for a night crèche facility when they need to go
out of town on official duty. Here, you may conduct a small research to test what facilities
they would like in this crèche and how much would they be willing to pay for this facility. In
fact, it would not be wrong to say that without the tool of research there would be no new
business practices or methods, as no one would want to start something new (for example,
launch a new product, enter a new market segment, etc) without testing through research.
Types of Research
Though every research conducted is unique, it is possible to categorize the research
approach that you may decide to take.
Sometimes, research may be done for a purely academic reason of a need to know.
For example, studies on employee dissatisfaction and attrition led to the study of impact of
fixed working hours on family life and responsibilities. This study led to the organizations
realizing that they need to have flexible working hours so that employees can better manage
their work-life balance. The context of this kind of study is vast and time period, flexible.
This type of research is termed as fundamental or basic research. On the other hand, you have
studies that are specific to a particular business decision. For example, you want to study the
reason why a particular product is not doing well and you need to identify the reasons for
this, in order to take corrective action. Thus, the study you undertake would be of practical
value to the specific organization. Secondly, it has implications for immediate action. This
action-oriented research is termed as applied research.
However, at this juncture we would like to advise the reader not to look at the two as
opposites of each other. It may happen that the research which started as applied might lead
to some fundamental and basic research, which expands the body of knowledge or vice versa.
The process followed in both basic and applied research is systematic and scientific; the
difference between them could simply be a matter of context and purpose.
Research studies can also be classified on the basis of the nature of inquiry or the
objective behind the conduction. Based on the nature of inquiry or the objective, research can
be of the following types:
Exploratory research
Conclusive research
Exploratory Research
As the name suggests, exploratory research is used to gain a deeper understanding of
the issue or problem that is troubling the decision maker. The idea is to provide direction to
subsequent and more structured and rigorous research. The following are some examples of
exploratory research:
Let us say a diet food company wants to find out what kind of snacks customers like
to eat and where they generally buy healthy food from.
A reality show producer wants to make a show for children. He would like to know
what kind of shows children like to watch.
There is an investment bank that would like to know from its customers about what
kind of help they want from the bank while making their investments.
3|Page
As can be seen from the examples above, an informal exploratory study would be needed.
Exploratory research studies are less structured, more flexible in approach and sometimes
could lead to some testable hypotheses. Exploratory studies are also conducted to develop the
research questionnaire.
(These will be discussed in detail in Unit 3.) The nature of the study being loosely structured
means that the researcher's skill in observing and recording all possible information will
increase the accuracy of the findings.
Conclusive Research
Conclusive research is especially carried out to test and validate the study of
hypotheses. In contrast to exploratory research, these studies are more structured and definite.
The variables and constructs in the research are clearly defined; for example, studying the
customer satisfaction for the different pizzas in the menu of Pizza Hut amongst heavy
consumers of pizza. Now, this needs clear definition of what is customer satisfaction;
secondly, what will be the way that heavy users can be identified.
The timeframe of the study and respondent selection are more formal and
representative. The emphasis on reliability and validity of the research findings are all the
more significant, as the results might need to be implemented.
Based on the nature of investigation required, conclusive research can
further be divided into the following types:
Descriptive research
Causal research
Descriptive research
The main goal of descriptive research is to describe the data and characteristics about
what is being studied. The annual census carried out by the Government of India is an
example of descriptive research. The census describes the number of people living in a
particular area. It also gives other related data about them. It is contemporary and time-
bound. Some more examples of descriptive research are as follows:
A study to distinguish between the characteristics of the customers who buy normal
petrol and those who buy premium petrol
A study to find out the level of involvement of middle level versus senior level
managers in a company's stock-related decisions
All the above research studies are conducted to test specific hypotheses and trends.
For example, we might study in the second example the hypotheses that the level of
involvement of senior level managers is higher than middle level managers in stock-related
decisions. Thus, these studies are more structured and require a formal, specific and
systematic approach to sampling, collecting information and testing the data to verify the
4|Page
research hypotheses.
Causal research
Causal research studies explore the effect of one thing on another and more
specifically, the effect of one variable on another. For example, if a fast-food outlet currently
sells vegetarian fare, what will be the impact on sales if the price of the vegetarian food is
increased by 10 per cent? Causal research studies are highly structured and require a rigid
sequential approach to sampling, data collection and data analysis. This kind of research, like
research in pure sciences, requires experimentation to establish causality. In majority of the
situations, it is quantitative in nature and requires statistical testing of the information
collected.
Process of Research
Any research starts with the need and desire to know more. This need might be purely
academic (basic or fundamental research) or there might be an immediate business decision
that requires an effective and workable solution (applied research).
While conducting research, information is gathered through a sound and scientific research
process. Each year, organizations spend enormous amounts of money on research and
development in order to maintain their competitive edge. Thus, we propose a broad
framework that can be easily be followed in most researches.
In the following paragraphs, we will briefly discuss the steps that in general any research
study might follow.
The Management Dilemma
Any research starts with the need and desire to know more. This is essentially the
management dilemma. It could be the researcher himself or herself, or it could be a business
manager who gets the study done by a researcher. The need might be purely academic (basic
or fundamental research) or there might be an immediate business decision that requires an
effective and workable solution (applied research).
Defining the research problem
This is the first and the most critical step of the research journey. For example, a soft
drink manufacturer who is making and selling aerated drinks now wants to expand his
business. He wants to know whether moving into bottled water would be a better idea or he
should look at fruit juice based drinks. Thus, a comprehensive and detailed survey of the
bottled water as well as the fruit juice market will have to be done. He will also have to
decide whether he wants to know consumer acceptance of a new drink. Thus, there has to be
complete clarity in the mind of the researcher regarding the information he must collect.
Formulating the research hypotheses
In the model, we have drawn broken lines to link the research definition problem
stage to the hypotheses formulation stage. The reason is that every research study might not
always begin with a hypothesis; in fact the task of the study might be to collect detailed data
that might lead to, at the end of the study, some indicative hypotheses to be tested in
5|Page
subsequent research. For example, while studying the lifestyle and eating-out behaviour of
consumers at Pizza Hut, one may find that the young student group consumes more pizzas.
This may lead to a hypothesis that young consumers consume more pizzas than older
consumers.
Hypothesis is, in fact the assumptions about the expected results of the research. For
example, in the above example of work-life balance among women professionals we might
start with a hypothesis that higher the work- family conflict higher is the intention to leave the
job. Conversion of the defined problem into working hypotheses will be discussed in Unit 2.
Developing the research proposal
Once the management dilemma has been converted into a defined problem and a
working hypothesis, the next step is to develop a plan of investigation. This is called the
research proposal. The reason for its placement before the other stages is that before you start
you need to spell out the research problem, the scope and the objectives of the study and the
operational plan for achieving this. The proposal is a flexible contract about the proposed
methodology and once it is made and accepted, the research is ready to begin. The
formulation of a research proposal, its types and purpose will be explained in the next unit.
Research design formulation
Based on the orientation of the research, i.e., exploratory, descriptive or causal, the
researcher has a number of techniques for testing the stated objectives. These are termed in
research as research designs. The main task of the design is to explain how the research
problem will be investigated. There are different kind of designs available to you while doing
a research. These will be discussed in detail in Unit 3.
Sampling design
It is not always possible to study the entire population. Thus, one goes about studying
a small and representative sub-group. This sub-group is referred to as the sample of the study.
There are different techniques available for selecting the group based on certain assumptions.
The most important criteria for this selection would be the representativeness of the sample
selected from the population under study.
Two categories of sampling designs available to the researcher are probability and
non-probability. In the probability sampling designs, the population under study is finite and
one can calculate the probability of a person being selected. On the other hand, in non-
probability designs, one cannot calculate the probability of selection. The selection of one or
the other depends on the nature of the research, degree of accuracy required (the probability
sampling techniques reveal more accurate results) and the time and financial resources
available for the research. Another important decision the researcher needs to take is to
determine the best sample size to be selected in order to obtain results that can be considered
as representative of the population under study.
6|Page
and the data collection plan helps in obtaining information from the specified population. The
data collection methods may be classified into secondary and primary data methods. Primary,
as the name suggests is original and collected first hand for the problem under study. There
are a number of primary data personal/telephonic interviews, mail surveys and
questionnaires.
Secondary data is information that has been collected and compiled earlier. For example,
company records, magazine articles, expert opinion surveys, sales records, customer
feedback, government data and previous researches done on the topic of interest. This step in
the research process requires careful and rigorous quality checks to ensure the reliability and
validity of the data collected.
7|Page
Marketing function:
Research is the lifeline in the field of marketing, where it is carried out on a vast area
of topics and is conducted both in-house by the organization itself and outsourced to external
agencies. This could be related to the 4 Ps—product, price, place and promotions.
8|Page
e) The research at every stage and at any cost must maintain the highest ethical
standards.
f) And lastly, the reason for a structured, ethical, justifiable and objective approach is the
fact that the research carried out by you must be replicable. This means that the process
followed by you must be 'reliable', i.e., in case the study is carried out under similar
conditions it should be able to reveal similar results.
Summary
Let us recapitulate the main points discussed in this unit:
Research may be done for a purely academic rea or a need to know (fundamental or
basic research) or it could be undertaken as it would be of practical value to an
organization with implications for immediate action (applied research).
Different kinds of studies are carried out in the area of business management, such
as marketing, finance, human resources and operations, each having their own
orientation and approach.
9|Page
Structure
Introduction
Defining the Research Problem
Management Problem vs. Research Problem
Problem Identification Process
Components of the Research Problem
Formulating the Research Hypotheses
Types of Research Hypotheses
Writing a Research Proposal
Contents of a Research Proposal
Types of Research Proposal
Summary
Keywords
Introduction
In the last unit, you were introduced to the meaning of research as well its types, process and
features. In this unit, we will focus on the research problem and the formulation of the
research hypothesis. The most important aspect of the business research method is to identify
the 'what', i.e., what is the exact research question to which you are seeking an answer. The
second important thing is that the process of arriving at the question should be logical and
follow a line of reasoning that can lend itself to scientific inquiry. This reasoning approach
needs to be converted into a possible research question. And based on the initial study of the
research topic, one should be able to make certain assumptions which can lend direction to
the study as research hypotheses.Thus, in this unit, we will understand how to identify a
problem that can be subjected to research and help us reduce decision risks.
10 | P a g e
commitment. This might be further different for males and females. These kinds of problems
require a model or framework to be developed to define the research approach.
We can say that,management problem is a difficulty faced by the decision maker and by itself
cannot be tested. In case the decision maker is a business manager, the management's
problem is looking for answers to the problem faced by the manager, as in the above example
of how to reduce the turnover rate in a BPO company. This problem has to be reduced to a
simpler form of research question. And as said earlier, there can be more than one research
problems that can help the manager in taking a decision. It will depend on the researcher how
he looks at it. For example, he may say that the research problem is:
Thus, as you can see we can have many questions. Finally, the research problem you think is
likely to give the possible solution is the one you decide to take as your research problem.
11 | P a g e
The process of identifying the research problem involves the following steps:
1. Management decision problem: The entire process explained above begins with
the identification of the difficulty encountered by the business manager/researcher.The
manager could do the study himself or give it to a researcher or a research agency.This
step requires the researcher/decision maker to carry out a problem appraisal, which
would involve a complete audit of the origin and symptoms of the diagnosed business
problem.
2. Discussion with subject experts: The next step involves getting the problem in the
right perspective through discussions with industry and subject experts. These
individuals are knowledgeable about the industry as well as the organization. They
could be found both within and outside the company. The information about the current
state and the future projections can be obtained in an interview. Thus, the researcher
must have a predetermined set of questions related to the doubts experienced in
problem formulation. It should be remembered that the purpose of the interview is
simply to gain clarity on the problem area and not to arrive at any kind of conclusion or
solutions to the problem. For example, for the organic food study, which is mentioned
in Table 2.1 as a decision problem, the researcher might decide to go to food experts
like doctors and dieticians to seek their opinion. This data should, in practice, be
supported with secondary data in the form of theory as well as organizational facts.
12 | P a g e
as the detailed organizational structure, with the job descriptions. It is to be
remembered here that the organizational data might not be always essential, for
example in case of basic research, where the nature of study is not company specific
but general.
6. Management research problem: Once the audit process of secondary review and
interviews and survey is over, the researcher is ready to focus and define the issues of
concern that need to be investigated further, in the form of an unambiguous and clearly
defined research problem. Here, it is important to remember that simply using the word
'problem' does not mean that there is something wrong that has to be corrected, it
simply indicates the gaps in information or knowledge base available to the researcher.
These might be the reason for his inability to take the correct decision. Second,
identifying all possible dimensions of the problem might be a monumental and
impossible task for the researcher. For example, the lack of sales of a newly launched
product could be due to consumer perceptions about the product, ineffective supply
chain, gaps in the distribution network, competitor offerings or advertising
ineffectiveness. It is the researcher who has to identify and then refine the most
probable cause of the problem and formalize it as the research problem. This would be
achieved through the five preliminary investigative steps indicated above. Once done,
the research problem has to be clearly defined in terms of certain components. This
will be discussed in the next section.
7. Theoretical foundation and model building: Having identified and defined the
variables under study, the next step is to try and form a theoretical framework. It can be
best understood as a schema or network of the probable relationship between the
identified variables. Another advantage of the model is that it clearly shows the
expected direction of the relationships between the concepts. There is also an
indication of whether the relationship would be positive or negative.
This step, however, is not mandatory as sometimes the objective of the research is to
explore the probable variables that might explain the observed phenomena and the
outcome of the study helps to finally develop a conceptual model.
13 | P a g e
8. Statement of research objectives: Next, the research question(s) that were
formulated need to be broken down into tasks or objectives that need to be met in order
to answer the research question.This section makes active use of verbs such as 'to find
out', 'to determine', 'to establish', and 'to measure' so as to spell out the objectives of the
study. In certain cases, the main objectives of the study might need to be broken down
into sub-objectives which clearly state the tasks to be accomplished.
In the organic food research, the objectives and sub-objectives of the study were as follows:
To address the problems of clarity and focus, we need to understand the components of a well
defined problem. These are:
a. The unit of analysis: The researcher must specify in the problem statement the
individual(s) from whom the research information is to be collected and on whom the
research results are applicable. This could be the entire organization, departments,
groups or individuals.
b. Research variables: The research problem also requires identification of the key
variables under study. A variable is any concept that varies and we can assign to it
numerals or values. A variable may be dichotomous in nature, that is, it can possess
two values such as male-female or customer-non- customer. Values that can only fit
into prescribed number of categories are discrete variables, for example, Strongly
Disagree (1) to Strongly Agree (5). There are still others that possess an indefinite set,
e.g., age, income and production data. These are called continuous variables.
Variables can be further classified into four categories, depending on the role they play in
the problem under consideration. These are:
14 | P a g e
1. Dependent variables
2. Independent variables
3. Moderating variables
4. Extraneous variables
1. Dependent variable (DV) : The most important variable to be studied and analysed
in research study is the effect-dependent variable. The entire research process is
involved in either describing this variable or investigating the probable causes of the
observed effect. Thus, this must be a measurable variable. For example, in the organic
food study, the consumer's purchase intentions as well as sales of organic food
products in the domestic market, could serve as the dependent variable.
3. Moderating variables (MV): Moderating variables are the ones that have a strong
effect on the relationship between the independent and dependent variables. These
variables must be considered in the expected pattern of relationship as they modify the
direction as well as the magnitude of the independent-dependent association. In the
organic food study, the strength of the relation between attitude and intention might be
modified by the education and the income level of the buyer. Here, education and
income are the moderating variables.
There might be instances when confusion might arise between a moderating variable
and an independent variable. Consider the following situation:
Proposition 1: Turnover intention (DV) is an inverse function of organizational
commitment (IV), especially for workers who have a higher job satisfaction level
(MV).
While another study might have the following proposition to test: Proposition 2:
Turnover intention (DV) is an inverse function of job satisfaction (IV), especially for
workers who have a higher organizational commitment (MV). Thus, the two
propositions are studying the relation between the same three variables. However, the
decision to classify one as independent and the other as moderating depends on the
research interest of the decision maker. At this stage, we can clearly distinguish
between the different kinds of variables discussed above. An independent variable is
the prime antecedent condition which is qualified as explaining the variance in the
dependent variable; the moderating variable is a contributing variable which might
impact the defined relationship.
15 | P a g e
4. Extraneous variables are outside the domain of the study and responsible for
chance variations, but in some instances, their effect might need to be controlled.
Consumer liking for the electronic advertisement for the new diet drink will have
positive impact on brand awareness of the drink
High organizational commitment will lead to lower turnover intention
A hypothesis must be measurable and quantifiable
A hypothesis is a conjectural statement based on the existing literature and theories
about the topic and not based on the gut feel of the researcher
The validation of the hypothesis would necessarily involve testing the statistical
significance of the hypothesized relation
Directional Hypothesis: This type of hypothesis suggests the outcome the investigator
16 | P a g e
expects at the end of the study. Scientific journal articles generally use this form of
hypothesis. The investigator bases this hypothesis on the trends apparent from previous
research on this topic. Considering the previous example, a researcher may state the
hypothesis as, “High school students who participate in extracurricular activities have a lower
GPA than those who do not participate in such activities.” Such hypotheses provide a definite
direction to the prediction.
Causal Hypothesis: Some studies involve a measurement of the degree of influence of one
variable on another. In such cases, the researcher states the hypothesis in terms of the effect
of variations in a factor on another factor. This causal hypothesis is said to be bivariate
because it specifies two aspects -- the cause and the effect. For the example mentioned, the
causal hypothesis will state, “High school students who participate in extracurricular
activities spend less time studying which leads to a lower GPA.” When verifying such
hypotheses, the researcher needs to use statistical techniques to demonstrate the presence of a
relationship between the cause and effect. Such hypotheses also need the researcher to rule
out the possibility that the effect is a result of a cause other than what the study has examined.
Executive summary
This is a broad overview that gives the purpose and objective of the study. In a short
paragraph the author gives a summary about the management problem/academic concern.
17 | P a g e
This is the detailed background of the management problem. It requires a sequential and
systematic build-up to the research questions and why the study should be done. The
researcher must be able to demonstrate that there could be a number of ways in which the
management dilemma could be answered.
For example, a pharmaceutical company develops a new hair growing solution and packages
it in two different types of bottles. They want to know which one people will buy. The
product testing could be done internally in the company, or the two sample bottles could be
formulated and tested for their acceptability amongst likely consumers or retailers keeping
the product; or the two types could be developed and test launched and tested for their
sales potential. The researcher thus must spell out all probabilities and then systematically
and logically argue for the research study. This section must be objective and written in
simple language, avoiding any metaphors or idioms to dramatize the plan. The logical
arguments should speak for themselves and be able to convince the reader of the need for the
study in order to find probable solutions to the management dilemma.
The clear definition of the problem broken down into specific objectives is the next step. This
section is crisp and to the point. It begins by stating the main thrust area of the study. For
example, in the above case, the problem statement could be:
To test the acceptability of a spray or capped bottle dispenser for a new hair growing
formulation.The basic objectives of this research would be to:
Research design
This is the working section of the proposal as it needs to indicate the logical and systematic
approach intended to be followed in order to achieve the listed objectives. This would include
specifying the population to be studied, the sampling process and plan, sample size and
selection. It also details the information areas of the study and the probable sources of data,
i.e., the data collection methods. In case the process must include an instrument design, then
the intended approach needs to be detailed here. A note of caution must be given here: this is
not a simple statement of the sampling and data collection plan; it requires a clear and logical
justification of using the techniques over the methods available for research.
18 | P a g e
Results and outcomes of the research
Here the clear terms of contract or expected outcomes of the study must be spelt out. This is
essential even if it is an academic research. The expected deliverables need to clearly
demonstrate how the researcher intends to link the findings of the proposed study design to
the stated research objectives. For example, in the pharmaceutical study, the expected
deliverables are:
19 | P a g e
External organizational proposals have their basis or origin within the company, but the scope
and nature of the study requires a more structured and objective research. For example, if the
above stated pharmaceutical company wishes to explore the herbal cosmetic market and
wants market analysis and feasibility study conducted; the PR might be spelt out to
solicit proposals to address the research question, and execute an outsourced research
Summary
Let us recapitulate the main points discussed in this unit:
The most important step in research is to identify the decision to be made and how it
can be converted into a research problem
The problem definition process is a well-integrated, linked and step wise process.
These include the unit of analysis—which is the individual or group that is to be
studied. The second element is a clear definition of the variables under study
By the time the research problem is identified and stated, the researcher should be
able to specify what is the causal or independent variable and which is the effect or
dependent variable under study. Also, it is best to acknowledge the effect or presence
of any external variables which might have a contingent effect on the cause and effect
relationship that is to be studied. These can be further classified as moderator,
intervening, and extraneous variables
It is advisable to the researcher to construct a model or theoretical framework based
on the process of problem formulation.This is recommended but not necessarily an
essential step as some studies might be of a nature that the intent is to conduct the study
and then arrive at a theory or a model
The problem formulation process ultimately ends as a research hypothesis
An entire step wise document in the shape of a formal plan to be followed is made.
This is called the research proposal
There are three different kinds of research proposals available to the researcher -
academic, internal and external
Introduction
Nature and Classification of Research Designs
Exploratory Research Designs
Secondary Resource Analysis
Case Study Method
Expert Opinion Survey
Focus Group Discussions
Descriptive Research Designs
Cross-Sectional Studies
Longitudinal Studies
Experimental Designs
Errors Affecting Research Design
Summary
Keywords
20 | P a g e
Introduction
In the last unit, we studied the defining of the research problem and the formulation of
the research hypothesis. However, in research, it is not enough to define the problem to
formulate the hypotheses. It has been found by research scholars and managers alike that
most research studies do not result in any significant findings because of a faulty research
design. Most researchers feel that once the problem is defined and hypotheses are made, one
can go ahead and collect the data on a specified group, or sample, and then analyse it using
statistical tests. However, unless the formulated research problem and the study hypotheses
are tested through a well defined plan, answers are going to be based on hit and trial rather
than any sound logic.
Several design approaches are available to a researcher and it depends on whether the study is
of descriptive or conclusive nature. The designs range from very simple, loosely structured to
highly scientific experimentation. In this unit, we will study the complete choice of designs,
along with detailed reasoning on which design should be used under what conditions. Just
like with experiments in science, in business research also, there are chances of error, and this
needs to be understood and controlled for more accurate results for the decision maker.
Thyer (1993) states that, 'A traditional research design is a blueprint or detailed plan
for how a research study is to be completed—operationalizing variables so they can be
measured, selecting a sample of interest to study, collecting data to be used as a basis for
testing hypotheses, and analysing the results'. Sellitz et al. (1962) state that, 'A research
design is the arrangement of conditions for collection and analysis of data in a manner that
aims to combine relevance to the research purpose with economy in procedure'.
One of the most comprehensive and holistic definitions has been given by Kerlinger
(1995). He refers to a research design as, '….. a plan, structure and strategy of investigation
so conceived as to obtain answers to research questions or problems. The plan is the complete
scheme or programme of the research. It includes an outline of what the investigator will do
from writing the hypotheses and their operational implications to the final analysis of data'.
Thus, the formulated design must ensure three basic principles:
Convert the research question and the stated assumptions/ hypotheses into variables
that can be measured.
Specify the process to complete the above task.
21 | P a g e
Specify the 'control mechanism(s)' to follow so that the effect of other variables that
could have an effect on the outcome of the study has been controlled.
The researcher has a number of designs available to him for invstigating the research
objectives. The classification that is universally followed is the one based upon the objective
or the purpose of the study. A simple classification that is based upon the research needs
ranging from simple and loosely structured to the specific and more formally structured. The
best way is to view the designs on a continuum. Hence, in case the research objective is
diffused and requires refinement, one uses the exploratory design, and this might lead to the
slightly more concrete descriptive design—here one describes all the aspects of the construct
and concepts under study. This leads to a more structured and controlled experimental
research design. Analysis.
For example, a university professor might decide to do an exploratory analysis of the new
channels of distribution that are being used by the marketers to promote and sell products and
services. To do this, a structured and defined methodology might not be essential as the basic
objective is to understand how to teach this to students of marketing. The researcher can
make use of different methods and techniques in an exploratory research– like secondary data
sources, unstructured or structured observations, expert interviews and focus group
discussions with the concerned respondent group. Here, we will technique to collect the
information required to answer the research problem, given the created framework. Thus,
research designs have a critical and directive role to play in the research process. The
22 | P a g e
execution details of the research question to be investigated are referred to as the research
design.
Secondary Resource Analysis
Secondary sources of data, as the name suggests, are data in terms of the details of
previously collected findings in facts and figures—which have been authenticated and
published. It is a fast and inexpensive way of collecting information. The past details can
sometimes point out to the researcher that his proposed research is redundant and has already
been established earlier. Secondly, the researcher might find that a small but significant
aspect of the concept has not been addressed and should be studied. For example, a marketer
might have extensively studied the potential of the different channels of communication for
promoting a 'home maintenance service' in Greater Mumbai. However, there is no impact of
any mix that he has tested and postulated the need for studying the potential of WOM (word
of mouth) in a close-knit and predominantly Parsi colony where this might be the most
effective culture-dependent technique that would work. Thus, such insights might provide
leads for carrying out an experimental and conclusive research subsequently.
Expert Opinion Survey
At times, there might be a situation when the topic of research is such that there is no
previous information available on it. In these cases, it is advisable to seek help from experts
who might be able to provide some valuable insights based upon their experience in the field
or with the concept. This approach of collecting particulars from significant and erudite
people is referred to as the expert opinion survey. This methodology might be formal and
structured and might be useful when being authenticated or supported by a secondary/
primary research or it might be fluid and unstructured and might require an in- depth
23 | P a g e
interviewing of the expert. For example, the evaluation of the merit of marketing organic
food products in the domestic Indian market cannot be done with the help of secondary data
as no such structured data sources exist. In this case the following can be contacted:
On the other hand, more than ten might lead to more confusion rather than any fruitful
discussion and that would be unwieldy to manage. Generally, these discussions are carried
out in neutral settings by a trained observer, also referred to as the moderator. The moderator,
in most cases, does not participate in the discussion. His prime objective is to manage a
relatively non- structured and informal discussion. He initiates the process and then
manoeuvres it to steer it only to the desired information needs.
Sometimes, there is more than one observer to record the verbal and non- verbal
content of the discussion. The conduction and recording of the dialogue requires considerable
skill and behavioural understanding and focus group discussions were carried out with the
typical consumers/buyers of grocery products. The objective was to establish the level of
awareness about health hazards, environmental concerns and awareness of organic food
products. A series of such focus group discussions carried out across four metros—Delhi,
Mumbai, Bengaluru and Hyderabad—revealed that even though the new age consumer was
concerned about health, the awareness about organic products varied from extremely low to
non-existent.
Descriptive Research Designs
As the name implies, the objective of descriptive research studies is to provide a
comprehensive and detailed explanation of the phenomena under study. The intended
objective might be to give a detailed sketch or profile of the respondent population being
studied. For example, to design an advertising and sales promotion campaign for high-end
24 | P a g e
watches, a marketer would require a holistic profile of the population that buys such luxury
products. Thus a descriptive study, which generates data on the who, what, when, where, why
and how of luxury accessory brand purchase would be the design necessary to fulfil the
research objectives.
Descriptive research thus leads to conclusive studies. However, such research lacks
the precision and accuracy of experimental designs, yet it lends itself to a wide range of
situations and is more frequently used in business research. Based on the time period of the
collection of the research information, descriptive research is further subdivided into two
categories: cross-sectional studies and longitudinal studies.
The cross-sectional study is carried out at a single moment in time and thus the
applicability is most relevant for a specific period. For example, a cross-sectional study
on the attitude of Americans towards Asian-Americans, pre- and post-9/11, was vastly
different, and a study done in 2012 would reveal a different attitude and behaviour
towards the population which might not be absolutely in line with that found earlier.
Secondly, these studies are carried out on a section of respondents from the
population units under study (e.g., organizational employees, voters, consumers,
industry sectors). This sample is under consideration and under investigation only for
the time coordinates of the study.
There are also situations in which the population being studied is not of a
homogeneous nature and there are different groups that exist. Thus it becomes essential to
study the sub-segments independently. This variation of the design is termed as multiple
cross-sectional studies. Usually this multi- sample analysis is carried out at the same moment
in time. However, there might be instances when the data is obtained from different samples
at different time intervals and then they are compared. Cohort analysis is the name given to
such cross-sectional surveys conducted on different sample groups at different time intervals.
Cohorts are essentially groups of people who share a time zone or have experienced an event
that took place at a particular time period. For example, in the 9/11 case, if we study and
compare the attitudes of middle-aged Americans versus teenaged Americans towards Asian-
Americans post the event, it would be a cohort analysis.
The technique is especially useful in predicting election results. Cohorts of males–
females, different religious sects, urban–rural or region-wise cohorts are studied by leading
opinion poll experts like Nielsen, Gallup and others. Thus, cross-sectional studies are
extremely useful for studying current patterns of behaviour or opinion.
Longitudinal Studies
A single sample of the identified population that is studied over a longer period of
time is termed as a longitudinal study design. A panel of consumers specifically chosen to
25 | P a g e
study their grocery purchase pattern is an example of a longitudinal design. There are certain
distinguishing features of the same:
The kinds of panels are defined as true panels and the ones using a different group
every time are called omnibus panels. The advantages of a true panel are that it has a
more committed sample group that is likely to tolerate extended or long data collecting
sessions. Secondly, the profile information is a one-time task and need not be collected
every time.
However, the problem is getting a committed group of people for the entire study period.
Secondly, there is an element of mortality or attrition where the members of the panel might
leave midway and the replaced new recruits might be vastly different and could skew the
results in an absolutely different direction. A third disadvantage is the highly structured study
situation which might be responsible for a consistent and structured behavior, which might
not be the real or field conditions.
Experimental Designs
Experimental designs are conducted to infer causality. In an experiment, a researcher
actively manipulates one or more causal variables and measures their effects on the
dependent variables of interest. Since any changes in the dependent variable may be caused
by a number of other variables, the relationship between cause and effect often tends to be
probabilistic in nature. It is virtually impossible to prove causality. One can only infer a
cause-and- effect relationship.
The necessary conditions for making causal inferences are: (i) concomitant variation,
(ii) time order of occurrence of variables and (iii) absence of other possible causal factors.
The first condition implies that cause and effect variables should have a high correlation. The
second condition means that the causal variable must occur prior to or simultaneously with
the effect variable. The third condition means that all other variables except the one whose
influence we are trying to study should be absent or kept constant.
There are two conditions that should be satisfied while conducting an experiment. These are:
Internal validity: Internal validity tries to examine whether the observed effect on a
dependent variable is actually caused by the treatments (independent variables) in
question. For an experiment to be possessing internal validity, all the other causal
factors except the one whose influence is being examined should be absent. Control of
extraneous variables is a necessary condition for inferring causality. Without internal
validity, the experiment gets confounded.
26 | P a g e
External validity: External validity refers to the generalization of the results of an
experiment. The concern is whether the results of an experiment can be generalized
beyond the experimental situations. If it is possible to generalize the results, then for
what population, settings, times, independent variables and the dependent variables can
the results be projected. It is desirable to have an experiment that is valid both
internally and externally. However, in reality, a researcher might have to trade off one
type of validity for another. To remove the influence of an extraneous variable, a
researcher may set up an experiment with artificial settings, thereby increasing its
internal validity. However, in the process the external validity will be reduced.
There are four types of experimental designs. These are explained below:
Suppose we know that the sales of a product are influenced by the price. In this case,
sales are a dependent variable and the price is the independent variable. Let there be
three levels of price, namely, low, medium and high. We wish to determine the most
effective price level at which the sale is the highest. Here,the test units are the stores
which are randomly assigned to the three treatment levels. The average sales for each
27 | P a g e
price level are computed and examined to see whether there is any significant
difference sales at various price levels. The statistical technique to test for such a
difference is called analysis of variance (ANOVA).
The main limitation of completely randomized design is that it does not take into
account the effect of extraneous variables on the dependent variable. The possible
extraneous variables in the present example could be the size of the store, the
competitor's price and the price of the substitute product in question. This design
assumes that all the extraneous factors have the same influence on all the test units
which may not be true in reality. This design is very simple and inexpensive to
conduct.
In the example considered in the completely randomized design, the price level (low,
medium and high) was considered as an independent variable and all the test units
(stores) were assumed to be more or less equal. However, all stores may not be of the
same size and, therefore, can be classified as small, medium and large size stores. In
this design, the extraneous variables, like the size of the store, could be treated as
different blocks. Now the treatments are randomly assigned to the blocks in such a way
that each treatment appears in each block at least once. The purpose of forming these
blocks is that it is hoped that the scores of the test units within each block would be
more or less homogeneous when the treatment is absent. What is assumed here is that
block (size of the store) is correlated to the dependent variable (sales). It may be noted
that blocking is done prior to the application of the treatment. In this experiment, one
might randomly assign twelve small-sized stores to three price levels in such a way that
there are four stores for each of the three price levels. Similarly, twelve medium-sized
stores and twelve large-sized stores may be randomly assigned to three price levels.
Now the technique of analysis of variance could be employed to analyse the effect of
treatment on the dependent variable and to separate the influence of the extraneous
variable (size of store) from the experiment.
3. Factorial design: A factorial design may be employed to measure the effect of two
or more independent variables at various levels. The factorial designs allow for
interaction between the variables. An interaction is said to take place when the
simultaneous effect of two or more variables is different from the sum of their
individual effects. An individual may have a high preference for mangoes and may also
like ice-cream, which does not mean that he would like mango ice cream, leading to an
interaction. The sales of a product may be influenced by two factors, namely, price
28 | P a g e
level and store size. There may be three levels of price-low(A1), medium (A2) and
high(A3). The store size could be categorized into small (B) and big (B). This could be
conceptualized as a two-factor design with information reportedin the form of a table.
In the table, each level of one factor may be presented asa row and each level of
another variable would be presented as a column. This example could be summarized
in the form of a table having three rows and two columns. This would require 3 × 2 = 6
cells. Therefore, six different levels of treatment combinations would be produced each
with a specific level of price and store size. The respondents would be randomly
selected and randomly assigned to the six cells.
Respondents in each cell receive a special treatment combination. For example, respondents
in the upper left hand corner cell would face small level of price and small store. Similarly,
the respondents in the lower right hand corner cell will be subjected to both high price level
and big store.
The main advantages of factorial design are:
It is possible to measure the main effects and the interaction effect of two or more
independent variables at various levels.
It allows saving time and effort because all observations are employed to study the
effects of each factor.
The conclusion reached using factorial design has broader applications as each
factor is studied with different combinations of other factors.
The limitation of this design is that the number of combinations (number of cells) increases
with increased number of factors and levels. However, a fractional factorial design could be
used if the interest is in studying only a few of the interactions or main effects.
Summary
Let us recapitulate the main points discussed in this unit:
29 | P a g e
1. Research design is the blueprint or the framework for carrying out the research
study.
2. The researcher has a number of designs available to him for investigating the
research objectives. Based upon the objective or the purpose of the study, research
design may be exploratory, descriptive or experimental.
3. Exploratory designs are loosely structured and investigative in nature.
4. In case the hypothesis formulated is descriptive in nature, the study design would
also be descriptive. The study involves collecting the who, what, why, where, why,
when and how about the population under study.
5. Descriptive studies can further be divided into cross-sectional an longitudinal
design.
In case the study is conducted on a single part of the population, it is called single
cross-sectional and in case it is done on more than one segment it is called multiple
cross-sectional.
The second type of descriptive design is the longitudinal design. Here, a selected
sample is studied at different intervals (fixed) of time to measure the variable(s) under
study.
6. Experimental designs are conducted to infer causality. There are four types of
experimental designs – pre-experimental designs, quasi- experimental designs, true
experimental designs and statistical designs.
Structure
Introduction
Classification of Data
Secondary Data
Summary
Keywords
Introduction
30 | P a g e
In the last unit, we discussed research design and its various aspects. Once the
research design is in place, it is time to answer the research problem and hypotheses. But this
cannot be done unless one collects the relevant information necessary for arriving at any
suitable conclusion. The information thus collected is usually termed as data. The researcher
has a choice of a wide variety of methods to collect data. It has to be remembered that there
might be a lot of information available on the topic under study; however you need to pick up
only that information which is of direct relevance to the current problem under study.
Classification of Data
Primary data, as the name suggests, is original, problem- or project- specific, and collected
for the specific needs spelt out by the researcher. The accuracy and relevance is reasonably
high. The time and money required for this are quite high and sometimes a researcher might
not have the resources or the time or both to go ahead with this method. In this case, the
researcher can look at alternative sources of data which are economical and reliable enough
to take the study forward. These include the second category of
data sources—namely the secondary data.
Secondary data is that information which is not topical or research- specific and has been
collected and compiled by some other researcher or investigative body. This type of data is
recorded and published in a structured format, and thus, is quicker to access and manage.
Secondly, in most instances, unless it is a data product, it is not too expensive to collect. The
information required is readily available as a data product or as the audit information which
the researcher or the organization can get and use for arriving at quick decisions. In
comparison to the original research-centric data,secondary data can be economically and
quickly collected by the decision maker in a short span of time. However, one must
remember that it is a little low on accuracy since what is primary and original for one
researcher would essentially become secondary and historical for someone else. Table 4.1
gives a snapshot of the major differences between the two methods.
Secondary Data
We have already discussed what secondary data is. Let us see what are its uses, types
and sources.
Uses of Secondary Data
Secondary data can be used for multiple purposes:
Problem identification and formulation stage: Existing information on the topic under
study is useful to help develop the research question.
Hypotheses designing: Previous research studies done in the area could help in
hypothesizing about expected results.
31 | P a g e
Sampling considerations: There might be respondent related databases available to
seek respondent statistics and relevant contact details. These would help during
sampling for the study.
Primary base: The secondary information collected can be used to design the
primary data collection instruments, in order to phrase and design the right
questions.
Validation board: Earlier records and studies can also be used to support or
validate the information collected through primary sources.
Before we examine the wide range of secondary sources available to the business researcher,
it is essential that one is aware of the advantages and disadvantages of using secondary
sources.
Advantages and Disadvantages of Secondary Data
There are multiple advantages of using secondary data.
Accuracy and stability of data: Data from recognized sources has the additional
advantage of accuracy and reliability.
Assessment of data: It can be used to compare and support the primary research
findings of the present study. However, there is need for caution as well because
in using secondary data, there might be some disadvantages like:
Applicability of data: The information might not be directly suitable for our
study. Also since it is old data it might not be applicable today.
Accuracy of data: All data that is available might not be reliable and accurate.
Types and Sources of Secondary Data
Secondary data can be divided into internal and external sources. Internal, as the name
implies, is an organization-or environment-specific source and includes the historical
output and records available with the organization which might be the backdrop of the
study. The data that is independent of the organization and covers the larger industry-
scape would be available in the form of published material, computerized databases or
data compiled by syndicated services. Discussed below are three major sources of
data – internal, external, computer-stored data and syndicated databases.
1. Internal sources of data
Compilation of various kinds of information and data is mandatory for any organization that
exists.
32 | P a g e
Company records: This includes all the data about the inception, the owners, the
mission and vision statements, infrastructure and other details, including the
process and manufacturing (if any) and sales, as well as historical timeline of the
events.
Employee records: All details regarding the employees (regular and part-time)
would be part of employee records.
Sales data: This data can take on different forms:
1. Cash register receipt
2. Salespersons' call records: This is a document to be prepared and updated every
day by each individual salesperson.
3. Sales invoices: These are about the customers who have placed an order with the
company, then complete details including the size of the order, location, price by
unit, terms of sale and shipment details (if any).
4. Financial records and sales reports: Besides this, there are other published sources
like warranty records, CRM data and customer grievance data which are
extremely critical for evaluating the health of a product or an organization.
33 | P a g e
3. Computer-stored data
Information today is also available in the electronic form. The databases available to
the researcher can be classified based on the type of information or by the method of storage
and recovery as described here. Figure 4.3 gives a classification of the sources of
computerized data.
Observation is a direct method of collecting primary data. It is one of the most appropriate
methods to use in case of descriptive research. The method of observation involves viewing
and recording individuals, groups, organizations or events in a scientific manner in order to
collect valuable data related to the topic under study.
34 | P a g e
The mode of observation could be standardized or structured observation. Here, the nature of
content to be recorded and the format and the broad areas of recording are predetermined.
Thus, the observer's bias is reduced and the authenticity and reliability of the information
collected is higher. For example, Fisher Price toys carry out an observational study whenever
they come out with a new toy. The observer is supposed to record the appeal of the toy for a
child.
The opposite of this is called unstructured observation. Here, the observer is supposed to
make a note of whatever he understands as relevant for the research study. This kind of
approach is more useful in exploratory studies. Since it lacks structure, the chances of
observer's bias are high. An example of this is the observation of consumers at a bank, a
restaurant or a doctor's clinic.
However, it is critical here to understand that the researcher must have a preconceived plan to
capture the observations made. It is not to be treated as a blank sheet where the observer
reports what he sees. The aspects to be observed must be clearly listed as in an audit form, or
they could be indicative areas on which the observation is to be made.
Another way of distinguishing observations is the level of respondent awareness of being
observed. This might be disguised; here, the observation is done without the respondent's
knowledge, who has no idea that he/she is being observed. This can also be done with devices
like a one-way mirror or a hidden camera or a recorder. The only disadvantage is that this is
ethically an intrusion of an individual's right to privacy. On the other hand, the knowledge
that the person is under observation can be conveyed to the respondent, and this is
undisguised observation. The decision to choose one over the other depends upon the nature
of the study.
The observation method can also be distinguished on the basis of the setting in which the
information is being collected. This could be natural observation, which as the name
suggests, is carried out in actual real life locations, for example the observation of how
employees interact with each other during lunch breaks. On the other hand, it could be an
artificial or simulated environment. This is actively done in the armed forces where stress
tests are carried out to measure an individual's tolerance level.
Human observation: As the name suggests, this technique involves observation and recording
done by human observers. The task of the observer is simple and predefined in case of a
structured observation study as the format and the areas to be observed and recorded are
clearly defined. In an unstructured observation, the observer records in a narrative form the
entire event that he has observed.
Focus group discussion (FGD) is a highly versatile and dynamic method of collecting
primary data from a representative group of respondents. The process generally involves a
moderator who steers the discussion on the topic under study. There is a group of carefully
selected respondents who are invited and gathered at a neutral setting. The moderator initiates
the discussion and then the group carries it forward by holding a focused and interactive
discussion.
Key elements of a focus group
Size: Ideal recommended size for a group discussion is eight to twelve members. Less
than eight would not generate all the possible perspectives on the topic and the group
dynamics required for a meaningful session. And more than twelve would make it
difficult to get any meaningful insight.
Nature: Individuals who are from a similar background—in terms of demographic and
psychographic traits—must be included; otherwise disagreement might emerge as a
result of other factors rather than the one under study. The other requirement is that
the respondents must be similar in terms of the subject/policy/product knowledge and
experience with the product under study. Moreover, the organizer of the focus group
discussion must ensure that the following criteria are taken care of:
Acquaintance: It has been found that knowing each other in a group discussion is
disruptive and hampers the free flow of discussion. It is recommended that the group
should consist of strangers rather than subjects who know each other.
Setting: The space or setting in which the discussion takes place should be as neutral,
informal and comfortable as possible. In case one-way mirrors or cameras are
installed, there is a need to ensure that these gadgets are not directly visible.
Time period: The discussion should be held in a single setting unless there is a 'before'
and 'after' design, which requires group perceptions, before the study variable is
36 | P a g e
introduced; and later to gauge the group's reactions. The ideal duration of discussion
should not exceed an hour and a half. This is usually preceded by a short rapport
formation session between the moderator and the group members.
The recording: This is most often machine recording even though sometimes this may
be accompanied by human recording as well.
The moderator: The moderator is the one who manages the discussion. He might be a
participant in the group discussion or he might be a non- participant. He must be a
good listener and unbiased in his conduct of the discussions.
Steps for planning and conducting focus groups
1. The focus group must be conducted in a stepwise manner:
1. Clearly define and enlist the research objectives of the study that requires group
discussion.
2. A comprehensive moderator's structured outline for conducting the whole process
needs to be charted out.
3. After this, the actual focus group discussion is carried out.
4. The focus summary of the findings are clubbed under different heads as indicated in the
focus group objectives and reported in a narrative form. This may include expressions like
'majority of the participants were of the view' or 'there was a considerable disagreement on
this issue.
37 | P a g e
Brand-obsessive group: These are special respondent sub-strata who are passionately
involved with a brand or product category (say, cars). They are selected, as they can
provide valuable insights that can be successfully
incorporated into the brand's marketing strategy.
Online focus group: This is a recent addition to the methodology and is extensively
used today. Here, the respondents logs in at the designated time into a web-based chat
room. The discussion between the moderator and the participants is real-time.
Primary Data Collection:Personal Interview Method
A personal interview is a one-to-one interaction between the investigator/ interviewer and the
interviewee. The purpose of the dialogue is research specific and ranges from completely
unstructured to highly structured.
Uses of the interview method
The interview has varied applications in business research and can be used effectively at
various stages.
2. Problem definition: The interview method can be used right in the beginning of the
study. Here, the researcher uses the method to get clarity about the topic under study.
3. Exploratory research: Here because the structure is loose this method can be actively
used.
4. Primary data collection: There are situations when the method is used as a primary
method of data collection. This is generally the case when the area to be investigated
is high on emotional responses.
The interview process
The steps undertaken for organizing a personal interview are somewhat similar to those of a
focus group discussion.
Interview objective: The information needs that are to be addressed by the instrument
should be clearly spelt out as study objectives. This step includes a clear definition of
the construct/variable(s) to be studied.
Interview guidelines: A typical interview may take from 20 minutes to close to an
hour. A brief outline to be used by the investigator is formulated depending upon the
contours of the interview.
Structure: Based on the needs of the study, the actual interview may be
unstructured, semi-structured or structured.
1. Unstructured: This type of interview has no defined guidelines. It usually begins with
a casually worded opening remark ike 'so tell us/me something about yourself'. The
direction the interview will take is not known to the researcher also. The probability
of subjectivity is very high.
2. Semi-structured: This has a more defined format and usually only the broad areas to
be investigated are formulated. The questions, sequence and language are left to the
investigator's choice. Probing is of critical importance in obtaining meaningful
responses and uncovering hidden issues. After asking the initial question, the
direction of the interview is determined by the respondent's initial reply, the
interviewer's probes for elaboration and the respondent's answers.
38 | P a g e
3. Structured: This format has the highest reliability and validity. There is considerable
structure to the questions and the questioning is also done based on a prescribed
sequence. They are sometimes used as the primary data collection instrument also.
Interviewing skills: The quality of the output and the depth of information collected
depend upon the probing and listening skills of the interviewer. His attitude needs to
be as objective as possible.
Analysis and interpretation: The information collected is not subjected to any
statistical analysis. Mostly the data is in narrative form, in the case of structured
interviews it might be summarized in prose form. Figure 4.4 presents classification of
the types of personal interview.
Personal methods: These are the traditional one-to-one methods that have been used
actively in all branches of social sciences. However, they are distinguished in terms of
the location of interview.
At-home interviews: This face-to-face interaction takes place at the respondent's
residence. Thus, the interviewer needs to initially contact the respondent to ascertain
the interview time.
Mall-intercept interviews: As the name suggests, this method involves conducting
interviews with the respondents as they are shopping in malls. Sometimes, product
testing or product reactions can be carried out through structured methods and
followed by 20-30 minute interviews to test the reactions.
Computer-assisted personal interviewing (CAPI): This technique is carried out with
the help of the computer. In this form of interviewing, the respondent faces an
assigned computer terminal and answers a questionnaire on the computer screen by
using the keyboard or a mouse. A number of pre-designed packages are available to
help the researcher design simple questions that are self-explanatory and instead of
probing, the respondent is guided to a set of questions depending on the answer given.
There is usually an interviewer present at the time of respondent's computer- assisted
interview and is available for help and guidance, if required.
Telephone method: The telephone method replaces the face-to-face interaction
between the interviewer and interviewee. This involves calling up the subjects and
asking them a set of questions. The advantage of the method is that geographic
boundaries are not a constraint and the interview can be conducted at the individual
respondent's location. The format and sequencing of the questions remains the same.
Traditional telephonic interviews: The process can be accomplished using the
traditional telephone for conducting the questioning.
Computer-assisted telephone interviewing: In this process, the interviewer is replaced
by the computer and it involves conducting the telephonic interview using a
computerized interview format. The interviewer sits in front of a computer terminal
and wears a mini-headset, in order to hear the respondent answer. However, unlike the
traditional method where he had to manually record the responses, the responses are
simultaneously recorded on the computer.
Since the interview requires a one-to-one dialogue to be carried out, it is more cumbersome
and costly as compared to a focus group discussion. Also, conduction of interview requires
considerable skills on part of the interviewer and thus adequate training in interviewing skills
39 | P a g e
is needed for capturing comprehensive study-related data.
Summary
The researcher has access to two major sources of data: original as in primary
sources or secondary data.
The focus group discussion is a cost effective method and can ideally be done on a
small group of respondents to obtain meaningful data
Interview method involves a dialogue between the interviewee and the interviewer.
This can range from unstructured to completely structured. Today the interviewer can
make use of the telephone as well as computer to assist him in conducting the
interview.
Introduction
In the previous unit, we studied the various types, sources and methods of collecting
data. In this unit, we will focus on the different types of measurements and statistical
techniques that are applicable for the same. The various formats of a rating scale and the
construction of an attitude measurement scale, along with the description of the distinct
criteria involved in analysing a good measurement scale are elaborated in this Unit.
40 | P a g e
The term 'measurement' means assigning numbers or some other symbols to the
characteristics of certain objects. When numbers are used, the researcher must have a rule for
assigning a number to an observation in a way that provides an accurate description. We do
not measure the object but some characteristics of it. Therefore, in research,
people/consumers are not measured; only their perceptions, attitude or any other relevant
characteristics are measured. There are two reasons for which numbers are usually assigned:
Firstly, numbers permit statistical analysis of the resulting data
Secondly, they facilitate the communication of measurement results.
Scaling is an extension of measurement. Scaling involves creating a continuum on
which measurements on objects are located. Suppose you want to measure the satisfaction
level of Kingfisher Airlines and a scale of 1 to 11 is used for the said purpose. This scale
indicates the degree of dissatisfaction, with 1 = extremely dissatisfied and 11 = extremely
satisfied.
In this Unit, you will also study the sources of measurement errors and the
criteria for evaluating measurements.
Types of Measurement Scales
There are four types of measurement scales—nominal, ordinal, interval and ratio. We
will discuss each one of them in detail. The choice of the measurement scale has implications
for the statistical technique to be used for data analysis.
Nominal scale: This is the lowest level of measurement. Here, numbers are assigned for the
purpose of identification of the objects. Any object which is assigned a higher number is in
no way superior to the one which is assigned a lower number. Each number is assigned to
only one object and each object has only one number assigned to it. It may be noted that the
objects are divided into mutually exclusive and collectively exhaustive categories.
Example:
What is your religion?
a) Hinduism b) Sikhism
c) Christianity d) Islam
e) Any other (please specify)
A Hindu may be assigned number 1, a Sikh may be assigned number
2, and a Christian may be assigned number 3, and so on. Any religion which is assigned a
higher number is in no way superior to the one which is assigned a lower number. The
assignment of numbers is only for the purpose of identification.
41 | P a g e
The assigned numbers cannot be added, subtracted, multiplied or divided. The only
arithmetic operations that can be carried out are the count of each category. Therefore, a
frequency distribution table can be prepared for the nominal scale variables and mode of
distribution can be worked out. One can also use chi-square test and compute contingency
coefficient using nominal scale variables.
Ordinal scale: This is the next higher level of measurement than the nominal scale
measurement. One of the limitations of the nominal scale measurements is that we cannot say
whether the assigned number to an object is higher or lower than the one assigned to another
option.
The ordinal scale measurement takes care of this limitation. An ordinal scale
measurement tells whether an object has more or fewer characteristics than some other
objects. However, it cannot answer how much more or how much less.
Example:
Rank the following attributes while choosing a restaurant for dinner. The most important
attribute may be ranked as 1, the next important may be assigned a rank of 2, and so on.
In the ordinal scale, the assigned ranks cannot be added, multiplied, subtracted or
divided. One can compute median, percentiles and quartiles of the distribution. The other
major statistical analysis which can be carried out is the rank order correlation coefficient, a
sign test. All the statistical techniques which are applicable in the case of nominal scale
measurement can also be used for the ordinal scale measurement. However, the reverse is not
true. This is because ordinal scale data can be converted into nominal scale data but not the
other way round.
Interval scale: The interval scale measurement is the next higher level of measurement. It
takes care of the limitation of the ordinal scale measurement where the difference between
the score on the ordinal scale does not have any meaningful interpretation. In the interval
scale, the difference in the score on the scale has meaningful interpretations. It is assumed
that the respondent is able to answer the questions on a continuum scale. The mathematical
form of the data on the interval scale may be written as,
Y=a+bX Where a ≠ 0
In the interval scale, the difference in score has a meaningful interpretation while the ratio of
the score on this scale does not have a meaningful interpretation. This can be seen from the
following interval scale question:
How likely are you to buy a new designer carpet in the next six months?
Suppose a respondent ticks the response category 'likely' and another respondent ticks the
category 'unlikely'. If we use any of the scales A, B or C, we note that the difference between
nttheernasl cores in each case is 2. Whereas, when the ratio of the scores is taken, itDiasta2, 3
and –1 for the scales A, B and C, respectively. Therefore, the ratio of the scores on the scale
does not have a meaningful interpretation. The following are some examples of interval scale
data:
42 | P a g e
The numbers on this scale can be added, subtracted, multiplied or divided. One can compute
arithmetic mean, standard deviation, correlation coefficient, and conduct a t-test, Z-test,
regression analysis and factor analysis. As the interval scale data can be converted into the
ordinal and the nominal scale data, all the techniques applicable for the ordinal and the
nominal scale data can also be used for interval scale data.
Ratio scale: This is the highest level of measurement and takes care of the limitations of the
interval scale measurement, where the ratio of the measurements on the scale does not have a
meaningful interpretation. The ratio scale measurement can be converted into interval,
ordinal and nominal scale. But the other way round is not possible. The mathematical form of
the ratio scale data is given by Y = b X. In this case, there is a natural zero (origin), whereas
in the interval scale, we had an arbitrary zero. Examples of the ratio scale data are weight,
distance travelled, income and sales of a company, to mention a few.
All the mathematical operations can be carried out using the ratio scale data. In addition to
the statistical analysis mentioned in the interval, ordinal and nominal scale data, one can
compute the coefficient of variation, geometric mean, and harmonic mean using the ratio
scale measurement.
Attitude
Attitude is viewed as an enduring disposition to respond consistently in a given
manner to various aspects of the world, including persons, events and objects. A company is
able to sell its products or services when its customers have a
Cognitive component: This component represents an individual's information and
knowledge about an object. It includes awareness of the existence of the object, beliefs about
the characteristics or attributes of the object and judgement about the relative importance of
each of the attributes. In a survey, if the respondents are asked to name the companies
manufacturing plastic products, some respondents may remember names like Tupperware,
Modicare and Pearl Pet. This is called unaided recall awareness. More names are likely to be
remembered when the investigator makes a mention of them. This is aided recall. The
examples of beliefs or judgements could be that the products of Tupperware are of high
quality, non-toxic and can be used in parties; a mutton dish can be cooked in a pressure
cooker in less than thirty minutes, and so on.
Affective component: The affective component summarizes a person's overall feeling or
emotions towards the objects. The examples this component could be: the food cooked in a
pressure cooker is tasty, taste of orange juice is good or the taste of bitter gourd is very bad.
Intention or action component: This component of an aptitude, also called the behavioural
component, reflects predisposition to an action by reflecting the consumer's buying or
purchase intention. It also reflects a person's expectations of future behaviour towards an
object.
There is a relationship between attitude and behaviour. If a consumer does not have a
favourable attitude towards a product, he/she will certainly not buy the product. However,
43 | P a g e
having a favourable attitude does not mean that it would be reflected in the purchase
behaviour. This is because the intention to buy a product has to be backed by the purchasing
power of the consumer. Therefore, the relationship between the attitude and the purchase
behaviour is a necessary condition for the purchase of the product but it is not a sufficient
condition. This relationship could hold true at the aggregate level but not at the individual
level.
Classification of Scales
One of the ways of classifying scales is based on the number of items in the scale. Based
upon this, the following classification may be proposed:
Single Item vs. Multiple Item Scale
Single item scale: In the single item scale, there is only one item to measure a given
construct. For example:
Consider the following question:
How satisfied are you with your current job?
Very Dissatisfied
Dissatisfied
Neutral
Satisfied
Very satisfied
The problem with the above question is that there are several aspects to a job, like pay, work
environment, rules and regulations, job security and communication with the seniors. The
respondent may be satisfied on some of the factors but may not on others. By asking a
question as stated above, it will be difficult to analyse the problematic areas. To overcome
this problem, a multiple item scale is proposed.
Multiple item scale: In multiple item scale, there are many items that play an important role
in forming the underlying construct that the researcher is trying to measure. This is because
each item forms some part of the construct (satisfaction) which the researcher is trying to
measure. As an example, some of the following questions may be asked in a multiple item
scale:
How satisfied are you with the pay you are getting on your current job?
Very Dissatisfied
Dissatisfied
Neutral
Satisfied
Very satisfied
44 | P a g e
How satisfied are you with the rules and regulations of your organization?
Very Dissatisfied
Dissatisfied
Neutral
Satisfied
Very satisfied
Comparative Scales vs. Non-Comparative Scales
Comparative scales
In comparative scales, it is assumed that respondents make use of a standard frame of
reference before answering a question. For example:
A question like 'How do you rate Barista in comparison to Cafe Coffee Day on the basis of
quality of beverages?' is an example of the comparative rating scale. It involves direct
comparison of stimulus objects.
Example:
Please rate Domino's in comparison to Pizza Hut on the basis of your satisfaction level on an
11-point scale, based on the following parameters: 1 = Extremely poor, 6 = Average, 11 =
Extremely good. Circle your response:
Comparative scale data is interpreted generally in a relative kind. Described below are types
of comparative rating scales:
(I) Paired comparison scales: Here a respondent is presented with two objects and is
asked to select one according to whatever criterion he or she wants to use. The resulting data
from this scale is ordinal in nature. For example, suppose a parent wants to offer one of the
45 | P a g e
four items to a child—chocolate, burger, ice cream and pizza. The child is offered to choose
one out of the two from six possible pairs, i.e., chocolate or burger, chocolate or ice cream,
chocolate or pizza, burger or ice cream, burger or pizza and ice cream or pizza. In general, if
there are n items, the number of paired comparison would be (n(n – 1)/2). Paired comparison
technique is useful when the number of items is limited because it requires a direct
comparison and overt choice.
(ii) Rank order scaling: In rank order scaling, respondents are presented with
several objects simultaneously and asked to order or rank them according to some criterion.
Consider, for example, the following question:
Rank the following soft drinks in order of your preference. The most preferred soft drink
should be ranked one, the second most preferred should be ranked two, and so on.
Like paired comparison, this approach is also comparative in nature. The problem with this
scale is that if a respondent does not like any of the above- mentioned soft drinks and is
forced to rank them in the order of his choice, then the soft drink which is ranked one should
be treated as the least disliked soft drink, and similarly, the other rankings can be interpreted.
The rank order scaling results in the ordinal data.
(iii) Constant sum rating scaling: In constant sum rating scale, the respondents are asked to
allocate a total of 100 points between various objects and brands. The respondent distributes
the points to the various objects in the order of his preference.
Consider the following example:
Allocate a total of 100 points among the various schools into which you would like to admit
your child. The points should be allocated in such a way that the sum total of the points
allocated to various schools adds up to 100.
Suppose Mother's International is awarded 30 points, whereas Laxman Public School is
awarded 15 points. One can make a statement that the respondent rates Mother's International
twice as high as Laxman Public School. This type of data is not only comparative in nature
but could also result in ratio scale measurement.
(iv) Q-sort technique: This technique makes use of the rank order procedure in which
objects are sorted into different piles based on their similarity with respect to certain criterion.
Suppose there are 100 statements and an individual is asked to pile them into five groups in
such a way that the strongly agreed statements could be put in one pile, agreed statements
could be put in another pile, neutral statements form the third pile, disagreed statements come
in the fourth pile, strongly disagreed statements form the fifth pile, and so on. The data
generated in this way would be ordinal in nature. The distribution of the number of statement
in each pile should be such that the resulting data may follow a normal distribution.
Non-comparative scales
In non-comparative scales, respondents do not make use of any frame of reference before
answering the questions. The resulting data is generally assumed to be interval or ratio scale.
46 | P a g e
Non-comparative scales are divided into two categories, namely, the graphic rating scales and
the itemized rating scales. A useful and widely used itemized rating scale is the Likert scale.
To measure the preference of an individual towards fast food, one has to measure the distance
from the extreme left to the position where a tick mark has been put. Higher the distance,
higher would be the individual's preference for fast food. This scale suffers from two
limitations—one, if a respondent has put a tick mark at a particular position, and after ten
minutes, he or she is given another form to put a tick mark, it will virtually be impossible to
put a tick at the same position as before. Does it mean that the respondent's preference for
fast food has undergone a change in 10 minutes?
The basic assumption of this scale is that the respondents can distinguish the fine shade of
difference between preference/attitude, which need not be the case. Further, the coding,
editing and tabulation of data generated through such a procedure is a very tedious task and
researchers try to avoid using it.
Number of categories to be used: There is no hard and fast rule as to how many categories
should be used in an itemized rating scale. However, it is standard practice to use five or six
categories. Some researchers are of the opinion that more than five categories should be used
in situations where small changes in attitudes are to be measured. There are others that argue
that the respondents would find it difficult to distinguish between more than five categories.
Odd or even number of categories: It has been a matter of debate among the researchers as
to whether odd or even number of categories is to be used. By using even number of
categories, the scale would not have a neutral category and the respondent will be forced to
choose either the positive or the negative side of the attitude. If odd numbers of categories are
used, the respondent has the freedom to be neutral.
Balanced versus unbalanced scales: A balanced scale has an equal number of favourable
and unfavourable categories. The following is an example of a balanced scale:
47 | P a g e
How important is price to you in buying a new car?
– Very important
– Relatively important
– Neither important nor unimportant
– Relatively unimportant
– Very unimportant
In this question, there are five response categories, two of which emphasize the importance of price
and two others that do not show its importance. The middle category is neutral.
In this question, there are four response categories that are skewed towards the importance
given to the price, whereas one category is for the unimportant side. Therefore, this question
is an unbalanced question.
Nature and degree of verbal description: Many researchers believe that each category
must have a verbal, numerical or pictorial description. Verbal description should be clearly
and precisely worded so that the respondents are able to differentiate between them. Further,
the researcher must decide whether to label every scale category, some scale categories or
only extreme scale categories.
Forced versus non-forced scales: In the forced scale, the respondent is forced to take a stand,
whereas in the non-forced scale, the respondent can be neutral if he/she so desires. The
argument for a forced scale is that those who are reluctant to reveal their attitude are
encouraged to do so with the forced scale. Paired comparison scale, rank order scale and
constant sum rating scales are examples of forced scales.
Physical form: There are many options that are available for the presentation of the scales. It
could be presented vertically or horizontally. The categories could be expressed in boxes,
discrete lines or as units on a continuum. They may or may not have numbers assigned to
them. The numerical values, if used, may be positive, negative or both.
Suppose we want to measure the perception about Jet Airways using a multi-
48 | P a g e
item scale. One of the questions is about the behaviour of the crew members. Given below is
a set of scale configurations that may be used to measure their behaviour:
Below, we will describe Likert scale, which is very commonly used in survey research.
Likert scale: This is a multiple item agree–disagree five-point scale. The respondents are
given a certain number of items (statements) on which they are asked to express their degree
of agreement/disagreement. This is also called a summative scale because the scores on
individual items can be added together to produce a total score for the respondent. An
assumption of the Likert scale is that each of the items (statements) measures some aspect of
a single common factor; otherwise the scores on the items cannot legitimately be summed up.
In a typical research study, there are generally twenty-five to thirty items on a Likert scale.
It may be noted that only anchor labels and no numerical values are assigned to the response
categories. Once the scale is administered, numerical values are assigned to the response
categories. The scale contains statements, some of which are favourable to the construct we
are trying to measure and some are unfavourable to it.
For example, out of the ten statements given, statements numbering 1, 2, 4, 6 and 9 in Table
5.1 are favourable statements, whereas the remaining are unfavourable statements. The
reason for having a mixture of favourable and unfavourable statements in a Likert scale is
that the responses by the respondent should not become monotonous while answering the
questions. Generally, in a Likert scale, there is an approximately equal number of favourable
and unfavourable statements. Once the scale is administered, numerical values are assigned to
the responses. The rule is that a 'strongly agree' response for a favourable statement should
get the same numerical value as the 'strongly disagree' response of the unfavourable
statement.
49 | P a g e
Suppose for a favourable statement, the numbering is done as: Strongly disagree = 1;
Disagree = 2; Neither agree nor disagree = 3; Agree = 4; and Strongly agree = 5.
Accordingly, an unfavourable statement would get the numerical values as: Strongly disagree
= 5; Disagree = 4; Neither agree nor disagree = 3; Agree = 2; and Strongly agree = 1. In order
to measure the image that the respondent has about the company, the scores are added.
For example, if a respondent has ticked () statements numbering from one to ten as shown in
Table 5.1, his total score would be 3 + 5 + 4 + 4 + 5 + 4 + 4 + 5 +
4 + 4 = 42 out of 50. Now if there are 100 respondents and 100 statements, the score on the
image of the company can be worked out for each respondent by adding his/her scores on the
100 statements. The minimum score for each respondent will be 100, whereas the maximum
score would be 500.
As mentioned earlier, a typical Likert scale comprises about 25–30 statements. In order to
select twenty-five statements from the 100 statements, we need to discard some of them. The
rule behind discarding the statements is that those items that are non-discriminating should be
removed.
As mentioned earlier, the score for each of the respondents on each of the
statements can be used to measure his/her total score about the image of the company.
Measurement Error
Measurement error occurs when the observed measurement on a construct or concept
deviates from its true values. The following is a list of the reasons measurement errors:
There are factors like mood, fatigue and health of the respondent which may influence the
observed response while the instrument is being administered.
The variations in the environment in which measurements are taken may also result in a
departure from the true value.
At times, the errors may be committed at the time of coding, on entering data from
questionnaire to the spreadsheet on the computer, and at the tabulation stage.
The observed measurement in any research need not be equal to the true measurement. The
observed measurement can be written as,
O = T + S + R Where,
O = Observed measurement T = True score S=Systematic error
R = Random error
It may be noted that the errors consist of two components—systematic error and random
error. Systematic error causes a constant bias in the measurement. Suppose there is a
weighing scale that weighs 50 gm less for every one kg of product being weighed. The error
would consistently remain the same irrespective of the kind of product and the time at which
the product is weighed. Random error, on the other hand, involves influences that bias the
measurements but are not systematic. Suppose we use different weighing scales to weigh 1
50 | P a g e
kg of a product, and if systematic error is assumed to be absent, we may find that recorded
weights may fall within a range around the true value of the weight, thereby causing random
error.
1. Reliability
Reliability is concerned with consistency, accuracy and predictability of the scale. It refers to
the extent to which a measurement process is free from random errors. The reliability of a
scale can be measured using the following methods:
Test–retest reliability: In this method, repeated measurements of the same person or group
using the same scale under similar conditions are taken. A very high correlation between the
two scores indicates that the scale is reliable. The researcher has to be careful in deciding the
time difference between two observations. If the time difference between two observations is
very small, it is very likely that the respondent would give same answer which would result
in higher correlation. Further, if the difference is too large, the attitude might have undergone
a change during that period, resulting in a weak correlation, and hence poor reliability.
Therefore, the researcher has to be very careful in deciding the time difference between the
observations. Generally, a time difference of about five-six months is considered as an ideal
period.
Split-half reliability method: This method is used in the case of multiple item scales. Here,
the number of items is randomly divided into two parts and a correlation coefficient between
the two is obtained. A high correlation indicates that the internal consistency of the construct
leads to greater reliability.
2. Validity
The validity of a scale refers to the question whether we are measuring what we want to
measure. Validity of the scale refers to the extent to which the measurement process is free
from both systematic and random errors. The validity of a scale is a more serious issue than
reliability. There are different ways to measure validity.
Content validity: This is also called face validity. It involves subjective judgement by an
expert for assessing the appropriateness of the construct. For example, to measure the
perception of a customer towards Kingfisher Airlines, a multiple item scale is developed. A
set of fifteen items is proposed. These items when combined in an index measure the
perception of Kingfisher Airlines. In order to judge the content validity of these fifteen items,
a set of experts may be requested to examine the representativeness of the fifteen items. The
items covered may be lacking in the content validity if we have omitted the behaviour of the
crew, food quality, food quantity, etc., from the list.
In fact, conducting the exploratory research to exhaust the list of items measuring perception
of the airline would be of immense help in such a case.
51 | P a g e
Predictive validity: This involves the ability of a measured phenomenon at one point of time
to predict another phenomenon at some point in the future. If the correlation coefficient
between the two is high, the initial measure is said to have a high predictive ability. As an
example, consider the use of CAT (common admission test) to shortlist candidates for
admission to the MBA (Masters of Business Administration) programme in a business
school. The CAT scores are supposed to predict the candidate's aptitude for studies towards
business education.
3. Sensitivity
Sensitivity refers to an instrument's ability to accurately measure the variability in a concept.
A dichotomous response category, such as agree or disagree, does not allow the recording of
any attitude changes. A more sensitive measure with numerous categories on the scale may
be required. For example, adding 'strongly agree', 'agree', 'neither agree nor disagree',
'disagree' and 'strongly disagree' categories will increase the sensitivity of the scale.
The sensitivity of scale based on a single question or a single item can be increased by adding
questions or items. In other words, because composite measures allow for a greater range of
possible scores, they are more sensitive than a single-item scale.
Summary
Let us recapitulate the main points discussed in the unit:
Measurement means the assignment of numbers or other symbols to the characteristics of
certain objects. Scaling is an extension of measurement.
Scaling involves creating a continuum on which measurements on the objects are located.
There are four types of measurement scales: nominal, ordinal, interval and ratio scale.
Attitude is the predisposition of an individual to evaluate some objects or
symbols. It has three components: cognitive, affective and intention or action component.
Scales can be classified as single-item and multiple-item scales. Another
classification could be whether the scales are comparative or non- comparative in nature.
The observed measurement need not be equal to the true value of the
measurement. Some systematic and random errors may be found in the observed
measurement.
There are three criteria for determining the accuracy of a
measurement—reliability, validity and sensitivity.
52 | P a g e
Introduction
The Questionnaire Method
Types of Questionnaire
Process of Questionnaire Designing
Advantages and Disadvantages of the Questionnaire Method
Summary
Keywords
Introduction
In the last unit, we discussed some methods of primary data collection, like observation,
focus group discussion and interviews. However, a discussion on data collection would be
incomplete if one did not talk about the questionnaire method. This is the most cost effective
and widely used method, apart from being extremely user friendly. The questionnaire method
is flexible enough to reveal data that is in the respondents' own words and language. It can be
made extremely scientific by framing questions which enable a very advanced level of
quantitative measurement and analysis. The pattern of questioning is always designed
keeping in mind the respondent's comfort and ease of answering. Today, with the wide use of
technology it is very easy to use the questionnaire method even without being present
physically in front of the respondent.
Even though all of us have filled a questionnaire at some time or the other and know what it
must include, designing a well structured and study specific questionnaire requires a
structured and logical path so that the effort of collecting information using the questionnaire
is meaningful. In this unit, you will learn about the various aspects of the questionnaire
method in detail. The entire process of questionnaire designing will be discussed at length,
with special reference to the different kinds of questionnaires available to the researcher.
53 | P a g e
There are certain criteria that must be kept in mind while designing the questionnaire. The ?
rst and foremost requirement is that the spelt-out research objectives must be converted
clear questions which will extract answers from the respondent. This is not as easy as it
sounds, for example, if one wants to know how many times your teacher praised you in the
week? It is very dif?cult to give an exact number. The second requirement is, it should be
designed to engage the respondent and encourage a meaningful response. For example, a
questionnaire measuring stress cannot have a voluminous set of questions which fatigue the
subject. The questions, thus, should encourage response and be easy to understand. Lastly,
the questions should be self- explanatory and not confusing as then the person will answer the
way he understood the question and not in terms of what was asked. This will be discussed
in detail later, when we discuss the wording of the questions.
Types of Questionnaires
There are many different types of questionnaires available to the researcher. The categorization can
be done on the basis of a variety of parameters. The two criteria that are most frequently used for
designing purposes are the degree of structure and the degree of concealment. Structure refers to
the degree to which the response category has been de?ned. Concealment refers
to the degree to which the purpose of the study is explained to the respondent. Instead of
considering them as individual types, most research studies use a mixed format. Thus, they will be
discussed here as a two-by-two matrix
Let us discuss the types of questionnaires. Questionnaires can be categorized on the basis of their
structure or method of administration.
categories:
Formalized and unconcealed questionnaire: This is the one that is the most frequently used by all
management researchers. For example, if a new brokerage firm wants to understand the investment
behaviour of people, they would structure the questions and answers as follows:
2. Out of the following options, where do you invest? (tick all that apply).
Precious metals
real estate
stocks
government instruments ,
mutual funds
any other .
This kind of structured questionnaire is easy to administer, and has self- explanatory questions and
clearly defined answer categories.
54 | P a g e
These questionnaires have a formal method of questioning; however, the purpose is not clear to the
respondent. The research studies which are trying to find out the latent causes of behaviour and
cannot rely on direct questions use these. For example, young people cannot be asked direct
questions on whether they are likely to be indulging in corruption at work. Thus, the respondent has
to be given a set of questions that can give an indication of what are his basic values, opinions and
beliefs, as these would influence how he would react to issues.
Non-formalized and unconcealed: Some researchers argue that rather than giving the
respondents pre-designed response categories, it is better to give them unstructured questions
where they have the freedom of expressing themselves the way they want. Some examples of
these kinds of questions are given below:
1. Why do you think maggie noodles are liked by young children ?
2. How do you generally decide on where you are going to invest your money?
3. Give THREE reasons why you believe that the show Satyamev jayate has affected the
common Indian?
The data obtained here is rich in content, but quanti?cation cannot go beyond frequency and
percentages to represent the ?ndings.
Non-formalized and concealed: If the objective of the research study is to uncover socially
unacceptable desires and subconscious and unconscious motivations, the investigator makes
use of questions of low structure and disguised purpose. However, these require interpretation
that is highly skilled. Cost, time and effort are also much higher than in others.
Another useful way of categorizing questionnaires is on the basis of method of
administration. Thus, the questionnaire that has been prepared would necessitate a face-to-
face interaction. In this case, the interviewer reads out each question and makes a note of the
respondent's answers. This administration is called a schedule. It might have a mix of the
questionnaire types as described in the section above and might have some structured and
some unstructured questions. The other kind is the self-administered questionnaire, where the
respondent reads all the instructions and questions on his own and records his own statements
or responses.
Thus, all the questions and instructions need to be explicit and self- explanatory.
The selection of one over the other depends on certain study prerequisites.
Population characteristics: In case the population is illiterate or unable to write the responses,
then one must as a rule use the schedule, as the questionnaire cannot be effectively answered
by the subject himself.
Population spread: In case the sample to be studied is large and widely spread, then one needs
to use the questionnaire. When the resources available for the study are limited, then
schedules become expensive to use and the self-administered questionnaire is better.
55 | P a g e
Study area: In case one is studying a sensitive topic like harassment at work, a self
administered questionnaire is suggested. However, in case the study topic needs additional
probing then a schedule is better. There is another categorization that is based upon the mode
of administration; this would be discussed in later sections of the unit.
Even though the questionnaire method is most used by researchers, designing a well-
structured questionnaire needs considerable skill. Presented below is a standardized process
that a researcher can follow.
1. Convert the research objectives into information areas
This is the first step of the design process. By this time the researcher is clear about the
research questions; research objectives; variables to be studied; research information required
and the characteristics of the population being studied. Once these tasks are done, one can
prepare a tabled framework so that the questions which need to be developed become clear.T
2. Method of administration
Once the researcher has identified his information area; he needs to specify how the
information should be collected. The researcher usually has available a variety of methods for
administering the study. The main methods are personal schedule (discussed earlier in the
unit), self-administered questionnaire through mail, fax, e-mail and web-based questionnaire.
56 | P a g e
Profession like any other
profession that is not money making
Any other
Do we need to ask several questions instead of a single one? After deciding on the signi?
cance of the question, one needs to ascertain whether a single question will serve the purpose
or should more than one question be asked. For example, in a TV serial study, one may give
ten popular serials to be ranked as 1 to 10 in order of preference. Then the second question
after the ranking question is:
'Why do you like the serial (the one you ranked No. 1/ prefer watching most)?' (Incorrect)
Here, one lady might say, 'Everyone in my family watches it'. While another might say, 'It
deals with the problems of living in a typical Indian joint family system' and yet another
might say, 'My friend recommended it to me’.
Thus, we need to ask her:
‘What do you like about ?’
'Who all in your household watch the serial?'and
'How did you first hear about the serial?' (Correct)
57 | P a g e
Does the person remember? Many times, the question addressed might be putting too much
stress on an individual's memory. For example, consider the following questions:
How much did you spend on eating out last month? (Incorrect) Such questions are beyond
any normal individual's memory bank.
Thus, the questions listed above could have been rephrased as follows:
Can the respondent articulate? Sometimes the respondent might not know how to put the
answer in clear words. For example, if you ask a respondent to:
Describe a river rafting experience.
Most respondents would not know what phrases to use to give an answer. Thus, in the above
case, one can provide answer categories to the person as
follows:
Describe the river rafting experience. (Correct)
Sensitive information: There might be instances when the question being asked might be
embarrassing to the respondents and thus they would not be comfortable disclosing the data
required.
For example, questions such as the following will not get any answers.
Have you ever used fake receipts to claim your medical allowance?
(Incorrect)
Have you ever spit tobacco on the road (to tobacco consumers)?
(Incorrect)
58 | P a g e
However, in case the socially undesirable habit is in the context of a third person, the chances
of getting some correct responses are possible. Thus the questions should be rephrased as
follows:
Do you associate with people who use fake receipts to claim their medical allowance?
(Correct)
Do you think tobacco consumers spit tobacco on the road?
(Correct)
5. Determining the type of questions
Closed-ended questions
In closed-ended questions, both the question and response formats are structured and de?ned.
There are three kinds of formats as we observed earlier—dichotomous questions, multiple–
choice questions and those that have a scaled response.
59 | P a g e
Probably in the next one year Definitely in the next one year
Sometimes, multiple-choice questions do not have verbal but rather numerical options for the
respondent to choose from, for example:
How much do you spend on grocery products (average in one month)?
Less than `2500/-
Between `2500–5000/-
More than `5000/-
Most multiple-choice questions are based upon ordinal or interval levels of measurement.
There could also be instances when multiple options are given to the respondent and he can
select all those that apply in the case. These kinds of multiple-choice questions are called
checklists. For example, in the organic food study, the retailer who does not stock organic
products was given multiple reasons as follows:
You do not currently sell organic food products because (Could be = 1)
You do not know about organic food products. You are not interested.
Organic products do not have attractive packaging. Organic food products are not supplied
regularly.
Any other
Scales: Scales refer to the attitudinal scales that were discussed in detail in Unit 5.
Since these questions have been discussed in detail in the earlier unit, we will only
illustrate this with an example. The following is a question which has two sub-
questions designed on the Likert scale. These require simple agreement disagreement
on the part of the respondent. This scale is based on the interval level of measurement.
Given below are statements related to your organization. Please indicate your
agreement/disagreement with each:
Clearly specify the issue: By reading the question, the person should be able to clearly
understand the information needed.
Which newspaper do you read? (Incorrect)
This might seem to be a well-defined and structured question. However, the 'you' could be the
person filling the questionnaire or the family. He could be reading different newspapers. He
60 | P a g e
might be reading different papers at home and say, the college library. A better way to word
the question would be:’
Which newspaper or newspapers did you personally read at home during the last month? In
case of more than one newspaper, please list all that you read. (Correct)
Use simple terminology: The researcher must take care to ask questions in a language that is
understood by the population under study. Technical words or difficult words that are not
used in everyday communication must be avoided.
Do you think thermal wear provides immunity? (Incorrect)
Do you think that thermal wear provides you protection from the cold? (Correct)
Avoid ambiguity in questioning: The words used in the questionnaire should mean the same
thing to all those answering the questionnaire. A lot of words are subjective and relative in
meaning. Consider the following question:
How often do you visit Pizza
Hut? Never Occasionally Sometimes Often
Regularly (Incorrect)
These are ambiguous measures, as occasionally in the above question might be three to four
times in a week for one person, while for another it could be three times in a month. A much
better wording for this question would be the following:
In a typical month, how often do you visit Pizza Hut?
Less than once 1 or 2 times
3 or 4 times
More than 4 times (Correct)
Avoid leading questions: Any question that provides a clue to the respondents in terms of the
direction in which one wants them to answer is called a leading or biasing question.
For example,
Do you think that working mothers should buy ready-to-eat food even when it might contain
some chemical preservatives?
Yes
No
Don't know (Incorrect)
The question would mostly generate a negative answer, as no working mother would like to
buy something that is convenient but might be harmful.
61 | P a g e
Thus, it is advisable to construct a neutral question as follows: Do you think that working
mothers should buy ready-to-eat food? Yes
No
Don't know (Correct)
Avoid loaded questions: Questions that address sensitive issues are termed as loaded
questions and the response to these questions might not always be honest, as the person might
not wish to admit the answer. For example, questions such as the following will rarely get an
affirmative answer:
Avoid double-barrelled questions: Questions that have two separate options separated by an
'or' or 'and' like the following:
Do you think Nokia and Samsung have a wide variety of touch phones? Yes/no (Incorrect)
The problem is that respondent may feel Nokia has better phones or Samsung has better
phones or both. These questions are referred to as double-barrelled and the researcher should
always split them into two separate questions. For example,
A wide variety of touch phones is available for: Nokia
Samsung
Both (Correct)
7. Determine the questionnaire structure
The questions now have to be put together in a proper sequence.
Instructions: The questionnaires, even the schedules, always begin with standardized
instructions. These begin by greeting the respondent and then introducing the researcher and
then the purpose of questionnaire
administration. For example, in the study on organic food products, the following instructions
were given at the beginning of the questionnaire:
'Hi. We are carrying out a market research on the purchase behaviour for grocery
products/organic food. We are conducting a survey of consumers, retailers and experts in the
NCR for the same.
As you are involved in the purchase and/or consumption of food products, we seek your
cooperation for providing the following relevant information for our research. Thank you
very much.'
62 | P a g e
Opening questions: After instructions come the opening questions, which lead the reader
into the study topic. For example, a questionnaire on understanding the consumer's buying
behaviour in malls can ask an opening question that is generic in nature, such as:
What is your opinion about shopping at a mall?
Study questions: After the opening question/s, the bulk of the instrument needs to be
devoted to the main questions that are related to the specific information needs of the study.
Here also, the general rule is that the simpler questions, which do not require a lot of thinking
or response time should be asked ?rst as they build the tempo for answering the more
difficult/sensitive questions later on. This method of going in a sequential manner from the
general to the specific is called the funnel approach.
Classification information: This is the information that is related to the basic socio-economic
and demographic traits of the person. These might include name (kept optional in some
cases), address, e-mail address and telephone number.
63 | P a g e
Layout of Questionnaire:
Having a good set of questions to ask the respondent doesn't totally guarantee success in
conducting a survey. The overall look of the questionnaire is also necessary to achieve the
goals of the survey.
Most often the respondents consider the questionnaire layout ?rst before having the
motivation to complete the survey. Studies show that respondents may not be able answer the
questions truthfully because of being pre-occupied or bothered by the number of pages to
answer, or the overall look of the questionnaire. Therefore, a good-looking questionnaire
layout is an important factor in increasing response rates
Format
Placing a cover page on your survey questionnaire increases the level of motivation and
willingness to participate. The survey cover can instantly connect the respondents to the
survey and make them feel that they are important to make the survey a success.
The cover should contain the following:
1. The title of the survey or study
2. A one or two-sentence description of the survey, stating its purpose
3. Initial instructions
4. The name of the company conducting the survey
5. Any sponsors
The cover, as well as the back cover, should look simple to give an impression that the survey
is conducted in a professional manner. However, studies show that using colored covers
increase response rates by 2% to 4%, so feel free to add some spark on your cover.
2) The Instructions Page
In this page, explain further the purpose of the survey. Provide brief and speci?c instructions
on how the respondent should answer the questions. Also, instructions on how the respondent
should answer the questions. Also, instruct the respondent about the deadline for completing
the survey.
In addition, inform the respondent about con?dentiality matters, and offer contact numbers
that the respondent may call if there are any problems or comments regarding the survey
questionnaire.
64 | P a g e
In forming the survey layout, the order of questions should be taken into consideration. The
questions should be arranged from general to speci?c. The very ?rst question should be a
general one but is pertaining to goals or purpose of the survey, so that the respondent won't
get intimidated but rather, become slowly engaged to the questionnaire. Being “general”
means that the ?rst question should be applicable to all respondents and is easy to answer in
just a few seconds.
The questions should be grouped according to their content. This helps the respondent to
organize his thoughts and reactions, leading to a more accurate response to the questions.
With regards to the appearance, the questions should be consistent in font style, font size, and
even the indentation.
5) Survey Length
According to Dillman (2000), the length of the survey varies depending on three factors
relating to the respondent: his sense of commitment, interest and sense of responsibility in
completing the survey. As a rule of thumb, keep the questions as short as possible to keep
these three levels at their peaks.
Probably the greatest benefit of the method is its adaptability. There is, actually speaking, no
domain or branch for which a questionnaire cannot be designed. It can be shaped in a manner
that can be easily understood by the population under study. The language, the content and
the manner of questioning can be modified suitably. The instrument is particularly suitable
for studies that are trying to establish the reasons for certain occurrences or behaviour. The
second advantage is that it assures anonymity if it is self- administered by the respondent, as
there is no pressure or embarrassment in
revealing sensitive data.
65 | P a g e
A lot of questionnaires do not even require the person to ?ll in his/ her name. Administering
the questionnaire is much faster and less expensive as compared to other primary and a few
secondary sources as well. There is considerable ease of quantitative coding and analysis of
the obtained information as most response categories are closed-ended and based on the
measurement levels as discussed in Unit 5. The chance of researcher bias is very little here.
Lastly, there is no pressure of immediate response, thus the subject can ?ll in the
questionnaire whenever he or she wants.
However, the method does not come without some disadvantages. The major disadvantage is
that the inexpensive standardized instrument has limited applicability, that is, it can be used
only with those who can read and write.
The return ratio, i.e., the number of people who return the duly filled in questionnaires
sometimes not even 50 per cent of the number of forms distributed. Skewed sample response
could be another problem. This can occur in two cases; one, if the investigator distributes the
same to his friends and acquaintances and second, because of the self-selection of the
subjects. This means that the ones who ?ll in the questionnaire and return it might not be the
representatives of the population at large. In case the person is not clear about a question,
clari?cation with the researcher might not be possible.
Summary
Learning Objectives
After going through this unit, you will be able to:
Explain the basic concepts of sampling Distinguish between sample and census
Differentiate between a sampling and non-sampling error Describe sampling design
Explain different types of probability sampling designs Describe various types of non-
probability sampling designs
Estimate the sample size required while estimating the population mean and
proportion
66 | P a g e
Structure
7.1 Introduction
7.2 Sampling Concepts
7.2.1 Sample vs. Census
7.2.2 Sampling vs. Non-Sampling Error
7.3 Sampling Design
7.3.1 Probability Sampling Design
7.3.2 Non-Probability Sampling Designs
7.4 Determination of Sample Size
7.4.1 Sample Size for Estimating Population Mean
7.4.2 Determination of Sample Size for Estimating the Population Proportion
7.5 Summary
7.6 Keywords
7.1 Introduction
In the last unit, we discussed the concept of questionnaire designing. In this unit, we will
discuss an important aspect of research—sampling. Let us understand what is sampling and
what role it plays in research. As we have discussed earlier, research objectives are generally
translated into research questions that enable the researchers to identify the information
needs. Once the information needs are specified, the sources for collecting the information
are sought. Some of the information may be collected through secondary sources (published
material), whereas the rest may be obtained through primary sources. The primary methods
of collecting information include the observation method, personal interview with
questionnaire, telephone surveys and mail surveys. Surveys are, therefore, useful in
information collection, and their analysis plays a vital role in finding answers to research
questions. Survey respondents should be selected using the appropriate procedures;
otherwise, the researchers may not be able to get the right information to solve the problem
under investigation. This is done through sampling.
In this unit, we will discuss in detail the concept of sampling, including sampling and non-
sampling error, probability and non-probability sampling designs, as well as determination of
sample size.
67 | P a g e
Population: Population refers to any group of people or objects that form the subject of study
in a particular survey and are similar in one or more ways. For example, the number of full-
time MBA students in a business school could form one population. If there are 200 such
students, the population size would be 200. We may be interested in understanding their
perceptions about business education. If in an organization there are 1,000 engineers, out of
which 350 are mechanical engineers and we are interested in examining the proportion of
mechanical engineers who intend to leave the organization within six months, all the 350
mechanical engineers would form the population of interest. If the interest is in studying how
the patients in a hospital are looked after, then all the patients of the hospital would fall under
the category of population.
Element: An element comprises a single member of the population. Out of the 350
mechanical engineers mentioned above, each mechanical engineer would form an element of
the population.
Sampling frame: Sampling frame comprises all the elements of a population with proper
identification that is available to us for selection at any stage of sampling. Some examples of
sampling frames are:
The list of registered voters in constituency
The telephone directory
The number of students registered with a university
The attendance sheet of a particular class
The payroll of an organization
When the population size is very large, it becomes virtually impossible to form a sampling
frame. We know that soft drinks have a large number of consumers and, therefore, it becomes
very difficult to form the sampling frame for the same.
Sample: It is a subset of the population. It comprises only some elements of the population.
For instance, if out of 350 mechanical engineers employed in an organization, 30 are
surveyed regarding their intention to leave the organization in the next six months, then these
30 members would constitute the sample.
Sampling unit: A sampling unit is a single member of the sample. If a sample of 50 students
is taken from a population of 200 MBA students in a business school, then each of the 50
students is a sampling unit.
Sampling: It is a process of selecting an adequate number of elements from the population so
that the study of the sample will not only help in understanding the characteristics of the
population but also enable us to generalize the results. We will see later that there are two
types of sampling designs—probability sampling design and non-probability sampling
design.
Census (or complete enumeration): An examination of each and every element of the
population is called census or complete enumeration. Census is an alternative to sampling.
We will discuss the inherent advantages of sampling over a complete enumeration later.
68 | P a g e
In a research study, we are generally interested in studying the characteristics of a population.
Suppose there are 2 lakh households in a town, and we are interested in estimating the
proportion of households that spend their summer vacations at a hill station. This information
can be obtained by asking every household in that town. If all the households in a population
are asked to provide information, such a survey is called a census. An alternative way of
obtaining the same informations by choosing a subset of all 2 lakh households and asking
them for the same information. This subset is called a sample.
Based upon the information obtained from the sample, a generalization about the population
characteristics could be made. However, that sample has to be representative of the
population. For a sample to be representative of the population, the distribution of sampling
units in the sample has to be in the same proportion as the elements in the population. For
example, if in a town there are 50, 35 and 15 per cent households in lower, middle and upper
income groups, respectively, then a sample taken from this population should have the same
proportions for it to be representative. There are several advantages of a sample over a
census, some of which are as follows:
Sample saves time and cost. Many times a decision-maker may not have too much of time to
wait till all the information is available. Then, a sample could come to his rescue.
There are situations where a sample is the only option. When we want to estimate the average
life of fluorescent bulbs, they are burnt out completely. If we go for a complete enumeration,
there would not be anything left for use. Another example could be testing the quality of a
photographic film.
The study of sample instead of complete enumeration may, at times, produce more reliable
results. This is because by studying a sample, fatigue is reduced and fewer errors occur while
collecting the data, especially when a large number of elements are involved.
A census is appropriate when the population size is small, e.g., the number of public sector
banks in a country. Suppose the researcher is interested in collecting information from the
top management of a bank regarding their views on the monetary policy announced by the
Reserve Bank of India (RBI). In this case, a complete enumeration may be possible as the
population size is not very large.
There are two types of errors that may occur when we try to estimate the population
parameters from the sample. These are called sampling and non- sampling errors.
Sampling error: This error arises when a sample is not representative of the population. It is
the difference between sample mean and population mean. The sampling error reduces with
the increase in sample size as an increased sample may result in increasing the
representatives of the sample.
69 | P a g e
Probability sampling designs are used in conclusive research. In a probability sampling
design, each and every element of the population has a known chance of being selected in the
sample. The known chance does not mean equal chance. Simple random sampling is a special
case of probability sampling design where every element of the population has both known
and equal chances of being selected in the sample.
In case of non-probability sampling design, the elements of the population do not have any
known chance of being selected in the sample. These sampling designs are used in
exploratory research.
7.3.1 Probability Sampling Design
Under probability sampling design, the following sampling designs would be covered:
Simple random sampling with replacement (SRSWR)
Simple random sampling without replacement (SRSWOR)
Systematic sampling
Stratified random sampling
a. Simple random sampling with replacement (SRSWR)
Under this scheme, a list is prepared which consists of all the elements of the population from
where the samples are to be drawn. If there are 1,000 elements in the population, we write the
identification number or the name of all the 1,000 elements on 1,000 different slips. These are
put in a box and shuffled properly. If there are 20 elements to be selected from the
population, the simple random sampling procedure involves selecting a slip from the box and
reading the identification number. Once this is done, the chosen slip is put back to the box
and again a slip is picked up, and the identification number is read from that slip. This
process continues till a sample of 20 is selected. Please note that the first element is chosen
with a probability of 1/1,000. The second one is also selected with the same probability and
so are all the subsequent elements of the population.
The simple random sampling (with or without replacement) is not used in consumer research.
This is because in a consumer research, the population size is usually very large, which
creates problems during the preparation of a sampling frame. For example, the number of
consumers of soft drinks, pizza, shampoo, soap, chocolate, etc., is very large. However, these
(SRSWR and SRSWOR) designs could be useful when the population size is very small, for
example, the number of steel/aluminium-producing companies in India and the number of
70 | P a g e
banks in India. Since the population size is quite small, the preparation of a sampling frame
does not create any problem.
Another problem with these (SRSWR and SRSWOR) designs is that we may not get a
representative sample using such a scheme. Consider an example of a locality having 10,000
households, out of which 5,000 belong to low-income group, 3,500 belong to middle income
group and the remaining 1,500 belong to high-income group. Suppose it is decided to take a
sample of 100 households using simple random sampling. The selected sample may not
contain even a single household belonging to the high- and middle- income group and only
the low-income households may get selected, thus resulting in a non-representative sample.
c. Systematic sampling
Systematic sampling takes care of the limitation of the simple random sampling that the
sample may not be a representative one. In this design, the entire population is arranged in a
particular order. The order could be the calendar dates or the elements of a population
arranged in an ascending or a descending order of the magnitude, which may be assumed as
random. List of subjects arranged in the alphabetical order could also be used and they are
usually assumed to be random in order. Once this is done, the steps followed in the
systematic sampling design are as follows:
First of all, a sampling interval, K = N/n is calculated, where N=the
size of the population, and n = the size of the sample.
It is seen that the sampling interval K should be an integer. If it is not, it is rounded off to
make it an integer. A random number is selected from 1 to K, Let us call it c.
The first element to be selected from the ordered population would be C, the next element
would be C + K and the subsequent one would be C + 2K, and so on, till a sample of size n is
selected.
This way we can get representation from all the classes in the population and overcome the
limitations of the simple random sampling. To take an example, assume that there are 1,000
grocery shops in a small town. These shops could be arranged in an ascending order of their
sales, with the first shop having the smallest sales and the last shop having the highest sales.
If it is decided to take a sample of 50 shops, then our sampling interval
K will be equal to 1000 ÷ 50 = 20. Now, we select a random number from 1 to
20. Suppose the chosen number is 10. This means that the shop number 10 will be selected
first and then shop number 10 + 20 = 30 and the next
one would be 10 + 2 × 20 = 50, and so on, till all the 50 shops are selected. This way, we can
get a representative sample in the sense that it will contain small, medium and large shops.
It may be noted that in systematic sampling, the first unit of the sample is selected at random
71 | P a g e
(probability sampling design), and having chosen this, we have no control over the
subsequent units of sample (non-probability sampling). This design of sampling is called
mixed sampling.
The main advantage of systematic sampling design is its simplicity. When sampling from a
list of population arranged in a particular order, one can easily choose a random start as
described earlier. After having chosen a random start, every Kth item can be selected instead
of going for a simple random selection. This design is statistically more efficient than a
simple random
sampling, provided the condition of ordering of the population is satisfied.
The use of systematic sampling is quite common as it is easy and cheap to select a systematic
sample. In systematic sampling, one does not have to jump back and forth all over the
sampling frame wherever a random number leads, and neither does one have to check for
duplication of elements as compared to simple random sampling. Another advantage of
systematic sampling over simple random sampling is that one does not require a complete
sampling frame to draw a systematic sample. The investigator may be instructed to interview
every 10th customer entering a mall without a list of all customers.
72 | P a g e
stratification would be the household income. This is because the expenditure on
entertainment and household income are highly correlated.
Generally, stratification is done on the basis of demographic variables like age, income,
education and gender. Customers are usually stratified on the basis of life stages and income
levels to study their buying patterns. Companies may be stratified according to size and
profits for analysing the stock market reactions.
How many strata should be constructed?
Going by common sense, as many strata as possible should be used so that the elements of
each stratum will be as homogeneous as possible. However, it may not be practical to
increase the number of strata and, therefore, the number may have to be limited. Too many
strata may complicate the survey and make preparation and tabulation difficult. Costs of
adding more strata may be more than the benefits obtained. Further, the researcher may end
up with the practical difficulty of preparing a separate sampling frame as the simple random
samples are to be drawn from each stratum.
What would be the appropriate number of sample size taken from each stratum?
This question pertains to the number of observations to be taken from each stratum. At the
outset, one needs to determine the total sample size for the universe and then allocate it
between each stratum. This may be explained as follows:
Let there be a population of size N. Let this population be divided into three strata based on
certain criterion. Let N, N and N denote the sizes 1, 2 and 3 of strsta 1,2 and 3,
respectively, such that N=N1+N2+NS. These strata are mutually exclusive and
collectively exhaustive. Each of these three strata could be treated as three populations.
Now, if a total sample of size n is to could be treated as three populations. Now, if a total
sample of size n is to be taken from the population, the question arises that how much of the
sample should be taken from strata 1, 2 and 3, respectively, so that the sum total of sample
sizes from each strata adds up to n. Let the size of the sample from first, second and third
strata be n1+n2+n3 and n respectively such that n= n1+n2+n3 Then, There are two schemes
that may be used to determine the values of n , (i = 1, 2, 3) from each strata.
These are proportionate and disproportionate allocation schemes.
Proportionate allocation scheme: In this scheme, the size of the sample in each stratum is
proportional to the size of the population of the strata. For example, if a bank wants to
conduct a survey to understand the problems that its customers are facing, it may be
appropriate to divide them into three strata based upon the size of their deposits with the
bank. Let us assume that there are 10,000 customers in a bank. Out of this, 1,500 of them are
big account holders (having deposits of more than `10 lakh), 3,500 of them are medium
account holders (having deposits of more than Rs 2 lakh but less than Rs10 lakh) and
the remaining 5,000 are small account holders (having deposits of less than 2 lakh). Suppose
the total budget for sampling is fixed at
`20,000 and the cost of sampling a unit (customer) is
`20. If a sample of 100 is to be chosen from all the three strata, the size of
73 | P a g e
This way the size of the sample chosen from each stratum is proportional to the size of the
stratum. Once we have determined the sample size from each stratum, one may use the
simple random sampling or systematic sampling or any other sampling design to take out
samples from each strata.
Disproportionate allocation: As per the proportionate allocation explained above, the sizes of
the samples from strata 1, 2 and 3 are 15, 35 and 50, respectively. As it is known that the cost
of sampling of a unit is `20, irrespective of the strata from where the sample is drawn, the
bank would Could be treated as three populations. Now, if a total sample of size n is to
naturally be more interested in drawing a large sample from stratum 1, which has the big
customers, as it gets most of its business from strata 1. In other words, the bank may follow a
disproportionate allocation of sample as the importance of each stratum is not the same from
the point of view of the bank. The bank may like to take a sample of 45 from strata 1, and 40
and 15 from strata 2 and 3, respectively. Also, a large sample may be desired from the strata
having more variability.
In all the above situations, the sampling unit may either be self-selected or selected because
of ease of availability. No effort is made to choose a representative sample. Therefore, in this
design, the difference between the population value (parameters) of interest and the sample
value (statistic) is unknown both in terms of the magnitude and direction. Therefore, it is not
possible to make an estimate of the sampling error and researchers would not be able to make
a conclusive statement about the results from such a sample. Because of this, convenience
sampling should not be used in conclusive research (descriptive and causal research).
74 | P a g e
purpose of an exploratory research is to gain an insight into the problem and generate a set of
hypotheses which could be tested with the help of conclusive research. When very little is
known about a subject, a small-scale convenience sampling can be of use in the exploratory
work to help understand the range of variability of responses in a subject area.
Judgemental sampling
Under judgemental sampling, experts in a particular field choose what they believe to be the
best sample for the study in question. Judgement sampling calls for special efforts to locate
and gain access to the individuals who have the required information. Here, the judgement of
an expert is used to identify a representative sample. For example, the shoppers at a shopping
centre may serve to represent the residents of a city or some cities may be selected to
represent a country. Judgemental sampling design is used when the required information is
possessed by a limited number/category of people. This approach may not empirically
produce satisfactory results and may, therefore, curtail generalizability of the findings due to
the fact that we are using a sample of experts (respondents) that are usually conveniently
available to us.
Further, there is no objective way to evaluate the precision of the results. A company wanting
to launch a new product may use judgemental sampling for selecting 'experts' who have prior
knowledge or experience of similar products. A focus group discussion of such experts may
be conducted to get valuable insights. Opinion leaders who are knowledgeable are included in
the organizational context. Enlightened opinions (views and knowledge) constitute a rich data
source. A very special effort is needed to locate and have access to individuals who possess
the required information.
The most common application of judgemental sampling is in business- to- business (B to B)
marketing. Here, a very small sample of lead users, key accounts or technologically
sophisticated firms or individuals is regularly used to test new product concepts, producing
programmes, etc.
Snowball sampling
Snowball sampling is generally used when it is difficult to identify the members of the
desired population, e.g., deep-sea divers, families with triplets, people using walking sticks,
doctors specializing in a particular ailment, etc. Under this design, each respondent, after
being interviewed, is asked to identify one or more experts in the field. This could result in a
very useful sample. The main problem is in making the initial contact. Once this is done,
these cases identify more members of the population, who then identify further members, and
so on. It may be difficult to get a representative sample. One plausible reason for this is that
initial respondents may identify other potential respondents who are similar to themselves.
The next problem is to identify new cases.
7.4 Determination of Sample Size
The size of a sample depends upon the basic characteristics of the population, the type of
information required from the survey and the cost involved. Therefore, a sample may vary in
75 | P a g e
size for several reasons. The size of the population does not influence the size of the sample,
as will be shown later on. There are various methods of determining the sample size in
practice:
Researchers may arbitrarily decide the size of the sample without giving any explicit
consideration to the accuracy of the sample results or the cost of sampling. This arbitrary
approach should be avoided.
For some projects, the total budget for the field survey (usually mentioned)
mentioned) in a project proposal is allocated. If the cost of sampling per sample unit is
known, one can easily obtain the sample size by dividing the total budget allocation
by the cost of sampling per unit. This method concentrates only on the cost aspect of
sampling, rather than the value of information obtained from such a sample.
There are other researchers who decide on the sample size based on what was done by
the other researchers in similar studies. Again, this approach cannot be a substitute for
the formal scientific approach.
The most commonly used approach for determining the size of the sample is the
confidence interval approach covered under inferential statistics. Below will be
discussed this approach while
determining the size of a sample for estimating population mean and population
proportion. In a confidence interval approach, the following points are taken into
account for determining the sample size in estimation of problems involving means:
(a) The variability of the population: It would be seen that the higher the variability as
measured by the population standard deviation, the larger will be the size of the sample. If the
standard deviation of the population is unknown, a researcher may use the estimates of the
standard deviation from previous studies. Alternatively, the estimates of the population
standard deviation can be computed from the sample data.
(b) The confidence attached to the estimate: It is a matter of judgement how much
confidence you want to attach to your estimate. Assuming a normal distribution, the higher
the confidence the researcher wants for the estimate, larger will be the sample size. This is
because the value of the standard normal ordinate 'Z' will vary accordingly. For 90 per cent
confidence, the value of 'Z' would be 1.645 and for 95 per cent confidence, the corresponding
'Z' value would be 1.96, and so on (see Appendix 1 at the end of the book). It would be seen
later that a higher confidence would lead to a larger 'Z' value.
c) The allowable error or margin of error: How accurate do we want our estimate to be is
again a matter of judgement of the researcher. It will, of course, depend upon the objectives
of the study and the consequences resulting from the higher inaccuracy. If the researcher
seeks greater precision, the resulting sample size would be large.
76 | P a g e
The formula for determining the sample size in such a case is given by, It may be noted from
above that the size of the sample is directly proportional to the variability in the population
and the value of 'Z' for a confidence interval. It varies inversely with the size of the error. It
may also be noted that the size of a sample does not depend upon the size of the population.
A solved out example for the determination of a sample size is given below:
Example 7.1: An economist is interested in estimating the average monthly household
expenditure on food items by the households of a town. Based on past data, it is estimated
that the standard deviation of the population on the monthly expenditure on food item is `30.
With allowable error set at `7, estimate the sample size required at 90 per cent confidence.
Solution:
7.4.2 Determination of Sample Size for Estimating the Population Proportion
The formula for determining the sample size in such a case is given by,
n= pq [(Z/E)*(Z/E)]
The above formula will be used if the value of population proportion p is known. If, however,
p is unknown, we substitute the maximum value of pq in the above formula. It can be shown
that the maximum value of pq is ¼ when p =1/2 and q=1/2
7.5 Summary
Let us recapitulate the main points discussed in this unit:
Surveys are useful for information collection. The survey respondents should be selected
using appropriate and right procedures. The process of selecting the right individuals, objects
or events for the study is known as sampling.
An alternative to sample is a census where each and every element of the population
(universe) is examined. There are many advantages of sampling over complete enumeration.
While estimating the population parameter using
sample results, the researcher may incur two types of errors:o Sampling error o Non-sampling
error
The process of selecting samples from the population is referred to as sampling design. There
are two types of sampling designs:
Under the non-probability sampling designs, there are convenience sampling, judgmental
sampling and snowball sampling.
Introduction
Data Editing
Field Editing
Centralized In-House Editing
Coding
Coding Closed-Ended Structured Questions
Coding Open-Ended Structured Questions
Classification and Tabulation of Data
Summary
Keywords
Introduction
In the last few units, you have learnt about the various aspects of data collection. The critical
job of the researcher begins after the data has been collected. He has to use this information
to assess whether he had been correct or incorrect while making certain assumptions in the
form of the hypotheses at the beginning of the study. The raw data that has been collected
must be refined and structured in such a format that it can lend itself to statistical enquiry.
This process of preparing the data for an analysis is a structured and sequential process. The
process starts by validating the measuring instrument, which could be a questionnaire or any
other primary technique. This is followed by editing, coding, classifying and tabulating the
obtained data. In this unit we will learn these steps of preparing the data through editing,
coding and tabulating, so that it is ready for any kind of statistical analysis, in order to
achieve the research objectives we had formulated earlier.
Data Editing
Data editing is the process that involves detecting and correcting errors (logical
inconsistencies) in data. After collection, the data is subjected to processing. Processing
requires that the researcher must go over all the raw data forms and check them for errors.
The significance of validation becomes more important in the following cases:
In case the form had been translated into another language, expert analysis is done to see
whether the meaning of the questions in the two measures is the same or not.
The second case could be that the questionnaire survey has to be done at multiple locations
and it has been outsourced to an outside research agency. The respondent seems to have used
78 | P a g e
the same response category for all the questions; for example, there is a tendency on a five
point scale to give 3 as the answer for all questions.
The form that is received back is incomplete, in the sense that either the person has not
answered all questions, or in case of a multiple- page questionnaire, one or more pages are
missing.
The forms received are not in the proportion of the sampling plan. For
example, instead of an equal representation from government and private sector employees,
65 per cent of the forms are from the government sector. In such a case the researcher either
would need to discard the extra forms or get an equal number filled-in from private sector
employees.
Once the validation process has been completed, the next step is the editing of the raw data
obtained. While carrying out the editing, the researcher needs to ensure that:
Field Editing
Usually, the preliminary editing of the information obtained is done by the field investigators
or supervisors who review the filled forms for any inconsistencies, non-responses, illegible
responses or incomplete questionnaires. Thus the errors can be corrected immediately and if
need be the respondent who filled in the form, can be contacted again. The other advantage is
that regular field editing ensures that one can also check that the surveyor is able to handle
the process of instructions and probing correctly or not. Thus, the researcher can advise and
train the investigator on how to administer the questionnaire correctly.
First, one might detect an incorrect entry. For example, in case of a five- point scale one
might find that someone has used a value more than 5. In another case, one might be asking a
question like, 'how many days do you travel out of the city in a week?' and the person says
'15 days'. Here one can carry out a quick frequency check of the responses; this will
immediately detect an unexpected value.
The second and the major problem that most researchers face is that of 'armchair
interviewing' or a fudged interview. One way to handle this is to first scroll the answers to the
open-ended questions, as generally if the investigator is filling in multiple forms faking these
would be difficult.
79 | P a g e
The researcher has some standard processes available to him to carry out the editing process.
These are briefly discussed below.
Backtracking: The best and the most efficient way of handling unsatisfactory responses is to
return to the field, and go back to the respondents. This technique is best used for industrial
surveys but a little difficult in individual surveys.
Allocating missing values: This is a contingency plan that the researcher
might need to adopt in case going back to the field is not possible. Then the option might be
to assign a missing value to the blanks or the unsatisfactory responses. However, this works
in case:
Plug value: In cases such as the third condition above, when the variable being studied is the
key variable, then sometimes the researcher might insert a plug value. Sometimes one can
plug an average or a neutral value in such cases, for example a 3 for a five-point scale or the
researcher might have to establish a rule as to what value will be put if the person has not
answered. Sometimes, the respondents' pattern of responses to other questions is used to
extrapolate and calculate an appropriate response for the missing answer.
Discarding unsatisfactory responses: If the response sheet has too many blanks/illegible or
multiple responses for a single answer, the form is not worth correcting and editing. Hence, it
is much better to completely discard the whole questionnaire.
Coding
The process of identifying and denoting a numeral to the responses given by a
respondent is called coding. This is essentially done in order to help the researcher in
recording the data in a tabular form later. It is advisable to assign a numeric code even for the
categorical data (e.g., gender). In fact, even the open-ended questions, which are in a
statement form, we will try to categorize them into numbers. The reason for doing this is that
the graphic representation of data into charts and figures becomes easier.
Usually, the codes that have been formulated are organized into fields, records and
files. For example, the gender of a person is one field and the codes used could be 0 for males
and 1 for females. All related fields, for example, and all the demographic variables like age,
gender, income, marital status and education could be one record. The records of the entire
sample under study form a single file. The data that is entered in the spreadsheet, such as on
EXCEL, is in the form of a data matrix, which is simply a rectangular arrangement of the
data in rows and columns. Here, every row represents a single case or record.
Codebook formulation: In order to manage the data entry process, it is best to prepare a
method for entering the records. This coding scheme for all the variables under study is called
a code book. Generally, while designing the rules, care must be taken to decide on some
categories that are:
80 | P a g e
Comprehensive: Should cover all the possible answers to the question that was
asked.
Mutually exclusive: The categories and codes devised must be exclusive or clearly
different from each other.
Single variable entry: The response that is being entered and the code for it should
indicate only a single variable. For example, a 'working single mother' might seem an
apparently simple category which one could code as 'occupation'. However, it needs
three columns—occupation, marital status and family life. So, one needs to have three
different codes to enter this information.
Based on the above rules, one creates a code book. This would generally contain information
on the question number, variable name, response descriptors, coding instructions and the
column descriptor. Table 8.2 gives an example from a questionnaire designed to measure the
consumer buying behavior for ready-to-eat food products.
As we have read in Unit 6, a questionnaire can have both closed- ended and open-ended
questions. When the questions are structured and the response categories are prescribed then
one does what is called pre- coding, i.e., giving numeral codes to the designed responses
before administration. However, if the questions are structured and the answers are open
ended, one needs to decide on the codes after the administration of the survey. This is called
post- coding.
The method of coding for structured questions is easier as the response categories are decided
in advance. The coding method to be followed for different kinds of questions is discussed
below.
Dichotomous questions: For dichotomous questions, which are on a nominal scale, the
responses can be binary, for example: Do you eat ready-to- eat food? Yes = 1; No = 0. This
means if someone eats ready-to-eat food, he/she will be given a score of 1 and if not, then 0.
Ranking questions: For ranking questions where there are multiple objects to be ranked, the
person will have to make multiple columns, with column numbers equaling the number of
objects to be ranked.
Scaled questions: For questions that are on a scale, usually an interval scale, the
question/statement will have a single column and the coding instruction would indicate what
number needs to be allocated for the response options given in the scale. Consider the
following question:
Please indicate your level of agreement with the following statements.
SA – Strongly agree; A – Agree; N – Neutral; D – Disagree; SD – Strongly disagree
The code book for this will look as follows:
Missing values: It is advisable to use a standard format for signifying a non- response or a
missing value. For example, a code of 9 could be used for a single-column variable, 99 for a
double-column variable, and 999 for a three character variable and so on. The researcher
must take care as far as possible to use a value that is starkly different from the valid
responses. This is one of the reasons why 9 is suggested. However, in case you have a 10
81 | P a g e
point scale, do not use 9.
Dichotomous questions: For dichotomous questions, which are on a nominal scale, the
responses can be binary, for example: Do you eat ready-to- eat food? Yes = 1; No = 0. This
means if someone eats ready-to-eat food, he/she will be given a score of 1 and if not, then 0.
Ranking questions: For ranking questions where there are multiple objects to be ranked, the
person will have to make multiple columns, with column numbers equaling the number of
objects to be ranked.
Sometimes, the data obtained from the primary instrument is so huge that it becomes
difficult to interpret. In such cases, the researcher might decide to reduce the information into
homogenous categories. This method of arrangement is called classification of data. This can
be done on the basis of class intervals.
Classification by class intervals: Numerical data, like the ratio scale data, can be classified
into class intervals. This is to assist the quantitative analysis of data. For example, the age
data obtained from the sample could be reduced to homogenous grouped data. For example,
all those below twenty- five form one group, 25–35 are another group, and so on. Thus, each
group will have class limits—an upper and a lower limit. The difference between the limits is
termed as the class magnitude. One can have class intervals of both equal and unequal
magnitude.
The decision on how many classes and whether equal or unequal depends upon the
judgement of the researcher. Generally, multiples of 2 or 5 are preferred. Some researchers
adopt the following formula for determining the number of class intervals:
I = R/(1 + 3.3 log N)
I = size of class interval,
R = Range (i.e., difference between the values of the largest item and smallest item among
the given items),
N = Number of items to be grouped.
The class intervals that are decided upon could be exclusive, for example: 10–15
15–20
82 | P a g e
20–25
25–30
In this case, the upper limit of each is excluded from the category. Thus, we read the first
interval above as 10 and under 15, the next one as 15 and under 20, and so on.
The other kind is inclusive, that is: 10–15
16–20
21–25
26–30
Here, both the lower and the upper limits are included in the interval. It says 10–15 but
actually means 10–15.99. It is recommended that when one has continuous data it should be
signified as 10–15.99, as then all possibilities of the responses are exhausted here. However,
for discrete data one can use 10–15.
Once the categories and codes have been decided upon, the researcher needs to arrange them
according to some logical pattern. This is referred to as tabulation of data. This involves an
orderly arrangement of data into an array that is suitable for a statistical analysis. Usually,
this is an orderly arrangement of the rows and columns. In case there is data to be entered for
one variable, the process is a simple tabulation and, when it is two or more variables, then
one carries out a cross-tabulation of data.
Summary
Data processing refers to the primary data that has been collected specifically for the
study.
The researcher has to check for omissions or errors. This is the editing
stage of the data processing step. This is done first at the field and then at the central
office level.
At this stage, the research team conducts some data treatment such as allocating the
missing values, if possible, backtracking and sometimes, plugging the incomplete data.
Once this is completed, the researcher prepares the code book.
Classification into attributes or class intervals is carried out and the entered data is
now ready for analysis in a tabular for.
Introduction
Descriptive vs. Inferential Analysis
Descriptive Analysis
Inferential Analysis
Descriptive Analysis of Univariate Data
Analysis of Nominal Scale Data with only One Possible Response
Analysis of Ordinal Scaled Question
83 | P a g e
Descriptive Analysis of Bivariate Data
Summary
Keywords
Introduction
In the previous unit, we studied the processing of data collected from both primary and
secondary sources. The next step is to analyse the same so as to draw logical inferences
from them. The data collected in a survey could be voluminous in nature, depending
upon the size of the sample. In a typical research study there may be a large number of
variables that the researcher needs to analyse.
In this unit, we will concentrate on the descriptive analysis of univariate and bivariate data.
At the data analysis stage, the first step is to describe the sample which is followed by
inferential analysis. In the descriptive analysis, we describe the sample whereas the
inferential analysis deals with generalizing the results as obtained from the sample.
Descriptive Analysis
Descriptive analysis refers to transformation of raw data into a form that will facilitate
easy understanding and interpretation. Descriptive analysis deals with summary measures
relating to the sample data. The common ways of summarizing data are by calculating
average, range, standard deviation, frequency and percentage distribution. Below is a set of
typical questions that are required to be answered under descriptive statistics:
Inferential Analysis
After descriptive analysis has been carried out, the tools of inferential statistics are
applied. Under inferential statistics, inferences are drawn on population parameters based on
sample results. The researcher tries to generalize the results for the population based on
sample results. The following is an illustrative list of questions that are covered under
inferential statistics.
84 | P a g e
Is the average age of the population significantly different from 35?
Is the job satisfaction of unskilled workers significantly related with their pay
packet?
Do the users and non-users of a brand vary significantly with respect to age?
The first step under univariate analysis is the preparation of frequency distributions of
each variable. The frequency distribution is the counting of responses or observations for
each of the categories or codes assigned to a variable.
1. Mean
The mean represents the arithmetic average of a variable and is appropriate for
interval and ratio scale data. The mean is computed as:
It is also possible to compute the value of mean when interval or ratio scale data are grouped
into categories or classes. The formula for mean in such a case is given by:
Where,
fi= Frequency of ith class
Xi= Midpoint of ith class
k = Number of classes
2. Median
85 | P a g e
The median can be computed for ratio, interval or ordinal scale data. The median is
that value in the distribution such that 50 per cent of the observations are below it and 50 per
cent are above it. The median for the ungrouped data is defined as the middle value when the
data is arranged in ascending or descending order of magnitude.
n case the number of items in the sample is odd, the value of (n + 1)/2th item gives
the median. However if there are even number of items in the sample, say of size 2n, the
arithmetic mean of nth and (n + 1)th items gives the median.
It is again emphasized that data needs to be arranged in ascending or descending orders of
magnitude before computing the median.
Example 9.2: The marks of 21 students in economics are given 62, 38, 42,
43, 57, 72, 68, 60, 72, 70, 65, 47, 49, 39, 66, 73, 81, 55, 57, 57 and 59. Compute
the median of the distribution.
Solution:
By arranging the data in ascending order of magnitude, we obtain: 38, 39, 42, 43, 47, 49, 55,
57, 57, 57, 59, 60, 62, 65, 66, 68, 70, 72, 72, 73 and 81.
The median will be the value of the 11th observation arranged as above. Therefore,
the value of median equals 59. This means 50 per cent of students score marks below 59 and
50 per cent score above 59.
The median could also be computed for the grouped data. In that case, first of all, median
class is located and then median is computed using interpolation by using the assumption that
all items are evenly spread over the entire class interval. The median for the grouped data is
computed using the following formula:
Where,
l = Lower limit of the median class
f = Frequency of the median class
CF = Cumulating frequency for the class immediately below the class containing the median
h= size of the interval of the median class.
Given below is an example to illustrate the computation of median in the case of grouped
data:
Substituting these values in the formula for median, we get Median = 30.83
The results show that half of the companies have declared less than 30.83 per cent dividend
and the other half have declared more than 30.83 per cent dividend.
86 | P a g e
The limitation of median as a measure of central tendency is that it does not use each and
every observation in its computation since it is a positional average.
3. Mode
The mode is that measure of central tendency which is appropriate for nominal or
higher order scales. It is the point of maximum frequency in a distribution around which
other items of the set cluster densely. Mode should not be computed for ordinal or interval
data unless these data have been grouped first. The concept is widely used in business, e.g., a
shoe store owner would be naturally interested in knowing the size of the shoe that the
majority of the customers ask for. Similarly, a garment manufacturer is interested in
determining the size of the shirt that fits most people so as to plan its production accordingly.
I) Range: This is the simplest measure of dispersion and is defined as the distance
between the highest (maximum) value and the lowest (minimum) value in an ordered set of
values. The range could be computed for interval scale and ration scale data.
Where,
Range = Xmax– Xmin
Where,
Xmax= Maximum value of the variable
Xmin= Minimum value of the variable
The limitation of range as a measure of dispersion is that it considers only the extreme value
and ignores all other data points. The value of range could vary considerably from sample to
sample. Even with this limitation, range as a measure of dispersion is widely used in
industrial quality control for the preparation of control charts.
Example 9.6: The following are the prices of shares of a company from Monday to Friday:
Calculate the range of the distribution.
Solution:
L = Largest values = 210
S = Smallest value = 100
Therefore, range = L – S = 210 – 100 = 110.
In the case of a frequency distribution, range is calculated by taking the difference between
the lower limit of the lowest class and upper limit of the highest class. The limitation of range
is that it is not based on each and every observation of the distribution and, therefore, does
not take into account the form of distribution within the range.
87 | P a g e
(ii) Variance and standard deviation: Variance is defined as the mean squared
deviation of a variable from its arithmetic mean. The positive square root of the variance is
called standard deviation. The population standard deviation is denoted by σ and computed
using the following formula:
σ = Population standard deviation
X = Value of observations
μ = Population mean of observations
N = Total number of observations in the population
However, in survey research, we generally take a sample from the population. If the standard
deviation is computed from the sample data, the following formula may be used.
Where,
(iii) Coefficient of variation: This measure is computed for ratio scale measurement. The
standard deviation measures the variability of a variable around the mean. The unit of
measurement of standard deviation is the same as that of arithmetic mean of the variable
itself. The measure of dispersion is considerably affected by the unit of measurement. In such
a case, it is not possible to compare the variability of two distributions using standard
deviation as a measure of variability. To compare the variability of two or more distributions,
a measure of relative dispersion called the coefficient of variation can be used. This measure
is independent of units of measurement. The formula of coefficient of variation is:
9.4 Descriptive Analysis of Bivariate Data
As already mentioned, bivariate analysis examines the relationship between two variables.
There are various methods used for carrying out bivariate analysis. We will discuss two
methods, namely, cross-tabulation and correlation coefficient in this course. The discussion of
correlation coefficient is taken up in Unit 13.
Cross-tabulation
In simple tabulation, the frequency and the percentage for each question is calculated. In
cross-tabulation, responses to two questions are combined and data is tabulated together. A
cross-tabulation counts the number of observations in each cross-category of two variables.
The descriptive result of a cross-tabulation is a frequency count for each cell in the analysis.
For example, in cross-tabulating a two-category measure of income (low- and high-income
households) with a two-category measure of purchase intention of a product (low and high
purchase intentions) the basic result is a cross- classification as shown in Table 9.5.
The results of cross-tabulation show the number of sample respondents with low income
having low purchase intention, low income with high purchase intention, high income with
low purchase intention and high income with high purchase intention.
As is the case with simple tabulation, the results of a cross-tabulation are more meaningful if
cell frequencies are computed as percentages. The percentages can be computed in three
88 | P a g e
ways. As is the case in Table 9.5, the percentages can be computed (1) row-wise so that the
percentages in each row add up to 100 per cent; (2) column-wise so that the percentages in
each column add up to 100 per cent or (3) cell percentages, such that percentages added
across all cells equal 100 per cent. The interpretation of percentages is different in each of the
three cases. Therefore, the question arises that which of these percentages is most useful to
the researcher. What is the general rule for computing percentages?
The basis for calculating category percentage depends upon the nature of relationship
between the variables. One of the variables could be viewed as dependent variable and the
other one as independent variable. In the cross- tabulation presented in Table 9.5, the
purchase intention could be treated as dependent variable, which depends upon income
(independent variable). The rule is to cast percentages in the direction of independent (causal)
variable across the dependent variable.
For Table 9.5, there are 200 respondents with low income, out of which 120 have low
purchase intention for the product. In terms of percentages, 60 per cent of the respondents
with low income have low purchase intention for the product. Now there are 250 people with
high income, out of which 60 have low purchase intention and 190 have high purchase
intention for the product. By calculating percentages column wise, it is seen that 24 per cent
have low purchase intention whereas 76 per cent have high purchase intention for the
product. The results indicate that with increase in income, the purchase intention for the
product increases. Table 9.6 presents the percentages column-wise as given below:
From the above example, it is clear that any two variables with each having certain categories
can be cross-tabulated. The interpretation of the cross- tabulation results may show a high
association between the two variables. That does not mean one of them, the independent
variable, is the cause of the other variable—the dependent variable. Causality between the
two variable is more of an assumption made by the researcher based on his experience or
expectations. Just because there is high association between two variables, it does not imply a
cause-and-effect relationship.
As mentioned earlier, correlation coefficient would be discussed in Unit 13.
9.5 Summary
Let us recapitulate the main points discussed in this unit
Data analysis could be univariate, bivariate and multivariate. Further, it could be descriptive
or inferential.
The type of analysis depends upon the level of measurement, i.e., nominal,
ordinal, interval and ratio.
The bivariate analysis of data is illustrated through cross-table and correlation coefficient.
Introduction
Concepts in Testing of Hypothesis
Steps in Testing of Hypothesis Exercise
Tests Concerning Means— Case of Single Population
Tests for Difference between Two Population Means
Tests Concerning Population Proportion— Case of Single Population
Tests for Difference between Two Population Proportions
Summary
89 | P a g e
Keywords
Introduction
In the previous unit, we studied the descriptive analysis of univariate and bivariate
data. In this unit, we will study the testing of hypothesis. A hypothesis is an assumption or a
statement that may or may not be true. The hypothesis is tested on the basis of information
obtained from a sample. Instead of asking, for example, what the mean assessed value of an
apartment in a multi- storeyed building is, one may be interested in knowing whether or not
the assessed value equals some particular value, say `80 lakh. Some other examples would be
whether a new drug is more effective than the existing drug based on the sample data, and
whether the proportion of smokers in a class is different from 0.30. The formulation of
hypothesis has already been discussed in Unit 2.
We will now study the concepts and steps in the testing of hypothesis exercise.
One-tailed and two-tailed tests: A test is called one-sided (or one-tailed) only if the null
hypothesis gets rejected when a value of the test statistic falls in one specified tail of the
distribution. Further, the test is called two-sided (or two-tailed) if null hypothesis gets
rejected when a value of the test statistic falls in either one or the other of the two tails of its
sampling distribution.
For example, consider a soft drink bottling plant which dispenses soft drinks in bottles
of 300 ml capacity. The bottling is done through an automatic plant. An overfilling of bottle
(liquid content more than 300 ml) means a huge loss to the company given the large volume
of sales. An under-filling means the customers are getting less than 300 ml of the drink when
they are paying for 300 ml. This could create a bad reputation of the company. The company
wants to avoid both overfilling and under-filling. Therefore, it would prefer to test the
hypothesis whether the mean content of the bottles is different from 300ml. This hypothesis
could be written as:
H0 μ = 300 ml
H1 μ = 300 ml
The hypotheses stated above are called two-tailed or two-sided hypotheses. However, if the
concern is the overfilling of bottles, it could be stated as:
90 | P a g e
H0: μ = 300 ml
H1: μ > 300 ml
Such hypotheses are called one-tailed or one-sided hypotheses and the researcher would be
interested in the upper tail (right hand tail) of the distribution. If however, the concern is loss
of reputation of the company (under-filling of the bottles), the hypothesis may be stated as:
H0 : μ = 300 ml
H1: μ < 300 ml
The hypothesis stated above is also called one-tailed test and the researcher would be
interested in the lower tail (left hand tail) of the distribution.
Type I and type II errors: The acceptance or rejection of a hypothesis is based upon sample
results and there is always a possibility of a sample not being representative of the
population. This could result in errors, as a consequence of which inferences drawn could be
wrong.
Setting up of a hypothesis: The first step is to establish the hypothesis to be tested. As you
know, these statistical hypotheses are generally assumptions about the value of the population
parameter; the hypothesis specifies a single value or a range of values for two different
hypotheses rather than constructing a single hypothesis. These two hypotheses are generally
referred to as—(1) the null hypotheses denoted by H0
and (2) alternative hypothesis denoted by H1 .
The null hypothesis is the hypothesis of the population parameter taking a specified value. In
case of two populations, the null hypothesis is of no difference or the difference taking a
specified value. The hypothesis that different from the null hypothesis is the alternative
hypothesis. If the null hypothesis H0 is rejected based upon the sample information, the
alternative hypothesis H1 is accepted. Therefore, the two hypotheses are constructedin such a
way that if one is true, the other one is false and vice versa.
Setting up a suitable significance level: The next step in the testing of hypothesis exercise is
to choose a suitable level of significance. The level of significance denoted by α is chosen
before drawing any sample. The level of significance denotes the probability of rejecting the
null hypothesis when it is true. The value of α varies from problem to problem, but usually it
is taken as either 5 per cent or 1 per cent. A 5 per cent level of significance means that there
91 | P a g e
are 5 chances out of hundred that a null hypothesis will get rejected when it should be
accepted. When the null hypothesis is rejected at any level of significance, the test result is
said to be significant. Further, if a hypothesis is rejected at 1 per cent level, it must also be
rejected at 5 per cent significance level.
Determination of a test statistic: The next step is to determine a suitable test statistic and its
distribution. As would be seen later, the test statistic could be t, Z, χ2 or F, depending upon
various assumptions to be discussed later in the book.
Determination of critical region: Before a sample is drawn from the population, it is very
important to specify the values of a test statistic that will lead to rejection or acceptance of the
null hypothesis. The one that leads to the rejection of null hypothesis is called the critical
region. Given a level of significance, α, the optimal critical region for a two-tailed test
consists of that α/2 per cent area in the right hand tail of the distribution plus that α/2 per cent
in the left hand tail of the distribution where null hypothesis is rejected.
Computing the value of test statistic: The next step is to compute the value of the test statistic
based upon a random sample of size n. Once the value of test statistic is computed, one needs
to examine whether the sample results fall in the critical region or in the acceptance region.
Making a decision: The hypothesis may be rejected or accepted depending upon whether the
value of the test statistic falls in the rejection or the acceptance region. Management decisions
are based upon the statistical decision of either rejecting or accepting the null hypothesis.
In case a hypothesis is rejected, the difference between the sample statistic and the
hypothesized population parameter is considered to be significant. On the other hand, if the
hypothesis is accepted, the difference between the sample statistic and the hypothesized
population parameter is not regarded as significant and can be attributed to chance.
Test Concerning Means – Case of Single Population
In this section, a number of illustrations will be taken up to explain the test of hypothesis
concerning mean. Two cases of large samples and small samples will be taken up.
In case of large sample
As mentioned earlier, in case the sample size n is large or small but the value of the
population standard deviation is known, a Z test is appropriate. There can be alternate cases
of two-tailed and one-tailed tests of hypotheses. Corresponding to the null hypothesis H0: μ =
μ0, the following criteria could be used as shown in Table 10.2.
Where,
X = Sample mean
σ = Population standard deviation
μH₀ = The value of μ under the assumption that the null hypothesis is true.
n = Size of sample.
Table 10.2 Criteria for accepting or rejecting null hypothesis under different cases of
alternative hypotheses
92 | P a g e
If the population standard deviation σ is unknown, the sample standard deviation is used as
an estimate of σ. It may be noted that Zα and Zα/2 are Z values such that the area to the right
under the standard normal distribution is α and α /2 respectively. Below are solved examples
using the above concepts:
Example 10.1:
A sample of 200 bulbs made by a company gives a lifetime mean of 1540 hours with a
standard deviation of 42 hours. Is it likely that the sample has been drawn from a population
with a mean lifetime of 1500 hours? You may use 5 per cent level of significance.
Solution:
In the above example, the sample size is large (n=200), Sample mean(X) equals 1540 hours
and the sample standard deviation (s) is equal to 42 hours. Then null and alternative
hypotheses can be written as:
It is a two-tailed test with level of significance (α) to be equal to 0.05. Since n is large (n >
30), though population standard deviation μ is unknown, one can use Z test. The test statistics
are given by:
Where,
µHo= Value of µ under the assumption that the null hypothesis is true
= Estimated standard error of mean
The value of α = 0.05 and since it is a two-tailed test, the critical valueZ is given by –Zα/2
and Zα/2
which could be obtained from the standard normal table given in Appendix 1 at the end of the
book.
Rejection regions for Example 10.1
Since the computed value of Z = 13.47 lies in the rejection region, the null hypothesis is
rejected. Therefore, it can be concluded that the average life of the bulb is significantly
different from 1500 hours.
Example 10.2: On a typing test, a random sample of 36 graduates of a secretarial school
averaged 73.6 words with a standard deviation of 8.10 words per minute. Test an employer's
claim that the school's graduates average less than 75.0 words per minute using the 5 per cent
level of significance.
Solution:
H0 : µ = 75
H1 : µ < 75
93 | P a g e
X = 73.6, s = 8.10, n = 36 and α = 0.05. As the sample size is large (n > 30), though
population standard deviation μ is unknown, Z test is appropriate. The test statistic is given
by:
Since it is a one -tailed test and the interest is in the left hand tail of the distribution, the
critical value of Z is given by – Zα= –1.645. Now, the computedvalue of Z lies in the
acceptance region, and the null hypothesis is accepted
The procedure for testing the hypothesis of a mean is similar to what is explained in the case
of a large sample. The test statistic used in this case is:
A few examples pertaining to 't' test are worked out for testing the hypothesis of mean in case
of a small sample.
Example 10.3:
Prices of share (in `) of a company on the different days in a month were found to be 66, 65,
69, 70, 69, 71, 70, 63, 64 and 68. Examine whether the mean price of shares in the month is
different from 65. You may use 10 per cent level of significance.
Solution :
Since the sample size is n = 10, which is small, and the sample standard deviation is
unknown, the appropriate test in this case would be t. First of all, we need to estimate the
value of sample mean (X) and the standard deviation (s). It is known that the sample mean
and the standard deviation are given by the following formula:
The test statistic is given by:
The critical values of t with 9 degrees of freedom for a two-tailed test are given by –1.833
and 1.833. Therefore, the average price of the share of the company is different from 65.
Tests for Difference between Two Population Means
94 | P a g e
So far, we have been concerned with the testing of means of a single population. We took up
the cases of both large and small samples. It would be interesting to examine the difference
between the two population means. Again, various cases would be examined as discussed
below:
In case of large sample
In case both the sample sizes are greater than 30, a Z test is used. The hypothesis to be tested
may be written as:
Where,
µ1= mean of population 1 µ = mean of population 2
The above is a case of two-tailed test. The test statistic used is:
X =Mean of sample drawn from population X =Mean of sample drawn from population n1=
size of sample drawn from population 1 n2= size of sample drawn from population 2
If and σ2 are unknown, their estimates given by sˆ 1 and sˆ 2 are used.
The Z value for the problem can be computed using the above formula and compared with
the table value to either accept or reject the hypothesis. Let us consider the following
problem:
Example 10.4: A study is carried out to examine whether the mean hourly wages of the
unskilled workers in the two cities—Ambala Cantt. and Lucknow are the same. A random
sample of hourly earnings in both the cities is taken and the results are presented in the Table
10.4.
Table 10.4 Survey Data on Hourly Earnings in two Cities
As the problem is of a two-tailed test, the critical values of Z at 5 per cent level of
95 | P a g e
significance are given by –Zσ/2= –1.96 and Zσ/2= 1.96. The sample value of Z = –2.83
lies in the rejection region.
In case of small sample
If the sizes of both samples are less than 30 and the population standard deviation is
unknown, the procedure described above to discuss the equality of two population means is
not applicable in the sense that a t test would be applicable under the assumption that two
population variances are equal.
If the two population variances are equal, it implies that their respective
unbiased estimates are also equal. In such a case, the expression becomes:
To get an estimate of sˆ 2 , a weighted average of s12 and S22 is used, where the weights
are the number of degrees of freedom of each sample. The weighted average is called a
'pooled estimate' . This pooled estimate is given by the expression:
Once the value of t statistic is computed from the sample data, it is compared with the
tabulated value at the level of significance α to arrive at a decision regarding the acceptance
or rejection of hypothesis. Let us work out a problem
illustrating the concepts defined above.
Example 10.5: Two drugs meant to provide relief to arthritis patients were produced in two
different laboratories. The first drug was administered to a group of 12 patients and produced
an average of 8.5 hours of relief with a standard deviation of 1.8 hours. The second drug was
tested on a sample of 8 patients and produced an average of 7.9 hours of relief with a
standard deviation of 2.1 hours. Test the hypothesis that the first drug provides a significantly
higher period of relief. You may use 5 per cent level of significance.
The critical value of t with 18 degrees of freedom at 5 per cent level of significance is given
by 1.734. The sample value of t = 0.685 lies in the acceptance region
Therefore, the null hypothesis is accepted as there is not enough evidence to reject it. So, one
may conclude that the first drug is not significantly more effective than the second drug.
Activity 2:
From an IT company, take a random sample of ten male and female software engineers with
two years of work experience. Test the hypotheses that there is no significant difference in
their average salaries at 5 per cent level of significance.
Hint: Refer to Section 10.2.1. While testing the hypotheses, you can follow the below steps:
96 | P a g e
a. State the hypotheses
b. Formulate an analysis plan
c. Analyse sample data
d. Interpret results
Tests Concerning Population Proportion—Case of Single Population
We have already discussed the tests concerning population means. In the tests about
proportion, one is interested in examining whether the respondents possess a particular
attribute or not.
The random variable in such a case is a binary one in the sense that it takes only two values—
yes or no. As we know that either a student is a smoker or not, a consumer either uses a
particular brand of product or not and lastly, a skilled worker may be either satisfied or not
with the present job. At this stage it may be recalled that the binomial distribution is a
theoretically correct distribution to use while dealing with proportions. Further, as the sample
size increases, the binomial distribution approaches the normal distribution in characteristic.
To be specific, whenever both np and nq (where n = number of trials, p = probability of
success and q = probability of failure) are at least 5, one can use the normal distribution as a
substitute for the binomial distribution.
Example 10.6: An officer of the health department claims that 60 per cent of the male
population of a village smokes. A random sample of 50 males showed that 35 of them were
smokers. Are these sample results consistent with the claim of the health officer? Use a level
of significance of 0.05.
It is a one-tailed test. For a given level of significance α = 0.05, the critical value of Z is
given by Zα = Z0.05 =1.645 It is seen that the sample value of Z = 1.44 lies in the acceptance
region .
Therefore, there is not enough evidence to reject the null hypothesis. So it can be concluded
that the proportion of male smokers is not statistically different from 0.60.
97 | P a g e
Here, we need to test whether the two population proportions are equal or not. The
hypothesis under investigation is:
The alternative hypothesis assumed is two-sided. It could as well have been one-sided. The
test statistic is given by:
Now, for a given level of significance α, the sample Z value is compared with the critical Z
value to accept or reject the null hypothesis. We consider below a few examples to illustrate
the testing procedure described above.
The critical value of Z at 5 per cent level of significance is 1.645. The sample value Z = 1.13,
lies in the acceptance region
Summary
Let us recapitulate the main points discussed in this unit:
UNIT 11 :
STRUCTURE
Introduction
A Chi-Square Test for the Goodness of Fit
A Chi-Square Test for the Independence of Variables
A Chi-Square Test for the Equality of More than Two Population Proportions
Summary
98 | P a g e
Keywords
Introduction
In the last unit, we discussed the Z test for the equality of two population proportions.
Now, in case we have more than two populations and want to test the equality of all of them
simultaneously, it is not possible to do it using the Z test. This is because the Z test can
examine the equality of only two proportions at a time. In such a situation, the chi-square test
can come to the rescue and carry out the test in one go.
The chi-square test is widely used in research. For the use of chi- square test,
data is required in the form of frequencies. Data expressed in percentages or proportion can
also be used, provided it could be converted into frequencies. The majority of the applications
of chi-square (χ2) are with discrete data. The test could also be applied to continuous data,
provided it is reduced to certain
categories and tabulated in such a way that the chi- square may be applied.
99 | P a g e
Compute the expected frequencies of the occurrence of certain events under the
assumption that the null hypothesis is true.
Make a note of the observed counts of the data points falling in different cells
Compute the chi-square value given by the formula:
Where,
Oi = Observed frequency of ith cell
Ei = Expected frequency of ith cell
k = Total number of cells
k – 1 = degrees of freedom
Compare the sample value of the statistic as obtained in the previous step with the
critical value at a given level of significance and make the decision.
A goodness of fit test is a statistical test of how well the observed data supports the
assumption about the distribution of a population. The test also examines how well an
assumed distribution fits the data. Many times, the researcher assumes that the sample is
drawn from a normal or any other distribution of interest. A test of how normal or any other
distribution fits a given data may be of some interest. Consider, for example, the case of the
multinomial experiment which is the extension of a binomial experiment. In the multinomial
experiment, the number of the categories k is greater than 2. Further, a data point can fall into
one of the k categories and the probability of the data point falling in the i category is a
constant and is denoted by pi where i = 1, 2, 3, 4, ..., k. In summary, a multinomial
experiment has the following features:
There are fixed number of trials.
The trials are statistically independent.
All the possible outcomes of a trial get classified into one of the several categories.
The probabilities for the different categories remain constant for each trial.
Consider as an example that a respondent can fall into any one of the four non- overlapping
income categories. Let the probabilities that the respondent will fall into any of the four
groups may be denoted by the four parameters P1, P2,Ps, and P¢. Given these, the
multinomial distribution with these parameters, and the number of people in a random
sample, specifies the probabilities of any combination of the cell counts.
Given such a situation, we may use a multinomial distribution to test how well the data fits
the assumption of K probability P1, PZ,...PK of falling into the k cells. the hypothesis to be
tested is:
Ho: Probabilities of the occurrence of events E1, Ez,...Ek are given by specified probabilities
P1, Pz,...Pk
H1: Probabilities of the k events are not the pi Stated in the null hypothesis.
Such hypothesis could be tested using the chi-square statistics. Three are given in a set of
illustrated example
100 | P a g e
Example 11.1: The manager of ABC ice-cream parlour has to take a decision regarding how
much of each flavour of ice-cream should he stock so that the demands of the customers are
satisfied. The ice-cream suppliers claim that among the four most popular flavours, A2 per
cent customers prefer vanilla, 18 per cent chocolate, 12 per cent strawberry and 8 per cent
mango. A random sample of 2OO CUSTOMERS produces the results as given below. At the
α = o.oc significance level, test the claim that the percentages given by the supplies are
correct.
Solution:
Let
Pv: proportion of customers preferring vanilla flavour
Pc : proportion of customers preferring chocolate flavour
Ps : proportion of customers preferring strawberry flavour
Pm: proportion of customers preferring mango flavour
101 | P a g e
Total
4.323
Table c2 3 (5 per cent) = 9.488
As sample χ2 lies in the acceptance region, accept H0 . Therefore, the customer preference
rates are as stated.
It may be worth pointing out that for the application of a chi-square test, the expected
frequency in each cell should be at least 5.0. In case it is found that one or more cells have the
expected frequency less than 5, one could still carry out the chi-square analysis by combining
them into meaningful cells so that the expected number has a total of at least 5. Another point
worth mentioning is that the degree of freedom, usually denoted by df in such cases, is given
by k – 1, where k denotes the number of cells (categories).
It may be noted that in Example 11.1, the hypothesized probabilities were not equal. There
are situations where the hypothesized probabilities in each category are equal or in other
words, the interest is in investigating the uniformity of the distribution. The following
example illustrates this.
Example 11.2: An insurance company provides auto insurance and is analysing the data
obtained from fatal crashes. A sample of the motor vehicle deaths is randomly selected for a
two-year period. The number of fatalities is listed below for the different days of the week. At
the 0.05 significance level, test the claim that accidents occur on different days with equal
frequency.
102 | P a g e
Monday 31 25.714 5.286
27.942 1.087
Tuesday 20 25.714 – 5.714
32.650 1.270
Wednesday 20 25.714 – 5.714
32.650 1.270
Thursday 22 25.714 – 3.714
13.794 0.536
Friday 22 25.714 – 3.714 13.794
0.536
Saturday 29 25.714 3.286
10.798 0.420
Sunday 36 25.714 10.286
105.802 4.114
Total
9.233
Degrees of freedom = 7 – 1 = 6 Critical (Table) = 12.592
Since the sample chi-square value is less than the tabulated χ2, there is not enough evidence
to reject the null hypothesis
A Chi-Square Test for Independence of Variables
The chi-square test can be used to test the independence of two variables each having
at least two categories. The test makes use of contingency tables, also referred to as cross-
tabs with the cells corresponding to a cross classification of attributes or events.
Assuming that there are r rows and c columns, the count in the cell corresponding to the
ithrow and the jthcolumn is denoted by Oij , where i = 1,2, ..., r and j = 1, 2, ..., c. The total
for row i is denoted by Ri whereas that corresponding to column j is denoted by Cj . The
total sample size is given by n, which is also the sum of all the r row totals or the sum of all
the columns
The hypothesis test for independence is
H0: Row and column variables are independent of each other.
H1: Row and column variables are not independent.
The hypothesis is tested using a chi-square test statistic for independence given by:
The degrees of freedom for the chi-square statistic are given by (r – 1) (c – 1).
103 | P a g e
For a given level of significance α, the sample value of the chi-square is compared with the
critical value for the degree of freedom (r – 1) (c – 1) to make a decision.
The expected frequency in the cell corresponding to the ith row and the jthcolumn is given
by:
Example 11.3: A sample of 870 trainees was subjected to different types of training classified
as intensive, good and average and their performance was noted as above average, average
and poor. The resulting data is presented in the table below. Use a 5 per cent level of
significance to examine whether there is any relationship between the type of training and
performance
Performance Training
Intensive Good Average Total
Above average 100 150 40 290
Average 100 100 100 300
Poor 50 80 150 280
Total 250 330 290 870
Solution:
H0: Attribute performance and the training are independent.
H1: Attribute performance and the training are not independent
The expected frequencies corresponding to the ith row and the jth column in the contingency
table are denoted by Eij , where i = 1, 2, 3 and j = 1, 2, 3.
The critical value of the chi-square at 5 per cent level of significance with 4
degrees of freedom is given by 9.49. The sample value of the chi- square falls in the rejection
region as shown in the figure below.
Therefore, the null hypothesis is rejected and one can conclude that there is an association
between the type of training and performance.
A Chi-Square Test for the Equality of More than Two Population Proportions
In certain situations, the researchers may be interested to test whether the proportion
of a particular characteristic is the same in several populations. The interest may lie in finding
out whether the proportion of people liking a movie is the same for the three age groups —
twenty-five and under, over twenty-five and under fifty, and fifty and over. To take another
example, the the satisfied employees in four categories—class I, class II, class III and class
IV employees—is the same. In a sense, the question of whether the proportions are equal is a
question of whether the three age populations of different categories are homogeneous with
respect to the characteristics being studied. Therefore, the tests for equality of proportions
across several populations are also called tests of homogeneity.
The analysis is carried out exactly in the same way as was done for the other two cases. The
formula for a chi-square analysis remains the same. However, two important assumptions
here are different.
(I) We identify our population (e.g., age groups or various classes of employees) and the
sample directly from these populations.
(ii) As we identify the populations of interest and the sample from them directly, the sizes of
the sample from different populations of interest are fixed. This is also called a chi-square
analysis with fixed marginal totals. The hypothesis to
tested is as under:
105 | P a g e
H0: The proportion of people satisfying a particular characteristic is the same in population.
H1: The proportion of people satisfying a particular characteristic is not the same in all
populations.
The expected frequency for each cell could also be obtained by using the formula as
explained earlier. There is an alternative way of computing the same, which would give
identical results. This is shown in the following example:
Example 11.5: An accountant wants to test the hypothesis that the proportion of incorrect
transactions in four client accounts is about the same. A random sample of 80 transactions of
one client reveals that 21 are incorrect; for the second client, the number is 25 out of 100; for
the third client, the number is 30 out of 90 sampled and for the fourth, 40 are incorrect out of
a sample of 110. Conduct the test at α = 0.05.
Let
p1 = Proportion of incorrect transaction for 1st client
p2 = Proportion of incorrect transaction for 2nd client
p3 = Proportion of incorrect transaction for 3rd client
p4 = Proportion of incorrect transaction for 4th client
H0 = p1 = p2 = p3 = p4
H1 : All proportions are not the same The observed data in the problem can be rewritten as:
frequencies in each cell would be the same using the formula Eij = Ri ÌCj n as already
explained. Now the value of the chi-square statistic can be calculated as:
The critical value of the chi-square with 3 degrees of freedom at 5 per cent level of
significance equals 7.815. Since the sample value of χ2 is less than the critical value, there is
not enough evidence to reject the null hypothesis. Therefore, the null hypothesis is accepted.
Therefore, there is no significant difference in the proportion of incorrect transaction for the
four clients.
Summary
106 | P a g e
Let us recapitulate the main points discussed in this unit:
STRUCTURE
Introduction
Completely Randomized Design in a One-Way ANOVA
Randomized Block Design in Two-Way ANOVA
Factorial Design
Summary
Keywords
INTRODUCTION
In Unit 10, we discussed the test of hypothesis concerning the equality of two
population means using both the Z and t tests. However, if there are more than two
populations, the test for the equality of means could be carried out by considering two
populations at a time. This would be a very cumbersome procedure. One easy way out could
be to use the analysis of variance (ANOVA) technique. The technique helps in performing
this test in one go and, therefore, is considered to be an important technique for analysis for
the researcher. Through this technique it is possible to draw inferences on whether the
samples have been drawn from populations having the same mean.
The technique has found applications in the fields of economics, psychology,
sociology, business and industry. It proves handy in situations where we want to compare the
means of more than two populations. Some examples could be to compare:
R.A. Fisher developed the theory concerning ANOVA. The basic principle underlying the
technique is that the total variation in the dependent variable is broken into two parts—one
which can be attributed to some specific causes and the other that may be attributed to
107 | P a g e
chance. The one which is attributed to specific causes is called the variation between samples
and the one which is attributed to chance is termed as the variation within samples.
Therefore, in ANOVA, the total variance may be decomposed into various components
corresponding to the sources of the variation.
In ANOVA, the dependent variable in question is metric (interval or ratio scale), whereas the
independent variables are categorical (nominal scale). If there is one independent variable
(one factor) divided into various categories, we have one-way or one-factor analysis of
variance. In the two- way or two-factor analysis of variance, two factors each divided into the
various categories are involved.
In ANOVA, it is assumed that each of the samples is drawn from a normal population and
each of these populations has an equal variance. Another assumption that is made is that all
the factors except the one being tested are controlled (kept constant). Basically, two estimates
of the population variances are made. One estimate is based upon between the samples and
the other one is based upon 'within the samples'. The two estimates of variances can
compared for their equality using F statistic.
We had earlier partitioned the total sum of squares into two components—one which is due
to the differences between the sample (treatment sum of squares) and the other one due to the
differences within the samples (error sum of squares). Now, the error sum of squares includes
the sum of squares due to laboratories (called blocks) as an extraneous factor.
In two-way analysis of variance, we remove the effect of the extraneous factors (laboratories
or blocks) from the error sum of squares. Therefore, the total sum of squares is partitioned
into three components—one due to treatment, second due to block and the third one due to
chance (called the error sum of squares). It may be noted that the total sum of squares (TSS)
and the treatment sum of squares (TrSS) would remain the same as computed earlier in
108 | P a g e
Factorial Design
In factorial design, the dependent variable is the interval or the ratio scale and there
are two or more independent variables which are nominal scale. In the factorial design, it is
possible to examine the interaction between the variables. If there are two independent
variables each having three cells, there would be a total of nine interactions. The details on
this are already explained in Unit 3 (Research Design). Let us consider an example to explain
factorial design.
It is generally observed that there are differences in the pay packages offered to fresh MBA
graduates. The variations could be either due to the type of business school where they have
studied or it could be due to their area of specialization. The variation can also be due to an
interaction between the business school and the area of specialization. For example,
specialization in finance from a particular business school might fetch a better package.
Summary
Let us recapitulate the main points discussed in this unit:
Keywords
109 | P a g e
Analysis of variance: A technique used to compare means of two or more samples
(using the F distribution). This technique can be used only for numerical data
Completely randomized design: A design that involves the testing of the equality of
means of two or more groups; there is one dependent variable and one independent
variable in this design
Factorial design: A design for an experiment that allows the experimenter to find out
the effect of two or more independent variables each having two or more categories
along with their interactions on dependent variable
One-way ANOVA: A technique that compares the mean of two or more groups
based on one independent variable (or factor)
Two-way ANOVA: A statistical test used to determine the effect of two nominal
predictor variables on a continuous outcome variable. A two-way ANOVA test
analyses the effect of the independent variables on the expected outcome along with
their relationship to the outcome itself.
Structure
Introduction
Concept of Correlation
Quantitative Estimate of a Linear Correlation
Testing the Significance of the Correlation Coefficient
Regression Analysis
Test of Significance of Regression Parameters
Goodness of Fit of Regression Equation
Uses of Regression Analysis in Prediction
Summary
Keywords
Introduction
Correlation and regression analysis are generally performed together. Correlation
measures the degree of association between two or more set of variables. Regression, on the
other hand, is used to explain the variations in one variable—usually called the dependent
variable—by a set of independent variables. It identifies the nature of the relationship. The
number of independent variables in regression analysis could be one or more. In case of one
independent variable, we classify it as a simple regression, whereas in case of more than one
independent variable, it is called a multiple regression analysis.
In this unit, you will study the importance of correlation and regression analysis in research
methodology, with a focus on quantitative estimate of a linear coefficient, the significance of
correlation coefficient and regression parameters, and goodness of fit to fregression equation.
Concept of Correlation
Correlation measures the degree of association between two or more variables. When
we are dealing with two variables, we are talking in terms of simple correlation and when
more than two variables are involved, the subject matter of interest is called multiple
110 | P a g e
correlation. In this unit, we will discuss simple correlation. There are three types of
correlation:
1. Positive correlation: When two variables X and Y move in the same direction, the
correlation between the two is positive. If one variable increases, the other variable also
increases, and if one variable decreases, the other variable also decreases.
2. Negative correlation: When two variables X and Y move in the opposite direction,
the correlation is negative. If one variable increases, the other. decreases, and vice
versa. The examples of negative correlation are usually the quantity demanded and the
price of the commodity. The scatter of the points on the variables X and Y is clustered
around a negatively sloped straight line/curve in such a situation, as shown in Figure
13.2. In the figure, we find that the variables X and Y are moving in the opposite
direction.(Example)
3. Zero correlation: The correlation between two variables X and Y is zero when the
variables move in no connection with each other. If the variable X increases, Y may
increase or decrease in different situations.
Zero correlation does not mean that the variables are not related. We are dealing with a
linear correlation here and there could be a non-linear relation between them.
111 | P a g e
correlation. It may be noted that the closer the scatter of points to the line, higher is the
degree of correlation between the variables.
Regression Analysis
One of the problems with Karl Pearson's formula of correlation coefficient is that it is
applicable only when the relationship between the two variables is linear. There can,
however, be situations when the variables are connected in a non-linear relationship. It may
be noted that zero correlation and the independence of the two variables are not the same
thing. Zero correlation does not mean that the variables are not related. They may be non-
linearly related. However, statistical independence implies that there is a zero correlation
between the variables.
Another problem with the simple correlation coefficient is that it does not indicate
which variable is influencing which one. If, for example, the correlation coefficient between
the variables X and Y is 0.96, it can only be said that the variables X and Y are positively and
highly correlated. We cannot say that whether the variable X influences Y or Y influences X
or there may be a third variable Z which may be influencing both these variables, This results
in a high correlation between X and Y. To overcome this limitation of the correlation
analysis, we have another concept called the regression analysis.
For example, food expenditure in households could be predicted by using family income and
family size as independent variables in regression. In another example, the amount spent by a
consumer at a retail store in the last three months can be explained by the store's location,
prices, credit policy, merchandise quality and speed of service by using the regression
analysis. Likewise, another example could be to predict the sales volume of a photocopier by
using a set of independent variables like the size of sales force, amount of the advertising
budget and the consumer attitudes towards the company's product. Similarly, the willingness
to export the product by the small entrepreneurs could be explained by the employee size,
firm revenue and the years of operation in the domestic market.
112 | P a g e
In regression analysis, it is assumed that there is a variable that is influencing another
variable. For example, we may write,
Y = f (X)
This indicates that the values of Y depend upon the values of X. Further, there is a one-way
causation between X and Y in the sense that it is X which influences the values of Y and not
the other way round. The variable Y is called a dependent variable or an effect variable,
whereas the variable X is called an independent variable, explanatory variable, causal
variable or a regressor. The relationship between Y and X may be assumed to be linear and
we may write the following expression as:
Y=α+βX
The above expression shows that if we have a pair of data on the variables X and Y, the
scatter of all the points between these two variables will lie on a positively or negatively
sloped straight line, depending upon whether the sign of beta (β) is positive or negative. This
means that the correlation coefficient between X and Y will either be +1 or –1. In fact, such a
thing rarely happens. If we plot the data on the variables X and Y on a two- dimensional
plane, all the scatter of points would not lie on either positively or negatively sloped straight
line. This is because the variable Y is not only influenced by the variable X but
also by many other variables, which we have ignored for various reasons. The possible
reasons for ignoring those variables could be the non-availability of data or poor knowledge
about the existence of such variables influencing the dependent variable Y or the errors of
measurements in the variables X and Y or the researcher's inability to quantify such variables.
Therefore, to account for those variables which have been omitted for one reason or the other,
a stochastic error term is added to the above equation, which appears as:
Y=α+βX+U
Where,
U = Stochastic error term
α, β = Parameters to be estimated
The above equation is called a simple linear regression equation. This is because there is one
dependent variable and one independent variable. In case of multiple regressions, there are at
least two independent variables. The equation is estimated using the ordinary least squares
(OLS) method of estimation. The OLS method of estimation states that the regression line
should be drawn in such a way so as to minimize the error sum of squares.
In the above expression, n and k denote the sample size and the number of parameters to be
estimated in a given regression. The standard error of estimates indicates how close the
113 | P a g e
scatter of the points is to the regression line. However, this measure suffers from the defect
that it depends upon the units of measurement and, therefore, the fit of the two regression
equations with different standard errors of estimates cannot be compared. To overcome this
problem, we will introduce the concept of rz, the coefficient of determination, later in the
unit.
Test of Significance of Regression Parameters
We need to test the significance of the regression coefficients α and β, which is carried
out with the help of the t-statistic. The hypothesis to be tested for the slope coefficient is
mentioned below as:
Ho: β = 0
H1: β ≠ 0
The acceptance of the null hypothesis (H ) would indicate that the variable X does not
influence Y. In the above case, we have used a two- tailed test. The decision whether a
researcher should use a two-tailed or a one-tailed alternative depends upon whether the
direction of the relationship between the dependent and the causal variable is known or not. If
we know the direction of the relationship between the causal variable and the dependent
variable, we
should go for a one-tailed test and if there is no clue about the direction of relationship
between the two variables, it is suggested that a two-tailed alternative should be adopted.
The test statistic to be used to test the significance of the slope coefficient is given by:
(formula)
Once we compute the t-statistic, it is compared with table value of t with n – k degrees of
freedom where n is the number of the observations in the sample and k represents the
number of parameters to be estimated in a regression equation (in the present case k = 2).
In case the computed value of |t| is greater than the tabulated valued of |t| at a given level of
significance, the null hypotheses is rejected.
Goodness of Fit of Regression Equation
A researcher would be interested in knowing how good the estimated regression
equation is. To answer this question, there is a measure rz which, in the case of simple linear
regression model, is simply the square of the correlation coefficient. This measure is also
called the coefficient of determination of a regression equation and it takes values between 0
and 1 (both values inclusive). It indicates the explanatory power of the regression model. If
for a particular regression model, r2 is equal to 0.86, it means that 86 per cent of the
variations in the dependent variable Y are explained by the variations in the independent
variable X. Then rz may be computed as:
The value of r2 is free from the units of measurements and, therefore, can be used to
compare the goodness of fit of two or more regressions. The test for the goodness of fit is
carried out by using the F-statistic.
114 | P a g e
For a given level of significance α, the computed value of the F-statistic is compared with the
tabulated value of F with k – 1 degrees of freedom in the numerator and n – k degrees of
freedom in the denominator. If the computed F exceeds the tabulated F, the null hypothesis is
rejected in favour of the alternative hypothesis.
Uses of Regression Analysis in Prediction
The regression analysis can be employed for prediction. The prediction estimates could be
both point and interval. Further, the interval prediction can be approximate as well as exact.
Summary
Let us recapitulate the main points discussed in the unit:
In this unit, the concept of correlation is defined as measuring the association between two
variables.
Keywords
115 | P a g e
Cluster Analysis
Uses of Cluster Analysis
Statistics Associated with Cluster Analysis
Key Concepts in Cluster Analysis
Process of Clustering
Summary
Glossary
Introduction
In the unit on univariate and bivariate analysis of data, we made a mention of multivariate
analysis. In the multivariate analysis of data, we analyse more than two variables at at a time
the multivariate analysis of data has a number of uses in research which will be shown
through specific techniques. In this unit, we are going to discuss factor analysis, discriminant
analysis and cluster analysis - some very commonly used multivariate techniques.
Factor Analysis
Factor analysis is a data reduction method. It is a very useful method to reduce a large
number of variables resulting in data complexity to a few manageable factors. These factors
explain for most part the variations in the original set of data. Factor analysis helps in
identifying the underlying structure of the data. A factor is a linear combination of variables.
It is a construct that is not directly observable but that needs to be inferred from the input
variables. The factors are statistically independent.
Factor analysis requires some specific conditions that must be ensured before
executing the technique. These are mentioned below:
The factor analysis exercise requires metric data.
This means the data should be either interval or ratio scale in nature.
The variables for factor analysis are identified through exploratory research.
Generally in a survey research, a five or seven-point Likert scale or any other interval
scale may be used.
As the responses to different statements are obtained through different scales, all the
responses need to be standardized. The standardization helps in comparison of different
responses from such scales. The standardization is carried out using the following formulae:
The size of the sample respondents should be at least four to five times more than the number
of variables (number of statements).The basic principle behind the application of factor
analysis is that the initial set of variables should be highly correlated. If the correlation
116 | P a g e
coefficients between all the variables are small, factor analysis may not be an appropriate
technique. A correlation matrix of the variables could be computed and tested for its
statistical significance.
2. Rotation of factors: The second step in the factor analysis exercise is the rotation of
initial factor solutions. This is because the initial factors are very difficult to interpret.
Therefore, the initial solution is rotated so as to yield a solution that can be interpreted easily.
Most of the computer software would give options for orthogonal rotation, varimax rotation
and oblique rotation. Generally, the varimax rotation method is used, as this results in
independent factors.
We will explain all that is discussed above with the help of a numerical example.
Discriminant Analysis
117 | P a g e
scale, whereas the independent or predictor variables are either interval or ratio scale in
nature. When there are two groups (categories) of dependent variables, we have two-group
discriminant analysis and when there are more than two groups, it is a case of multiple
discriminant analysis. In case of two-group discriminant analysis, there is one discriminant
function, whereas in case of multiple discriminant analysis, the number of functions is one
less than the number of groups.
The objectives of discriminant analysis are the following:
To find a linear combination of variables that discriminate between categories of dependent
variables in the best possible manner
Y= b0 + b1 X1 + b2 X2 + b2 X2 + ... + bK XK
Where,
Y = Dependent variable
bi =Coefficients of independent variables;(i=0,1,2,....K)
Xj =Predictor or independent variables;(j=1,2,....K)
118 | P a g e
It may be kept in ming that the dependent variable Y should be a categorized variable,
whereas the independent variables Xs should becontinuous. As the dependent variable is a
categorized variable, it should be coded as 0, 1, similar to the dummy variable coding.
The method of estimating bS is based on the principle that the ratio of 'between group sum of
squares' to 'within group sum of squares' be
maximized. This will make the groups differ as much as possible on the values of the
discriminant function.
After having estimated the model, the bi coefficients are used to calculate Y, the (also called
discriminant coefficient) are used to calculate Y, the discriminant score by substituting the
values of Xj in the estimated discriminant model. For any new data point that we want to
classify into one of the groups, a decision rule is formulated for this purpose to determine the
cut-off score, which is usually the midpoint of the mean discriminant scores of the two
groups in case of two-group discriminant analysis, provided the size of the samples in the two
groups is the same. The accuracy of classification is determined by using a classification
matrix (also called confusion matrix).
The relative importance of the independent variables could be determined from the
standardized discriminant function coefficient and the structure matrix. The difference
between the standardized and un- standardized discriminant function is that in the un-
standardized discriminant function we have a constant term, whereas in the standardized
discriminant function, there is no constant term.
llustration of Discriminant Analysis
We will illustrate the estimation and the use of the discriminant model in the case of two
groups with the help of an example.
Cluster Analysis
Cluster analysis is similar in terms of analysing the function of multiple independent
variables. However, there are essential differences between the other data reduction
techniques and cluster analysis.
In factor analysis, the objective was to reduce the original correlated variables to a
manageable number of factors. However, the data reduction was carried out on the variables.
On the other hand, in cluster analysis the focus is on the individuals or entities and the
objective is to group the individuals on the variables.
The other data classification technique was two group discriminant analyses. Here also, one
might wish to group individuals or objects into groups, but the technique has an established
classification rule and the objective of the technique is to validate the information to see
whether the groups obtained by the identified function are correctly classified or not. In
cluster analysis, the whole population/sample is undifferentiated and the attempts to assess
similarity in response to variables and the grouping happens post the answers have been
obtained on the questions/variables.
Uses of Cluster Analysis
119 | P a g e
Cluster analysis has widespread use in the field of management. However, its most
valuable contribution is in the area of marketing. Some applications of the technique are as
follows:
Market segmentation: As we know, market segmentation is the process of splitting
customers/ potential customers, within a market into different groups/segments. The
advantage with the technique is that one can look at the combination of variables to
predict consumer or potential consumer groups.
Career planning and training analysis: In the area of human resources (HR) the
technique can be used to group people into clusters on the basis of their educational
qualification, experience, aptitude and aspirations. This grouping can assist the HR
division to effectively manage training and manpower development for the members
of different clusters effectively.
Segmenting financial sectors/instruments: This is an emerging area where different
factors like raw material cost, financial allocations, seasonality and other factors are
being used to group sectors together to understand the growth and performance of a
group of industries.
Cluster analysis is the simplest in terms of mathematical derivations. The simplest way to
explain the technique is to understand that it simply measures the distance between objects on
the basis of multiple variables and looks for similarity as a function of distance, i.e., the
shorter the distance between two objects, the more similar they are. For obtaining a cluster
solution to data that is collected on an interval or ratio scale the statistical assessment of the
distance between two objects can be done by calculating the Euclidean distance between
them. The distance between person A and B can be calculated:
Where XB1 and XB2 represent the variables under study.For example, two variables–
nutrition and ease of preparation were placed on a 10-point scale of importance (with 1 =
very unimportant and 10 = very important). The values selected by persons A and B were as
Follows:
Then the distance between the two is,
Suppose there was a third person C who had selected
Then the distance between A and C would be 5.0 and between B and
C would be 1.0.Thus, B and C are the most similar pair as the inter-person distance is the
least and, as stated earlier, the shorter the distance, the greater the similarity. If, in addition to
having nutrition and ease of preparation for breakfast, we also had a variable that measured
cost, we would effectively have a s- dimensional solution. Then the formula would have
been:
And generally, for any two objects I and j:
120 | P a g e
dij= Distance between persons i and j
k = Variable (interval/ratio)
i = Object/person
j = Object/person
Key Concepts in Cluster Analysis
The following statistics and concepts are associated with cluster analysis:
ANOVA table: The univariate or one-way ANOVA statistics for each clustering
variable. The higher is the ANOVA value, the greater is the difference between the
clusters on that variable.
Cluster variate: The variables or parameters used to cluster and calculate the similarity
between objects.
Cluster centroid: The average values of the objects on all variables in the cluster
variate.
Cluster seeds: Initial cluster centres in the non-hierarchical clustering that are the
initial points from which one starts. Then the clusters are created around these seeds.
Cluster membership: The address or the cluster to which a particular
person/object belongs.
Dendrogram: This is a tree-like diagram that graphically presents the cluster results.
The vertical axis represents the objects and the horizontal axis represents the inter-
respondent distance. The figures are to be read from left to right.
Distances between final cluster centres: These are the distances between the
individual pairs of clusters. A robust solution that is able to demarcate the groups
distinctly is the one where the inter-cluster distance is large; the larger the distance the
more distinct are the clusters.
Entropy group: Individuam,ls or small groups that do not seem to fit into any cluster.
Final cluster centres: The mean value of the cluster on each of the variables that is
part of the cluster variate.
Hierarchical methods: A step-wise process that starts with the most similar pair and
formulates a tree-like structure composed of separate clusters.
Non-hierarchical methods: Cluster seeds or centres are the starting points and one
builds individual clusters around it based on some pre-specified distance of the seeds.
Summary: Number of cases in each cluster is indicated in the non- hierarchical
clustering method.
Process of Clustering
121 | P a g e
(ii) Using the single-linkage method, prepare a dendrogram.
Summary
Factor loading: It gives the correlation coefficient between the factor score and the
variable in question
Bartlett's test of sphericity: Use to test the significance of correlation matrix
Communality: It is a measure of the percentage of variable's variation i.e. explained
by the factors
Wilks' lambda: It is given by ratio of between group variance to total variance
Hit Ratio: It is the ratio of number of correct prediction to total number of cases
Cluster membership: The address or the cluster to which a particular person/object
belongs
Dendrogram: This is a tree-like diagram that graphically presents the cluster results.
The vertical axis represents the objects and the horizontal represents the inter-
respondent distance. The figures are to be read from left to right
Structure
Introduction
Types of Research Reports
Brief Reports
Detailed Reports
Report Writing: Structure of the Research Report
122 | P a g e
Preliminary Section
Main Report
Interpretations of Results and Suggested Recommendations
Report Writing: Formulation Rules for Writing the Report
Guidelines for Presenting Tabular Data
Guidelines for Visual Representations: Graphs
Summary
Keywords
Introduction
In the previous units, we have discussed and learnt about data collection and
processing. On completion of the research study and after obtaining the research results, the
real skill of the researcher lies in analysing and interpreting the findings and linking them
with the propositions formulated in the form of research hypotheses at the beginning of the
study. The statistical or qualitative summary of results would be little more than numbers or
conclusions unless one is able to present the documented version of the research endeavour.
One cannot overemphasize the significance of a well- documented and structured research
report. Just like all the other steps in the research process, this requires careful and sequential
treatment.
In this unit, we will be discussing in detail the documentation of the research study.
The format and the steps might be moderately adjusted and altered based on the reader's
requirement. Thus, it might be for an academic and theoretical purpose or might need to be
clearly spelt and linked with the business manager’s decision dilemma
Brief Reports
These kinds of reports are not formally structured and are generally short,
sometimes not running more than four to five pages. The information provided has
limited scope and is a prelude to the formal structured report that would subsequently
follow. These reports could be designed in several ways.
Working papers or basic reports are written for the purpose of recording the process carried
out in terms of scope and framework of the study, the methodology followed and instrument
designed. The results and findings would also be recorded here. However, the interpretation
123 | P a g e
of the findings and study background might be missing, as the focus is more on the present
study rather than past literature.
Survey reports might or might not have an academic orientation. The focus here is to present
findings in easy-to-comprehend format that includes figures and tables. The advantage of
these reports is that they are simple and easy to understand and present the findings in a clear
and usable format.
Detailed Reports
These are more formal and could be academic, technical or business reports.
Technical reports: These are major documents and would include all elements of the
basic report, as well as the interpretations and conclusions, as related to the obtained
results. This would have a complete problem background and any additional past
data/records that are essential for understanding and interpreting the study results. All
sources of data, sampling plan, data collection instrument(s), data analysis outputs
would be formally and sequentially documented.
Business reports: These reports include conclusions as understood by the business
manager. The tables, figures and numbers of the first report would now be pictorially
shown as bar charts and graphs and the reporting tone would be more in business
terms. Tabular data might be attached the appendix.
Activity 1:
Find a technical and business report and examine the contents of the report against what has
been discussed in the unit. What deviations did you find from the stated structure? What do
you think could have been the reason for this?
Hint: You can avail the report from your library from the Internet.
124 | P a g e
be presented immediately after the study objectives and a short reporting on methodology
could be presented in the appendix.
1. Preliminary Section
2. Title Page
3. Letter of Authorization Executive Summary Acknowledgements Table of
Contents
4. Background Section
5. Problem Statement
6. Study Introduction and Background Scope and Objectives of the Study Review
of Liteature
7. Methodology Section Research Design Sampling Design
8. Data Collection Data Analysis
9. Findings Section
10. Results
11. Interpretation of Results
12. Conclusions Section
13. Conclusion and Recommendations Limitations of the Study
14. Appendices Glossary of Terms
Preliminary Section
This section mainly consists of identification information for the study conducted. It has the
following individual elements:
Title page: The title should be crisp and indicative of the nature of the project, as illustrated
in the following examples.
Comparative analysis of BPO workers and schoolteachers with reference to their work-life
balance Segmentation analysis of luxury apartment buyers in the National Capital Region
(NCR)
Letter of transmittal: This is the letter that broadly refers to the purpose behind the study.
The tone in this note can be slightly informal and indicative of the rapport between the client-
reader and the researcher. The letter broadly refers to three issues. It indicates the terms of the
study or objectives; next it goes on to broadly give an indication of the process carried out to
conduct the study and the implications of the findings. The conclusions generally are
indicative of the researcher's learnings from the study. A sample letter of transmittal is as
folllows:
Dear Prem,
Please find the enclosed document which covers a summary of the findings of the November-
December 2011 study of the new product offering and its acceptability. I would be sending
three hard copies of the same tomorrow. Once the core group has discussed the direction of
125 | P a g e
the expected results I would request you to kindly get back with your comments/queries/
suggestions, so that they can be incorporated in the preparation of the final report document.
The major findings of the study were that the response of the non- vegetarians consuming the
new vegan keema bonda pav at Just Bondas were positive. As you can observe, however, the
introduction of vegan mockmeat bonda has not been well received by the regular customers
who visit the outlets for their regular alloo bonda. These findings, though on a small
respondent base, are significant as they could be an indication of a deflecting loyal customer
base.
Best regards,
Nayan
Letter of authorization: The author of this letter is the business manager who formally gives
the permission for executing the project. The tone of this letter, unlike the above document, is
very precise and formal.
Table of contents: All reports should have a section that clearly indicates the division of the
report based on the formal areas of the study as indicated in the research structure. The major
divisions and subdivisions of the study, along with their starting page numbers, should be
presented. Once the major sections of the report are listed, the list of tables come next,
followed by the list of figures and graphs, exhibits (if any) and finally the list of appendices.
Executive summary: The summary of the entire report, starting from the scope and
objectives of the study to the methodology employed and the results obtained, has to be
presented in a brief and concise manner. The executive summary essentially can be divided
into four or five sections. It begins with the study background, scope and objectives of the
study, followed by the execution, including the sample details and methodology of the study.
Next comes the findings and results obtained. The fourth section covers the conclusions and
finally, the last section includes recommendations and suggestions.
Acknowledgements: A small note acknowledging the contribution of the respondents, the
corporates and the experts who provided inputs for accomplishing the study is included here.
Main Report
This is the most significant and academically robust part of the report.
Problem definition: This section begins with the formal definition of the research problem.
126 | P a g e
In case the study is an academic research, there is a separate section devoted to the review of
related literature, which presents a detailed reporting of work done on the same or related
topic of interest.
Study scope and objectives: The logical arguments then conclude in the form of definite
statements related to the purpose of the study. In case the study is causal in nature, the
formulated hypotheses are presented here as well.
Methodology of research: The section would essentially have five to six sections specifying
the details of how the research was conducted. These would essentially be:
Research framework or design: The variables and concepts being investigated are
clearly defined, with a clear reference to the relationship being studied. The
justification for using a particular design also has to be presented here.
Sampling design: The entire sampling plan in terms of the population being studied,
along with the reasons for collecting the study-related information from the selected
group is given here.
Data collection methods: In this section, the researcher should clearly list the
information needed for the study as drawn from the study objectives stated earlier.
The secondary data sources considered and the primary instrument designed for the
specific study are discussed here. However, the final draft of the measuring instrument
can be included in the appendix.
Data analysis: The assumptions and constraints of the analysis need to be explained
here in simple, non-technical terms.
Study results and findings: This is the most critical chapter of the report and requires
special care; it is probably also one of the longest chapters in the document.
127 | P a g e
Limitations of the study is the last part in this section is a brief discussion of the
problems encountered during the study and the constraints in terms of time, financial or
human resources.
End is the final section of the report provides all the supportive material in the study.
Some of the common details presented in this section are as follows:
Appendices: The appendix section follows the main body of the report and essentially
consists of two kinds of information:
1. Secondary information like long articles or in case the study uses/ is based on/refers to
some technical information that needs to be understood by the reader; long tables or
articles or legal or policy documents.
2. Primary data that can be compressed and presented in the main body of the report.
This includes the original questionnaire, discussion guides, formula used for the
study, sample details, original data, long tables and graphs which can be described in
statement form in the text.
Bibliography: This is an important part of the final section as it provides the complete details
of the information sources and papers cited in a standardized format. It is recommended to
follow the publication manuals from the American Psychological Association (APA) or the
Harvard method of citation for preparing this section. The reporting content of the
bibliography could also be in terms of:
Selected bibliography: Selective references are cited in terms of relevance
and reader requirement. Thus, the books or journals that are technical and not really
needed to understand the study outcomes are not reported.
Complete bibliography: All the items that have been referred to, even when
not cited in the text, are given here.
Annotated bibliography: Along with the complete details of the cited work, some
brief information about the nature of information sought from the article is given.
At this juncture we would like to refer to citation in the form of a footnote. To explain the
difference we would first like to explain what a typical footnote is:
Footnote: A typical footnote, as the name indicates, is part of the main report and
comes at the bottom of a page or at the end of the main text. This could refer to a
source that the author has referred to or it may be an explanation of a particular
concept referred to in the text.
The referencing protocol of a footnote and bibliography is different.
In a footnote, one gives the first name of the person first and the surname next. However, this
order is reversed in the bibliography. Here we start with the surname and then give the first
name.
In a bibliography, we generally mention the page numbers of the article or the total pages in
the book. However, in a footnote, the specific page from which the information is cited is
mentioned. A bibliography is generally arranged alphabetically depending on the author's
128 | P a g e
name, but in the footnote the reporting is based on the sequence in which they occur in the
text.
Glossary of terms: In case there are specific terms and technical jargon used in the report,
the researcher should consider putting a glossary in the form of a word list of terms used in
the study. This section usually the last section of the report.
Activity 2:
Thus, some guidelines should be kept in mind while writing the report.
Command over the medium: A correct and effective language of communication is critical
in putting ideas and objectives in the vernacular of the reader/decision-maker.
129 | P a g e
Phrasing protocol: There is a debate about whether or not one makes use of personal
pronoun while reporting. The use of personal pronoun such as
'I think…..' or 'in my opinion…..' lends a subjectivity and personalization of judgement.
Thus, the tone of the reporting should be neutral. For example: 'Given the nature of the
forecasted growth and the opinion of the respondents, it is likely that the……'
Whenever the writer is reproducing information verbatim from another document or
comment of an expert or published source, it must be in inverted commas or italics and the
author or source should be duly acknowledged.
For example:
Sarah Churchman, Head of Diversity, Price water house Coopers, states 'At Price water house
Coopers, we firmly believe that promoting work-life balance is a 'business-critical' issue and
not simply the 'right thing to do'. The writer should avoid long sentences and break up the
information in clear chunks, so that the reader can process it with ease.
Report formatting and presentation: In terms of paper quality, page margins and font style
and size, a professional standard should be maintained. The font style must be uniform
throughout the report. The topics, subtopics, headings and subheadings must be construed in
the same manner throughout the report. The researcher can provide data relief and variation
by adequately supplementing the text with graphs and figures.
Table identification details: The table must have a title (1a) and an identification number
(1b). The table title should be short and usually would not include any verbs or articles. It
only refers to the population or parameter being studied. The title should be briefly yet clearly
descriptive of the information provided. The numbering of tables is usually in a series and
generally one makes use of Arabic numbers to identify them.
Data arrays: The arrangement of data in a table is usually done in an
130 | P a g e
ascending manner.
Measurement unit: The unit in which the parameter or information is presented should be
clearly mentioned.
Spaces, leaders and rulings (SLR): For limited data, the table need not be divided using
grid lines or rulings, simple white spaces add to the clarity of information presented and
processed. In case the number of parameters is too many, it is advisable to use vertical ruling.
Horizontal lines are drawn to separate the headings from the main data.
Data sources: In case the information documented and tabled is secondary in nature,
complete reference of the source must be cited after the footnote, if any.
Special mention: In case some figure or information is significant and the reader should pay
special attention to it, the number or figure can be bold or can be highlighted to increase
focus.
Guidelines for Visual Representations: Graphs
Similar to the summarized and succinct data in the form of tables, the data can also be
presented through visual representations in the form of graphs.
Line and curve graphs: Usually, when the objective is to demonstrate trends and some sort
of pattern in the data, a line chart is the best option available to the researcher. It is also
possible to show patterns of growth of different sectors or industries in the same time period
or to compare the change in the studied variable across different organizations or brands in
the same industry. Certain points to be kept in mind while formulating line charts include:
The time units or the causal variable being studied are to be put on the X-axis, or the
horizontal axis. If the intention is to compare different series on the same chart, the lines
should be of different colours or forms
Too many lines are not advisable; an ideal number would be five or less than five lines on the
chart.
The researcher also must take care to formulate the zero baseline in the chart as otherwise,
the data would seem to be misleading.
131 | P a g e
Area or stratum charts: Area charts are like the line charts and are used to demonstrate
changes in a pattern over a period of time. What is done is that the change in each of the
components is individually shown on the same chart and each of them is stacked one on top
of the other. The areas between the various lines indicate the scale or volume of the relevant
factors/ categories (Figure 15.4).
Pie charts: Another way of demonstrating the area or stratum or sectional representation is
through pie charts. The critical difference between a line and pie chart is that the pie chart
cannot show changes over time. It simply shows the cross-section of a single time period.
There are certain rules that the researcher should keep in mind while creating pie charts.
The complete data must be shown as 100 per cent area of the subject being graphed.
It is a good idea to have the percentages displayed within or above the pie rather than in the
legend as then it is easier to understand the magnitude of the section in comparison to the
total.
Bar charts and histograms: A very useful representation of quantum or magnitude of
different objects on the same parameter are bar diagrams. The comparative position of objects
becomes very clear. The usual practice is to formulate vertical bars; however, it is possible to
use horizontal bars as well if none of the variables is time related . Horizontal bars are
especially useful when one is showing both positive and negative patterns on the same graph.
These are called bilateral bar charts and are especially useful to highlight the objects or
sectors showing a varied pattern on the studied parameter.
Another variation of the bar chart is the histogram.Here the bars are vertical and the height of
each bar reflects the relative or cumulative frequency of that particular variable.
Summary
Let us recapitulate the main points discussed in this unit:
The most important task ahead of the researcher is to document the entire work done
in the form of a well structured research report.
The orientation and structure will depend on what kind of report is being constructed.
These could be brief or detailed; academic, technical or business report.
The reports generally follow a standardized structure. The entire report can be divided
into three main sections—the preliminary section, the main body and endnotes.
There must be no ambiguity in either presenting the findings or representativeness of
the findings.
Visual relief from the written text can be provided through figures, tables and graphs.
Keywords
132 | P a g e
Annotated bibliography: A bibliography that includes brief explanations or notes for
each reference
Bibliography: A list of the works of a specific author or publisher
Executive summary: The summary of the entire report, starting from the scope and
objectives of the study to the methodology employed and the results obtained,
presented in a brief and concise manner
Letter of transmittal: The letter that broadly refers to the purpose behind the study
Working paper: Report that is written for the purpose of recording the process
carried out in terms of scope and framework of the study, the methodology followed
and instrument designed
133 | P a g e