0% found this document useful (0 votes)
22 views391 pages

MB0040

The document outlines a distance education module titled 'Statistics for Management' from Sikkim Manipal University, consisting of 15 units covering various statistical concepts and applications in business. Key topics include data collection, classification, probability, hypothesis testing, and forecasting, aimed at enhancing decision-making skills for managers. The module emphasizes the importance of statistical methods in analyzing data to support managerial decisions and improve business outcomes.

Uploaded by

burhanamet919
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views391 pages

MB0040

The document outlines a distance education module titled 'Statistics for Management' from Sikkim Manipal University, consisting of 15 units covering various statistical concepts and applications in business. Key topics include data collection, classification, probability, hypothesis testing, and forecasting, aimed at enhancing decision-making skills for managers. The module emphasizes the importance of statistical methods in analyzing data to support managerial decisions and improve business outcomes.

Uploaded by

burhanamet919
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 391

MB 0040

Statistics for Management


Contents
Unit 1
Introduction 1
Unit 2
Statistical Survey 19
Unit 3
Classification, Tabulation and Presentation of Data 40
Unit 4
Measures Used to Summarise Data 73
Unit 5
Probabilities 117
Unit 6
Theoretical Distributions 149
Unit 7
Sampling and Sampling Distributions 177
Unit 8
Estimation 200
Unit 9
Testing of Hypothesis in Case of Large & Small Samples 217
Unit 10
Chi-Square 247
Edition: Spring 2010
th
BKID – B1129 7 Jan. 2010
Unit 11
F – Distribution and Analysis of Variance (ANOVA) 267
Unit 12
Simple Correlation and Regression 286
Unit 13
Business Forecasting 318
Unit 14
Time Series Analysis 337
Unit 15
Index Numbers 360
Dean
Directorate of Distance Education
Sikkim Manipal University

Board of Studies
Chairman Mr. Pankaj Khanna
HOD Management & Commerce Director
SMU – DDE HR, Fidelity Mutual Fund
Additional Registrar Mr. Shankar Jagannathan
SMU – DDE Former Group Treasurer
Wipro Technologies Limited
Controller of Examination Mr. Abraham Mathew
SMU – DDE Chief Financial Officer
Infosys BPO, Bangalore
Dr. T. V. Narasimha Rao Ms. Sadhna Dash
Adjunct Faculty & Advisor Ex-Senior Manager, HR
SMU – DDE Microsoft India Corporation (Pvt.) Ltd.
Prof. K. V. Varambally
Director
Manipal Institute of Management
Manipal

Content Preparation Team Content Editing Peer Review By


Prof. S. Santhanam Mr. Subhabaha Pal Prof. K. C. S. Rao
(Faculty – M P Birla Institute Lecturer, SMU – DDE Chairman, MBA (BT)
of Management) Bangalore School of Management
Studies, Pondicherry
Mrs. Sharada Rudramurthy University, Pondicherry
(Visiting Faculty – SIET)

Edition: Spring 2010


This book is a distance education module comprising of collection of learning
material for our students.
All rights reserved. No part of this work may be reproduced in any form by any
means without permission in writing from Sikkim Manipal University of Health,
Medical and Technological Sciences, Gangtok, Sikkim.
Printed and Published on behalf of Sikkim Manipal University of Health, Medical and
Technological Sciences, Gangtok, Sikkim by Mr. Rajkumar Mascreen, GM, Manipal
Universal Learning Pvt. Ltd., Manipal – 576 104. Printed at Manipal Press Limited,
Manipal.
SUBJECT INTRODUCTION
Statistics, as a branch of applied mathematics, discusses data management
processes. It deals with collection, classification, presentation and analysis
of numerical data. It is useful in testing inferences relating to behavior of
economic data in the course of managerial decision making. This module
comprises of 15 units:
Unit 1: Introduction
This unit discusses the definition of Statistics and also describes the two
categories of Statistics. It also deals with some of the statistical software
packages used for evaluating the data.
Unit 2: Statistical Survey
This unit will briefly explain how to conduct Statistical Survey. We will also
discuss about the collection and analysis of numerical data. We further
discuss about the various types of data.
Unit 3: Classification, Tabulation and Presentation of Data
This unit deals with some methods used for classification and presentation
of data in a tabular or graphical way that reveals certain patterns.
Unit 4: Measures used to Summarise Data
This unit explains about the measures available for summarising the data
such as mean, median and mode. It also deals with calculation of standard
deviation and coefficient of variance.
Unit 5: Probability
This unit describes the different ways of dealing with uncertainty using the
probability concepts. The rules of probability are described in detail. It also
describes the application of Bayes’ theorem.
Unit 6: Theoretical Distributions
This unit discusses the random variables both discrete and continuous. It
deals with the probability distributions associated with the random variables.
Unit 7: Sampling and Sampling Distributions
This unit describes the sampling design and also the theories of sampling. It
further deals with different sampling methods available. At the end of this
unit, central limit theorem is discussed.
Unit 8: Estimation
This unit gives the importance of estimation used for improving the business
statistics. It deals with different types of estimation. We will be discussing
the calculation of confidence intervals of the population mean when the
standard deviation is unknown. Finally, it deals with the methods to calculate
the sample size if the confidence levels are given.
Unit 9: Testing of Hypothesis in Case of Large and Small Samples
This unit explains hypothesis testing, which is helpful in decision making. It
deals with testing of hypothesis in case of large and small samples. Finally,
we discuss the calculation of ‘t’ distribution statistics.
Unit 10: Chi-Square
This unit discusses about the Chi-Square tests, which are non parametric
tests. It further explains the application of Chi-Square test when we have
few or no assumptions about the population.
Unit 11: F – Distribution and Analysis of Variance (ANOVA)
In this unit, we will discuss about the purpose of using analysis of variance
technique to evaluate the variations among more than two population
means. It also deals with conducting the F-test to draw inferences about
whether our samples are drawn from populations having the same mean or
not.
Unit 12: Simple Correlation and Regression
This unit explains about the techniques such as correlation and regression,
used for investigating the relationship between two or more variables. It also
discusses applying these techniques to measure the strength of
relationships between variables.
Unit 13: Business Forecasting
This unit deals with the business forecasting, the methods available in
forecasting, and the use of forecasting models in business improvement
processes.
Unit 14: Time Series Analysis
This unit briefly explains about the time series analysis and different
components of time series. It also deals with the forecasting methods using
time series.
Unit 15: Index Numbers
This unit particularly deals with the meaning and definition of index numbers
and the types of indices. It also discusses different kinds of index numbers.
Finally, it deals with the limitations and uses of index numbers.
Module objectives
By the end of the module, ‘Statistics for Management’, you should be able
to:
 Apply the correct data collection methods
 Collect and manipulate data to draw valuable inferences
 Apply probability and probability distribution concepts to solve real life
problems
 Apply correct sampling designs for the data
 Test hypothesis using various test statistics
 Apply time series analysis in business scenarios
 Describe how much the economic variables have changed over time.
Statistics for Management Unit 1

Unit 1 Introduction
Structure:
1.1 Introduction to Statistics
Learning objectives
Importance of Statistics in modern business environment
1.2 Definition of Statistics
1.3 Scope and Applications of Statistics
1.4 Characteristics of Statistics
1.5 Functions of Statistics
1.6 Limitations of Statistics
1.7 Statistical Softwares
1.8 Summary
1.9 Terminal Questions
1.10 Answers to SAQs and TQs
Answers to Self Assessment Questions
Answers to Terminal Questions
1.11 References

1.1 Introduction
Welcome to the unit on Statistics. In this unit, you will study about Statistics,
which deals with gathering, organising and analysing data.
Statistics plays an important role in almost every facet of human life. In the
business context, managers are required to justify decisions on the basis of
data. They need statistical models to support these decisions. Statistical skills
enable managers to collect, analyse and interpret data and make relevant
decisions. Statistical concepts and statistical thinking enable them to:
 Solve problems in almost any domain
 Support their decisions
 Reduce guesswork
1.1.1 Learning objectives
By the end of this unit, you should be able to:
 Describe the scope of Statistics
 Distinguish between statistical data and non-statistical data

Sikkim Manipal University Page No. 1


Statistics for Management Unit 1

 Recognise the functions of Statistics


 Recognise the limitations of Statistics
 Recall the computer programs used for analysing Statistics
1.1.2 Importance of Statistics in modern business environment
Due to advanced communication network, rapid changes in consumer
behaviour, varied expectations of variety of consumers and new market
openings, modern managers have a difficult task of making quick and
appropriate decisions. Therefore, there is a need for them to depend more
upon quantitative techniques like mathematical models, statistics,
operations research and econometrics.

Caselet 1
The new General Manager Mr. Ravi of a manufacturing company is
concerned about the dwindling profits of the company. The Marketing
and Production Managers identifies the reason as guarantee period
given to customers, since the product has to be replaced if it fails within
guarantee period. This replacement lowers the company‟s profits and
also causes loss of reputation. The General Manager is now thinking in
terms of reducing the percentage of failure of units within a year. This
means that he should take action to improve the life of the unit. After
preliminary studies he decides to
I. Estimate the average life of the units and their variation.
II. Take action to improve the life.
III. Lower the replacement cost as much as possible.

As you can see, what the General Manager is doing here is using Statistics
to solve a problem and to increase profits.
Decision making is a key part of our day-to-day life. Even when we wish to
purchase a television, we like to know the price, quality, durability, and
maintainability of various brands and models before buying one. As you can
see, in this scenario we are collecting data and making an optimum
decision. In other words, we are using Statistics.
Again, suppose a company wishes to introduce a new product, it has to
collect data on market potential, consumer likings, availability of raw
materials, feasibility of producing the product. Hence, data collection is the
back-bone of any decision making process.

Sikkim Manipal University Page No. 2


Statistics for Management Unit 1

Many organisations find themselves data-rich but poor in drawing


information from it. Therefore, it is important to develop the ability to extract
meaningful information from raw data to make better decisions. Statistics
play an important role in this aspect.
Statistics is broadly divided into two main categories. Figure 1.1 illustrates
the two categories. The two categories of Statistics are descriptive statistics
and inferential statistics.

Statistics

Descriptive Inferential
Statistics Statistics

Collecting Making Inference


Organising Hypothesis Testing
Summarising Determining
Presenting data relationships
Making Predictions

Fig. 1.1: Divisions in Statistics


 Descriptive Statistics: Descriptive statistics is used to present the
general description of data which is summarised quantitatively. This
is mostly useful in clinical research, when communicating the results
of experiments.

Caselet 2
In a firm, Human Resources Manager (HR Manager) calculates
the average salary of employees pertaining to production
department. The statistical data collected is related to production
department and does not give any information about other
departments of the firm. Here, the HR Manager is using
descriptive statistics. In this example, the HR Manager displays
the summarised numerical data in the form of tables, charts, and
diagrams, which comes under descriptive statistics.

Sikkim Manipal University Page No. 3


Statistics for Management Unit 1

 Inferential Statistics: Inferential statistics is used to make valid


inferences from the data which are helpful in effective decision making
for managers or professionals.

Caselet 3
In a firm, the Human Resources Manager (HR Manager) uses the
average salary of employees pertaining to production department to
calculate the average salary of employees of all other departments
of the firm. Here, the HR Manager is using inferential statistics as the
estimation of averages deals with inferential statistics.

Statistical methods such as estimation, prediction and hypothesis testing


belong to inferential statistics. The researchers make deductions or
conclusions from the collected data samples regarding the
characteristics of large population from which the samples are taken.

Self Assessment Questions

1. In which of the following situations would you like to use Statistics?


a. Buying a house
b. Purchasing medicine prescribed by a doctor
c. Investing funds in several options.
d. Attending relatives marriages

1.2 Definition of Statistics

Statistics is usually and loosely defined as:


1. A collection of numerical data that measure something.
2. The science of recording, organising, analysing and reporting
quantitative information.
Professor A.L. Bowley gave several definitions of Statistics. He defined
Statistics as:
“i) The science of counting
ii) The science of averages

Sikkim Manipal University Page No. 4


Statistics for Management Unit 1

iii) The science of measurement of social phenomena, regarded as a whole


in all its manifestations.
iv) A subject not confined to any one science”1
However, none of these definitions are complete.
According to Horace Secrist, “Statistics may be defined as the aggregate of
facts affected to a marked extent by multiplicity of causes, numerically
expressed, enumerated or estimated according to a reasonable standard of
accuracy, collected in a systematic manner, for a predetermined purpose
and placed in relation to each other”2. This definition is both comprehensive
and exhaustive.
Prof. Boddington, on the other hand, defined Statistics as „The science of
estimates and probabilities‟3. This definition is also not complete.
According to Croxton and Cowden, „Statistics is the science of collection,
presentation, analysis and interpretation of numerical data from logical
analysis‟4.
The four different components of Statistics as per Croxton and Cowden are
shown in figure 1.2.

Collection of Presentation Analysis Interpretation


Data of Data of Data of Data

Fig. 1.2: Basic components of Statistics according to Croxton and Cowden

1. Collection of Data
Careful planning is needed while collecting data. The different methods
used for collecting data such as census method, sampling method and
so on. The investigator has to take care while selecting appropriate
collection methods.

1 th
Agarwal B L (2006) Basic Statistics 4 ed. Pgs 1-2 New Age International
Publishers
2 th
Agarwal B L (2006) Basic Statistics 4 ed. Pg 1 New Age International Publishers
3 th
Agarwal B L (2006) Basic Statistics 4 ed. Pg 2 New Age International Publishers
4 th
Agarwal B L (2006) Basic Statistics 4 ed. Pg 2 New Age International Publishers

Sikkim Manipal University Page No. 5


Statistics for Management Unit 1

In the census method, every unit or object of the population is included


in the investigation. For example, if we want to study the average annual
income of all the families in a given area which has 500 families, we
must study the income of all 500 families. When the population is large,
census method would be difficult.
A sample of units or objects is taken from the population to describe the
overall characteristics of the population from which the sample was
drawn. This method of collecting data is called sampling. This method is
helpful when size of the population is large or when the results are
needed in short time.
2. Presentation of Data
The collected data is usually presented for further analysis in a tabular,
diagrammatic or graphic form. The collected data is condensed,
summarised and visually represented in a tabular or graphical form.
Tabulation is a systematic arrangement of classified data in rows and
columns. For the representation of data in diagrams, we use different
types of diagrams such as one-dimensional, two-dimensional and three-
dimensional diagrams.
 Line diagrams, bar diagrams are one-dimensional diagrams. (Refer
to figure 1.3 and figure 1.4 for the illustrations of line diagram and
bar diagram respectively)

Fig. 1.3: Line diagram Fig. 1.4: Bar diagram

 Pie-charts are the two-dimensional diagrams which are in the form of


circle. In pie-chart, total and component parts are shown in circular
shape.

Sikkim Manipal University Page No. 6


Statistics for Management Unit 1

Example 1
The pie-chart in figure 1.5 represents the sales figures of SPQ Company
for the year 2008.

Fig. 1.5: Pie-chart representing sales figures of SPQ Company

3. Analysis of ata
The data presented has to be carefully analysed to make any inference
from it. The inferences can be of various types, for example, as
measures of central tendencies, dispersion, correlation, regression.
Measures of central tendency will quantify the middle of the distribution.
The measures in case of population are the parameters and in case of
sample, the measures are statistics that are estimates of population
parameters. The three most common ways of measuring the centre of
distribution is the mean, mode and median.
In case of population, the measures of dispersion are used to quantify
the spread of the distribution. Range, interquartile range, mean absolute
deviation and standard deviation are four measures to calculate the
dispersion.
4. Interpretation of Data
The final step is to draw conclusions from the analysed data.
Interpretation requires high degree of skill and experience. We can
interpret the data easily from pie-charts.

Sikkim Manipal University Page No. 7


Statistics for Management Unit 1

Example 2
The pie-chart in figure 1.6 shows the monthly expenses of „family A‟.
From the pie-chart, we can infer that Prasad‟s family spent maximum on
food and spent equal amounts on the fuel and miscellaneous items.

Fig. 1.6: Pie-chart of Prasad’s family expenses

Thus, Statistics contains the tools and techniques required for the collection,
presentation, analysis and interpretation of data. Thus, we see that this
definition is precise and comprehensive.

Self Assessment Questions


2. According to the definition of Statistics given by Croxton and Cowden,
what are the four components of Statistics?

1.3 Scope and Applications of Statistics


Statistical methods are applied to specific problems in various fields such as
Biology, Medicine, Agriculture, Commerce, Business, Economics, Industry,
Insurance, Sociology and Psychology.
In the field of medicine, statistical tools like t-tests are used to test the
efficiency of the new drug or medicine. In the field of economics, statistical
tools such as index numbers, estimation theory and time series analysis are
used in solving economic problems related to wages, price, production and
distribution of income. In the field of agriculture, an important concept of
statistics such as analysis of variance (ANOVA) is used in the experiments
related to agriculture, to test the significance between two sample means.
In Biology, Medicine and Agriculture, Statistical methods are applied in the
 Study of growth of plant
 Movement of fish population in the ocean

Sikkim Manipal University Page No. 8


Statistics for Management Unit 1

 Migration patterns of birds


 Analysing the effect of newly invented medicines
 Theories of heredity
 Estimation of yield of crop
 Study the effect of fertilizer on yield
 Birth rate
 Death rate
 Population growth
 Growth of bacteria
Insurance companies decide on the insurance premiums based on the age
composition of the population and the mortality rates. Actuarial science is
used for the calculation of insurance premiums and dividends.
Statistics is a part of Economics, Commerce and Business. Statistical
analysis of the variations in price, demand and production are helpful to both
businessmen and economists. Cost of living index numbers help the
governments in economic planning and fixation of wages. A government‟s
administrative system is fully dependent on production statistics, income
statistics, labour statistics, economic indices of cost, price. Economic
planning of any nation is entirely based on statistical facts. Cost of living
index numbers are also used to estimate the value of money. Analysis of
demand, price, production cost, and inventory costs help in decision making
in business activities.
Management of limited resources and labour needs statistical methods to
maximise profit. Planned recruitments and distribution of staff, proper quality
control methods, careful study of demand for goods in the market as well as
balanced investment help the producer to extract maximum profit out of
minimum capital. In manufacturing industries, statistical quality control
techniques help in increasing and controlling the quality of products at
minimum cost. Hence, statistics is applied in every sphere of human activity.

Self Assessment Questions


3. Mention some other areas where there is scope of applying statistics.

Sikkim Manipal University Page No. 9


Statistics for Management Unit 1

1.4 Characteristics of Statistics


There are several characteristics of Statistics. Not only does it deal with an
aggregate of facts, it also gets affected by multiple causes. Statistics are
numerically expressed and are estimated with varying degrees of accuracy.
Statistics are collected in a systematic manner and for pre-determined
purposes. To ensure comparative and analytical studies, statistical facts
need to be arranged in systematic, logical order. Let us look at each
characteristic in detail.
1. Statistics deals with aggregate of facts
Single figure cannot be analysed. Thus, the fact „Mr Kiran is 170 cms. tall‟
cannot be statistically analysed. On the other hand, if we know the heights
of 60 students of a class, we can comment upon the average height and
variation.
2. Statistics gets affected to a huge extent by multiplicity of causes
The Statistics of yield of a crop is the result of several factors such as fertility
of soil, amount of rainfall, quality of seed used, quality and quantity of
fertilizer used.
3. Statistics are numerically expressed
Only numerical facts can be statistically analysed. Therefore, facts as „price
decreases with increasing production‟ cannot be called statistics. The
qualitative data such as the categorical data cannot be called as statistics.
For example, the eye color of a person or the brand name of an automobile.
4. Statistics are enumerated or estimated with required degree of
accuracy
The facts should be collected from the field or estimated (computed) with
the required degree of accuracy. The degree of accuracy differs depending
on the purpose. For example, in measuring the length of screws, an
accuracy of up to a millimetre may be required, whereas while measuring
the heights of students in a class, an accuracy of up to a centimetre is
enough.
5. Statistics are collected in a systematic manner
The facts should be collected according to planned and scientific methods.
Otherwise, they are likely to be wrong and misleading.

Sikkim Manipal University Page No. 10


Statistics for Management Unit 1

6. Statistics are collected for a pre-determined purpose


There must be a definite purpose for collecting facts. Otherwise,
indiscriminate data collection might take place which would lead to wrong
diagnosis.
7. Statistics are placed in relation to each other
The facts must be placed in such a way that a comparative and analytical
study becomes possible. Thus, only related facts which are arranged in
logical order can be called Statistics. Statistical analysis cannot be used to
compare heterogeneous data.

Self Assessment Questions

4. a) Will the same degree of accuracy be needed when measuring the


height of a mountain and the height of a person?
b) Does Statistics deal with qualitative data?
5. Categorise the following data as qualitative or quantitative data.
a) The number of transactions occurring in an ATM per day
b) The popular brand name in cars is Maruthi.

1.5 Functions of Statistics


Statistics is used for various purposes. It is used to simplify mass data and
to make comparisons easier. It is also used to bring out trends and
tendencies in the data as well as the hidden relations between variables. All
this helps to make decision making much easier. Let us look at each
function of Statistics in detail.
1. Statistics simplifies mass data
The use of statistical concepts helps in simplification of complex data. Using
statistical concepts, the managers can make decisions more easily. The
statistical methods help in reducing the complexity of the data and
consequently in the understanding of any huge mass of data.
Solved Problem 1: Fifty people were interviewed to rate a regional movie
on the scale of 1 to 10, with number 1 being for the top movie and number
10 being for the worst movie. The table 1.1a shows the ratings given by 50
customers. Simplify the data?

Sikkim Manipal University Page No. 11


Statistics for Management Unit 1

Table 1.1a. The ratings (scale of 1 to 10) for a


regional movie given by 50 customers

15768 75347 12587 47424 98725


45798 78967 23287 63576 39548

The data in table 1.1a can be condensed and is presented in table 1.1b
using the statistical concepts such as calculating frequency and frequency
distribution to draw conclusions and then frequency table is prepared. In this
example, from the bulk data consisting of 50 rating scores, the frequency
table was prepared. The frequency table is in condensed and simple form.
From the tabled data, we can easily interpret that for the regional movie,
most of the customers gave a 7 rating (that is, 11 customers). Only two
customers gave a rating of 1 for the regional movie, which means only two
out of 50 customers surveyed liked the regional movie the most.
Table 1.1b. Frequency table

Rating Frequency Frequency Distribution


1 2 2/50 = 0.04
2 5 5/50 = 0.10
3 4 4/50 = 0.08
4 6 6/50 = 0.12
5 7 7/50 = 0.14
6 4 4/50 = 0.08
7 11 11/50 = 0.22
8 7 7/50 = 0.14
9 4 4/50 = 0.08
10 0 0/50 =0
Total 50 1

2. Statistics makes comparison easier


Without using statistical methods and concepts, collection of data and
comparison cannot be done easily. Statistics helps us to compare data
collected from different sources. Grand totals, measures of central
tendency, measures of dispersion, graphs and diagrams, coefficient of
correlation all provide ample scopes for comparison.

Sikkim Manipal University Page No. 12


Statistics for Management Unit 1

Example 3
The graphical curve represented in figure 1.7 and figure 1.8 shows the
profits of CBA Company and ZYX Company respectively, for ten years
from 1998 to 2008. The profits are plotted on the Y-Axis and the timeline
in years on X-Axis. From the graphs, we can compare the profits of two
companies and derive to a conclusion that profits of CBA Company in the
year 2008 are higher than that of ZYX Company.
The graphical curve in case of figure 1.7 shows that the profits for CBA
Company are increasing, whereas the profits curve in figure 1.8 is
constant for ZYX Company from middle of the decade (1998-2008).

Fig. 1.7: Profits of CBA Fig. 1.8: Profits of ZYX

Fig. 1.7:
Hence, visual Profits of CBA
representation Fig. 1.8:
of numerical data Profits
helps you of
toZYX
compare the
data with less effort and can make effective decisions.
3. Statistics brings out trends and tendencies in the data
After data is collected, it is easy to analyse the trend and tendencies in the
data by using the various concepts of Statistics.
4. Statistics brings out the hidden relations between variables
Statistical analysis helps in drawing inferences on data. Statistical analysis
brings out the hidden relations between variables.
5. Decision making power becomes easier
With the proper application of Statistics and statistical software packages on
the collected data, managers can take effective decisions, which can
increase the profits in a business.

Sikkim Manipal University Page No. 13


Statistics for Management Unit 1

Self Assessment Questions


6. Total sales of a product in Area A is 840 for 30 working days. Total sales
of same product in Area B is 784 for 28 working days. Do you think that
Statistics needs to be applied to get an appropriate picture regarding
comparison of sales?

1.6 Limitations of Statistics


Despite all its characteristics and functions, Statistics also have certain
limitations.
1. Statistics does not deal with qualitative data
Qualitative data deals with meanings while quantitative data deals with
numbers. Qualitative data describes properties or characteristics that are
used to identify things. Quantitative data describes data in terms of quantity
using the numerical figure accompanied by measurement unit. Statistics
deals only with quantitative data.
Statistics deals with numerical data, which can be expressed in terms of
quantitative measurements. So, the qualitative phenomenon like beauty,
intelligence cannot be expressed numerically and any statistical analysis
cannot be directly applied on these qualitative phenomena. But Statistical
techniques may be applied indirectly by first reducing the qualitative data to
accurate quantitative terms. For example, the intelligence of a group of
students can be studied on the basis of their marks in a particular
examination.
2. Statistics does not deal with individual fact
Statistical methods can be applied only to aggregates of facts, because
analysis and interpretation of data is highly difficult in case of individual
facts.
3. Statistical inferences (conclusions) are not exact
Statistical inferences are true only on an average. They are probabilistic
statements. For example, in case of data, which consists of height of 200
male persons taken from a graduate school, the inferences so obtained may
not hold true for an individual male person in particular.

Sikkim Manipal University Page No. 14


Statistics for Management Unit 1

4. Statistics can be misused and misinterpreted


Lack of sufficient knowledge of statistical science often leads to incorrect
conclusions. Therefore, proper care must be taken while selecting collection
method and also in choosing appropriate statistical models. Increasing
misuse of Statistics has led to increasing distrust in Statistics.
5. Common men cannot handle Statistics properly
The field of Statistics is so vast that it needs experience as well as skill to
effectively understand and apply the statistical concepts and models.
Hence, only statisticians can handle statistics properly.

1.7 Statistical Softwares


When the collected data is small, the analysis and interpretation can be
done without much difficulty. But when a huge amount of data is huge, the
process of analysis and interpretation would be difficult. Therefore, there is a
need for tools to do the calculations in an easier way.
With the advent of computers, lots of statistical tools have been developed
which help the scientific and technical researchers or statisticians in getting
the most accurate and useful information from data. These statistical
packages help the statisticians in summarising, presenting and analysing
huge amounts of data in a short time. Some such statistical software
applications are Minitab, SPSS, and E-Views. Let us know about some of
the statistical tools in brief.

Minitab
Minitab is a statistical software package that was designed especially for
the teaching of introductory statistics courses. It is our view that an easy-
to-use statistical software package is a vital and significant component of
such a course. This permits the student to focus on statistical concepts
and thinking rather than computations or the learning of a statistical
package. The main aim of any introductory statistics course should
always be the why of statistics rather than technical details that do little to
stimulate the majority of students or, in our opinion, do little to reinforce
the key concepts.
Source: http://www.minitab.com

Sikkim Manipal University Page No. 15


Statistics for Management Unit 1

SPSS
SPSS Inc. technology encapsulates advanced mathematical and
statistical expertise to extract predictive knowledge that when deployed
into existing processes makes them adaptive to improve outcomes.
Our Predictive Analytics Software will help you:
 Capture all the information you need about people's attitudes and
opinions
 Predict the outcomes of interactions before they occur
 Act on your insights by embedding analytic results into business
processes
Source: http://www.spss.com

Eviews
EViews is a statistical software tool, which offers academic researchers,
corporations, government agencies, and students access to powerful
statistical, forecasting, and modeling tools through an innovative, easy-to-
use object-oriented interface.
EViews is the ideal package for anyone who works with time series,
cross-section, or longitudinal data. EViews offers an extensive array of
powerful features for data handling, statistics and econometric analysis
forecasting and simulation, data presentation, and programming. EViews
generates forecasts or model simulations, and produce high quality
graphs and tables.
Source: http://www.eviews.com/

1.8 Summary
Decision making process becomes more efficient with the help of Statistics.
Statistics deals with aggregate of facts. Statistics is applied in all fields of
our activities. Statistical interpretation requires skilled and experienced
statisticians. Statistical data is numerical data or quantitative data but not
qualitative data.

Sikkim Manipal University Page No. 16


Statistics for Management Unit 1

Statistics is broadly divided into Descriptive and Inferential Statistics.


Descriptive Statistics gives the general description of quantitative data
whereas inferential statistics deals with reaching valid conclusions about the
data in order to make effective judgment. The statistical software packages
used by the interpreters or statisticians are Minitab, SPSS, Microsoft Excel,
EViews and others.

1.9 Terminal Questions


1. Mention the characteristics of Statistics?
2. Give plural meaning of the word Statistics?
3. What are the limitations of Statistics?

1.10 Answers to SAQs and TQs

Answers to Self Assessment Questions


1. a) Yes
b) No
c) Yes
d) No
2. Industrial Quality control, Investment policies, to find Market potential for
a product.
3. The four components of Statistics are collection, presentation, analysis
and interpretation of data.
4. a) No
b) No
5. a) Quantitative data
b) Qualitative data
6. Yes

Answers to Terminal Questions


1. Refer section 1.4
2. The science of estimates and probabilitie.
3. Refer section 1.6

Sikkim Manipal University Page No. 17


Statistics for Management Unit 1

1.11 References
 B.L. Agarwal, (2006) Basic Statistics, Fourth Edition, New Age
International Publishers
 Rand R. Wilcox , (2009) Basic Statistics – Understanding Conventional
Methods and Modern Insights, Oxford University Press
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
 http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-1.pdf

Sikkim Manipal University Page No. 18


Statistics for Management Unit 2

Unit 2 Statistical Survey


Structure:
2.1 Introduction
Learning Objectives
Definition of Statistical Survey
2.2 Stages of Statistical Survey
Planning of a Statistical Survey
Execution of Statistical Survey
2.3 Basic Terms used in Statistical Survey
Units or Individuals
Population or Universe
Sample
Quantitative Characteristic
Qualitative Characteristic
Variable
2.4 Collection of Data
Primary Data
Secondary Data
Pilot survey
2.5 Scrutiny and Editing of Data
2.6 Summary
2.7 Terminal Questions
2.8 Answers to SAQs and TQs
Answers to Self Assessment Questions
Answers to Terminal Questions
2.9 References

2.1 Introduction
In Unit 1, „Introduction‟, you have studied about Statistics and definition of
Statistics. You also studied the broad divisions of Statistics. You now have
an idea about what Statistics is, the characteristics of Statistics and the
limitations of Statistics. In this unit 2, „Statistical Survey‟, you will study about
the collection and analysis of numerical data.
When the population is large, it is hard to conduct a survey. In such
situations, a sample is drawn and studied to determine the characteristics of

Sikkim Manipal University Page No. 19


Statistics for Management Unit 2

the entire population from which the sample was taken. The primary
purpose of conducting a sample survey is to obtain certain information about
the population and to draw or infer valid conclusions about the
characteristics of the population.
We can define the term „survey‟ as a measurement tool, which is used to
gather people‟s opinions. Surveys differ in terms of purpose, field of study,
scope, and the source of information. Surveys are used by companies to
assess the level of satisfaction their customers feel, to find out what
products their customers choose and to determine which target population is
buying their products. All the following activities require collection and
analysis of data in a systematic manner.
 Formulation of a theory such as “Tobacco Consumption Leads to
Cancer”
 Framing of policies according to existing nature of a population
 Finding the relationship between characteristics of units in the
population
In other words, a search for knowledge by analysing numerical data is
known as Statistical Survey or Statistical Investigation.
2.1.1 Learning objectives
By the end of this unit, you will be able to:
 Recall the definition of Statistical survey
 Describe the activities involved in planning of a Statistical survey
 Recall the definition of terms used in Statistics
 Differentiate between sample and population
 Differentiate between quantitative and qualitative characteristics
 Describe various methods of data collection
 Describe the methods of collecting data
 Distinguish between primary and secondary data
 Identify the sources of primary and secondary data
2.1.2 Definition of statistical survey
A Statistical survey is a scientific process of collection and analysis of
numerical data. Statistical surveys are used to collect numerical information
about units in a population. Surveys involve asking questions to individuals.

Sikkim Manipal University Page No. 20


Statistics for Management Unit 2

Surveys of human populations are common in government, health, social


science and marketing sectors.

2.2 Stages of Statistical Survey


Statistical surveys are categorised into two stages – planning and execution.
The figure 2.1 shows the two broad stages of Statistical survey.

Statistical Survey

Planning Execution

Fig. 2.1: Stages of Statistical Survey

2.2.1 Planning a Statistical Survey


The relevance and accuracy of data obtained in a survey depends upon the
care exercised in planning. A properly planned investigation can lead to best
results with least cost and time. The figure 2.2 gives the explanation of steps
involved in the planning stage.

Sikkim Manipal University Page No. 21


Statistics for Management Unit 2

Fig. 2.2: Explanation of steps involved in planning of a statistical survey

2.2.2 Execution of Statistical survey


Control methods should be adopted at every stage of carrying out the
investigation to check the accuracy, coverage, methods of measurements,
analysis and interpretation.
The collected data should be edited, classified, tabulated and presented in
diagrams and graphs. The data should be carefully and systematically
analysed and interpreted.

Sikkim Manipal University Page No. 22


Statistics for Management Unit 2

Self Assessment Questions


1. What are the main stages in a survey?
2. Training of investigators belongs to which stage?
3. Analysis of data is a part of execution of survey. Is this correct?

2.3 Basic Terms Used in Statistics


Statistics, being a specialised subject, has a number of terms which have to
be used. You need to know and understand these terms in order to do any
statistical work. Let us get you acquainted with some of the basic terms
used in Statistics.
2.3.1 Units or Individuals
In a Statistical survey, the objects on which the characteristics are
measured are called units or individuals.
2.3.2 Population or Universe
The totality of all units or individuals in a survey is called population or
universe. If the number of objects in a population is finite then it is called
finite population otherwise it is known as infinite population.
The data that describes the characteristics of the population is known as
parameter. In the figure 2.3, the total number of eight consumers constitutes
the entire population.

Fig. 2.3: Population versus sample

Sikkim Manipal University Page No. 23


Statistics for Management Unit 2

Key Statistic
A parameter is a characteristic of population. Population can have many
parameters.
Statistic is a characteristic of sample. Sample can have many statistics.

2.3.3 Sample
A sample is a part or subset of the population. By studying the sample, you
can predict the characteristics of the entire population from where the
sample is taken. The data that describes the characteristics of a sample is
known as statistic.
If the population is large, it is hard to collect data. Hence, a part of the
population is chosen to study the characteristics of the entire population.
The size of the sample can never be as large as the size of the population.
Proper care must be taken while choosing the samples. In the figure 2.3, a
sample of three consumers is drawn from the entire population of eight
consumers.
2.3.4 Quantitative characteristic
A characteristic which is numerically measurable is called a quantitative.
Quantitative data is data expressing a certain quantity, amount or range.
Usually, there are measurement units associated with the data, for example,
the height of a person in metres.
2.3.5 Qualitative characteristic
A characteristic which is not numerically measurable is called a qualitative
characteristic. Qualitative data is data describing the attributes or properties
that an object possesses.
Let us understand the basic terminologies of Statistics with the help of a
caselet.

Caselet 1
Consider the survey of the average number of children below 16 years
in a ward of a municipality. The number of houses in the ward is finite.
Therefore, the population is finite. The objects are households. The
characteristic measured is number of children below 16 years in a
household. It is measurable and hence quantitative. On the other hand,
in survey to find the total number of blind people in a locality, the
characteristic „blindness‟ is qualitative.

Sikkim Manipal University Page No. 24


Statistics for Management Unit 2

2.3.6 Variable
In a population, some characteristics remain the same for all units and some
others vary from unit to unit. The quantitative characteristic that varies from
unit to unit is called a variable. The qualitative characteristic that varies from
unit to unit is called an attribute.
A variable that assumes only some specified values in a given range is
known as discrete variable. A variable that assumes all the values in the
range is known as continuous variable. For example, the number of children
per family and number of petals in a flower are examples of discrete
variables. The height and weight of persons are examples of continuous
variables.

Self Assessment Questions


4. Classify the following as finite or infinite population.
i) Production of a product in a factory for a day.
ii) Number of points in this page.
iii) The set of rational numbers.
iv) The weight of new born babies measured up to first decimal place
in a state during first week of February 2008.
5. Classify the following as attribute or variable.
i) Eye color of human beings
ii) Number of pages in a book of various subjects
6. Classify the following as discrete or continuous variable
i) Number of shares sold each day in a stock market.
ii) Temperatures recorded every half hour at a regional
meteorological centre.

2.4 Collection of Data


Collection of data is the first and most important stage in any Statistical
Survey. The method for collection of data depends upon various
considerations such as objective, scope, nature of investigation and
availability of resources. Direct personal interviews, third party agencies,
and questionnaires are some ways through which data is collected.

Sikkim Manipal University Page No. 25


Statistics for Management Unit 2

2.4.1 Primary data


Data collected for the first time keeping in view the objective of the survey is
known as primary data. They are likely to be more reliable. However, cost of
collection of such data is much higher. Primary data is collected by the
census method. In other words, information with respect to each and every
individual of the population is observed.

Key Statistic
A sample which consists of entire population is called a census.

Collection of primary data can be done by any of the following methods.


1. Direct personal observation
2. Indirect oral interview
3. Information through agencies
4. Information through mailed questionnaires
5. Information through schedule filled by investigators
Let us know about each of them in detail.
Direct personal observation
In the direct personal observation method, as illustrated in figure 2.4, the
investigator collects data by having direct contact with units of investigation.
The accuracy of data depends upon the ability, training and attitude of the
investigator.

Fig. 2.4: Direct personal observation

The direct personal observation method is suitable where,


 The scope of investigation is narrow
 Investigation is confidential and requires personal attention of the
investigator
 Accuracy of data is important

Sikkim Manipal University Page No. 26


Statistics for Management Unit 2

The table 2.1 shows the merits and demerits of direct personal observation
method.
Table 2.1: Merits and demerits of direct personal observation
Merits Demerits
1. We get the original data which 1. This method consumes more cost.
is more accurate and reliable.
2. Satisfactory information can be 2. This method consumes more time.
extracted by the investigator
through indirect questions.
3. Data is homogeneous and 3. This method cannot be used when the
comparable. scope of investigation is wide.
4. Additional information can be 4. Most of the data collected through this
gathered. method is maintained confidential.
Hence, there is a chance of leakage of
data.
5. Misinterpretation of questions
can be avoided.

Indirect oral interview


Indirect oral interview is used when the area to be covered is large. The
investigator collects the data from a third party or witness or head of
institution. This method is generally used by police department in cases
related to enquiries on causes of fires, thefts or murders.
In this method, the investigator contacts witnesses or neighbors or friends or
some other third parties who are capable of supplying the necessary
information. Enquiry committees appointed by governments use this method
to get people‟s views and every possible detail regarding the enquiry. This
method suits the best when direct sources do not exist or cannot be relied
upon or would be unwilling to take part in the survey. The table 2.2 shows
the merits and demerits of indirect oral interview.
Table 2.2: Merits and demerits of indirect oral interview
Merits Demerits
1. Economical in terms of time, cost and man 1. The degree of accuracy
power of information is less.
2. Confidential information can be collected,
3. Information is likely to be unbiased and
reliable

Sikkim Manipal University Page No. 27


Statistics for Management Unit 2

Collecting information through agencies


Methods of collecting information through local agencies or correspondents
are generally adopted by newspaper and television channels. Local agents
are appointed in different parts of the area under investigation. This method
is illustrated in figure 2.5. They send the desired information at regular
intervals.
This method is used where the area to be covered is very large and periodic
information is required. However, one disadvantage of this method is that
the information is likely to be affected by the bias of the correspondents or
agencies.

Fig. 2.5: Collecting information through agencies

Information is collected through mailed questionnaires


Often, information is collected through questionnaires. The questionnaires
are filled with questions pertaining to the investigation. They are sent to the
respondents with a covering letter soliciting cooperation from the
respondents (respondents are the people who respond to questions in the
questionnaire). The respondents are asked to give correct information and
to mail the questionnaire back. The objectives of investigation are explained
in the covering letter together with assurance for keeping information
provided by the respondents as confidential.
Good questionnaire construction is an important contributing factor to the
success of a survey. When questionnaires are properly framed and
constructed, they become important tools by which statements can be made
about specific people or entire populations.
This method is generally adopted by research workers and other official and
non-official agencies. This method is used to cover large areas of

Sikkim Manipal University Page No. 28


Statistics for Management Unit 2

investigation. It is more economical and free from investigator‟s bias.


However, it results in many “non-response” situations. The respondent may
be illiterate. The respondent may also provide wrong information due to
wrong interpretation of questions.
If the questionnaire consists of invalid questions, or questions in incorrect
order, or questions in inappropriate format, or questions that are biased,
then the survey would be useless. An important method for checking and
making sure whether a questionnaire is accurately capturing the intended
information is to pre-test among a smaller subset of target respondents.
Success of questionnaire method of collection of data depends mainly on
proper drafting of the questionnaire. You have to keep the following points in
mind while preparing a questionnaire:
 The respondent should not take much time in completing the
questionnaire. It should be small and not lengthy.
 The questions asked should be well structured and unambiguous.
 The questions asked should be in proper logical sequence.
 Questions should be unbiased. The questions in the questionnaire
should not disturb the privacy of the respondents.
 The task of completion of questionnaire should not have much writing
work.
 Necessary instructions and glossary should be given in covering letter.
 Questions involving technological jargons and mathematical calculations
should be avoided.
 The completed questionnaire should be kept confidential and used only
for the purpose of the survey as mentioned in the investigation.
 There should not be any scope for misinterpretation in the questions.

There are different types of questions that can be used in the questionnaire.
A questionnaire can have Contingency questions, Matrix questions, Closed
ended questions and Open ended questions. Let‟s have a look at each one
in detail.
 Contingency questions are questions that are answered only if the
respondent gives a particular response to a previous question. This
avoids asking people questions that do not apply to them
 Matrix questions are questions which are placed one under the other,
forming a matrix. The response categories are placed along the top and

Sikkim Manipal University Page No. 29


Statistics for Management Unit 2

a list of questions are placed down the side. This is used to efficiently
occupy page space and respondents‟ time.
 Closed ended questions are those where the respondents‟ answers are
limited to a fixed set of responses. Usually scales are closed ended.
There are various types of closed ended questions.
Yes/no questions – here the respondents answer with “yes” or “no”. Some
of the examples are:

 Are you a science graduate? Yes [ ] No [ ]


 Did you watch a movie last night? Yes [ ] No [ ]

Multiple choices – here the respondents have several options from which to
choose. For example:

Example 1
The sun rises in which direction?
East [ ]
West [ ]
North [ ]
South [ ]

Scaled questions – here the responses are graded on a continuum (For


example, rating the appearance of a product on a scale from 1 to 10, with 10
implying the most preferred appearance and 1 implying the least preferred
appearance). Scaled questions are mostly questions related to attitudes. A
Likert scale provides a number of attitude statements. The respondent has
to say how much they agree or disagree with each one.

Example 2
Read the following statement and then indicate by a tick whether you
strongly agree, agree, disagree or strongly disagree with the statement.
“Tasks when organised and prioritised take less time to complete.”
1. Strongly Agree [ ]
2. Agree [ ]
3. Disagree [ ]
4. Strongly Disagree [ ]

Sikkim Manipal University Page No. 30


Statistics for Management Unit 2

 Open ended questions are those questions for which the respondent
supplies their own answer without any fixed set of possible responses.
Examples of types of open ended questions include:
Sentence completion – In these, respondents complete an incomplete
sentence.
Example 3
Complete the sentence below.
“I like the management courses offered by Sikkim Manipal
University because ...”.

Story completion – In these, respondents complete an incomplete story.


Picture completion – In these, respondents fill in an empty conversation
balloon.
Thematic Apperception Test – In these, respondents explain a picture or
make up a story about what they think is happening in the picture.
Information through schedule filled by investigators
Information can be collected through schedules filled by investigators
through personal contact. In order to get reliable information, the
investigator should be well trained, tactful, unbiased and hard working.
A schedule is suitable for an extensive area of investigation through
investigator‟s personal contact. The problem of non-response is minimised.
There is a difference between a schedule and a questionnaire. A schedule
is a form that the investigator fills himself through surveying the units or
individuals. A questionnaire is a form sent (usually mailed) by an
investigator to respondents. The respondent has to fill it and then send it
back to the investigator.
2.4.2 Secondary data
Any information, that is used for the current investigation but is obtained
from some data, which has been collected and used by some other agency
or person in a separate investigation, or survey, is known a secondary data.
They are available in published or unpublished form.

Sikkim Manipal University Page No. 31


Statistics for Management Unit 2

In published form, secondary data is available in research papers, news


papers, magazines, government publication, international publication, and
websites. Secondary data is collected for different purposes. Therefore, care
should be exercised while making use of it.
The accuracy, reliability, objectives and scope of secondary data should be
examined thoroughly before use. Secondary data may be collected either by
census or by sampling methods.
The various sources of published data are:
 Reports and official publications of international and national
organisations as well as central and state governments
 Publications of several local bodies such as municipal corporations and
district boards
 Financial and economic journals
 Annual reports of various companies
 Publications brought out by research agencies and research scholars
Some of the journals (both academic and non-academic) are published at
regular intervals like yearly, monthly, weekly whereas other publications are
more ad hoc. Internet is a powerful source of secondary data, which can be
accessed at any time for any further analysis of the study.
Unpublished sources
It is not necessary that all statistical contents have to be published.
Unpublished data such as records maintained by various government and
private offices, studies made by research institutions and scholars can also
be used where necessary. The following are some of measures that need to
be considered while using secondary data.
1. The collection and processing of the data
2. Accuracy of the data
3. The degree of summarisation of the data
4. The degree of comparable the data is with other tabulations
5. How to interpret the data, especially when figures collected for one
purpose is used for another

Sikkim Manipal University Page No. 32


Statistics for Management Unit 2

With secondary data, people have to compromise between what they want
and what they are able to find.

The merits of secondary data are that:


 Secondary data is cheaper to obtain. Many government publications are
relatively cheap and libraries stock quantities of secondary data
produced by the government, by companies and other organisations.
 Large quantities of secondary data can be accessed through the
internet.
 Much of the available secondary data has been collected over a course
of many years and therefore it can be used to plot trends.
 Secondary data is valuable to the government, business and research
areas. In governments, it helps in making decisions and in planning
future policies. In the business and industry areas such as marketing
and sales, it is used to appreciate the general economic and social
conditions and to provide information on competitors. To the research
organisations, it helps by providing social, economical and industrial
information.

The demerits of secondary data are that:


 It is difficult to judge whether the secondary data is sufficiently accurate.
 It might be difficult to fit secondary data to the needs of the investigator.
 Secondary data might not be available for certain investigations. In such
situations, primary data has to be collected.

Sikkim Manipal University Page No. 33


Statistics for Management Unit 2

The differences between primary and secondary data are listed in the
table 2.3.
Table 2.3: Differences between primary and secondary data
Primary Data Secondary Data
1. Data is original and thus more 1. Data is not reliable.
accurate and reliable.
2. Gathering data is expensive. 2. Gathering data is cheap
3. Data is not easily accessible. 3. Data is easily accessible through
internet or other resources.
4. Most of the data is homogeneous. 4. Data is not homogeneous.
5. Collection of data requires more 5. Collection of data requires less
time. time.
6. Extra precautionary measures 6. Data needs extra care.
need not be taken.
7. Data gives detailed information. 7. Data may not be adequate.

Self Assessment Questions


7. State whether the following data are primary or secondary.
i) An official of the Census Board of India is preparing a report on
census of population based on the survey data collected by the
Census Board. ..
ii) An HR representative of a software company is deciding on the
time taken to perform a particular job on a project on the basis of
random observations collected by him.
iii) A neurologist is examining the relationship between cigarette
smoking and brain tumor based on the data published in a famous
neurology journal.
2.4.3 Pilot survey
Pilot survey is a small trial survey undertaken before the main survey. It
gives a measure of efficiency of the questionnaire. It reduces the
inconveniences and loss of information. It helps in introducing necessary
changes.
When some prior information about the nature of population under study,
the operational and cost aspects of data collection and analysis is not
available from surveys, it is desirable to design and carry out a pilot survey.

Sikkim Manipal University Page No. 34


Statistics for Management Unit 2

Pilot survey is a preliminary research conducted before a complete survey


to test the effectiveness of conducting the research. Pilot survey should be
completed before the final survey begins. By conducting the pilot survey, the
investigator will be able to know any difficulties that might arise that were not
known at the survey proposal stage.
Pilot surveys have many other advantages.
 Pilot surveys provide the investigator with many ideas, approaches and
clues that are not foreseen before conducting the pilot survey. Such
ideas and clues increase the chances of getting accurate findings in the
main survey.
 Pilot surveys help in making necessary alterations in the data collecting
methods. Hence investigators can analyse data in the main survey more
efficiently.
 Pilot surveys save a lot of time and provide enough data for the
investigator to decide whether to go ahead with the main survey or not.
Apart from advantages, pilot survey also has certain limitations which are
discussed below.
 Pilot surveys are not based on strong statistical foundation and are
based on very small sample sizes.
 There is a possibility that the investigator might make wrong predictions
or assumptions on the basis of pilot data.
 If data and results from pilot surveys are included in the main survey
then it might lead to incorrect decisions.
 If the pilot participants are included in the main survey, then data
obtained from these participants might result in corruption of main data.
 Sometimes, if an expensive pilot study is unsuccessful, then the
investigator might find it very difficult to stop the main survey.

Self Assessment Questions


8. State whether the following statements are „True‟ or „False‟.
i) Census conducted by Government of India is an example of
primary data.
ii) TV News Bulletins gather information on any event through their
agents.
iii) Schedules make respondents record their answers.

Sikkim Manipal University Page No. 35


Statistics for Management Unit 2

iv) A covering letter to the questionnaire brings confidence in


respondents.
v) Questions in questionnaire should be lengthy.

2.5 Scrutiny and Editing of Data


Before using the collected data, it should be checked for its completeness,
accuracy and reliability. By complete, we mean that all the required
information should be available. Editing the data is a time consuming
process and also an important task.
The data collected through various sources will be much disorganised and
needs to be condensed and analysed for further studies. There is a
possibility of missing the valuable data after condensation. Hence, proper
planning is required in editing process of any collected data. While editing, it
is important to have all the sources of collected data, and also the overall
scope of survey.
There are different steps involved in editing the collected data. The data
must be checked for:
Legibility
The data must be legible. If a response is not presented clearly, the
investigator has to rewrite it.
Completeness
An unanswered response on a questionnaire implies either the respondent
did not answer the entry or the investigator did not record the data. If the
fault is the investigator‟s, then the investigator has to fill the missing entry. If
an entry is missing as a result of omission of that entry by the respondent,
then the investigator has to conduct the survey again to gather the missing
entry.
Consistency
The investigator has to examine each questionnaire to check inconsistency
or inaccuracy in any statement. For example, the numerical figures of
attributes such as income, height, weight may be inconsistent. In such
cases, it is the duty of the concerned investigators to make the necessary
corrections. The investigators have to make sure that the collected data
must be free from redundant responses or duplicate entries.

Sikkim Manipal University Page No. 36


Statistics for Management Unit 2

2.6 Summary
A Statistical survey is a search for knowledge. There are two main stages in
any Statistical survey - planning and execution. Planning a Statistical survey
encompasses the following issues.
i) The nature of problem
ii) The objectives
iii) The scope
iv) Statistical units
v) The degree of accuracy
vi) The time period
vii) The source of information
viii) The organisation
The collected data should be edited, analysed and interpreted for
completeness, accuracy and consistency. Sample is a subset of population.
Sample can never be larger than the population from which the sample was
taken.
Quantitative characteristic is a characteristic which is numerically
measurable otherwise it is a qualitative characteristic. The quantitative
characteristic that varies from unit to unit is called a variable. The qualitative
characteristic that varies from unit to unit is called an attribute.
There are two categories of data - primary and secondary data. Primary
data is collected directly from the respondents whereas secondary data is
collected through agencies.
The various methods of collecting primary data are:
 Direct personal observation
 Indirect oral interview
 Information through agencies
 Information through mailed questionnaires
 Information through schedule filled by investigators
Questionnaires must be structured well and must not be ambiguous. A
covering letter must be included along with the questionnaire. Pilot survey is
a beneficial method when prior information about the survey does not exist
or when the results about the survey is needed quickly.

Sikkim Manipal University Page No. 37


Statistics for Management Unit 2

2.7 Terminal Questions


1. What is Statistical survey?
2. Enumerate the factors which should be kept in mind for proper planning.
3. What do you understand by the unit of measurement? Explain with
examples.
4. Distinguish between:
a) Primary and Secondary Data
b) Direct and Indirect Investigation
c) Questionnaire and Schedule

2.8 Answers to SAQs and TQs

Answers to self assessment questions


1. Planning and execution
2. Planning
3. Yes
4. i) Finite ii) Infinite iii) Infinite iv) Finite
5. i) Attribute ii) Variable
6. i) Discrete ii) Continuous
7. i) Primary data ii) Primary data iii) Secondary data
8. i) True ii) True iii) False iv) True v) False

Answers to Terminal Questions


1. Refer section 2.1.2.
2. Refer section 2.2.1.
3. It refers to the unit of the population on which measurements are made,
for example, the height of employees in an office. Employees are
individuals or units. Height is the measurement made on them.
4. a) Data collected for the first time by the investigator is primary data.
Data collected by some other persons but used by the investigator
for his study is known as secondary data.
b) Direct investigations are carried out directly by the investigator.
Investigation conducted through mail questionnaire is called indirect
investigation.

Sikkim Manipal University Page No. 38


Statistics for Management Unit 2

c) Questionnaires contain simple questions and are filled by


respondents. Schedules also contain questions but responses are
recorded directly by the investigator.

2.9 References
 B. L. Agarwal, (2006) Basic Statistics, Fourth Edition, New Age
International Publishers
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
 Rand R. Wilcox , (2009) Basic Statistics – Understanding Conventional
Methods and Modern Insights, Oxford University Press
 http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-3.pdf

Sikkim Manipal University Page No. 39


Statistics for Management Unit 3

Unit 3 Classification, Tabulation and


Presentation of Data
Structure:
3.1 Introduction
Learning objectives
3.2 Functions of Classification
Requisites of a good classification
Types of classification
Methods of classification
3.3 Tabulation
Basic difference between classification and tabulation
Parts of a table
Types of table
3.4 Frequency and Frequency Distribution
Derived frequency distributions
Bivariate and multivariate frequency distribution
Construction of frequency distribution
3.5 Presentation of Data
Diagrams
3.6 Graphical Presentation
Histogram
Frequency polygon
Frequency curve
Ogives
3.7 Summary
3.8 Terminal Questions
3.9 Answers to SAQs and TQs
Answers to Self Assessment Questions
Answers to Terminal Questions
3.10 References

3.1 Introduction
In unit 2, „Statistical Survey‟, you have studied about surveys and different
methods of collecting the data. In this unit 3, „Classification, Tabulation and
Presentation of Data‟, you will know about the simplification of collected
Sikkim Manipal University Page No. 40
Statistics for Management Unit 3

data. You will also know about some methods for graphical summarisation
of data that reveals certain patterns.
Collected data in the raw form would be voluminous and non-
comprehensible. Therefore, it should be condensed and simplified for better
understanding and usefulness.
Classification is the first stage in simplification. It can be defined as a
systematic grouping of the units according to their common characteristics.
Each of the group is called class.
For example, in a survey of industrial workers of a particular industry,
workers can be classified as unskilled, semi-skilled and skilled, each of
which form a class.
3.1.1 Learning Objectives
By the end of this unit, you should be able to:
 Describe the functions and methods of classification
 Identify the parts of table
 Describe the functions of tabulation
 Calculate the frequency and frequency distribution for the data
 Display the numerical data as graphical representation

3.2 Functions of Classification


Classification of data performs many functions.
 It condenses the bulk data
 It simplifies the data and makes the data more comprehensible
 It facilitates comparison of characteristics
 It renders the data ready for any statistical analysis
3.2.1 Requisites of a good classification
A good classification should be:
 Unambiguous: It should not lead to any confusion
 Exhaustive: Every unit should be allotted to one and only one class
 Mutually exclusive: There should not be any overlapping
 Flexible: It should be capable of adjusting to changing situation
 Suitable: It should be suitable to objectives of survey
 Stable: It should remain stable throughout the investigation

Sikkim Manipal University Page No. 41


Statistics for Management Unit 3

 Homogeneous: There should be similar units in the same class


 Revealing: It should bring out essential features of the collected data
3.2.2 Types of classification
The important types of classification are:
Geographical classification
Data classified according to region is geographical classification.
Chronological classification
Data classified according to the time of its occurrence is called chronological
classification.
Conditional classification
Classification of data done according to certain conditions is called
conditional classification.
Qualitative classification
Classification of data that is immeasurable is called qualitative classification.
For example, sex of a person, marital status, color and others.
Quantitative classification
Classification of data that is measurable either in discrete or continuous
form is called quantitative classification.
Statistical Series
Data is arranged logically according to size or time of occurrence or some
other measurable or non-measurable characteristics.
3.2.3 Methods of classification
There are three methods of classification. They are:
 One-way classification
 Two-way classification
 Manifold classification
One-way classification
Classification done according to a single attribute or variable is known as
one way classification.

Sikkim Manipal University Page No. 42


Statistics for Management Unit 3

Example 1
The data displayed in figure 3.1 is the number of students who has
secured more than 60% in various sub-modules of statistics. This can
be classified using one-way classification method.

Fig. 3.1: One-way classification

Two-way classification
Classification done according to two attributes or variables is known as two-
way classification.

Example 2
The data displayed in figure 3.2 is the classification of students according
to gender, who has secured more than 60% in respective sub-modules of
statistics. In the sub-module titled „Basic Concepts‟, ten students got
more than 60%. Out of ten students, four are males and six are females.

Fig. 3.2: Two-way classification

Sikkim Manipal University Page No. 43


Statistics for Management Unit 3

Manifold classification
Classification done according to more than two attributes or variables is
known as manifold classification.

Example 3
The figure 3.3 shows the classification of employees according to skill,
sex and education.

Fig. 3.3: Manifold classification example

Self Assessment Questions


1. Fill in the blanks
i. Classification is a systematic __________ of the units according to
their ____________ __________.
ii. Classification reduces _________ of the data.
iii. Classification of data that are non-measurable is known as ____
___.
iv. Data arranged logically according to size is known as _________.
v. Manifold classification involve more than _________ variables.
vi. Data arranged according to time of occurrence is known as
_________.

3.3 Tabulation
Tabulation follows classification. It is a logical or systematic listing of related
data in rows and columns. The row of a table represents the horizontal
arrangement of data and column represents the vertical arrangement of
data. The presentation of data in tables should be simple, systematic and
unambiguous.

Sikkim Manipal University Page No. 44


Statistics for Management Unit 3

The objectives of tabulation are to:


i. Simplify complex data
ii. Highlight important characteristics
iii. Present data in minimum space
iv. Facilitate comparison
v. Bring out trends and tendencies
vi. Facilitate further analysis

3.3.1 Basic differences between classification and tabulation


In spite of the fact that they are closely related, there are few differences
between classification and tabulation. They are displayed in table 3.1.

Table 3.1: Differences between classification and tabulation

Classification Tabulation
It is the basis for tabulation It is the basis for further analysis
It is the basis for simplification It is the basis for presentation
Data is divided into groups and sub- Data is listed according to a logical
groups on the basis of similarities sequence of related characteristics
and dissimilarities.

3.3.2 Parts of a table


In this section, you will study the parts of a table, which will help you in
creating accurate tables with the data given. The parts of table are
illustrated in figure 3.4 along with the explanation of each tab (tabs from
1 to 10).

Sikkim Manipal University Page No. 45


Statistics for Management Unit 3

2 9
1

Table 3.2. Percentage of P.G. employees


5 in age group and department-wise 3
(Age in years)
Age 4
6 Departments
20 – 40 40 & above
Accounts 2.564 1.282
Finance 2.564 1.795
Personal 3.846 1.282 7
8 Production 2.564 2.051
Marketing 1.282 1.795
Total 12.920 8.205
10

Source: ………..

Fig. 3.4: Parts of a table

Tab 1: Table number


Table number is to identify the table for reference. When there are many
tables in an analysis, then table numbers are helpful in identifying the tables.
Tab 2: Title
Title indicates the scope and the nature of contents in concise form. In other
words, title of a table gives information about the data contained in the body
of the table. Title should not be lengthy.
Tab 3 and Tab 4: Captions
Captions are the headings and subheadings describing the data present in
the columns.

Sikkim Manipal University Page No. 46


Statistics for Management Unit 3

Tab 5 and Tab 6: Stubs


Stubs are the headings and subheadings of rows.
Tab 7: Body of the table
Body of the table contains numerical information.
Tab 8: Ruling and Spacing
Ruling and spacing separate columns and rows. However, totals are
separated from main body by thick lines.
Tab 9: Head Note
Head note is given below the title of the table to indicate the units of
measurement of the data and is enclosed in brackets.
Tab 10: Source Note
Source note indicates the source from which data is taken. The source note
related to table is placed at the bottom on the left hand corner.
3.3.3 Types of table
Tables are classified into three types. They are on the basis of:
 Purpose of investigation,
 Nature of presented figures
 Construction
Purpose of investigation
Tables classified under this classification are of two types. They are:
1. General purpose table
General purpose table or reference table facilitates easy reference to the
collected data. They are formed without specific objective, but can be used
for any specific purpose. They contain large mass of data. Example:
Census.
2. Specific purpose table
Specific purpose table or text table or summary table deals with specific
problems. They are smaller in size and they highlight relationship between
characteristics. Example: Cost of living indices.

Sikkim Manipal University Page No. 47


Statistics for Management Unit 3

The nature of presented figures


Tables classified under this type are of two types. They are:
1. Primary table
The primary tables contain data in the form in which it were originally
collected. The table is illustrated in table 3.3 is a primary table
2. Derived table
The derived tables represent figures like totals, averages, ratios and so on,
which are derived from original data. The table illustrated in table 3.4 is a
derived table derived from table 3.3.

Table 3.3: Distribution of employees according to age and educational level in


various departments

Age Total
20 – 40 40 and Above
Depart- A C A
ments C
Under B Gra- Post Under B
duate Post
Gra- Gra- Gra- Graduate
Graduate
duate duate duate
Accounts 10 40 10 10 15 5 90
Finance 10 30 10 12 14 7 83
Personal 15 25 10 10 14 5 79
Production 10 30 10 8 12 6 76
Marketing 5 25 10 0 15 7 62
Total 50 150 50 40 70 30 390

Table 3.4: Percentage of P.G. employees’ age group according to department

Age
Departments
20 – 40 40 & above
Accounts 2.564 1.282
Finance 2.564 1.795
Personal 3.846 1.282
Production 2.564 2.051
Marketing 1.282 1.795
Total 12.920 8.205

Sikkim Manipal University Page No. 48


Statistics for Management Unit 3

Construction
Different types of tables under this classification of tables are:
1. Simple table
Simple table presents only one characteristic. The table illustrated in table
3.5 is a simple table.
2. Complex table
Complex table presents two or more characteristics. The table illustrated in
table 3.6 is a complex table.
3. Cross-classified table
In the cross-classified table, the entries are classified in both directions. An
example of cross-classified table is illustrated in table 3.7.

Table 3.5: Defectives produced by batches

Batches No. of defectives


1 15
2 20
3 40
4 50

Table 3.6: Distribution of defectives according to batch and nature of defects

Batch Defects
Major Minor
I 8 7
II 15 5
III 25 15
Total 40 27

Sikkim Manipal University Page No. 49


Statistics for Management Unit 3

Table 3.7: Population of a city according to age, sex and education during
2003 to 2005

Age
Years Educated Not Educated
Sex
Below 20 - Above Below Above
Total 20 – 40 Total
20 yrs 40 40 20 yrs 40
Male
2003
Female
Male
2004
Female
Male
2005
Female

Self Assessment Questions


2. State whether the following statements are true or false.
i) Tabulation presents the data in a minimum space
ii) Tabulation is a process of analysis
iii) General purpose table deals with specific objective
iv) Derived tables deal with total, percentages, ratios and so on
v) Row of a table is represented by the vertical arrangement of data

3.4 Frequency and Frequency Distribution


The number of units associated with each value of the variable is called
frequency of that value. Suppose, the variable takes the value 15 and the
value 15 occurs 3 times, then 3 is called the frequency of the value 15.
A systematic presentation of the values taken by variable together with
corresponding frequencies is called a frequency distribution of the variable.
It is presented in tabular form called as frequency table. If class intervals are
not present, then it is called a discrete frequency distribution and is
displayed in table 3.8. A frequency distribution formed with class-intervals is
called a continuous frequency distribution, which is represented in table 3.9.

Sikkim Manipal University Page No. 50


Statistics for Management Unit 3

Table 3.8: Discrete frequency distribution

Number of Children No. of families


0 15
1 20
2 22
3 16
4 7
Total 80

Table 3.9: Continuous frequency distribution

Marks No. of Students


0 – 20 15
20 – 40 20
40 – 60 28
60 – 80 22
80 – 100 15
Total 100

A continuous frequency distribution is divided into mutually exclusive sub-


ranges called class-intervals. Class intervals have lower and upper limits
known as lower class limits and upper class limits respectively. The
differences between upper class limit and lower class limit is termed as
class width. The middle value of a class interval is called mid-value of the
class. It is the average of class limits.
Solved Problem 1: For the class 10 – 20, find the lower class interval and
the upper class interval. Find also the width of the class interval and the mid
value of the class.
Solution: For the class 10-20, the lower class interval and the upper class
interval is 10 and 20 respectively. The width of the class is 10. The mid
value of the class is calculated as:
10  20
Mid value of the class = = 15.
2
Sikkim Manipal University Page No. 51
Statistics for Management Unit 3

Therefore, mid value of the class is 15.

Key Statistic
Class intervals are of two types; exclusive and inclusive. The class
interval that does not include upper class limit is called an exclusive type
of class interval. The class interval that includes the upper class limit is
called an inclusive type of class interval.

In table 3.10, the class „0 – 9‟ includes the value „9‟. In table 3.11, the class
„0 – 10‟ does not include the value „10‟. If the value of „10‟ occurs, it is
included in the class „10 – 20‟.
Table 3.10: Inclusive type class interval
Marks Number of Students
0–9 15
10 – 19 20

Table 3.11: Exclusive type class interval


Marks Number of Students
0-10 15
10-20 20
20-30 28

Solved Problem 2: In the country music band of 48 members, 22 play a


guitar, 12 play brass, 14 play piano. Provide a tabular display of the
frequency and frequency distribution for the type of instruments for this
music band.
Solution: The table 3.12 represents the data given in the solved problem 2.
Table 3.12. Frequency distribution the type of instruments for country music
band
Type of Instrument Frequency Frequency Distribution
Guitar 22 22/48 = 0.4583
Brass 12 12/48 = 0.2500
Piano 14 14/48 = 0.2917

Sikkim Manipal University Page No. 52


Statistics for Management Unit 3

3.4.1 Derived frequency distributions


From a given frequency distribution, we can form five derived frequency
distributions. They are:
i) Relative frequency distribution
If „f‟ is the class frequency and „N‟ is the total frequency, the relative
frequency distribution is formed by calculating f/N. Total of all the values
of relative frequency distribution will always be one.
ii) Percentage frequency distribution
The percentage frequency distribution is formed by multiplying
the ratio f/N by 100.
iii) Frequency density distribution
If “c” is the width of the class-interval and “f” is the frequency of the
class, then frequency density distribution is formed by calculating f/c.
iv) Less than cumulative frequency distribution
The less than cumulative frequency distribution is formed with number of
observations which are less than a given value.
v) More than cumulative frequency distribution
The more than cumulative distribution is formed with number of
observations, which are more than a given value.
Solved Problem 3: Consider the frequency distribution of marks given in
table 3.9. Calculate the less than and more than cumulative frequency
distribution.
Solution: The derived frequency distributions are displayed in table 3.13a,
3.13b, and 3.13c.
Table 3.13a. Forms of derived frequency distribution
Relative freq. Percentage
Marks Density D
distribution Distribution
0 – 20 0.15 0.75 15
20 – 40 0.20 1.00 20
40 – 60 0.28 1.40 28
60 – 80 0.22 1.10 22
80 – 100 0.15 0.75 15
Total 1.00 – 100 %

Sikkim Manipal University Page No. 53


Statistics for Management Unit 3

Table 3.13b. Less than cumulative frequency distribution


Marks less than Less than cumulative frequency
0 0
20 15
40 35
60 63
80 85
100 100

Table 3.13c. More than cumulative frequency distribution


Marks more than More than cumulative frequency
0 100
20 85
40 65
60 37
80 15
100 0

3.4.2 Bivariate and multivariate frequency distribution


Frequency distribution of more than two variables is known as multivariate
frequency distribution. If the number of variables is only two, then it is called
bivariate frequency distribution. A bivariate frequency distribution will have
two marginal distributions and “m+n” conditional distributions.
In the table 3.14a, the numbers in the last row and column represent
marginal distribution of age. Any row or column number represents
conditional distribution of salary. There are 4 rows (m = 4) and 3 columns
(n = 3). We have 4+3=7 conditional distributions.

Sikkim Manipal University Page No. 54


Statistics for Management Unit 3

Table 3.14a. Distribution of age and salary


Age in Salary / Month (Rs.)
years 9,000 – 12,000 12,000 – 15,000 15,000 – 18,000 Total
20 – 30 10 3 0 13
30 – 40 8 12 3 23
40 – 50 6 15 10 31
50 – 60 0 3 18 21
Total 24 33 31 88

Table 3.14b. Conditional distribution of age for given salary


Age Salary (9000 - 12000)
20-30 10
30-40 8
40-50 6
50-60 –
Total 24

Table 3.14c. Conditional distribution of salary for given age


Age 40-50 Salary
6 6000-12000
15 12000-15000
10 15000-18000
31 Total

3.4.3 Construction of frequency distribution


The steps followed to construct frequency distribution table are:
i. Determine the range = Highest value – Lowest value
ii. No. of class intervals is given by the Sturge‟s Rule that is. K = 1+3.2
log N. where N is the total number of observations.
iii. The width of the class interval is given by N/K
In practice, divide the range either by 2 or 5 or 10 or multiples of 10 such
that the number of class intervals will be between 7 and 15. Avoid open-end
class interval. Make sure that class intervals do not overlap.

Sikkim Manipal University Page No. 55


Statistics for Management Unit 3

Key Statistic
If the class interval does not prescribe lower limit for first class or upper
limit for the last class, then it is known as open-end class interval.

Tally marks
Tally marks are used to construct frequency table. Tally mark is a small
vertical line drawn against a class as soon as we observe a value belonging
to the class. The fifth tally mark is crossed for easy counting purposes. The
table 3.15 represents the marks secured in mathematics by the students of
a class.
Example 4
From the table 3.15, we can depict that ten students got 90 marks in
mathematics, six students got 82 and seven got 75.

Table 3.15. Marks secured by students in mathematics

Marks secured in Number of Students


mathematics
90

82

75

Self Assessment Questions


3. Fill in the blanks.
i) If the data readings are 3, 4, 5, 6, 7, then it is called _________
variable.
ii) Height is generally __________ variable.
iii) There are ____________ derived frequency distributions for any
frequency distribution.
iv) Width of class-interval is given by the difference between ________
and ______.
v) There are ________ marginal distributions for a distribution.

Sikkim Manipal University Page No. 56


Statistics for Management Unit 3

vi) __________ formula is used to calculate the number of class-


intervals.
vii) The relative frequency distribution is obtained from frequency
distribution by calculating ___________.

3.5 Presentation of Data


Top Management and common man do not have time to go through mass
data and to understand its nature. For them, diagrammatic and graphical
presentations are more intelligible, attractive and appealing. The
diagrammatic representations give a bird‟s eye-view of the data. They
facilitate comparison of various aspects of data. They create ever lasting
impressions. However, they cannot be considered as alternatives for
numerical data. Mathematical calculations are not possible. They do not
give accurate values.
3.5.1 Diagrams
Diagrams may be one-dimensional or two dimensional. In one-dimensional
we have bar diagrams. In two dimensional we have pie diagram. Simple bar
diagram, component bar diagram, sub-divided bar diagram and percentage
bar diagram are different bar diagrams.
Simple bar diagram
Simple bar diagram is drawn when items are to be compared with respect to
a single characteristic. A rectangular bar is constructed with height
proportional to the magnitude of the items.
Multiple bar diagram
Multiple bar diagrams are drawn when we have two or more sets of
comparable values.
Solved Problem 4: The table 3.16 represent the data regarding the yield /
acre of paddy in Karnataka over the last five years. Represent the data in
bar diagram.

Table 3.16. Data regarding the yield per acre in Karnataka

Year 2001 2002 2003 2004 2005


Yield 20 22 25 27 30

Sikkim Manipal University Page No. 57


Statistics for Management Unit 3

Solution: The simple bar diagram in figure 3.5 shows the yield of paddy in
Karnataka.

Fig. 3.5: Simple bar diagram showing yield of paddy in Karnataka

Solved Problem 5: Create a multiple bar diagram for the data represented
in the table 3.17.
Table 3.17. Product A
Year Cost of Manufacturing / Unit Revenue / Unit
2002 - 2003 40 70
2003 – 2004 45 85
2004 – 2005 55 90

Solution: The multiple bar diagram in figure 3.6 shows the cost and
revenue per unit.

Fig. 3.6: Multiple bar diagram showing the cost and revenue per unit

Sikkim Manipal University Page No. 58


Statistics for Management Unit 3

Component (sub-divided) bar diagram


Component bar diagrams are used when two or more characteristics are
observed on a unit. Each bar is proportionally subdivided.
Solved Problem 6: Represent the data in table 3.18 by a suitable bar
diagram.

Table 3.18. MBA Students according to their graduation course

Course No. of Students


Sec A Sec B
B.E 10 5
M.Tech 15 10
MBBS 10 15
B.Com 35 30
BBM 30 40
Total 100 100

Solution: The figure 3.7 displays the component bar diagram of


composition of MBA students according to their graduation course.

Fig 3.7: Component bar diagram

Key Statistic
It is easier to draw the bar diagram, if we first find the cumulative total for
each section.
Sikkim Manipal University Page No. 59
Statistics for Management Unit 3

Component pie diagram


It is drawn when data have magnitudes for two or more components. Circles
with area proportional to magnitudes are drawn to represent the total
magnitude. Then circles are divided sector-wise according to the magnitude
of the components. If „T‟ is the total magnitude and „R‟ is the magnitude of a
component, then the angle at the centre is given by:

360 R
A
T
Solved Problem 7: Draw pie-diagram for the data in table 3.19, regarding
expenses of Prasad‟s family and Krishna‟s family.
Table 3.19. Monthly expenses of two families
Monthly Expenses of
Items
Prasad’s Family Krishna’s Family
Food 2000 4000
Rent 1000 1500
Fuel 500 1000
Misc 500 1500
Total 4000 8000

Solution: The radii of the circles should be:


4000 : 8000
63.245 : 89.44
1.27 : 1.79

We draw two circles with radii 1.3 cms and 1.8 (where, 1 cm = 50 units).
The angles at the centre are determined and represented in a table 3.20.

Sikkim Manipal University Page No. 60


Statistics for Management Unit 3

Table 3.20. Monthly expenses represented in angles

Monthly Expenses of
Items
Prasad’s Family Krishna’s Family
Food 180 180
Rent 90 67.5
Fuel 45 45
Misc 45 67.5
Total 360 360

Fig. 3.8: Pie-chart showing monthly expenses of Prasad’s family

Fig. 3.9: Pie-chart showing monthly expenses of Krishna’s family

Self Assessment Questions


4. State whether the following statements are true or false.

Sikkim Manipal University Page No. 61


Statistics for Management Unit 3

i) Diagrams give accurate value.


ii) Pie diagram is drawn according to degree subtended at the
center of a circle.
iii) Simple bar diagram is drawn for multiple characteristics.

3.6 Graphical Presentation


Graphs are used mainly for frequency distributions. Some types of graphs
are:
i) Histogram
ii) Frequency polygon
iii) Frequency curve
iv) Ogives [cumulative frequency curves]
3.6.1 Histogram
The frequency distribution is represented by a set of rectangular bars with
area proportional to class frequency. If the class intervals have equal width
then the variable is taken along X-axis and frequency along Y-axis and a
rectangle is constructed.
Solved Problem 8: Draw a histogram for the distribution of age as shown in
table 3.21.
Table 3.21. Distribution of age
Age 0-10 10-20 20-30 30-40 40-50
No. of people 5 10 15 12 8

Solution: The figure 3.10 displays the histogram for the distribution of age
data.

Fig. 3.10: Histogram for the distribution of age

Sikkim Manipal University Page No. 62


Statistics for Management Unit 3

We join the upper left corner of highest rectangle to the right adjacent
rectangle‟s left corner and right upper corner of highest rectangle to left
adjacent rectangle‟s right corner. From the intersecting point of these lines
we draw a perpendicular to the X-axis. The X-reading at that point gives the
mode of the distribution.
If the widths of the rectangles are not equal then we make areas of
rectangles proportional and draw the histogram. This is explained in the
solved problem 8.
Solved Problem 9: Suppose we have the frequency distribution shown in
table 3.22a. Draw a histogram for the data.
Table 3.22a. Frequency distribution data for solved problem 9

Age Frequency
0-10 5
10-30 20
30-60 45
60-70 12
70-90 16

Solution: From the table 3.22a, we can interpret that the class intervals are
unequal. Hence, the class intervals are made equal to calculate the adjusted
frequencies. For the class interval 10-30:
 Divide the class interval into two equal class intervals
 Calculate the adjusted frequency by dividing the frequency of that class
interval by 2
Similarly, follow the procedure for other unequal class intervals. Then, we
can construct the histogram with the adjusted frequencies. The table 3.22b
represents the class intervals along with the adjusted frequencies.

Sikkim Manipal University Page No. 63


Statistics for Management Unit 3

Table 3.22b. Adjusted frequency distribution of data for solved problem 9


Age Adjusted Frequency
0-10 5
10-20 10
20-30 10
30-40 15
40-50 15
50-60 15
60-70 12
70-80 8
80-90 8

The figure 3.11 displays the histogram for the distribution of age data when
the class intervals are irregular.

Fig. 3.11: Histogram when class intervals are unequal


for the data given in solved problem 8

3.6.2 Frequency polygon


The mid values of class intervals are plotted against frequency of the class
interval. These points are joined by straight lines and hence the frequency
polygon is obtained.
Solved Problem 10: Construct a frequency polygon for the data
represented in table 3.20.
Solution: The figure 3.12 shows the frequency polygon for the data in
solved problem 10.

Sikkim Manipal University Page No. 64


Statistics for Management Unit 3

Fig. 3.12: Frequency polygon

3.6.3 Frequency curve


First we draw histogram for the given data. Then we join the mid points of
the rectangles by a smooth curve. Total area under frequency curve
represents total frequency. They are the most useful form of frequency
distribution.
Solved Problem 11: Construct a frequency curve for the data represented
in table 3.19.
Solution: The figure 3.13 shows the frequency curve for the data in solved
problem 11.

Fig. 3.13: Frequency curve for the data of solved problem 11

3.6.4 Ogives
Ogive is obtained by drawing the graph of a cumulative frequency
distribution. Hence, ogives are also called as cumulative frequency curves.

Sikkim Manipal University Page No. 65


Statistics for Management Unit 3

Since a cumulative frequency distribution can be of 'less than' or 'greater


than' type, we have less than and greater than type of ogives.
Less than ogive
Variables are taken along X-axis and less than cumulative frequencies are
taken along Y-axis. Less than cumulative frequencies are plotted against
upper limit of class interval and joined by a smooth-curve.
More than ogive
More than cumulative frequencies are plotted against lower limit of the
class-interval and joined by a smooth-curve.
From the meeting point of these two ogives if we draw a perpendicular to
X-axis, the point where it meets X-axis gives median of the distribution.
Solved Problem 12: Construct an ogive curve for the data represented in
table 3.23 and determine the median.
Table 3.23. Wage distribution of workers

Wage / day No. of workers Less than Greater than


0 – 10 5 10 5 0 50
10 – 20 10 20 15 10 45
20 – 30 20 30 35 20 35
30 – 40 12 40 47 30 15
40 – 50 3 50 50 40 3
Total 50 50 0

Solution: The figure 3.14 displays the ogive curve for the data related to
wage distribution of workers.

Sikkim Manipal University Page No. 66


Statistics for Management Unit 3

Fig. 3.14: Frequency curve for the data of solved problem 12

Key Statistic
With the help of an ogive, we can find all positional values of a
distribution. An ogive curve gives, at a glance, percentage of readings
that lie above or below a specified value.

3.7 Summary
For better understanding and usefulness, the collected data is classified in a
systematic manner according to common characteristics. Classification
simplifies and makes data more comprehensible and renders the data ready
for statistical analysis.
Classified data is tabulated in rows and columns for presentation, using
various types of classification. The tabulated data should be simple and
unambiguous, which should be understood and interpreted easily.
Frequency distribution is a special type of tabulation. In more concise form,
it brings out the salient features of the distribution.
Data presented in diagram or graphical form is more appealing and gives
rough idea of the situation for busy executives.
Graphical data is visual representation of data in the form of line diagrams,
pie-charts, histograms, frequency polygons, frequency curves, or ogives.

Sikkim Manipal University Page No. 67


Statistics for Management Unit 3

In a pie chart, different segments of a circle represent percentage


contribution of various components to the total. It brings out the relative
importance of various components of data.
The graph of cumulative frequency distribution is the ogive curve.

3.8 Terminal Questions

1. Form frequency distribution for the following data regarding weight of 50


people

50 72 61 64 72 62 61 56 75 55
52 71 54 64 71 64 59 59 70 54
60 60 57 57 66 68 60 62 68 54
62 65 58 64 65 60 60 67 58 56
70 62 60 68 64 62 59 69 52 58
2. Junior executive of XYZ Company has prepared budget for a new
division of the company. The budget data is shown in table 3.24. Vice
president of the company wanted to see the summary of the budget in
diagrammatic form. Prepare a pie diagram.
Table 3.24. Budget of an XYZ Company

Category Rs. in Lakhs


Capital investment 140
Salary and wages 65
Raw material 100
Research and development expenses 15
Miscellaneous 40

3. ABC Ice cream Company attempts to keep all of its ten flavours of ice
cream in stock at each of its stores. In-charge of stores operation
collects data to the nearest half gallon on the daily amount of each
flavour.
i. Is the flavour classification discrete or continuous? Open or closed?
ii. Data collected, is it qualitative or quantitative?
iii. Is the amount collected on each flavour discrete or continuous?

Sikkim Manipal University Page No. 68


Statistics for Management Unit 3

4. Construct histogram for the data represented in table 3.25.


Table 3.25. Frequency table data for the terminal question 4

Class 0-5 5-10 10-15 15-20 20-25 25-30


Frequency 4 6 10 5 3 1

5. Association of real estate sellers has collected data on a sample of 100


people with respect to the monthly commission earned by them. The
data is represented in table 3.26. Construct an ogive. Find:
i. What proportion of sales people earn more than 25,000
ii. What proportion earn between 15,000 & 25,000.
Table 3.26. Collected data of 100 people with respect to commissions earned

Earnings 5000 or 5000- 10000- 15000- 20000- 25000-


less 10000 15000 20000 25000 30000
No. of 5 9 13 30 27 16
people

3.9 Answers to SAQs and TQs

Answers to Self Assessment Questions


1. i. Grouping, Common, Characteristics.
ii. Bulk,
iii. Attribute
iv. Series
v. Two
vi. Series
2. i. True ii. False iii. False iv. True v. False
3. i. Discrete variable
ii. Continuous variable
iii. Five
iv. Upper class limit and lower class limit
v. Two
vi. Sturge‟s
vii. F / N
4. i. False ii. True iii. False

Sikkim Manipal University Page No. 69


Statistics for Management Unit 3

Answers to Terminal Questions


1. The solution for the terminal question 1 is represented in the table 3.27.
Table 3.27. Frequency distribution table for the data in terminal question 1

Class Interval Frequency


50-55 7
55-60 10
60-65 18
65-70 8
70-75 6
75-80 1
50

2. The table 3.28 displays the data required to construct the pie-chart
(figure 3.15) for the budget data of an XYZ Company.
Table 3.28. Budget of an XYZ Company

Category Angle Subtended at the centre of circle


Capital investment 140
Salary and wages 65
Raw material 100
Research and 15
development expenses
Miscellaneous 40

Fig. 3.15: Pie-chart representation of data of terminal question 2

Sikkim Manipal University Page No. 70


Statistics for Management Unit 3

3.
i. Discrete and closed
ii. Flavour is qualitative. Volume is quantitative
iii. Continuous
4. The figure illustrates the histogram diagram for the data in terminal
question 4.

Fig. 3.16: Histogram for the data in terminal question 4

5. The figure 3.17 is the ogive curve for the data given in terminal
question 5.
i. 16% ii. 47%

Fig. 3.17: Ogive curve for the data of terminal question 5

Sikkim Manipal University Page No. 71


Statistics for Management Unit 3

3.10 References
 B.L. Agarwal, (2006) Basic Statistics, Fourth Edition, New Age
International Publishers
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
 http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-5.pdf

Sikkim Manipal University Page No. 72


Statistics for Management Unit 4

Unit 4 Measures Used to Summarise Data


Structure:
4.1 Introduction
Learning objectives
Objectives of statistical average
4.2 Requisites of a Good Average
4.3 Statistical Averages
Arithmetic mean
Properties of arithmetic mean
Merits and demerits of arithmetic mean
4.4 Median
Merits and demerits of median
4.5 Mode
Merits and demerits of mode
4.6 Geometric Mean
4.7 Harmonic Mean
4.8 Appropriate Situations for the Use of Various Averages
4.9 Positional Averages
4.10 Dispersion
Range
Quartile deviations
Mean deviation
4.11 Standard Deviation
Properties of standard deviation
4.12 Coefficient of Variance
4.13 Summary
4.14 Terminal Questions
4.15 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
4.16 References

Sikkim Manipal University Page No. 73


Statistics for Management Unit 4

4.1 Introduction
In the unit 3, „Classification, Tabulation and Presentation of Data‟, you have
studied about data classification and representation of data in tables and
graphs. In this unit 4, „Measures Used to Summarise Data‟, you will study
the measures used to summarise data such as mean, median and mode.
Graphical representation is a good way to represent summarised data.
However, graphs provide us only an overview and thus may not be used for
further analysis. Hence, we use summary statistics like computing averages.
to analyse the data. Mass data, which is collected, classified, tabulated and
presented systematically, is analysed further to bring its size to a single
representative figure. This single figure is the measure which can be found
at central part of the range of all values. It is the one which represents the
entire data set. Hence, this is called the measure of central tendency.
In other words, the tendency of data to cluster around a figure which is in
central location is known as central tendency. Measure of central tendency
or average of first order describes the concentration of large numbers
around a particular value. It is a single value which represents all units.
4.1.1 Learning objectives
By the end of this unit, you will be able to:
 Describe the concept of statistical average
 Calculate arithmetic mean for discrete and continuous data
 Calculate median and mode of data
 Calculate quartiles, deciles and percentiles for the statistical data
 Compute coefficient of variance for the statistical data
4.1.2 Objectives of statistical average
The statistical average or simply an average refers to the measure of middle
value of the data set. The objectives of statistical average are to:
 Present mass data in a concise form
The mass data is condensed to make the data readable and to use it for
further analysis.
 Facilitate comparison
It is difficult to compare two different sets of mass data. But we can
compare those two after computing the averages of individual data sets.
While comparing, the same measure of average should be used. It leads
Sikkim Manipal University Page No. 74
Statistics for Management Unit 4

to incorrect conclusions when the mean salary of employees is


compared with the median salary of the employees.
 Establish relationship between data sets
The average can be used to draw inferences about the unknown
relationships between the data sets. Computing the averages of the data
sets is helpful for estimating the average of population.
 Provide basis for decision-making
In many fields, such as business, finance, insurance and other sectors,
managers compute the averages and draw useful inferences or
conclusions for taking effective decisions.

4.2 Requisites of a Good Average


The following are the requisites of a good average:
 It should be simple to calculate and easy to understand
 It should be based on all values
 It should not be affected by extreme values
 It should not be affected by sampling fluctuation
 It should be rigidly defined
 It should be capable of further algebraic treatment

4.3 Statistical Averages


The commonly used statistical averages are arithmetic mean, geometric
mean, harmonic mean.
4.3.1 Arithmetic mean
Arithmetic mean is defined as the sum of all values divided by number of
values and is represented by X .
Before you study how to compute arithmetic mean, you have to be familiar
with the terms such as discrete data, frequency and frequency distribution,
which are used in this unit.
If the number of values is finite, then the data is said to be discrete data.
The number of occurrences of each value of the data set is called frequency
of that value. A systematic presentation of the values taken by variable
together with corresponding frequencies is called a frequency distribution of
the variable.
Sikkim Manipal University Page No. 75
Statistics for Management Unit 4

Key Statistic
For discrete data, the arithmetic mean is given by:

 
 i

Key Statistic
For discrete data with frequency, the arithmetic mean is given by:

X
f X
i i

f i

Solved Problem 1: Find out the arithmetic mean of 15, 17, 22, 21, 19,
26, 20?
Solution: The arithmetic mean X is given by:
15  17  22  21 19  26  20 140
X   20
7 7
Therefore, the arithmetic mean is 20.
Solved Problem 2: The data in table 4.1 shows the number of students with
respect to age. Calculate the arithmetic mean of the students‟ ages.

Table 4.1: Number of students with respect to age

Students’ Age, (x) 20 23 25 28 30


Number of Students, (f) 3 5 10 6 1

Solution: The arithmetic mean X is given by:


20  3  23  5  25  10  28  6  30 1 623
X   24.92
3  5  10  6  1 25

Therefore, the arithmetic mean X is 24.92.

Sikkim Manipal University Page No. 76


Statistics for Management Unit 4

Key Statistic
For continuous series, the arithmetic mean is given by:

X  A
 fd  C.I.
f
where,

d =   Assumed Mean 
Width of Class Interval
C.I. is the width of class-interval
X is the mid value of the class
A is the Assumed Mean

Solved Problem 3: The table 4.2a shows the distribution of data of number
of students according to height. The table 4.2b shows the frequency table.
Find the arithmetic mean of the height of students.
Table 4.2a. Number of students with respect to height

Height in cms, x 140-150 150-160 160-170 170-180


Number of Students, f 50 65 80 55

Table 4.2b. Frequency table for solved problem 3

Mid X  155
f d fd
(Middle value, X) 10
145 50 -1 -50
155 65 0 0
165 80 1 80
175 55 2 110
250 140

Solution: The assumed mean is 155. Then the arithmetic mean is


calculated as:

  
 fd  C.I  155  140  10  155  5  6  160  6 cm
f 250

Sikkim Manipal University Page No. 77


Statistics for Management Unit 4

Therefore, the arithmetic mean of the height of students is 160.2


centimeters.
4.3.2 Properties of arithmetic mean
You have studied how to calculate arithmetic mean for grouped and
ungrouped data. Let us study about the properties of arithmetic mean which
are helpful in understanding the concept of arithmetic mean. The properties
of arithmetic mean are:
i. Algebraic sum of deviations of a set of values taken from their mean is
always zero, that is,
     0
ii. Sum of squares of deviations of a set of values from their mean is
always minimum, that is,
    
2
is always minimum
iii. Arithmetic mean is capable of further algebraic treatment. Suppose if
X1, X2….. Xn are the means of n1, n2…….nn sets of values, then their
combined arithmetic mean value is given by:
n1X1  n 2 X 2  ........  nn X n
X
n1  n 2  ....  nn

Solved Problem 4: If average height of 30 men is 158 cm and average


height of another group of 40 men is 162 cm, find the average height of the
combined group.
Solution: Given that,

n1  30 1  158 , n 2  40  2  162

30  158  40  162
  160.28 cms
30  40
The average height of the combined group is 160.28 cms.
Solved Problem 5: From solved problem 4, if you are given any 4 values
among n1, n2, x1, x2 and  , we can find the fifth value. Suppose,

n1  30 , n2  40 2  162 , and   160.28


Find 1  ?

Sikkim Manipal University Page No. 78


Statistics for Management Unit 4

Solution: On substituting the given values in the following equation, we get,


n X  n2 X2
X 1 1
n1  n 2
Then,
30  1  40  162
160.28 
30  40

 30 1  40  162  160.28  70

 30 1  160.28  70  6480

 30 1  11219 .60  6480


4739  6
 1   157.98
30
Solved Problem 6: The data in table 4.3a. is a reflection of the marks
scored by students of a class in an examination. Calculate the mean of the
marks scored by the students in an examination.
Table 4.3a. Marks scored by students
Less
Percentage Less Less Less Less Less Less
than
marks than 20 than 30 than 40 than 50 than 60 than 70
10
Number of
4 16 20 65 85 97 100
students

Solution: In the table 4.3a, the values given for the column „number of
students‟ is in cumulative frequency distribution. Now, we have to convert it
to frequency distribution. The calculated values are shown in table 4.3b.

Sikkim Manipal University Page No. 79


Statistics for Management Unit 4

Table 4.3b. Frequency table for the solved problem 6


Mid X  35
Marks (X) (Middle d Frequency (f) fd
value of X) 10

0 – 10 5 –3 4 – 12
10 – 20 15 –2 12 – 24
20 – 30 25 –1 4 –4
30 – 40 35 0 45 0
40 – 50 45 1 20 20
50 – 60 55 2 12 24
60 – 70 65 3 3 9
100 13

The mean X is given by:


13
X  35   10  36.3
100
Therefore, the mean score of the students is 36.3.
Solved Problem 7: Average weight of 100 screws in box „A‟ is 10.4 gms. It
is mixed with 150 screws of box „B‟. Average weight of mixed screws is 10.9
gms. Find the average weight of screws of box „B‟.
Solution: Given that:

n1  100 , n2  150 1  10.4 , and   10.9


2  ?
We know that:
n11  n 2  2

n1  n 2
100  10.4  150  2
  10.9
100  150
 1040  150 2  10.9  250  2725

 150 2  2725  1040  150 2  1685


1685
 2   11.23 gms
150

Sikkim Manipal University Page No. 80


Statistics for Management Unit 4

Therefore, the average weight of screws of box „B‟ is 11.23 gms.


Solved Problem 8: A clerk calculated arithmetic mean of 50 values as 39.2.
However, it was found that instead of taking two values as 25 and 32, he
took them as 52 and 23. Find the corrected arithmetic mean.
Solution: Given that:
  50,   39.2

 Present Total =   X  50  39.2  1960

 Corrected Total = Present Total – wrong values + correct values


Corrected Total  1960  52  23  25  32  1942
1942
 Corrected Average =  38.84
50
The arithmetic mean, therefore, is 38.84.
Solved Problem 9: Find the missing frequency for the distribution in table
4.4a, given the mean value as 129.
Table 4.4a. Distribution table for solved problem 9
Class
80-100 100-120 120-140 140-160 160-180 Total
Interval
Frequency 8 – 26 14 10 80

Solution: Let the missing frequency be „f‟. Then,


Table 4.4b. Frequency distribution table for solved problem 9

X
Mid X d f fd
CI
90 –1 8 –8
110 = A 0 f 0
130 1 26 26
150 2 14 28
170 3 10 30
58+f 76

Sikkim Manipal University Page No. 81


Statistics for Management Unit 4

Since, in case of grouped data, the arithmetic mean is given by:

XA
 fidi  C.I.
 fi
76 76
129  110   20  19   20
58  f 58  f
that is,
 19 f  1102  1520
f  22
Hence, the missing frequency is 22.

4.3.3 Merits and demerits of arithmetic mean


The table 4.5 displays the merits and demerits of arithmetic mean.
Table 4.5. Merits and demerits of arithmetic mean
Merits Demerits
It is simple to calculate and easy to It is affected by extreme values.
understand.

It is based on all values It cannot be determined for


distributions with open-end class
intervals.
It is rigidly defined. It cannot be graphically located.

It is more stable. Sometimes it is a value which is not in


the series.

It is capable of further algebraic


treatment.

Self Assessment Questions


1. State whether the following questions are „True‟ or „False‟.
i. For a given set of values if we add a constant 5 to every value, then
the arithmetic mean is affected.
ii. Arithmetic mean can be calculated for distribution with open-end
classes.
iii. Arithmetic mean is affected by extreme values.
iv. Arithmetic mean of 12, 16, 23, 25, 28, 32 is 22.

Sikkim Manipal University Page No. 82


Statistics for Management Unit 4

4.4 Median

Median of a set of values is the value which is the middle most value when
they are arranged in the ascending order of magnitude. Median is denoted
by „M‟. In case of discrete series without or with frequency, it is given by:
th
 n  1
 is the   value
 2 

Key Statistic
To solve problems on median,:
i. Arrange the data in ascending order or descending order
ii. Make class-interval as exclusive type

Solved Problem 10: Find the median value of the following set of values
45, 32, 31, 46, 40, 28, 27, 37, 36, 41, 47, 50.
Solution: Arranging in ascending order, we get:
27, 28, 31, 32, 36, 37, 40, 41, 45, 46, 47, 50
we have, n = 12

th
 12  1
 Median =   value  6.5th
 2 


37  40  38.5
2
The median for the given set of values is 38.5.
Solved Problem 11: Find the median value for the data shown in table
4.6a.
Table 4.6a. Data for solved problem 11

X 12 16 10 14 17 20 15
f 4 9 3 5 4 2 10

Solution: In this problem, we have, n = 37

Sikkim Manipal University Page No. 83


Statistics for Management Unit 4

Table 4.6b. Frequency distribution table for solved problem 11

X f Cumulative frequency
10 3 3
12 4 7
14 5 12
15 10 22
16 9 31
17 4 35
20 2 37

th th
 n  1  37  1
     19 th value
 2   2 
Therefore, the median, M is 15.

Key Statistic
In case of continuous series, median M is given by:

n / 2  Cf p
M = Lower limit of median class +  C.I.
fc
where,
Cfp = Cumulative frequency up to previous class
fc = frequency of class
C.I. = Width of class interval

Solved Problem 12: Find the median of the data in table 4.7a.
Table 4.7a. Distribution of weight data for solved problem 12

Weight in Kg 30-35 35-40 40-45 45-50 50-55


Frequency 10 15 40 27 8

Solution: As it is an exclusive type of interval, we organise the data as


shown in the table 4.7b.
  100  50
2 2

Sikkim Manipal University Page No. 84


Statistics for Management Unit 4

Table 4.7b. Cumulative frequency table for data in solved problem 12

Cumulative
Weight Frequency Frequency
Frequency
30-35 10 10
35-40 15 25
40-45 40fc 65
45-50 27 92
50-55 8 100

(n / 2)  Cf p
M = Lower limit of median class +  CI .
fc
where,
Lower limit of median class = 40.
Cfp = Cumulative frequency up to previous class = 25
fc = frequency of class = 40
C.I. = Width of class interval = 5
50  25
 Median  40   5  43  125
40
Hence, the median weight is 43.125 kg.
Solved Problem 13: Find the missing frequency for the data shown in table
4.8a, given that its median is 34.
Table 4.8a. Data for solved problem 13

Class interval Frequency


0 – 10 4
10 – 20 9
20 – 30 -
30 – 40 20
40 – 50 18
50 – 60 7
60 – 70 3
Solution: Since median is 34, it falls in the class-interval 30-40. Let „f‟ be
the missing frequency. Therefore, we have the data shown in table 4.8b.

Sikkim Manipal University Page No. 85


Statistics for Management Unit 4

N / 2  Cf p
Median  L.L.   C.I.
fc

Table 4.8b. Cumulative frequency distribution for data of solved problem 13


Class interval Frequency Cumulative frequency
0 – 10 4 4
10 – 20 9 13
20 – 30 – 13 + f
30 – 40 20 33 + f
40 – 50 18 51 + f
50 – 60 7 58 + f
60 – 70 3 61 + f

(61  f ) / 2  (13  f )
34  30  X10
20

61 / 2  f / 2  13  f 35 / 2  f / 2
 4  4  16  35  f
2 2
 f = 19
Therefore, the missing frequency is 19.
4.4.1 Merits and demerits of median
The table 4.9 displays the merits and demerits of median.
Table 4.9. Merits and demerits of median
Merits Demerits
It can be easily understood and It is not based on all values.
computed.
It is not affected by extreme values. It is not capable of further algebraic
treatment.
It can be determined graphically It is not based on all values.
(Ogives).

It can be used for qualitative data.


It can be calculated for distributions
with open-end classes.

Sikkim Manipal University Page No. 86


Statistics for Management Unit 4

4.5 Mode
Mode is the value which has the highest frequency and is denoted by Z.
Modal value is most useful for business people. For example, shoe and
readymade garment manufacturers will like to know the modal size of the
people to plan their operations. For discrete data with or without frequency,
it is that value corresponding to highest frequency.
Solved Problem 14: The following data relate to size of shoes. Find the
mode.
6, 7, 6, 8, 9, 9, 9, 10, 8, 7, 7, 9, 10, 9, 9, 9, 8, 8, 11
Solution: Arranging the data in ascending order, data obtained is shown in
table 4.10.
Table 4.10. Frequency table for data in solved problem 14
Size Frequency
6 3
7 3
8 4
9 7
10 2
11 1

 Modal value is 9, which is corresponding to the highest frequency 7.

Key Statistic
In case of continuous series, mode is given by:
fm  fp
Mode  L.L.   C.I.
2fm  fp  fs
Where,
L.L. = lower limit of modal class
fm = frequency of modal class
fp = frequency of previous class
fs = frequency of succeeding class
C.I = width of class interval

Sikkim Manipal University Page No. 87


Statistics for Management Unit 4

Solved Problem 15: Praveen, an apartment builder, concerned about the


number of customers who wishes to have plinth area of their apartments. He
collects the data and summarises in table 4.11. Find the modal plinth area.
Table 4.11. Customers wishing to have plinth area
Plinth Area Sq ft No. of Customers
600 – 800 4
800 – 1000 10
1000 – 1200 15 fp
1200 – 1400 25 fm
1400 – 1600 12 fs
1600 – 1800 8
Above 1800 2

Solution: We note that the intervals are exclusive type and the highest
frequency is 25. Therefore, the corresponding interval is 1200-1400, which
is called modal class.
fm  fp
Mode  L.L.   C.I.
2fm  fp  fs
Where,
L.L. = lower limit of modal class = 1200
fm = frequency of modal class = 25
fp = frequency of previous class = 15
fs = frequency of succeeding class = 12
C.I = width of class interval = 200
Therefore, the mode is calculated as:

25  15 2000
Mode  1200   200  1200  = 1286.95
2  25  15  12 23
Hence, the modal plinth area is 1286.95 square feet.

Solved Problem 16: The distributions shown in table 4.12 are the average
monthly balances of customers in a nationalised bank. The mode of the
distribution is 119. Find the total number of customers surveyed.

Sikkim Manipal University Page No. 88


Statistics for Management Unit 4

Table 4.12. Distribution of average monthly balances of customers

Class Interval Frequency


0 – 50 78
50 – 100 123
100 – 150 –
150 – 200 82
200 – 250 51
250 – 300 47
300 – 350 18
350 – 400 9
400 – 450 6
450 - 500 4

Solution: Let the missing frequency be „f‟ since the mode is given to be 119.
Modal class is 100 – 150. fm = f fp = 123 fs = 82 C.I = 50

f  123 f  123
 119  100   50  119  100   50
2f  123  83 2f  205

 192f  205   50f  123 

 38f  3895  50f  6150


 2255  12f
 f  188
 The total number of customers surveyed is 601.

Sikkim Manipal University Page No. 89


Statistics for Management Unit 4

4.5.1 Merits and demerits of mode


The table 4.13 depicts the merits and demerits of mode.
Table 4.13. Merits and demerits of mode
Merits Demerits
In many cases it can be found by It is not based on all values.
inspection.
It is not affected by extreme values. It is not capable of further
mathematical treatment.
It can be calculated for distributions with It is much affected by sampling
open end classes. fluctuations.
It can be located graphically.
It can be used for qualitative data.

Key Statistic
The empirical relationship between mean, median and mode:
Mean – Mode = 3 (Mean – Median)
which is same as,
Mode = 3 Median – 2 Mean.

4.6 Geometric Mean


The geometric mean (GM) of a series of “n” positive numbers is given by:
i. In case of discrete series without frequency,
GM  n x1.x 2 .......... ....x n
ii. In case of discrete series with frequency,

GM  n X1f1.X 2 f 2 .......... Xn f n
where,
n  f1  f2  .......... .  fn
iii. In case of continuous series,

GM  n X1f1.X 2 f2 .......... ..Xn fn

Sikkim Manipal University Page No. 90


Statistics for Management Unit 4

where, n  f1  f2  .......... .  fn and x1, x 2 ,......, x n are the mid points of


class intervals.


It is also given by GM  anti log 
 log x 

 N 

Solved Problem 17: The growth in bad-debt expense for Das Office
Supplies Company, over the last few years is as shown in table 4.14.
Calculate the average percentage increase in bad-debt expense over this
time period.
Table 4.14. Bad-debt expense growth for Das Office Supplies Company

Year 1992 1993 1994 1995 1996 1997 1998


Expense
1.110 1.090 1.075 1.080 1.095 1.080 1.200
Rate

Solution: The geometric mean is given by:


GM = 7 (1.11) (1.09) (1.075 ) (1.08) (1.095) (1.08) (1.20) = 1.109675
Therefore, the average increase is 1.09675 – 1 = 0.09675 %
Solved Problem 18: The share-price of a particular company was moving
up and down. The data shown in table 4.15a consolidates its movement for
past 6 months. Find the appropriate average share-price.

Table 4.15a. Frequency table of share price

Share Price 110 115 118 119 120


Frequency 4 11 21 6 2

Solution: The data in table 4.15b is obtained from the data in table 4.15a.

Sikkim Manipal University Page No. 91


Statistics for Management Unit 4

Table 4.15b. Calculation of geometric mean of share prices

Share Price X frequency Log X f log x


110 4 2.0414 8.1656
115 11 2.0607 22.6677
118 21 2.0719 43.5099
119 6 2.0755 12.4530
120 2 2.0792 4.1584
Total 44 – 90.9546

The geometric mean GM is calculated as:



GM = antilog 
 log x  = antilog  90.9546 
 N   44 
= antilog 2.0672 = 116.7
The appropriate average share price is Rs. 116.70.

Key Statistic
Whenever data deal with rates, ratios, growth rates, and so on, the
geometric mean is the best measure
Geometric mean is not defined even if one of the values is zero or
negative.

4.7 Harmonic Mean


If x1, x2, …………xn are “n” values for discrete series without frequency, then
their harmonic mean (HM) is.
N
H.M.=

(1/ x i )

Key Statistic
For discrete series with frequency, the harmonic mean is given by:
N
H.M =
 ( fi / x i )
where, fi are the corresponding frequencies for values of i equal to 1 to N.

Sikkim Manipal University Page No. 92


Statistics for Management Unit 4

Solved Problem 19: Calculate the harmonic mean of 9.7, 9.8, 9.5, 9.4, 9.7.
Solution: The harmonic mean (HM) is calculated as:
Table 4.16. Calculation of harmonic mean
X f/
9.7 0.1031
9.8 0.1020
9.5 0.1053
9.4 0.1064
9.7 0.1031
Total 0.5199

5
 HM = = 9.6172
0.5199
Therefore, the harmonic mean is 9.6172.

Self Assessment Questions


2. State whether the following questions are true „T‟ or false „F‟.
i. Mode is based on all values
ii. Mode = 3 Median – Mean
iii. Geometric mean is used when we are interested in rate of growth
of any phenomena.
iv. Harmonic mean exists if one of the values is zero.
v. A.M < G.M < H.M for any two values „a‟ and „b‟.
vi. Arithmetic mean can be calculated accurately even when the
distribution has open-end class.
vii. Mode can be located graphically.
viii. Mode is used when data is on interval scale.

4.8 Appropriate Situations for the use of Various Averages


1. Arithmetic mean is used when:
a. In depth study of the variable is needed
b. The variable is continuous and additive in nature
c. The data are in the interval or ratio scale
d. When the distribution is symmetrical

Sikkim Manipal University Page No. 93


Statistics for Management Unit 4

2. Median is used when:


a. The variable is discrete
b. There exists abnormal values
c. The distribution is skewed
d. The extreme values are missing
e. The characteristics studied are qualitative
f. The data are on the ordinal scale
3. Mode is used when:
a. The variable is discrete
b. There exists abnormal values
c. The distribution is skewed
d. The extreme values are missing
e. The characteristics studied are qualitative
4. Geometric mean is used when:
a. The rate of growth, ratios and percentages are to be studied
b. The variable is of multiplicative nature
5. Harmonic mean is used when:
a. The study is related to speed, time
b. Average of rates which produce equal effects has to be found

4.9 Positional Averages


Median is the mid-value of series of data. It divides the distribution into two
equal portions. Similarly, we can divide a given distribution into four, ten or
hundred or any other number of equal portions.

Key Statistic
Quartiles: When distribution is divided into four equal portions, then we
get first quartile (Q1), second quartile (Q2 = Median) and third quartile
(Q3) as the positional averages.

For discrete series with or without frequency, Q1 and Q3 are given by:
th
 N  1
Q1 is   value
 4 

Sikkim Manipal University Page No. 94


Statistics for Management Unit 4

th
 (3(n  1)) 
Q 3 is   value
 4 

For continuous distribution Q1 and Q3 are given by:


N / 4  Cf p
Q1  L.L.  xC.I.
Fc
3N / 4  Cf p
Q3  L.L.  xC.I.
Fc
Solved Problem 20: Weekly sales of a product on 8 different shops are as
follows. Calculate the quartiles.
Sales in units: 309, 312, 305, 307, 310, 308, 308, 306, 308
Solution:
 n  1 th
Arranging the data in ascending order. We have Q1 as   value
 4 
305, 306, 307, 308, 309, 310, 312.
 n  1 th  8  1 th th
Q1 =  4  Value =  4  Value = 2.25 value
   
= 2nd value + 0.25 (third value – second value)
= 306 + 0.25 (307 – 306) = 306.25
th
 2(n  1) 
Q2    value = 2.25 x 2 = 4.5th value
 4 
= 4th value + 0.5 (5th value – 4th value)
= 308 + 0.5 (30/ - 308) = 308
th
 3(n  1) 
Q3    Value = 2.25 x 3 = 6.75th value
 4 
= 6th value + 0.75 (7th value – 6th value)
= 309 + 0.75 (310 – 309)
= 309 + 0.75 = 309.75
Therefore, Q1, Q2, and Q3 are 306.25, 308 and 309.75 respectively.
Solved Problem 21: The table 4.17a shows the distribution of weight of
students of 1st standard of a school. Find the quartiles.

Sikkim Manipal University Page No. 95


Statistics for Management Unit 4

st
Table 4.17a. Distribution of weight of 1 standard students

Class Interval 13 - 18 18 - 20 20 - 21 21 - 22 22 - 23 23 - 25 25 – 30
Frequency 22 27 51 42 32 16 10
Solution: The table 4.17b displays the cumulative frequency distribution of data
for solved problem 21.
Table 4.17b. Cumulative frequency distribution of data for solved problem 21

Class interval Frequency Cumulative frequency


13 – 18 22 22
18 – 20 27 49
20 – 21 51 100
21 – 22 42 142
22 – 23 32 174
23 – 25 16 190
25 – 30 10 200

P20 class
Q1 class and Q2 class
D7 class
Q3 class
NthValue
N=200 Q1   50th value
4
50  49
 Q1  20   1  20.02
51
N
Q 2  th value
2
N
Q 2  th value  100 th value
2
100  49
Q 2  20  1  21
51
th
 3 
Q3    value  150 th value
 4 
150142
Q 3  22  1 22.25
32

Sikkim Manipal University Page No. 96


Statistics for Management Unit 4

Therefore the quartiles Q1, Q2, and Q3 are 20.02, 21 and 22.25.

Key Statistic
For deciles, we divide N / 10 and multiply by required deciles value.

Solved Problem 22: Find the 7th decile for the same data given in solved
problem 22.
Solution: The 7th decile is given by:

7NthValue 7  200
D7 =   140 th value
10 10
140  100
D7 = 21   1 = 21.95
42

Therefore, the 7th decile is 21.95.

Key Statistic
To find percentiles we divide N/100 and multiply by required percentile
value.

Solved Problem 23: For the solved problem 21, find the 20th percentile.
Solution: The 20th percentile is given by:

Nth Value 20  200


P20  20  
100 100 = 40th value
49  40
 P20  18   2  P20  18.67
27
Therefore, the 20th percentile is 18.67.

Self Assessment Questions


3. State whether the following questions are „True‟ or „False‟.
i. Quantiles are positional value.
ii. Quantiles help us to find percentage of readings below or above a
certain value.
iii. Q2 = P50 = D7 = Median

Sikkim Manipal University Page No. 97


Statistics for Management Unit 4

Key Statistic
Suppose the values x1, x2, … xn are assigned the weights w1, w2………wn
then their weighted average is given by:

Xw 
 Wx
W
and their weighted Geometric Mean is given by:

Gw = antilog
 W log x
W
where, „W‟ acts as frequency

Solved Problem 24: A professor assigns 5, 10, 10, 20, as weights for
assignments, presentations, first test and final test respectively. Moni and
Mani got the percentages in the above categories as shown in table 4.18.
Find the weighted percentage.
Table 4.18. Percentages of assignment weightages

Classification Moni Mani Weight


Assignment 60 40 5
Presentation 80 60 10
First Test 50 100 10
Find Test 100 70 20
45

Solution: For Moni, the weighted arithmetic mean is given by:


60 x 5  80 x 10  5 x 10  100 x 20
Xw  = 70 %
45

Sikkim Manipal University Page No. 98


Statistics for Management Unit 4

For Mani, it is given by:


40 x 5  60 x 10  100 x 10  70 x 20
Xw  = 71.11 %
45
The weighted arithmetic mean of assignments done by Moni and Mani are
70% and 71.11% respectively.

Self Assessment Questions


4. State whether the following questions are true, „T‟ or false, „F‟.
i. The cost of living index numbers calculated are based on weighted
averages.
ii. Many of the items which we use in our life can be assigned weights.

4.10 Dispersion
It describes another characteristic of a distribution. Consider the two
distribution of weights of a product produced by two machines, shown in
table 4.19.
Table 4.19. Distribution of weights of a product

Machine A B
Sample size 1000 1000
Average weight 80 80
Minimum weight 20 40
Maximum weight 140 100

Machine „B‟ produces products with weights much closer to the average
than Machine „A‟. As a manufacturer or customer, we would choose
Machine „B‟. In other words, we choose that machine whose spread is
smaller.
The property of deviations of values from the average is called dispersion or
variations. The degree of variations is found by the measures of variations.
They are:
1. Range (R)
2. Quartile Deviations (Q.D)
3. Mean Deviations (M.D)
4. Standard Deviations (S.D)
Sikkim Manipal University Page No. 99
Statistics for Management Unit 4

They have units of measurement attached to them. Therefore, they are


known as absolute measures of variations. However, we may want to
compare two different distributions whose measurements are one in terms
of Kilograms and another in terms of centimeters. Then, we use the
following relative measures that do not have any units attached to them. The
relative measures are:
1. Coefficient of Range
2. Coefficient of Quartile Deviations
3. Coefficient of Mean Deviations
4. Coefficient of Variations
They are known as relative measures. In this unit, we study both measures
of variations and coefficients of variations simultaneously.
Prerequisite of a good measure of variations are:
1. It should be easy to understand and simple to calculate.
2. It should be based on all values.
3. It should be rigidly defined.
4. It should not be affected by extreme values.
5. It should not be affected by sampling fluctuations.
6. It should be capable of further algebraic treatment.
4.10.1 Range
Range is the difference between highest and lowest value of the data.
R = H-L where, H: Highest value
L: Lowest value
HL
Coefficient of range =
HL

Sikkim Manipal University Page No. 100


Statistics for Management Unit 4

The table 4.20 shows the merits and demerits of range.


Table 4.20. Merits and demerits of range

Merits Demerits
It is easily understood and It is affected by extreme values.
simple to calculate.
It is rigidly defined. It is not based on all values. It uses
extreme values only.
Range is used:
 In Statistical Quality control
 When the study does not require deep analysis
 When data has no abnormal values
Solved Problem 25: Find the range of the following discrete series 26, 28,
28, 26, 28, 30, 27, 29, 26, 24
Solution: The range „R‟ is calculated as:
R=H-L
where,
 H: Highest value
 L: Lowest value
R = 30 – 24 = 6
Therefore, the range of the given discrete series is 6.
Solved Problem 26: Find the range for the continuous series of data shown
in table 4.21.
Table 4.21. Frequency table for data of solved problem 26

Class Interval 0-5 5-10 10-15 15-20 20-25


Frequency 10 15 25 12 8

Solution: Range R is calculated as:


R = 25 – 0 = 25
Therefore, the range of the given continuous series is 25.

Key Statistic
Range is not defined if the class intervals are open.

Sikkim Manipal University Page No. 101


Statistics for Management Unit 4

4.10.2 Quartile deviations


Unlike range, quartile deviation does not involve the extreme values. It is
defined as:
Q 3  Q1
Q.D. =
2
Q  Q1 (Re lative measure )
Coefficient of Q.D = 3
Q 3  Q1

Key Statistic
1. Q3-Q1 is called inter quartile range.
2. Q3-Q1 gives the middle 50% of reading. Q3 and Q1 are also known
as upper and lower limit of middle 50% of readings.
3. Quartile range is not capable of further algebraic treatment.

Solved Problem 27: Compute the inter quartile range, Q.D and coefficient
of Q.D for the age distributions shown in table 4.22a..
Table 4.22a. Age distributions

Age (Years) 18 21 22 24 27 30 32
Frequency 7 13 20 36 14 8 2

Solution: The table 4.22b shows the cumulative frequency distributions for
the age distributions.
Table 4.22b. Cumulative frequency table for the age distributions

Age (Years) Frequency Cumulative Frequency


18 7 7
21 13 20
22 20 40
24 36 76
27 14 90
30 8 98
32 2 100
Total 100

Sikkim Manipal University Page No. 102


Statistics for Management Unit 4

100  1th
Q1  value  25.25th value
4
Q1  22
3(100  1)th
Q3  value  75.75th value
4
Q3 = 24
Therefore, the inter quartile range, Q3 –Q1 = 24-22 = 2 Yrs.
24  22
Q .D.  = 1 year
2
24  22 2
Coefficient of Q.D.  
24  22 26

The table 4.23 shows the merits and demerits of quartile deviations.
Table 4.23. Merits and demerits of quartile deviations

Merits Demerits
It is easy to understand and to It is not based on all values.
compute.
It is rigidly defined. It is affected by sampling fluctuations.
It is not affected by extreme It is not capable of further algebraic
values. treatment.

4.10.3 Mean deviation


Mean deviation is defined as the mean of absolute deviations of the values
from the central value.
For discrete data with frequency, mean deviation is calculated as:

M.D.( X) 
 ( X  X)f
N
In case of continuous series „X‟ represents mid value of class-interval.
Similarly, we can have mean deviation from median or mode. „X‟ is replaced
by median or mode in the above formula. However, mean deviation from
median is the least. It is known as minimal property of mean deviation.
The corresponding relative measures are coefficient of mean deviation.
M.D.( X)
Coefficient of M.D. X 
X
Sikkim Manipal University Page No. 103
Statistics for Management Unit 4

M.D.(Median )
Coefficient of M.D.Median 
Median
Solved Problem 28: Calculate mean deviation and also coefficient of mean
deviation using:
i) Mean
ii) Median
Compare the results.
Heights of plants (cms) 140, 147,143,145,144,150,142,141.
Solution: The frequency table for the data of solved problem 28 is
represented in table 4.24.
Table 4.24. Data for the solved problem 28

X From Meanx – 145 From Medianx – 143.5


140 5 3.5
141 4 2.5
142 3 1.5
143 2 0.5
144 1 0.5
145 0 1.5
147 2 3.5
158 13 6.5
1160 30 20.0

1160
 ( X)   145
8 cms.

 
30
 Mean deviation from mean =  3.75
8
3.75
Coefficient of MD ( X) =  0.0258
145
(8  1)th
Median is value = 4.5th value
2
 Median = 143 + 0.5(144 – 143) = 143.5 cms

Sikkim Manipal University Page No. 104


Statistics for Management Unit 4

20
Mean deviation from median =  2.5
8
2.5
Coefficient of MD ( X) =  0.001742
143.5
The mean deviation from median (2.5 cms) is less than that of the mean
deviation from mean (3.75 cms).
Solved Problem 29: The data in table 4.25a is the distribution of employees
of a firm according to their efficiency. Find the mean deviation and
coefficient of mean deviation from:
i. Mean
ii. Median
Table 4.25a. Distribution of employees according to their efficiency

Efficiency Index 18-22 22-26 26-30 30-34 34-38


Employees 20 30 11 3 1

Solution: The table 4.25b displays the frequency distribution of employees


to calculate the mean deviation from mean and mean deviation from
median.
Table 4.25b. Frequency distribution of employees
Efficiency Fre- X  28
Index d fd f X - 24 Cf X – Med) fX – Med)
quency 4
18 – 22 20 -2 -40 80 20 3.63 72.60
22 – 26 30 -1 -30 0 50 0.34 10.20
26 – 30 11 0 0 44 61 4.34 47.74
30 – 34 3 1 3 24 64 8.34 25.02
34 – 38 1 2 2 12 65 12.34 12.34
65 -65 160 168.00

28  65
 ( X)   4  24
65
160
 M.D.( X)   2.46
65

Sikkim Manipal University Page No. 105


Statistics for Management Unit 4

Nth Value 65
  32.5
2 2
Median class is 22 – 26
32.5  20 12.5 50
Median  22   4  22   4  22   22  1.66  23.66
30 30 30
168
M.D. (Median )   2.58
85
2.46
Coefficient of M..D.( X)   0.1025
25
2.58
 Coefficient of M.D.( from Median )   0.1091
23.6

Therefore, the mean deviation and coefficient of mean deviation from mean
are 2.46 and 0.1025 respectively. The mean deviation and coefficient of
mean deviation from median are 2.58 and 0.1091 respectively. The table
4.26 shows the merits and demerits of mean deviation.
Table 4.26. Merits and demerits of mean deviation
Merits Demerits
It is based on all values. It is not capable of further algebraic
treatment.
It is less affected by extreme It does not take into account
values. negative signs.
It is not affected much by sampling
fluctuations.

The mean deviation, MD is used:


 When sample size is small.
 In Statistical analysis of certain economic, business and social
phenomena.

4.11 Standard Deviation


Measures of dispersion range and Q.D are not based on all values. Mean
deviation based on all values does not take into consideration the positive or
negative sign. Therefore, a measure that removes both drawbacks is given
by standard deviation (S.D).
Sikkim Manipal University Page No. 106
Statistics for Management Unit 4

The standard deviation of a set of values is the positive square root of mean
of the squared deviations of the values from their arithmetic mean. It is
denoted by „‟ (sigma).
For discrete series without frequency it is given by:

Variance =
 ( X  X) 2        ( A )
N
= ( Variance)

For discrete series with frequency, it is given by:

Variance =
 ( X  X)2 f        (B)
f
= ( Variance)

Where, „X‟ is the mid value of class interval for continuous series in case of
grouped data, alternative form for (A) & (B) are the followings –
For (A)

Variance =
 d2  (d)2
N
= ( Variance)
For (B)
 2
Variance = 
 fd2    fd   (C.F.) 2
 N   f  
 
= ( Variance)

Sikkim Manipal University Page No. 107


Statistics for Management Unit 4

Where, d = X-A: here, A is assumed mean


And C.F.= Class Width

Key Statistic
The square of standard deviation is called variance. It is denoted by 2.

Solved Problem 30: The diastolic blood pressures of men are distributed
as shown in table 4.27a. Find the standard deviation and variance.
Table 4.27a. Distribution of diastolic blood pressures of men
Pressure(men) 78-80 80-82 82-84 84-86 86-88 88-90
No. of Men 3 15 26 23 9 4

Solution: The table 4.27b represents the frequency distribution of data


required for calculating the standard deviation.
Table 4.27b. Frequency distribution of data for solved problem 30

Class Mid Frequency d = x-83


fd fd2
Interval value X ‘f’ 2
78-80 79 3 -2 -6 12
80-82 81 15 -1 -15 15
82-84 83 26 0 0 0
84-86 85 23 1 23 23
86-88 87 9 2 18 36
88-90 89 4 3 12 36
80 32 122

 2
 = 
2  fd2    fd   (C.I.) 2
 N   f  
 
122  32  2 
 = 
2
     (2) 2  1.525  0.16  4  5.46 (mm) = Variance
 80  80  

Standard deviation =  = 2.336 (mm)

Sikkim Manipal University Page No. 108


Statistics for Management Unit 4

4.11.1 Properties of standard deviation


1. It is independent of origin but not independent of scale.
2. Standard deviation is always greater than or equal to zero.
3. It is the least of all root-mean-square deviations.
4. Suppose the mean of n1 values is X1 and that of n2 values is X 2 and
standard deviation of the n1 and n2 values is 1 and 2 respectively.
Then the combined standard deviation of both the values is given by:

n1 (12  d12 )  n 2 ( 2 2  d2 2 )
Variance = ;   Variance
n1  n 2
Where, d1 = X – X1 and d2 = X – X 2
X being the combined mean of n1 and n2 values.
The table 4.28 shows the merits and demerits of standard deviation.
Table 4.28. Merits and demerits of standard deviation
Merits Demerits
It is rigidly defined. It is difficult to understand.
It is based on all values. It gives undue weightage for extreme
values.
It is capable of further algebraic It cannot be calculated for classes
treatment. with open end interval.
It is not very much affected by
sampling fluctuations.

Solved Problem 31: The average weight of 100 apples from area “A” is
150gms with standard deviation of 10gms. Similarly the average weight of
200 apples from area “B” is 200gms with standard deviation of 15gms. Find
the combined standard deviation.
Solution: Given that:

n1  100, n 2  200 1  150,  2  200


1 = 10, 2 = 15
Combined Average = n1 X1  n 2 X 2
n1  n 2

Sikkim Manipal University Page No. 109


Statistics for Management Unit 4

100  150  200  200



100  200
15000  40000 55000
   183.33 gms
300 300
 d12 (150 – 18333)2 = (3333)2 = 1110889

 d 22 = (200 – 18333)2 = (1666)2 = 2775556

100(100  1110 .8889 )  200(225  277.5556 )


=
100  200
(100  1210 .8889  200  302.5556 )
Standard deviation = =24.6035
300
Hence, the standard deviation is 24.6035.

4.12 Coefficient of Variation


When we want to compare two different sets of values pertaining to different
characteristics or pertaining to same characteristic, then we use coefficient
of variation (CV). It is a relative measure expressed in percentage and is
defined as:
S.D.
CV in % =  100
Mean
It is used to compare the homogeneity or stability or uniformity or
consistency of two or more data sets. A low value of coefficient of variation
indicates a low degree of variation.
Solved Problem 32: Find standard deviations of the two series shown in
table 4.29a. State which series is more stable?
Table 4.29a. Data of series A and series B
Series A 192, 288, 236, 229, 184, 160, 384, 291, 330, 243
Series B 31, 48, 13, 51, 38, 43, 50, 36, 47, 82

Solution: The table 4.29b displays the values required to calculate


coefficient of variation for data of series A.

Sikkim Manipal University Page No. 110


Statistics for Management Unit 4

Table 4.29b. Required values for series A

Series AX d = x-260 d2
192 -68 4624
288 28 784
236 -24 576
229 -31 961
184 -76 5776
160 0 0
384 124 15376
291 31 961
330 70 4900
43 -17 289
+37 34247
37
   260   263.7
10
2
34247  37 
2 =    (58.4) 2
10  10 
  58.4
58.4
CV% =  100  22.15%
263.7

The coefficient of variation for the data of series A is 22.15%.

Sikkim Manipal University Page No. 111


Statistics for Management Unit 4

The table 4.29c displays the values required to calculate coefficient of


variation for data of series B.
Table 4.29c. Required values for series B

Series B X X2
31 961
48 2304
13 169
51 2601
38 1444
43 1849
50 2500
36 1296
47 2209
82 6724
Total 439 22057

X  43.9
2
22057  439  2
 =
2
   2205 .7  ( 43.9)  2205 .7  1927 .21  278.49
10  10 
 278.49  16.68802
16.69
CV.% =  100  38.0154 %
43.9
The series A is more stable, since the CV for series A (22.15) is less than
the CV for series B (38.02).

Self Assessment Questions


5. State whether the following questions are true „T‟ or false „F‟.
i. Standard deviation is based on all values.
ii. Standard deviation of a set of values is increased if every value of
the set is increased by a constant.
iii. Standard deviation can be calculated for distributions with open-end
classes.

Sikkim Manipal University Page No. 112


Statistics for Management Unit 4

iv. C.V % can be used to compare the variability of two sets of data
measuring the same characteristics.

4.13 Summary
The measures of central tendency and measures of dispersion summarise
mass data in terms of its two important features.
i. With respect to nature of data to cluster around a central value
ii. With respect to their spread from their central value
Arithmetic mean is defined as the sum of all values divided by number of
values.
Median of a set of values is the middle most value when the values are
arranged in the ascending order of magnitude.
Mode is the value which has the highest frequency
The measures of variations are:
i. Range (R)
ii. Quartile Deviations ( Q.D)
iii. Mean Deviations (M.D)
iv. Standard Deviations (S.D)
Coefficient of variation is a relative measure expressed in percentage and is
defined as:
S.D.
CV in % =  100
Mean
4.14 Terminal Questions
1. In an office there are 84 employees. Their salaries in Indian rupees are
as given in table 4.30. Find the mean salary per day.
Table 4.30. Salaries of 84 employees

Salary / day 60 70 80 90 100 120


Employees 3 5 8 10 4 2

2. A survey of 128 smokers gave the results represented in table 4.31,


which are frequency distributions of smokers‟ daily expenses on
smoking. Find the mean expenses and standard deviation. Determine
coefficient of variation.
Sikkim Manipal University Page No. 113
Statistics for Management Unit 4

Table 4.31. Survey results of 128 smokers

Expenditure
10 - 20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 70 - 80
(Rs.)
No. of
23 44 35 12 9 3 2
Smokers

3. The average price/kg of Grade “A” tea is Rs.120 and that of grade “B”
tea is Rs.100. A trader mixes them and sells the mixture for Rs.115.
Find proportion of grade A and grade B in the mixture.
4. For the distribution shown in table 4.32, find the median and mode.
Table 4.32. Distribution data for terminal question 4

% Marks 0 - 10 10 - 20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70
No. of 4 9 19 20 18 7 80
Smokers

5. Find the geometric mean of the following distribution


Table 4.33. Distribution data for terminal question 5

X 110 115 118 119 120


f 4 11 21 6 2

6. Find the harmonic mean of the following distribution


Table 4.34. Distribution data for terminal question 6

X 121 122 123 124 125


f 5 25 36 37 20

7. Find the quartile deviation and the coefficient of quartile deviation for
the data shown in table 4.35.
Table 4.35. Distribution data for terminal question 7

Age group 15-20 20-25 25-30 30-35 35-40 40-45 Above 45


% of people
who exercise 15 31 19 15 8 7 7
regularly

Sikkim Manipal University Page No. 114


Statistics for Management Unit 4

8. Given sum of upper and lower quartiles as 122 and their difference as
23; find the quartile deviation of the series.
9. If C.V% = 22 and S.D = 4. Find the mean.
10. The table 4.36 shows the distribution of age at the time of first delivery
of 65 women. Find mean deviation from mean and median.
Table 4.36. Distribution of age at the time of first delivery of 65 women
Age 18 – 22 22 – 26 26 – 30 30 – 34 34 – 38
Frequency 20 30 11 3 1

11. Read the data given below and find the combined mean, S.D and
coefficient of variation.
n1 = 15 n2 = 20
X1 = 40 X2 = 50
1 = 3 2 = 5
12. Mean and Standard deviation of lengths of tails of 8 rats were found to
be 4.7 cm and 0.8 cm respectively. However, one reading was taken as
3.6 cm instead of 6.3 cm; find the corrected mean and standard
deviation.

4.15 Answers to SAQs and TQs

Answers to Self Assessment Questions


1 i- T, ii- F, iii – T, iv - F
2 i – F, ii – F, iii – T, iv – F, v – F, vi – F, vii – T, viii – T
3 i – T, ii – T, iii - F
4 i- T, ii - T
5 i – T, ii – F, iii – F, iv - T
„T‟ denotes „True‟
„F‟ denotes „False‟

Answers to Terminal Questions


1. Rs. 84.69
2. 31.64
3. 1:1

Sikkim Manipal University Page No. 115


Statistics for Management Unit 4

4. 34
5. 116.7 cm
6. 123.33
7. Q.D = 11.07
Coefficient of Q.D = 0.338
8. 11.5
9. 18.18
10. 2.462
11. Combined Mean = 45.7
Combined S.D = 6.53,
C.V in % = 14.29
12. Corrected Mean = 5.0375 cm
Corrected S.D = 0.8336 cm

4.16 References
 B. L. Agarwal, (2006) Basic Statistics, Fourth Edition, New Age
International Publishers
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited

Sikkim Manipal University Page No. 116


Statistics for Management Unit 5

Unit 5 Probabilities
Structure:
5.1 Introduction
Learning objectives
Definition of probability
Basic terminology used in probability theory
Approaches to probability
5.2 Rules of Probability
Addition rule
Multiplication rule
5.3 Conditional Probability
5.4 Steps Involved in Solving Problems on Probability
5.5 Bayes’ Probability
5.6 Random Variables
5.7 Summary
5.8 Terminal Questions
5.9 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
5.10 References

5.1 Introduction
In the unit 4, ‘Measures used to Summarise Data’, you have studied about
the measures of central tendency and measures of dispersion. In this unit 5,
‘Probabilities’, you will study about the ways of minimising the uncertainty
involved in our day to day lives by using probability theory.
Every human activity has an element of uncertainty. Uncertainty affects the
decision making process. In your daily lives, you very often use the word
‘probably’, like, probably it may rain today; probably the share price may go
up in the next week. Therefore, there is a need to handle uncertainty
systematically and scientifically.
Mathematicians and statisticians developed a separate field of mathematics
and named it as ‘Probability Theory’. The theory of probability helps us to
make wiser decisions by reducing the degree of uncertainty.

Sikkim Manipal University Page No. 117


Statistics for Management Unit 5

5.1.1 Learning objectives


By the end of this unit, you should be able to:
 Examine the use of probability theory in decision making
 Recognise the approaches to probability
 Apply the rules of probability for calculating different kinds of probabilities
 Apply the Bayes’ probability theorem by taking new information into
account
 Apply the concept of random variables to real life situations
5.1.2 Definition of probability
Probability is a numerical measure which indicates the chance of
occurrence of an event ‘A’. It is denoted by P(A). It is the ratio between the
favourable outcomes of an event ‘A’ (m) to the total outcomes of the
experiment (n). In other words:
m
P( A ) 
n
where, ‘m’ is the number of favourable outcomes of an event ‘A’ and ‘n’ is
the total number of outcomes of the experiment.

Key Statistic
The probability of event A [denoted P(A)], must lie within the interval
from 0 to 1.

5.1.3 Basic terminology used in probability theory


Experiment
An operation that results in a definite outcome is called an experiment.
Tossing a coin is an experiment if it shows head (H) or tail (T) on falling. The
figure 5.1 illustrates that if a coin stands on its edge, then it is not
considered as an experiment.

Fig. 5.1: If a coin stands on its edge, it is not an experiment.

Sikkim Manipal University Page No. 118


Statistics for Management Unit 5

Random Experiment
When the outcome of an experiment cannot be predicted, then it is called
random experiment or stochastic experiment
Sample Space
Sample Space or total number of outcomes of an experiment is the set of all
possible outcomes of a random experiment and is denoted by ‘S’.

Example 1
In tossing of coins, the outcomes are head and tail. The head is denoted
as ‘H’ and the tail as ‘T’. In tossing two coins, the sample space ‘S’ is
given by:

S   , ,  ,  
The number of outcomes is denoted by n(S).
Key Statistic
Ifn(the 4
S ) number of outcomes is finite then it is called as finite sample space,
otherwise it is called as an infinite sample space.

Event
Events may be a single outcome or combination of outcomes. Event is a
subset of sample space.

Example 2
In tossing a coin getting a head is (event A) a single outcome. Therefore,
P( A )  1
2
In tossing two fair coins, for getting a head (event A) the possible
combinations of outcomes are HT and TH. The sample space consists of
HH, HT, TH, and TT. Therefore,
P( A )  1
2

Sikkim Manipal University Page No. 119


Statistics for Management Unit 5

Equally likely events


Two or more events are said to be equally likely if they have equal chance
of occurrence.

Example 3
In tossing an unbiased coin, getting head and tail are equally likely.

Mutually exclusive events


Two or more events are said to be mutually exclusive if the occurrence of
one prevents the occurrence of other events.

Example 4
In tossing a coin, if head falls, it prevents the occurrence of tail and vice
versa.

Exhaustive set of events


A set of events is exhaustive if one or other of the events in the set occurs
whenever the experiment is conducted. It can be defined also as the set
whose sum of sample points forms the total sample points of the
experiment.
Complementation of an event
The complement of an event is given by:
   1  
c

Independent events
Two events are said to be independent of each other if the occurrence of
one is not affected by the occurrence of other or does not affect the
occurrence of the other.

Sikkim Manipal University Page No. 120


Statistics for Management Unit 5

Example 5
Consider tossing of three fair coins as shown in figure 5.2.. Then,
S = { HHH, HHT, HTH, THH, TTH, THT, HTT, TTT}
Let:
 A be the event of getting three heads
 B be the event of getting two heads
 C be the event of getting one head
 D be the event of not getting a head

Fig. 5.2: Tossing three fair coins

Then, the outcomes for events A, B, C, and D are:


A = HHH; B = HHT, HTH, THH; C = HTT, THT, TTH; D = TTT
Then,
Eventsthe
A,outcomes
B, C andforD events A, B, C,exclusive
are mutually and D are:and exhaustive but not
A = HHH;
equally likely.B = HHT, HTH, THH; C = HTT, THT, TTH; D = TTT
Events A, B, C and D are mutually exclusive and exhaustive but not
equally likely.

5.1.4 Approaches to probability


There are four approaches to probability. The figure 5.3. shows
the four approaches to probability. They are:
i) Classical / Mathematical / Priori approach
ii) Statistical / Relative frequency / Empirical / Posteriori approach
iii) Subjective approach
iv) Axiomatic approach

Sikkim Manipal University Page No. 121


Statistics for Management Unit 5

Fig. 5.3: Approaches to probability

Classical / Mathematical / Priori approach


Under this approach the probability of an event is known before conducting
the experiment.
The following are some of the examples of classical approach.
a) Getting a head in tossing a coin
b) Drawing a king from well shuffled pack
c) Getting a ‘6’ in throwing a die.
The probability of an event ‘A’ is defined as:
m
P( A ) 
n
where, ‘m’ is the number of favourable outcomes, ‘n’ is the total number of
outcomes of the experiments.
However, it is not possible to give probability to all events of our life. We
cannot attach a definite probability to the event ‘that it will rain today’.
Statistical / Relative Frequency / Empirical / Posteriori approach
Under this approach the probability of an event is arrived at after conducting
an experiment. If we want to know the probability that a particular household
in an area will have two earning members, then we have to gather data on
all household in that area and arrive at the probability. The greater number
of households surveyed, the more accurate will be the probability arrived.
The probability of an event ‘A’ in this case is defined as:

Sikkim Manipal University Page No. 122


Statistics for Management Unit 5

m
P( A )  Limit
n
n

In real life, it is not possible to conduct experiments because of high cost or


of destructive type experiments or of vast area to be covered.
Subjective approach
Under this approach the investigator or researcher assigns probability to the
events either from his experience or from past records. It is more suitable
when the sample size is ten or less than ten. The investigator has full
knowledge about the characteristics of each and every individual. However,
there is a chance of personal bias being introduced in such probability.
Axiomatic approach
This approach is based on set theory. The probability of an event is defined
as:
n( A )
P( A )  ; Such that
n(S)
a. 0  P Ai   1 b.  P( A )  1 for i = 1 to n
i

where, Ai is ‘n’ mutually exclusive and exhaustive events.

Self Assessment Questions


1. To which approach the following probability estimates belong to:
i. The probability that India will win the game.
ii. The probability that Mr. Ram will resign from the post.
iii. Probability of drawing a red card.
iv. Probability that you will go to America this year.

5.2 Rules of Probability


Managers very often come across with situations where they have to take
decisions about implementing either course of action A or course of action B
or course of action C. Sometimes, they have to take decisions regarding the
implementation of both A and B.

Sikkim Manipal University Page No. 123


Statistics for Management Unit 5

Example 6
A sales manager may like to know the probability that he will exceed the
target for product A or product B. Sometimes, he would like to know the
probability that the sales of product A and B will exceed the target. The
first type of probability is answered by addition rule. The second type of
probability is answered by multiplication rule.

5.2.1 Addition rule


The addition rule of probability states that:
i) If ‘A’ and ‘B’ are any two events then the probability of the occurrence of
either ‘A’ or ‘B’ is given by:
          
ii) If ‘A’ and ‘B’ are two mutually exclusive events then the probability of
occurrence of either A or B is given by:
      
iii) If A, B and C are any three events then the probability of occurrence of
either A or B or C is given by:
    C      C        C    C      C
In terms of Venn diagram, from the figure 5.4, we can calculate the
probability of occurrence of either event ‘A’ or event ‘B’, given that event ‘A’
and event ‘B’ are dependent events. From the figure 5.5, we can calculate
the probability of occurrence of either ‘A’ or ‘B’, given that, events ‘A’ and ‘B’
are independent events. From the figure 5.6, we can calculate the
probability of occurrence of either ‘A’ or ‘B’ or ‘C’, given that, events ‘A’, ‘B’
and ‘C’ are dependent events.

Fig. 5.4: AB for two Fig. 5.5: AB for two Fig. 5.6: ABC for
dependent events A and independent events A three dependent
B and B events A, B and C

Sikkim Manipal University Page No. 124


Statistics for Management Unit 5

iv) If A1, A2, A3………, An are ‘n’ mutually exclusive and exhaustive events
then the probability of occurrence of at least one of them is given by:
1   2  .......  n   1    2   ........  n .
5.2.2 Multiplication rule
If ‘A’ and ‘B’ are two independent events then the probability of occurrence
of ‘A’ and ‘B’ is given by:
     

5.3 Conditional Probability


Sometimes we wish to know the probability that the price of a particular
petroleum product will rise, given that the finance minister has increased the
petrol price. Such probabilities are known as conditional probabilities.
Thus the conditional probability of occurrence of an event ‘A’ given that the
event ‘B’ has already occurred is denoted by P (A / B). Here, ‘A’ and ‘B’ are
dependent events. Therefore, we have the following rules.
If ‘A’ and ‘B’ are dependent events, then the probability of occurrence of ‘A
and B’ is given by:
 
            

 

It follows that:
 A  P( A  B)
P  
B P( B)

 B  P( A  B)
P  
 A P( A)

For any bivariate distribution, there exists two marginal distributions and
‘m + n’ conditional distributions, where ‘m’ and ‘n’ are the number of
classifications/characteristics studied on two variables.

Sikkim Manipal University Page No. 125


Statistics for Management Unit 5

Example 7
Consider the example of a librarian who analysed the type of visitors and
their choice of library section. The data is represented in table 5.1a.
Table 5.1a: Bivariate distribution

Type of visitors Sections


Level of News Magazine Novel Subject Total
education Paper (story)
Under Graduates 50 100 120 50 320
Graduates 70 90 50 100 310
Post Graduates 100 60 30 150 340
Total 220 250 200 300 970

We can get the following distributions.


i) The table 5.1b represents the distribution of level of education
irrespective of their sections. Therefore, it is called marginal
distribution.

Table 5.1b: Marginal distribution of level of education irrespective of their


sections
Type of Visitors Frequency
Undergraduates 320
Graduates 310
Post graduates 340
Total 970

ii) The table 5.1c represents the distribution of people in sections


irrespective of their educational levels. It is another marginal
distribution. Thus, there are two marginal distributions for bivariate
data, variables being sections and level of education.

Sikkim Manipal University Page No. 126


Statistics for Management Unit 5

Table 5.1c: Marginal distribution of people irrespective of their educational


levels
News paper Magazine Novels Subjects Total
220 250 200 300 970

iii) The table 5.1d represents the distribution of people in sections given
that they are under graduate. Therefore, it is a conditional
distribution.

Table 5.1d: Conditional distribution

Level of News
Magazine Novels Subjects Total
education paper
Under
50 100 120 50 320
graduate

Thus for any bivariate distributions having ‘m’ and ‘n’ classifications there
exits two marginal distributions and ‘m + n’ conditional distributions. In
this case there are 3 + 4 = 7 conditional distributions.

Sikkim Manipal University Page No. 127


Statistics for Management Unit 5

5.4 Steps Involved in Solving Problems on Probability


The figure 5.7 gives the explanation of steps involved in solving problems on
probability.

Fig. 5.7: Steps involved in solving problems on probability

Solved Problem 1: Calculation of nCr for the following values of ‘n’ and ‘r’:
i. n = 10 and r = 2
ii. n =16 and r = 3

Sikkim Manipal University Page No. 128


Statistics for Management Unit 5

Solution:
10  9
10 C 2   45
1 2
16  15  14
16 C 3   560
1 2  3
The value of 16 C3 is 560.
Key Statistic
nCr = nCr-1

n 0 = nCn = 1
C
0! = 1

Solved Problem 2: Calculate 16C13.


Solution:
16C13 = 16C16-3 = 16C3 = 560

The value of 16 C13 is 560.

Solved Problem 3: Find the probability of getting a head when a coin is


tossed?
Solution: Let ‘A’ be the event of getting a head.

S  , 

 n(S )  2

n(A)  1
n(A) 1
 P(A)  
n(S) 2
Therefore, the probability of getting a head when a coin is tossed is 0.5.
Solved Problem 4: What is the probability of getting two heads when 3
coins are tossed and what is the probability of getting at least one head?

Sikkim Manipal University Page No. 129


Statistics for Management Unit 5

Solution:
i) Let ‘A’ be the event of getting two heads.


S  , ,  ,  ,  , , ,    nS   8


  ,  ,    n  3
n(A) 3
P(A)  
N(S) 8

Therefore, the probability of getting two heads when three coins are
tossed is 3/8.

ii) Let ‘A’ be the event of getting at least two heads.


  , ,  ,    n  4

4 1
 P(A)  
8 2

Therefore, the probability of getting at least two heads when three coins are
tossed is 1/2.
Solved Problem 5: What is the probability of getting a sum of ‘Nine’ when
two dice are thrown?
Solution: Let ‘A’ be the probability of getting a sum ‘Nine’.

nS   6 2  36


  6,3, 3,6, 4,5, 5,4
n  4

4 1
 P(A)  
36 9

Therefore, the probability of getting a sum of ‘Nine’ when two dice are
thrown is 1/9.
Solved Problem 6: What is the probability of getting at least a sum of ‘nine’
when two dice are thrown?

Sikkim Manipal University Page No. 130


Statistics for Management Unit 5

Solution:
Let ‘A’ be the probability of getting at least a sum of nine.
nS   6 2  36
A is the event of combination of mutually exclusive events of getting a sum 9
or 10 or 11 or 12.


  6,3, 3,6, 5,4, 4,5, 6,4, 4,6, 5,5, 6,5, 5,6, 6,6  n  10

10 5
 P(A)  
36 18

Therefore, the probability of getting at least a sum of ‘nine’ when two dice
are thrown is 5/18.
Solved Problem 7: A number is selected at random from the numbers 1 to
30. What is the probability that:
i. It is divisible by either 3 or 7
ii. It is divisible by 5 or 13
Solution:
i) Let ‘A’ be the event of selecting a number divisible by 3. Let ‘B’ be the
event of selecting a number divisible by 7.
nS  30 C1  30


  3, 6, 9, 12, 15, 18, 21, 24, 27, 30 
n  10


  7, 14, 21, 28 
n  4

    21 n    1

A and B are not mutually exclusive


          
10 4 1 13
   
30 30 30 30
Therefore, the probability that a number is divisible by 3 or 7 is 13/30.

Sikkim Manipal University Page No. 131


Statistics for Management Unit 5

ii) Let ‘A’ be the event of selecting a number divisible by 5. Let ‘B’ be the
event of selecting a number divisible by 13.
nS  30 C1  30

 
  5, 10, 15, 20, 25, 30  n  6

 
  13, 26  n  2

A and B are mutually exclusive


       

6 2 8 4
   
30 30 30 15

Therefore, the probability that a number is divisible by 5 or 13 is 4/15.


Solved Problem 8: The Board of Directors of a company wants to form a
quality management committee to monitor quality of their products. The
company has 5 scientists, 4 engineers and 6 accountants. Find the
probability that the committee will contain 2 scientists, 1 engineer and 2
accountants?

Solution: Let ‘A’ be the event of selecting 2 scientists, 1 engineer and 2


accountants. Then,

15  14  13  12  11
n(S)15 C 5   3003
1 2  3  4  5

n() 5 C 2  4 C1 6 C 2

54 65
 4  10  4  15  600
1 2 1 2
600
 P(A) 
3003
Therefore, the probability that the committee will contain 2 scientists,
1 engineer and 2 accountants is 600/3003.

Sikkim Manipal University Page No. 132


Statistics for Management Unit 5

Solved Problem 9: The odds favouring the event of a person hitting a


target are 3 to 5. The odds against the event of another person hitting the
target are 3 to 2. If each of them fire once at the target, find the probability
that:
i) Both of them hit the target
ii) At least one of them hit the target
Solution:
i) Let ‘A’ be event of first person hitting a target. Odds in favor means,
3 3
 P(A)   (1st ratio)
35 8
Let ‘B’ be event of Second person hitting a target. Odds against means,
2 3
 P(B)   (2nd ratio)
32 5
Both hitting the target mean A  B and A & B are independent
3 2 3
 P( A  B)  P( A )P(B)   
8 5 20
Therefore, the probability that both persons hit the target is 3/20.
ii) Let ‘A’ be the probability of hitting the target. Therefore,
3
P(A) 
8
2
Let ‘B’ be the probability of hitting the target. Therefore, P(A) 
5
          

3 2 8 15  16  6  25  5
   
8 5 30 40 40 8

Therefore, the probability that at least one of the persons hit the target is
5/8.

Solved Problem 10: The probabilities that drivers A, B and C will drive
home safely after consuming liquor are 2/5, 3/7 and 3/4, respectively. What
is the probability that they will drive home safely after consuming liquor?

Sikkim Manipal University Page No. 133


Statistics for Management Unit 5

Solution: Let ‘A’ be the event of driver ‘A’ driving safely after consuming
liquor. Let ‘B’ be the event of driver ‘B’ driving safely after consuming liquor.
Let ‘C’ be the event of driver ‘C’ driving safely after consuming liquor.
2 3 3
Given P(A)  P(B)  P(C) 
5 7 4
The events A, B, and C are independent. Therefore,

  A  B  C    A B  C 

2 3 3 9
   
5 7 4 70
Therefore, the probability that all the drivers will drive home safely after
consuming liquor is 9/70.
Solved Problem 11: The probabilities that ‘A’ and ‘B’ will tell the truth are
2/3 and 4/5 respectively. What is the probability that:
i) They agree with each other
ii) They contradict each other while giving a testimony in the court
Solution:
i) Let ‘A’ be the event of A telling truth. Let ‘B’ be the event of B telling
truth.
2 1
Given P(A)   P(A c )  1  P(A) 
3 3
4 1
P(B)   P(B c ) 
5 5
Both will agree if they say truth or they together lie, that is,

   or  c   c
They are mutually exclusive. Therefore,
     
     c  c       c   c
2 4 1 1 9 3
     
3 5 3 5 15 5
since, the events A and B are independent.
Therefore, the probability that both A and B agree with each other is 3/5.

Sikkim Manipal University Page No. 134


Statistics for Management Unit 5

ii) They will contradict if A tells truth and B tells lies or B tells truth and A
tells lies.

    c
or  c
 
Since, they are mutually exclusive.
       
   c   c       c   c  
2 1 1 4 6 2
     
3 5 3 5 15 5
since, they are independent.
Therefore, the probability that A and B contradict each other is 2/5.
Solved Problem 12: A box contains five red and four blue similar shaped
balls. Two balls are drawn at random from the box. Find the probability that
both of them are red if:
i. the balls are drawn together
ii. the balls are drawn one after the other, with replacement
iii. the balls are drawn one after the other, without replacement
Solution:
i) Let ‘A’ be the event of drawing two balls together.
98
n(S) 9 C 2   36
1 2
5 4
n(A) 5 C 2   10
1 2
10 5
 P(G)  
36 18
Therefore, the probability that both of them are red if the balls are drawn
together is 5/18.
ii) Let ‘A’ be the event of drawing a red ball in the first draw. Let ‘B’ be the
event of drawing a red ball in the second draw. The required probability
is given by:

        


5 5 25
9 9 81
since, the sample space does not change.

Sikkim Manipal University Page No. 135


Statistics for Management Unit 5

Therefore, the probability that both of them are red if the balls are drawn
one after the other, with replacement, is 25/81.
iii) Let ‘A’ be the event of drawing red ball in the first draw. Let ‘B’ be the
event of drawing red ball in the second draw. Since the first ball is not
replaced, the sample space changes for second draw. Therefore the
required probability is given by:

        
5 4 5
  
9 8 18
Therefore, the probability that both of them are red if the balls are drawn
one after the other, without replacement, is 5/18.
Solved Problem 13: Box I contains 5 red and 6 blue balls. Box II contains 6
red and 4 blue balls. A ball is drawn at random from box I and is transferred
to box II. Now from box II a ball is drawn at random. What is the probability
that it is red?
Solution: A ball drawn from box I and transferred to box II could be either
red or blue. Let ‘A’ be the event of drawing a red ball from box I. Let ‘B’ be
the event of drawing a blue ball from box I. Let ‘C’ be the event of drawing
red ball from box II.
 The required events are   C or C .

The events are mutually exclusive. Therefore,


    C 
  C     C     C     C       C

5 7 6 6 35  36 71
 .  .  
11 11 11 11 126 121
Therefore, the required probability is 71/121.
Solved Problem 14: The probabilities that component A and component B
of a machine will fail are 0.09 and 0.06 respectively. The machine will fail if
any one of them fails. Find the probability that it will fail?

Sikkim Manipal University Page No. 136


Statistics for Management Unit 5

Solution: Given that:


  0.09   0.06
        0.09  0.06  0.0054
            0.09  0.06  0.0054  0.1446
Therefore, the probability that the machine will fail is 14.46%.
Solved Problem 15: What is the probability of getting 53 Mondays in a leap
year?
Solution: There are 366 days in a leap year. Hence, in a leap year, there
are 52 weeks and 2 days. It has 52 Mondays.
For one more Monday we select from the following combination of the
remaining 2 days.
1. Sunday and Monday 3. Tuesday and Wednesday
2. Monday and Tuesday 4. Wednesday and Thursday
5. Thursday and Friday 7. Saturday and Sunday
6. Friday and Saturday
 nS   7 and n  2
2
 P(A) 
7
where, A is the event of getting 53 Mondays.
Therefore, the probability of getting 53 Mondays in a leap year is 2/7.

Self Assessment Questions


2. Find the probabilities in the following cases:
i. Getting an even number when a die is thrown
ii. Selecting two ‘y’s’ from the letters x, x, x, x, y, y, y
iii. Selecting a King and Queen from a pack of cards, when two cards
are drawn at a time
iv. Getting 53 Mondays in ordinary year
3. Given P(A) = 0.6, P(B) = 0.7, and P(A  B) = 0.5. Find P(A U B)?

Sikkim Manipal University Page No. 137


Statistics for Management Unit 5

5.5 Bayes’ Probability


Let A1, A2, A3, A4 be mutually exclusive and exhaustive events of a random
experiment. Let ‘B’ be a common event. The figure 5.8 is the representation
of Bayes’ theorem in Venn diagram.

Fig. 5.8: Bayes’ theorem

The event ‘B’ is made up of four mutually exclusive and exhaustive events.
   1    2    3    4  
  i   (1) [by using the Law of Marginal Probability]
We know that:
1    . *  1  ..….. (2) [by the Law of Conditional
 
Probability for dependent events]

 1 *    …………………….(3)


 1 
Consider:
 1  
 1   …. from above equation (2)
   

1  .   
  1  Numerator from (3)
  i   

In general, the Bayes’ theorem states that if A1, A2………….., An are ‘n’
mutually exclusive and exhaustive events and ‘B’ is a common event to all

Sikkim Manipal University Page No. 138


Statistics for Management Unit 5

theorems then probability of occurrence of A1 given that ‘B’ has already


occurred is given by:

P(A1 ) . P(B/A1 )
P(A1 /B)  n

 P( A )  P( B / A )
i 1
i i

Bayes’ probability is also a type of conditional probability. The table 5.2


displays the differences between conditional probability and Bayes’
probability:
Table 5.2: Differences between conditional probability and Bayes’ probability
Bayes’ Probability General Conditional Probability

1. Finds the probability of getting a


1. Finds the probability of population
sample value given the population
value, given the sample value.
value.

2. It is possible to incorporate
2. It is not possible to do so.
latest information.
3. It is possible to incorporate cost
3. It is not possible in this case.
aspects.

Whenever there are two probabilities connected with an event, then we


have to apply Bayes’ approach to solve it.
Solved Problem 16: The probabilities that Mr. Aravind, Mr. Anand and Mr.
Akil will become vice-president of a company are 0.40, 0.35 and 0.25
respectively. The probabilities that they will introduce new product are 0.10,
0.15 and 0.20 respectively. What is the probability that Mr. Anand
introduced a new product by becoming vice-president?
Solution: Let us assume the following:
 Let ‘A1’ be the event that Mr. Aravind became vice-president
 Let ‘B2’ be the event that Mr. Anand became vice-president
 Let ‘A3’ be the event that Mr. Akil became vice-president
 Let ‘B’ be the event that a new product was introduced

Sikkim Manipal University Page No. 139


Statistics for Management Unit 5

We are given that:

1   0.4,  2   0.35,  3   0.25

    0.10,     0.15,     0.20.


 1   2   3 
The given information can be put in the following form. We note that,

       .    
and      i   

      
i


i 

 2 |    |  2    2     


   | 2   2

  i  

The required probabilities are calculated and represented in the table 5.3.
Table 5.3: Required probabilities for the data in solved problem 16
Event Prior Conditional Joint Posterior
Ai Probability probability Probability Probability
P(Ai) P(B/Ai) P(Ai ∩ B)

0.0400
A1 0.4 0.10 0.0400  0.2807
0.1425
0.0525
A2 0.35 0.15 0.0525  0.3684
0.1425
0.0500
A3 0.25 0.20 0.0500  0.3509
0.1425
Total 1.00 P(B) = 0.1425 1.0000

Therefore, the required Probability 2 |   0.3684

Sikkim Manipal University Page No. 140


Statistics for Management Unit 5

Solved Problem 17: A factory has three machines M1, M2 and M3. They
produce 4000, 10,000 and 6,000 products per day. From past records, it is
known that M1, M2, and M3 produce 5%, 4%, and 8% defectives. A product
is selected at random from the day’s production. What is the probability that
it was not produced by machine M3?
Solution: Let us have the following:
 Let ‘A1’ be the event that the product was produced by M1
 Let ‘A2’ be the event that the product was produced by M2
 Let ‘A3’ be the event that the product was produced by M3
 Let ‘B’ be the event that the product is defective.
Then we are given:

1  
4000
 0.20
20000

 2  
10000
 0.5
2000

 3  
6000
 0.3
20000
P(B/A1) = 0.05P(B/A2) = 0.04 P(B/A3) = 0.08
The above information is represented in table 5.4.
Table 5.4: Required probabilities for the data in solved problem 17
Even Prior Conditional Joint Posterior
t Probability Probability Probability Probability
Ai P(Ai) P(B/Ai) P(Ai ∩ B)
0.010
A1 0.2 0.05 0.010  0.1852
0.054
0.020
A2 0.5 0.04 0.020  0.3704
0.054
0.054
A3 0.3 0.08 0.024  0.4444
1.0000
1.00 P(B) 0.054 1.0000

  3 
Required probability  1    = 1 – 0.4444 = 0.5556
   
Hence, the required probability is 0.5556.

Sikkim Manipal University Page No. 141


Statistics for Management Unit 5

Self Assessment Questions


4. State whether the following questions are true or false.
i. Bayes’ probability estimates sample value
ii. Conditional probability can incorporate costs
iii. Bayes’ probability gives up to date information

5.6 Random Variable


If we can assign a real-valued function to every value of the variable in the
sample space, such that:
i. i     i  for all values of i
ii. i   0 for all values of i
iii.     1
i then, it is called as Random Variable.

If Xi is a discrete random variable then P(X) is known as probability mass


function of X. If Xi is a continuous random variable then P(X) is called
probability density function and is denoted by f(X).
For example, let us consider the tossing of three coins. The table 5.5
displays the probabilities of getting heads when three coins are tossed.
Table 5.5: Probabilities of getting heads when three coins are tossed

No. of Heads
P(Xi)
(Xi)
3 ⅛
2 ⅜
1 ⅜
0 ⅛
Total 1

For every Xi, we are able to assign a P(Xi) such that:

    1
i

Probability of the number of heads forms a probability distribution. A


systematic representation of random variable with its value and probabilities

Sikkim Manipal University Page No. 142


Statistics for Management Unit 5

is called a probability distribution of that random variable. The distribution


will have its mean and standard deviation.
Mathematical expectation and variance of a random variable
Mathematical expectation of a random variable is denoted by E(X) and is
defined as:
E    i i 

Its variance is given by:

     
Var    E    2    E  2  E  2   i2  i    i  i 2

 
Where, E  2    i2  i 
Its standard deviation is:

 
S.D   Var    E  2  E  
2

Solved Problem 18: A random variable takes the values -3, -2, 1, 0, 4, 6
with probabilities 1/12, 2/12, 3/12, 4/12, 1/12, 1/12 respectively. Find its
mean or expected value and variance?
Solution: The table 5.6 represents the values required to calculate
expectation and variance for the data in solved problem 18.
Table 5.6: Required values for calculating mean and variance for the data in
solved problem 18

XI P(Xi) Xi P(Xi) Xi2 P(Xi)


-3 1/12 -3/12 9/12
-2 2/12 -4/12 8/12
1 3/12 3/12 3/12
0 4/12 0 0
4 1/12 4/12 16/12
6 1/12 6/12 36/12
Total 6/12 72/12 = 6

Sikkim Manipal University Page No. 143


Statistics for Management Unit 5

 E    i i   6 12  1 2

 
Var   E 2  E2  6  1 4  23 4

S.D   23 4
Hence, the mean, variance and standard deviation are 0.5, 5.75 and 2.4.
Solved Problem 19: Mr. Arun and Mr. Bandari play a game. If Mr. Arun
picks up an even number from 1 to 6, Mr. Bandari will pay him double the
amount equal to picked up number. If Mr. Arun picks up an odd number then
he has to pay amount equal to double the picked up number. What is Mr.
Arun’s expectation?
Solution: Let Xi be the random variable and P(Xi) be its probability. The
probabilities are indicated in table 5.7.
Table 5.7. Required values for calculating mean and variance for the data in
solved problem 19

No. (Xi) P(Xi) Xi2 P(Xi)


1 -2 1/6 -2/6
2 4 1/6 4/6
3 -6 1/6 1/6
4 8 1/6 8/6
5 -10 1/6 -10/6
6 12 1/6 12/6
Total 1 11/6

 Expectation of Mr. Arun is E    1

Solved Problem 20: The table 5.8 displays the distribution of random
variable X. Find the following probabilities:
i) P(Xi)  3
ii) P(Xi = 0)
iii) P(1  Xi  3)
iv) P(Xi)  4

Sikkim Manipal University Page No. 144


Statistics for Management Unit 5

Table 5.8: Distribution of a random variable X

Xi -3 -2 0 1 2 3 4 5
P(Xi) K 2K 2K 3K 3K 2K K K

Solution: Since Xi is a random variable     1


i

 K + 2K + 2K + 3K + 3K + 2K + K + K = 1
 15K = 1 K = 1/15
i)  i  3   i  3   i  4   i  5
 2K  K  K  4K  4 15

ii) i  0  2K  2 15

iii) 1  i  3
 i  1  i  2  i  3
 3K + 3K + 2K = 8K = 8 15
iv) i  4  i  4  i  5
 K + K = 2K = 2 15

Self Assessment Questions


5. Fill in the blanks.
i. For a random variable  P(Xi) = ___________.
ii. Expectation of a random variable is same as ________ of the
probability distribution of that variable.
iii. Var (X) = E (X2) - ___________.

5.7 Summary
Probability plays an important role in decision making process. The basic
definitions and approaches were explained with examples. The real life
situations where you can apply different rules of probability are also
explained with examples.

Sikkim Manipal University Page No. 145


Statistics for Management Unit 5

When multiple events are involved in an experiment, the concerned


probabilities are calculated using addition and multiplication rules of
probability.
Bayes’ theorem deals with the probability of the occurrence of an event to
the occurrence or non-occurrence of an associated event. This is an
important theorem helpful for managers in business decisions.
Random variable is a not a variable. It is a function. It can be discrete or
continuous.

5.8 Terminal Questions


1. Define independent events.
2. The probability of Mr. Sunil solving a problem is ¾. The probability of
Mr. Anish solving is ¼. What is the probability that a given problem will
be solved?
3. The probability that a contractor will get an electrical job is 0.8, he will
get a plumbing job is 0.6 and he will get both 0.48. What is the
probability that he get at least one? Is the probabilities of getting
electrical and plumbing job are independent?
4. A box contains 4 red and 5 blue similar rings. What is the probability of
selecting at random two rings:
i. having same colour
ii. having different colours
5. If P(A  B) = 1/2 and P(B) = 2/3, find P(A/B)?
6. If P(A U B) = 0.8, P(A) = 0.6, P(B) = 0.7 and find P(B/A)?
7. A shop keeper sells two types of articles, namely washing machine and
ovens. He has 10 similar washing machines and 20 similar ovens. 2 of
the washing machine and 5 of the ovens have 50% discount on them.
A customer selects an article. What is the probability it is a washing
machine or an article with discount price.
8. The probability that a bomb hitting a target is 2/5. If four bombs are
dropped on a bridge, what is the probability that it will be destroyed?
9. The probability that a company A will survive for 20 years is 0.6. The
probability that its sister concern will survive for 20 years is 0.8. What is
the probability that at least one of them will survive for 20 years?

Sikkim Manipal University Page No. 146


Statistics for Management Unit 5

10. A recently developed car has two important components A and B. The
probability of failure of A and B are 0.2 and 0.1. What is the probability
that the car will fail?
11. The probability that a football player will play on ordinary ground is 0.6
and on green turf is 0.4. The probability that he will get knee injury
when playing an ordinary ground is 0.07 and that a green turf is 0.04.
What is the probability that he got knee-injury due to the play on
ordinary ground?
12. Find the E(X) and Var(X) for the distribution of a random variable, X
represented in table 5.9.
Table 5.9: Distribution of a random variable

Xi -3 -1 1 2 4 6 8
P(Xi) K K 2K 3K 2K 2K K

5.9 Answers to SAQs and TQs

Answers to self assessment questions


1.
i) Relative frequency
ii) Subjective
iii) Classical
iv) Subjective
2.
i) ½ ii) 1/7 iii) 8/663 iv) 1/7
3. 0.8
4.
i) False ii) False iii) True
5.
i) 1 ii) Mean iii) [E(X)]2

Answers to terminal questions


1. Refer section 5.1.3.
2. 13/6
3. 0.92, yes
4. i) 4/9, ii) 5/9

Sikkim Manipal University Page No. 147


Statistics for Management Unit 5

5. 3/4
6. 5/8
7. 2/3
8. 544/625
9. 0.92
10. 0.28
11. 21/29
12. E(X) = 7/3, Var(X) = 115/18

5.10 References
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited

Sikkim Manipal University Page No. 148


Statistics for Management Unit 6

Unit 6 Theoretical Distributions


Structure:
6.1 Introduction
Learning objectives
Random variables
6.2 Probability Distributions
Discrete probability distributions
Continuous probability distributions
6.3 Bernoulli Distribution
Repetition of a Bernoulli experiment
6.4 Binomial Distribution
Assumptions for applying a binomial distribution
Examples of binomial variate
Recurrence formula in case of binomial distribution
Case study on binomial distribution
6.5 Poisson Distribution
Assumptions for applying the Poisson distribution
Real life examples of Poisson variate
Recurrence relation
Case study on Poisson distribution
6.6 Normal Distribution
Standard Normal Distribution
6.7 Summary
6.8 Terminal Questions
6.9 Answers to SAQs and TQs
Answers to Self Assessment Questions
Answers to Terminal Questions
Answers to case studies
6.10 References

6.1 Introduction
In the unit-5, ‘Probabilities’, we have studied about basic probability theory
concepts. We have also studied the application of probability rules in solving
problems related to real life situations. We have ended the previous unit with
concept of random variables. In this unit-6, ‘Theoretical Distributions’, we will
discuss about the probability distributions of the random variables; both

Sikkim Manipal University Page No. 149


Statistics for Management Unit 6

discrete and continuous. Before studying this unit, you have to refresh the
concept of random variables which was covered in the previous unit.
Individuals and corporates generate several data that resemble certain
theoretical distributions. Mathematically, we have many derived
characteristics of the theoretical distributions. We can make use of such
derived characteristics for a quick analysis of the observed distributions.
The examples of observed distributions are:
i. Number of male children in a family
ii. Number of defectives produced per production run
iii. Number of employees drawing salary in some brackets
The theoretical distributions are formed under certain assumptions. The
theorectical distributions are classified into two types. The two types of
theoretical distributions are:
i. Discrete probability distributions
ii. Continuous probability distributions
The figure 6.1 shows the two groups of theoretical distributions.

Fig. 6.1: Theoretical distributions

6.1.1 Learning objectives


By the end of this unit, you should be able to:
 Differentiate between Bernoulli process and binomial process
 Compute the probabilities using the binomial distribution
 Compute the probabilities using the Poisson distribution
 Compute the probabilities using the Normal distribution

Sikkim Manipal University Page No. 150


Statistics for Management Unit 6

6.1.2 Random variables


Random variable is a variable that assumes any value for every occurrence
of the event in a random experiment. We will recap for you the definition of a
random variable given in the unit 5.
If we can assign a real-valued function to every value of the variable in the
sample space, such that:
i. Ρ(Χi ) = Ρ[Χ = Χi ] for all values of i
ii.  i  ≥ 0 for all values of i
iii.  P( Xi )  1 Then, it is called as random variable.
Hence, the random variable is not exactly a variable but a function.
Discrete random variable
Random variable is discrete when the number of possible outcomes in a
random experiment is countable. For example, when a fair coin is tossed
once, the number of possible outcomes is two (either head or tail). When the
fair coin is tossed twice, the number of possible outcomes is four (HH, HT,
TT, TH).
In the above two cases, the number of outcomes are finite. As the number
of values of the random variable is finite, it is called discrete random
variable.
Continuous random variable
Random variable is continuous when the number of outcomes in a random
experiment is uncountable. For example, the values of train timings for
departures and arrivals at a particular station are the continuous random
variables. The measures of height and the intelligence quotient of the
people are also examples of continuous random variable.

6.2 Probability Distributions


As the random variables are discrete and continuous, the probabilities
associated with random variables are also discrete and continuous. The
listing of all the probable outcomes in a random experiment along with their
respective probabilities is called the probability distribution.

Sikkim Manipal University Page No. 151


Statistics for Management Unit 6

6.2.1 Discrete probability distributions


A discrete probability distribution consists of all possible values of a discrete
random variable along with their corresponding probabilities. binomial,
Bernoulli, Poisson are all examples of discrete probability distributions. In
this unit you will study all the three distributions in detail.
6.2.2 Continuous probability distributions
In a continuous probability distribution, the variable under consideration
assumes any value within a given range. Hence, it is very difficult to list all
values. One example of continuous probability distribution is the distribution
of normal variable. In this unit, you will study about normal distribution in
detail.

6.3 Bernoulli Distributions


A variable, which assumes values ‘1’ and ‘0’ with probabilities ‘p’ and ‘q’,
(where, q = 1-p) is called Bernoulli variable. It has only one parameter ‘p’.
For different values of ‘p’ (0p1), we get different Bernoulli distributions. In
these distributions, ‘1’ represents the occurrence of success and ‘0’
represents the occurrence of failure.
In other words, the assumption for the distribution is outcome of an
experiment. It is of dichotomous nature, that is, success/failure, present/
absent, defective/non defective, yes/no and so on.

Example 1
When a fair coin is tossed as shown in figure 6.2, the outcome is
either head or tail. The variable ‘X’ assumes ‘1’ or ‘0’.

Fig. 6.2: Flipping a coin

Sikkim Manipal University Page No. 152


Statistics for Management Unit 6

6.3.1 Repetition of a Bernoulli experiment


An experiment which results in two mutually exclusive and exhaustive
outcomes is called a Bernoulli experiment or a Bernoulli trial. Let a Bernoulli
experiment be repeated ‘n’ times under identical conditions.
Let Xi, for i = 1 to n, assume the values ‘1’ or ‘0’. Then Xi is a Bernoulli
variate with probability ‘p’.
Let X = X1 + X2 +……..+Xn denote the number of successes in the ‘n’
repetition. Then, ‘X’ forms Bernoulli distribution.

Key Statistic
The mean and variance of a Bernoulli distribution are ‘p’ and ‘pq’
respectively.

Self Assessment Questions


1. State whether the following statements are true ‘T’ or false ‘F’.
i) The sum of probabilities sometimes will be greater than 1.
ii) The amount of time you study for an exam is a discrete random
variable.
iii) The Bernoulli distribution has only one parameter ‘p’.

6.4 Binomial Distribution


When a Bernoulli experiment is repeated for ‘n’ number of times, then it is
called a binomial process. Binomial distribution is a discrete probability
distribution. Its probability mass function is given by:
Ρ(Χ)=n C x qn x p x
where, x = 0 to n.

Key Statistic
The binomial probability distribution is given by:
  x   q  pn  qn  n C1qn 1p1 n C2 qn 2p 2  ................ pn
where, the successive terms of the expansion give the probability of 0,
1, 2……..n success.

Sikkim Manipal University Page No. 153


Statistics for Management Unit 6

The mean and variance of the distribution are ‘np’ and ‘npq’ respectively,
where, ‘n’ and ‘p’ are its parameters. This distribution is a unimodal
distribution. For fixed ‘n’ or ‘p’, as ‘p’ or ‘n’ increases, the distribution shifts
from left to right.

Key Statistic
The mean and variance of a binomial distribution are ‘np’ and ‘npq’
respectively, where, ‘n’ and ‘p’ are its parameters.

6.4.1 Assumptions for applying a binomial distribution


The following are assumptions under which a binomial distribution can be
applied.
i) The experiment should be of dichotomous nature.
In the binomial experiment, there must be only two possible outcomes
on each trial, such as ‘success’ or ‘failure’, ‘yes’ or ‘no’, ‘defective’ or
‘not defective’, ‘male’ or ‘female’, ‘pass’ or ‘fail’, ‘favourable’ or
‘unfavorable’ and so on. In this experiment, the probability of success
is considered as ‘1’ and probability of failure is considered as ‘0’.
ii) The probability of success should remain the same from experiment to
experiment.
Irrespective of the number of times the experiment is conducted, the
probability of success should be same for all the trials of the
experiment. For example, the probability of getting a head is always
0.5 irrespective of the number of times a fair coin is tossed.
iii) Experiments should be conducted under identical conditions.
There should not be any change in conditions while conducting
binomial experiments. Any change in conditions only leads to incorrect
conclusions for the given experiment.
iv) Experiments should be statistically independent.
We can apply a binomial distribution only when the events in an
experiment are statistically independent, which means occurrence of
one event does not affect the occurrence of other event.
In a manufacturing plant, the product part coming out of the production line
cannot be ‘defective’ and ‘not defective’ simultaneously. The product part
can be either ‘defective’ or ‘not defective’ but not both at the same time.

Sikkim Manipal University Page No. 154


Statistics for Management Unit 6

6.4.2 Examples of binomial variate


Some of the examples of binomial variate are:
i) Number of defectives in a random sample of 6 articles drawn from a
manufactured lot
ii) Number of seeds germinating among 10 seeds sown
iii) Number of heads turned in tossing 8 coins
6.4.3 Recurrence formula in binomial distribution
Key Statistic
Recurrence relation between successive terms of binomial expansion
is given by:
( n1 X ) p
Tx  X
 *
q T x1

where, Tx-1 = N p(n=X-1). N – Total frequency.

This recurrence formula helps us to construct theoretical distribution for


given observed distribution.
There are the three types of problems in calculating distribution. They are
represented in table 6.1.
Table 6.1: Types of problems in calculating distribution

Type i Finding the probability of events


Type ii Finding the expected values
Type iii Finding the distribution if parameters are given

Type i: Finding the probability of events

Solved Problem 1: An unbiased coin is tossed six times. What is the


probability that the tosses will result in:
i) Exactly two heads
ii) At least five heads
iii) At most two heads
iv) Not greater than one head
v) Not less than five heads
vi) At least one head

Sikkim Manipal University Page No. 155


Statistics for Management Unit 6

Solution: Let ‘A’ be the event of getting head. Given that:


p = 12, q = 12, n = 6

 Binominal distribution is = 12 + 12
6
( )
i) The probability that the tosses will result in exactly two heads is given
by:

  26C2  1
2
 
62
 1
2
2
 

6  5 1 1 15
  
1  2 24 22 64
Therefore, the probability that the tosses will result in exactly two
heads is 15/64.
ii) The probability that the tosses will result in at least five heads is given
by:
  5    5    6  6C5  1  2   1 2 
6 5 5
 6C6  1 2 
6 6
 2
 1
6

  5  6  1  2   1 2 
6 6
7
64
Therefore, the probability that the tosses will result in at least five
heads is 7/64.
iii) The probability that the tosses will result in at most two heads is given
by:
  2    0    1    2

 2   C 1 2   1 2   C 1 2   1 2 
 1
6 6
1
61 1 6
2
62 2

1 6  5 1 1  6  15 22 11
  2 
1
 6     
64 64 1 2 64 64 64 32
Therefore, the probability that the tosses will result in at most two
heads is 11/32.
iv) The probability that the tosses will result in not greater than one head
is given by:

  1    0    1 


1 6 7
 
64 64 64

Sikkim Manipal University Page No. 156


Statistics for Management Unit 6

Therefore, the probability that the tosses will result in not greater
than one head is 7/64.
v) The probability that the tosses will result in not less than five heads is
given by:

  5    5    6 


6 1 7
6
 6 
2 2 64
Therefore, the probability that the tosses will result in not less than five
heads is 7/64.
vi) The probability that the tosses will result in at least one head is given
by:

  1  1    0  1 
1 1 63
6
 1  .
2 64 64
Therefore, the probability that the tosses will result in at least one head
is 63/64.
The graph shown in figure 6.3 illustrates the binomial distribution obtained
for different values of ‘x’.

Fig. 6.3: Binomial probability distribution

Solved Problem 2: The probability that an employee getting occupational


disease is 20%. In a firm having five employees, what is the probability that:
i) None of the employees get the disease
ii) Exactly two will get the disease
iii) More than four will contract the disease

Sikkim Manipal University Page No. 157


Statistics for Management Unit 6

Solution: Let ‘A’ be the event of employee contracting the disease. Given
that:
  0.2  p

 q  1  0.2  0.8
n=5
 Binominal distribution is q  p   0.8  0.2
n 5

i) The probability that none of the employees get the disease is given by:
  0  0.85  0.3277
Therefore, the probability that none of the employees get the disease
is 0.3277.
ii) The probability that exactly two employees will get the disease is given
by:
  25C2  0.83  0.22  10  0.512  0.04  0.2048
Therefore, the probability that exactly two employees will get the
disease is 0.2048.
iii) The probability that more than four employees will get the disease is
given by:
  4    5  0.25  0.00032
Therefore, the probability that more than four employees will get the
disease is 0.00032.
Solved Problem 3: The probability that a bomb dropped on a bridge hitting
it is 0.5. Eight bombs are dropped on the bridge. The bridge will be
destroyed if any two bombs fall on it. Find the probability that:
i) All bombs hit the bridge
ii) The bridge is destroyed
Solution: Let the probability that the bomb will hit the bridge be p. Given
that:
p  0.5 and n  8

 q  1  0.5  0.5

 Binominal distribution is q  p n  0.5  0.58

Sikkim Manipal University Page No. 158


Statistics for Management Unit 6

i) The probability that all the bombs hit the bridge is given by:
  8  0.58  1  2  1256
8

Therefore, the probability that all the bombs hit the bridge is 1/256.
ii) Bridge is destroyed if two or more bombs fall on it. The required
probability is given by:
  2  1    0    1

 1 

 
8 8 8
 
 1 2  C1 1 2   1  1 256  8 256  247 256

Therefore, the probability that the bridge is destroyed is 247/256.
Solved Problem 4: An engineering graduate student randomly guesses at
eight multiple-choice questions. There are four possible answers for every
question. However, there is only one correct answer. Assuming that all
questions are independent to each other, find the probability that the student
guesses five correct answers.
Solution: From the data given in the solved problem 4, we can say that the
experiment is a binomial experiment because of the following reasons.
 There are fixed number of events or trials (8 questions)
 Probability of success in case of each question (probability of guessing
correct answer) is 0.25.
 It is given that the trials (questions) are independent to each other.
 There are only two possible outcomes on each question (guessing
correct answer or guessing incorrect answer).
Let X denote the number of correct guesses.
Then X is a binomial random variable with,
n  8, p  0.25, q  0.75, x5

On substituting the given values in the binomial distribution formula, we get:


 n C x q n  x p x
  5 8 C5 q 85 p 5

  5  0.753  0.255  0.0231


8!
5!  3!

Sikkim Manipal University Page No. 159


Statistics for Management Unit 6

So, the probability that the graduate student guesses five correct answers is
0.0231.

Type ii: Finding the expected values

Solved Problem 5: A random sample of 5 sachets of coconut oil was


examined and two were found to be leaking. A wholesaler receives six
hundred and twenty five packets, each containing 5 sachets. Find the
expected number of packets to contain exactly one sachet leaking?
Solution: Given that:
n5
Probability of leaking p is given by:

5 and   625
p2

 q  1 2  5  35

 Binominal distribution is q  p   3 5  2 5
n 5

   15C1 3  5   2 5 
51 1

  1  5 
81 2 162
 
625 5 625
 The expected number of packets to contain exactly one leaking sachet is
given by:
162
    1  625   162
625
Hence, the expected number of packets to contain exactly one leaking
sachet is 162.

Type iii: Finding the distributions

Solved Problem 6: For a binomial distribution with n = 5 and p = 0.2.


Find:
i) P(x=3)
ii) P(x<4)
Sikkim Manipal University Page No. 160
Statistics for Management Unit 6

Solution: Given that:


n  5, p  0.2, and q  1  p  0.8
i) x  3 5 C3  0.853  0.23  10  0.82  0.23  0.0512
ii) x  4  x  0  x  1  x  2  x  3

5C0  0.850  0.20  5C1  0.851  0.21  5C2  0.852  0.22  5C3  0.853  0.23
 0.85  5  0.84  0.2  10  0.83  0.22  10  0.82  0.23
= 0.32768 + 0.4096 + 0.2048 + 0.0512
 0.99328
Therefore, the values for P(x=3) and P(x<4) are 0.0512 and 0.99328
respectively.
Solved Problem 7: Bring out the fallacy, if any, in the following statement
on binominal distribution.
‘The mean of a binomial distribution is 4 and its variance is 5’.
Solution: Given that:
np  4 (Mean)……………. (1)
npq  5 (Variance)………… (2)
npq 5
  ……………… (3)
np 4
 q  5/ 4

Since, q > 1, the statement ‘The mean of a binomial distribution is 4 and its
variance is 5’ is wrong.
Solved Problem 8: Find the probability that X = 3 for a binomial distribution
whose mean is 3 and variance is 2.
Solution: Given that:
np  3 (Mean)……………. (1)
npq  2 (Variance)………… (2)

From dividing equation 2 by equation 1, you get value of ‘q’ as:


npq 2
q 
np 3

Sikkim Manipal University Page No. 161


Statistics for Management Unit 6

1
 p  1 q 
3
Substituting value of p in equation 1, you get the value of ‘n’ as:
n9

 Binominal distribution is q  p n  2 3  13
9

Therefore, the probability that X=3 is given by:

 3   1 3 
  3 9 C3 2
6 3
 1792
6561
Hence, the probability that X = 3 for a binomial distribution is 1792/6561.

Self Assessment Questions


2. State whether the following statements are true ‘T’ or false ‘F’.
i) Mean of binomial distribution is ‘npq’.
ii) ‘n’ and ‘p’ are the parameters of binomial distribution.
iii) If the mean and variance of a binomial distribution are
6 and 5, then p = 1/6.
iv) Each trial in a binomial experiment has the different probability of
success, p.
6.4.4 Caselet on Binomial distribution
Case Study 1
Vinay is the operations manager of the books section of a large
department store. He has calculated that 0.4 is the probability that a
customer who is just browsing will buy something. Suppose that six
customers browse in the books section each hour. Vinay wants to
calculate the following probabilities.
What is the probability that:
i) Exactly four browsing customers will buy something during a
specified hour
ii) At least two browsing customers will buy something during a
specified hour
iii) None of the browsing customers will buy anything during a
specified hour

Sikkim Manipal University Page No. 162


Statistics for Management Unit 6

6.5 Poisson Distribution


Poisson process is obtained when the binomial experiment is conducted
many number of times. Here, the number of trials would be a large number.
It is also a discrete probability distribution. If the probability of success ‘p’ is
small and the number of trials ‘n’ is large, the binomial probabilities are hard
to calculate. In such cases, when ‘n’ is large and ‘p’ is small, the binomial
distributions are approximated to Poisson distributions.

Key Statistic
The probability distribution of a Poisson random variable X is given by:
m
   e  m
x
where,
x varies from ‘0’ to infinity
e  2.71828 , the base of natural logarithm
m  mean number of successes in the given time interval

The mean and variance of the distribution is ‘m’. Its standard deviation is
m and ’m’ is called the parameter of the distribution.

Key Statistic
The mean of the Poisson distribution is also given by:
m  n p
where, ‘p’ is the probability of success and ‘n’ is the number of trials.

It is a unimodal distribution. It is also known as the distribution of ‘rare


events’. It is the limiting form of binomial distribution as ‘n’ tends to infinity.
6.5.1 Assumptions for applying the Poisson distribution
Poisson distribution can be applied under the following assumptions.
i) The outcome of trial / experiment must be of dichotomous nature
ii) The probability of success must remain the same for trials
iii) The trials should be conducted under identical conditions
iv) The trials should be statistically independent

Sikkim Manipal University Page No. 163


Statistics for Management Unit 6

v) The probability of success should be very small and ‘n’ should be large
such that ‘np’ is a constant m [Generally, p < 0.1 and n > 10]
6.5.2 Real life examples of Poisson variate
Some of the real life examples of Poisson variate are:
i) Number of accidents in any traffic circle
ii) Number of incoming telephone calls at an exchange per minute
iii) Number of radio-active particles emitted by substances
iv) Number of defects in a product
v) Number of micro-organisms developed during a period
6.5.3 Recurrence relation
Key Statistic
Recurrence relation between successive terms of a Poisson
expansion is given by:
m
Tx  T
x x 1

Type i: Finding the probability of events

Solved Problem 9: Suppose two houses in a thousand catch fire in a year


and there are 2000 houses in a village. What is the probability that:
i) None of the houses catches fire
ii) At least one house catches fire
iii) Not more than two houses catches fire
Solution: Given the probability of a house catching fire is:
2
P  0.002 and n  2000
1000

 m  np  2000  0.002  4
Therefore, the required probabilities are calculated as follows:
i. The probability that none catches fire is given by:
m0
  0  e  m  e  4  0.01832
0
Therefore, the probability that none of the houses catches fire is
0.01832.

Sikkim Manipal University Page No. 164


Statistics for Management Unit 6

ii. The probability that at least one catches fire is given by:
  1  1    0  1  0.01832  0.98168
Therefore, the probability that at least one house catches fire is
0.98168.
iii. The probability that not more than 2 houses catches fire is given by:
m0 m1 m2
  2    0    1    2  e  m  e m  e m
0 1 2

 
 e m 1  m 
m2   e  4 1  4  16   e  4 13  0.01832 13  0.2382
 2   2
 
Therefore, the probability that not more than 2 houses catches fire is
0.2382.
Solved Problem 10: One percent of bulbs manufactured by a firm are
expected to be defective. A carton contains 200 bulbs. Find the probability
that the carton contains 3 or more defective bulbs?
Solution: Given that:
The probability that bulb is defective, p  0.01 ,
n  200
 m  np  200  0.01  2
The probability that the carton contains 3 or more defective bulbs is given
by:
  3  1    0    1    2
m0 21 22
 1   e 2  e 2  e 2   1  e 2 1  2  2
0 1 2
 1  0.13534  5  1  0.6767  0.3233
Therefore, the probability that the carton contains 3 or more defective bulbs
is 0.3233.
Solved Problem 11: On an average, there are three mistakes on a page of
a book. The book contains 200 pages. What is the probability that a
randomly selected page has exactly one mistake?
Solution: Given that m  3 the required probability is calculated as:

Sikkim Manipal University Page No. 165


Statistics for Management Unit 6

  1  e 3 
3
 0.04979  3  0.14937
1
Hence, the probability that a randomly selected page has exactly one
mistake is 0.14937
Solved Problem 12: A sales representative of RSR Insurance Company
sells 3 insurance policies on an average in a week. Using the Poisson law,
calculate the probability that in a given week, the salesman will sell:
i. some life insurance policies
ii. two or more policies but less than 4 policies
Solution: In this problem, it is given that the mean ‘m’ is 3.
i. Some policies mean that salesman selling one or more insurance
policies. Hence, P(X>0) must be found out first which is equal to 1
minus P(X=0)
  0  1    0
Calculating P(X=0) using the Poisson distribution formula:
mx
  x   e  m 
x
30
  0  e 3   4.9787 10  2
0
  0  1    0  1  0.0498  0.9502

The probability that the salesman of RSR Insurance Company will sell
some life insurance policies is 95.02%.
ii. To find the probability of the salesman selling more than two and
lesser than four policies means that we have to find the values for
P(2≤X<4).
2    4    2    3
32 33
2    4  e 3   e 3 
2 3
2    4  0.4482
The probability that the salesman of RSR Insurance Company will sell
two or more policies but less than four policies is 44.82%.

Sikkim Manipal University Page No. 166


Statistics for Management Unit 6

Type ii: Finding the expectations

Solved Problem 13: From the data given in solved problem 11, how many
pages would you expect to be free from mistakes?
Solution: Given that:
m  3 and n  200
  0  e 3  0.04979
 Expected number of pages to be free from mistakes is given by:
n    0  200  0.04979  9.958  10 pages
Expected number of pages to be free from mistakes is approximately 10
pages.

Type iii: Finding the distributions

Solved Problem 14: If X is a Poisson variate such that P(X = 1) = P(X = 2),
find P(X = 0).
Solution: Let ‘m’ be the parameter of the distribution, and P(X = 1) =
P(X = 2)
m1 m2
 e m   e m 
1 2
m m2
 
1 2
 2m  m 2  2  m
  0  e 2  0.13534

Self Assessment Questions


3. State whether the following statements are true ‘T’ or false ‘F’.
i. ‘X’ is a Poisson variate if P < 0.1 and n > 10
ii. Example of bimodal distribution is Poisson distribution

Sikkim Manipal University Page No. 167


Statistics for Management Unit 6

6.5.4 Case Study on Poisson distribution


Case Study 2
Read the information and find the required probability.
On average, four pigeons hit the India Gate and are killed each week.
Ramesh, an official of Archeological Survey of India, requested the
Central Government to provide funds to buy equipments to scare
pigeons away from the monument. The concerned official from the
Central Government replied that unless the probability of more than two
birds being killed in any week exceeds 0.7, funds cannot be allocated.
Calculate and find out if the Central Government allocates the funds.

6.6 Normal Distribution


So far in this unit, you have studied only the discrete probability
distributions. Now, you will study about the continuous probability
distributions. The Normal distribution is an important continuous probability
distribution.
The continuous random variables which can take all values in any given
interval such as the measure of heights, weights, temperatures, amount of
rainfall and so on are all the examples of Normal random variables.
The following are some of the characteristics of Normal distribution.
1. Normal distribution is a continuous probability distribution
2. Its probability density function is given by:
2
1  x 
f ( x)  e 1/ 2  
 2   
where, ‘x’ varies from - to +
3. Its mean is  and standard deviation is  where  and  are the
parameters of the distribution
4. It is a bell-shaped curve and is symmetric about its mean
5. The mean divides the curve into two equal portions
6. Its Quartile Deviation, Q.D = 2/3 .
7. Its Mean Deviation, M.D  4/5 
8. The X – axis is an asymptote to the curve
[Asymptote is a straight line that touches the curve at infinity]

Sikkim Manipal University Page No. 168


Statistics for Management Unit 6

9. The point of inflexion occurs at   


10. It is a unimodal distribution
11. Mean, median and mode coincide
12. The area under Normal curve within certain limits is shown in table 6.2.
The graphical representation of the table 6.2 is shown in figure 6.4.
Table 6.2: Area under the Normal curve for various values of ‘’ and ‘’

Limits Area %
 68.2
 1.96 95
  2 95.4
  3 99.7

Fig. 6.4: Areas under the Normal distribution curve

Key Statistic
The Normal distribution is the limiting form of binomial distribution.

6.6.1 Standard Normal distribution


Its distribution forms a Standard Normal distribution whose probability
density function is given by:

Sikkim Manipal University Page No. 169


Statistics for Management Unit 6

1
1 ( z )2
f (Z )  e 2
2
Key Statistic
Any Normal distribution can be converted into a Standard Normal
distribution by the transformation:
x
The Standard Normal variate, ‘Z’ is given by: Z  where, ‘Z’ is

called Standard Normal variate which gives the number of standard
deviations from x to the mean of this distribution
x is the value of random variable X
 is the mean of the distribution random variable X
 is the standard deviation of this distribution

where, ‘Z’ varies from -  to +  .


The mean of its distribution is ‘0’ and standard deviation is ‘1’. The
statisticians have developed a standard normal table. The table gives the
probability that ‘z’ will lie between ‘0’ and ‘Z’. Therefore, to solve any
problem with a Normal distribution, we convert it to Standard Normal
distribution to calculate ‘z’ and then refer to the table, which gives the area
under the Normal curve between mean and any value of the normally
distributed random variable.

Key Statistic
The mean of Standard Normal distribution is ‘0’ and the standard
deviation is ‘1’.

Solved Problem 15: The weight of Cocavito packs packed by the filling
machine follows a normal distribution with mean weight of 500 gms and
standard deviation of 10 gms. A pack is selected at random. What is the
probability that:
i. The pack’s weight will exceed 515 gms?
ii. The pack’s weight lie within 480 to 520 gms?
iii. The proportion of packs will have less than 480 and greater than 520
gms?

Sikkim Manipal University Page No. 170


Statistics for Management Unit 6

If 10,000 packs are supplied, how many packs will be rejected, given that
480 gms and 520 gms are lower and upper limit for acceptance?
Solution: To solve this problem we will draw the normal curve as
shown in figure 6.5.
i. The probability that the packs weight will exceed 515 gms is given by:
  515  0.5  500    515
 500  500 515  500 
 0.5      0.5  0    1.5  0.5  0.4332  0.0668
 10 10 
Therefore, the probability that the packs weight will exceed 515 gms is
0.0668.

Fig. 6.5: Normal curve for solved problem 15 i


Note: Mean divides it into two equal portions and  

ii. The probability that the pack’s weight lie within 480 to 520 gms is
given by:
480    520  480    500  480    500
 480  500   520  500 
     0   0    
 10   10 

Fig. 6.6: Normal curve for solved problem 15 ii

Sikkim Manipal University Page No. 171


Statistics for Management Unit 6

480    520   2    0  0    2  0.4772  0.4772  0.9544


[0.9544 = 0.4772 * 2, since the distribution is symmetrical about the
mean as shown in figure 6.6.]
iii. The probability of acceptance is as found in (ii),
480    520  0.9544
If the weight lies outside these values then it will be rejected.
 The probability of rejection  1  0.9544  0.0456
The number of packets that will be rejected is given by NP.
 NP  1000  0.0456  456
The number of packets that will be rejected is 456.

Type iii: Finding the distributions

Solved Problem 16: The sales volume of 1000 retail outlets of a soap
company follows Normal distribution. 20% of retail outlets sell less than 50
units per day and 15% of them sells 200 unit and above. Find:
i. The mean and standard deviation of the sales volume
ii. The expected number of retail outlets that sells units between 50 and
148 units
Solution: Let ‘m’ and ‘’ be the mean and standard deviation. The
given information can be represented in a graph as shown in figure
6.7.

Fig. 6.7: Normal Curve for solved problem 16

Sikkim Manipal University Page No. 172


Statistics for Management Unit 6

Given that:
 
 50  x    0.30
 50     50   
     0  0.30      0.84
     

or 50    0.84 .................1

 200    1.04 .................2

 50    0.84 .................1
And
  x  200  0.35
 200    200  
  0      0.35    1.04
  
200    1.04 ...............2
 50  0.84
…………. (1)
150  1.88

50  0.84   from 1


or 50  0.84  79.8  
  117
ii. The probability that the retail outlets that sell units between 50 and 148 is
given by:
50    148  50    117  117    148
 50  117   148  117 
     0   0     0.2995  0.1480  0.4475
 79.8   79.8 
Expected number of units = 10000 x 0.4475 = 447.5  448

Self Assessment Questions


4. State whether the following statements are true ‘T’ or false ‘F’.
i. Quartile deviation of Normal distribution is 4/ 5 

Sikkim Manipal University Page No. 173


Statistics for Management Unit 6

ii. Mean and standard deviation of a Standard Normal distribution are


‘1’ and ‘0’
iii. Mean, median and mode coincide in a Normal distribution

6.7 Summary
Quick analysis of observed data can be done if it is identified with the
theoretical distribution. The probabilities associated with random variate of
the distribution help us to know the chances of occurrence of several events
within specified values. We can also extend the solution to the cost aspects.
Binomial distribution ‘is applied when you run a series of finite independent
Bernoulli trials and the probability of success remains same for every trial. In
this distribution, 1’ represents the occurrence of success and ‘0’ represents
the occurrence of failure.
Poisson distribution is a unimodal distribution with mean ‘m’ and standard
deviation is m . This distribution is the limiting form of binomial distribution
as ‘n’ tends to infinity.
Normal distribution is a continuous probability distribution with probability
density function f(x) given by:
2
1  x 
f ( x)  e 1/ 2  
 2   

where, ‘x’ varies from - to +.


Any Normal distribution can be converted into the Standard Normal
distribution with the transformation

Z

where, ‘Z’ is called Standard Normal variate.

6.8 Terminal Questions


1. What are the assumptions under which binomial distribution is applied?
2. A shopkeeper notes that the probability that a customer will buy his
articles is 0.4. Six customers enter his shop in an hour. What is the
probability that:

Sikkim Manipal University Page No. 174


Statistics for Management Unit 6

i. At least one customer bought something?


ii. Exactly two bought something?
iii. None bought anything?
3. Find P(X = 2), given mean and standard deviation of the binomial
distribution are 4 and 3 respectively.
4. Give real life examples of Poisson variate.
5. If the first two terms of a Poisson distribution are 150 and 90,
find P(X = 0).
6. The average number of phone calls at a booth per hour is 2. What is
the probability that there will be exactly one call in an hour?
7. The probability that a firm’s product will succeed competitor’s is 2/3. If
in a month it has introduced 4 products, what is the probability that:
i. Two products succeed the competitor’s?
ii. All products succeed the competitor’s?
8. Mean life of electric bulbs produced by a company is 1500 hours with
a standard deviation of 300 hours. Assuming that the life of bulbs
follow Normal distribution, what is the probability that a randomly
selected bulb will:
i. Fail within 1200 hours?
ii. Survive between 1350 and 1650 hours?
iii. Survive beyond 1950 hours?
9. Write short notes on Normal distribution
10. The height of students follows Normal distribution. 15% of them have
height less than 150 cm and 10 % have height above 180 cm. Find the
mean and standard deviation of the distribution?

6.9 Answers to SAQs and TQs

Answers to Self Assessment Questions


1. i – F, ii- F, iii- T
2. i-F, ii- T, iii- T, iv- F
3. i- T, ii- T
4. i- F, ii- F, iii- T
‘T’ denotes ‘True’
‘F’ denotes ‘False’

Sikkim Manipal University Page No. 175


Statistics for Management Unit 6

Answers to Terminal Questions


1. Refer section 6.4.1.
2. 14899 / 15625
3. 16C2 (0.75)14 (0.25)2
4. Refer section 6.5.2.
5. e-0.6 = 0.5488
6. 0.27068
7. 8 / 27
8. i) 0.1587 ii) 0.3830 iii) 0.0668
9. Refer section 6.6.
10. Mean = 165.89, S.D = 11.03
Case Studies
Case Study 1:
The required probabilities are:
i. 0.1382
ii. 0.5444
iii. 0.0467
Case Study 2:
Yes, the Central Government will allocate funds as the probability of more
than two birds being killed in any week is 0.73 which is greater than 0.7.

6.10 References
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited

Sikkim Manipal University Page No. 176


Statistics for Management Unit 7

Unit 7 Sampling and Sampling Distributions


Structure:
7.1 Introduction
Learning Objectives
7.2 Population and Sample
Universe or Population
Types of Population
Sample
7.3 Advantages of Sampling
7.4 Sampling Theory
Law of Statistical Regularity
Principle of Inertia of Large Numbers
Principle of Persistence of Small Numbers
Principle of Validity
Principle of Optimisation
7.5 Terms Used in Sampling Theory
7.6 Errors in Statistics
Measures of Statistical Errors
7.7 Types of Sampling
Probability Sampling
Non-Probability Sampling
Caselet on Types of Sampling
7.8 Determination of Sample Size
7.9 Central Limit Theorem
7.10 Summary
7.11 Terminal Questions
7.12 Answers to SAQs and TQs
Answers to Self Assessment Questions
Answers to Caselets
Answers to Terminal Questions
7.13 References

7.1 Introduction
In the unit 6, ‟Theoretical Distributions‟, you have studied about both
discrete and continuous random variables along with the probability

Sikkim Manipal University Page No. 177


Statistics for Management Unit 7

distributions of random variables. You have studied about the binomial,


Poisson and Normal distributions which are explained with the help of
solved problems.
In this unit 7, „Sampling and Sampling Distributions‟, we will discuss about
the statistical sampling and sampling designs. You will study about different
types of sampling theories and also the laws of sampling. We will end this
unit with the important theorem called central limit theorem.
In different fields of human activity, the decision making process is based on
the observations of few units which form a portion of the total population.
The process of studying only a portion of the population and making
decisions involves risk, the risk of making wrong decisions. This unit deals
with the various techniques of drawing samples from the population.
Evaluation of risk will be discussed in unit 9, „Testing of Hypothesis in case
of Large and Small Samples‟.
When sampling design is not done properly, the estimation or the inferences
drawn from the sample can go wrong and the managerial decisions taken
on the wrong conclusions may lead to loss of time, money and human
resources. This may badly affect the reputation of their organisation. Hence,
the risks involved in using the incorrect sampling design are of primary
concerns to investigators.
7.1.1 Learning Objectives
By the end of this unit, you should be able to:
 Differentiate between population and a sample
 Recall the laws of sampling theory
 Identify the various sampling errors
 Recognise the types of sampling available
 Determine the sample size
 Recall the Central Limit Theorem.

7.2 Population and Sample


7.2.1 Universe or Population
Statistical survey or enquiries deal with studying various characteristics of
unit belonging to a group. The group consisting of all the units is called
Universe or Population. The figure 7.1 illustrates the population.
Sikkim Manipal University Page No. 178
Statistics for Management Unit 7

Fig. 7.1: Illustration of population

Example 1
In the statistical survey aimed at determining average per capita income
of the people in the city, all earning individuals in the city form the
population.

7.2.2 Types of population


The figure 7.2 displays the types of population along with the explanation.

Finite Population A population with finite number of units

Infinite Population A population with infinite number of units

Existent Population A population of concrete objects like books


in the library

Hypothetical Throwing a coin infinite number of times


Population

Fig. 7.2: Types of population

Sikkim Manipal University Page No. 179


Statistics for Management Unit 7

Note: Although many populations appear to be exceedingly large, no truly


infinite population of physical objects actually exists. Given limited resources
and time it is practically not possible to count the number of grains of sand
on the beach. Such populations are termed as infinite population for our
study.
7.2.3 Sample
Sample is a finite subset of a population. A sample is drawn from a
population to estimate the characteristics of the population. Sampling is a
tool which enables us to draw conclusions about the characteristics of the
population. The figure 7.3 illustrates the population and sample.

Population

Sample

Fig. 7.3: Illustration of population and sample

7.3 Advantages of Sampling


The advantages of sampling are:
 In short time we get maximum information about the population.
 It results in considerable amount of saving of time and labour.
 The organisation and administration of a sample survey is relatively
much less.
 The results obtained are reliable and always possible to attach degree of
reliability.
 There is a possibility of obtaining detailed information. In other words
there is a greater scope.
 In case of infinite population, it is the only available method.
 If the units are destroyed or affected adversely in the course of
investigation, then the only method is sampling.

Sikkim Manipal University Page No. 180


Statistics for Management Unit 7

7.4 Sampling Theory


The sampling theory is based on the following five important laws. The
figure 7.4 shows the five important laws of sampling theory.
 Law of statistical regularity
 Principle of inertia of large numbers
 Principle of persistence of small numbers
 Principle of validity
 Principle of optimisation

Fig. 7.4: Laws of sampling

7.4.1 Law of statistical regularity


The law of statistical regularity states that a group of units chosen at random
from a large group tends to posses the characteristics of that large group.
Suppose, a particular characteristic of the population has a particular shape,
then the same characteristics will also follow the same shape in the sample.
7.4.2 Principle of inertia of large numbers
This principle states that “other things being equal, as the sample size
increases, the results tend to be more reliable and accurate”. Suppose that

Sikkim Manipal University Page No. 181


Statistics for Management Unit 7

the population mean is 25 units. If a sample size of 50 results in average of


24.5 units, then larger sample size of 100 will result in 24.8 units. In other
words, larger the sample size, more accurate will be the result.
7.4.3 Principle of persistence of small numbers
If some of the units in a population possess markedly distinct
characteristics, then it will be reflected in the sample values also. For
example, if there are 300 blind persons in a population of 10,000 persons,
then a sample of hundred will have more or less same proportion of blind
persons in it.
7.4.4 Principle of validity
A sampling design is said to be valid if it enables us to obtain tests and
estimation about population parameters.
7.4.5 Principle of optimisation
This principle aims at obtaining a desired level of efficiency at minimum cost
or obtaining maximum possible efficiency with given level of cost.

7.5 Terms Used in Sampling Theory


Parameter
Any statistics, like mean, median, calculated from population values are
known as parameters of the population and denoted by Greek letters (, 
and so on).
Statistics
Any statistics calculated from the sample are known as statistic and are
denoted by English letters ( x , s and so on). Statistic is the parameter of a
sample.
Sampling distribution
Sampling distribution consists of all the possible values of a statistic and
their respective probabilities for a given sample size.
Solved Problem 1: Consider the selection of two numbers from the given
five numbers (1, 2, 3, 4, 5). Find the possible combinations and their mean.
Solution: The possible combinations and their average are represented in
table 7.1a.

Sikkim Manipal University Page No. 182


Statistics for Management Unit 7

Table 7.1a: Possible combinations of given 5 numbers and their average

Combinations Numbers Selected Average


1 1,2 1.5
2 1,3 2
3 1,4 2.5
4 1,5 3
5 2,3 2.5
6 2,4 3
7 2,5 3.5
8 3,4 3.5
9 3,5 4
10 4,5 4.5

This gives the means of sample size 2. We form a distribution of sample


means which can be represented in table 7.1b.
Table 7.1b: Frequency table for the data of solved problem 1
2
X F fx fx
Mean Frequency
1.5 1 1.5 2.25
2 1 2.0 4.00
2.5 2 5.0 12.50
3 2 6.0 18.00
3.5 2 7.0 24.5
4 1 4.0 16.0
4.5 1 4.5 20.25
N 10 30 97.50

 Mean of the distribution = fx / N = 3


Mean of the population is 1 + 2 + 3 + 4 + 5 / 5 = 3
The table 7.1b represents the sampling distributions of means. We observe
that the mean of sample means is equal to population mean.

Sikkim Manipal University Page No. 183


Statistics for Management Unit 7

Key Statistic
The standard deviation of sampling distribution of any statistic is called
standard error of that statistic. It is denoted as „S‟ and is given by:

2 2
 fX   fX 
S2    

f   f 

where, „f‟ is the frequency and „X‟ is the mean.

Therefore, the standard error or the standard deviation of sample means is


given by:
2 2
2  fx   fx  97.50
S      (3)2
f   f  10
S  0.7500  0.866
asdpoooooooo
Hence, the standard error of the mean „S‟ is 0.866.
Uses of standard error
Standard error helps us in:
i) Testing of hypothesis
ii) Constructing confidence interval for the statistics
iii) Giving reliability measure for the statistic by its reciprocal value

7.6 Errors in Statistics


The term „error‟ denotes the difference between population value and its
estimate provided by sampling technique. Therefore, the term is not referred
in its ordinary sense in statistics. There are four types of errors as shown in
the figure 7.5.

Sikkim Manipal University Page No. 184


Statistics for Management Unit 7

Fig. 7.5: Errors in Statistics

Let us understand about each of the error types and the factors causing
those errors.
Sampling errors
The sample results are bound to differ from population results, since sample
is only a small portion of the population. It is also known as inherent error
and cannot be avoided. It is not worth to eliminate them completely. These
errors may be due to the following factors:
 Faulty selection of sample
 Substitution of units to be studied
 Faulty demarcation of sampling units
 Error due to bias in estimation
However, the sampling errors follow random or chance variations and tend
to cancel out each other on averaging.
Non-sampling errors
Non-sampling errors are attributed to factors that can be controlled and
eliminated by suitable actions. It is worth to eliminate these errors. They are
due to the following factors:
 Faulty planning, faulty definitions
 Defective methods of interviewing
 Personal bias of investigator
 Lack of trained and qualified investigators
 Respondents‟ failure to answer
 Improper coverage
 Compiling errors
 Publication errors

Sikkim Manipal University Page No. 185


Statistics for Management Unit 7

Biased errors
It arises in both census and sampling method. These errors occur due to
personal bias of the investigator and the instruments used for measuring.
They are also due to faculty collection of data, respondent‟s bias and bias
due to non-response. Biased errors have a tendency to grow with sample
size. Therefore, they are also known as cumulative errors. The magnitude of
biased errors is directly proportional to the sample size.
Unbiased errors
The errors that are due to over-estimation and under-estimation such that
they are equal are known as unbiased errors. They are also known as
compensatory errors. They do not increase with sample size.
7.6.1 Measures of statistical errors
Key Statistic
Absolute error is the difference between true value „t‟ and the observed
value „a‟. Symbolically, absolute error „AE‟ is represented as:
AE  t a
It is independent of magnitude of the actual value.

Key Statistic
Relative error is the ratio of the absolute error to the actual value. It is
symbolically represented as:
AE t - a 
RE  
a a
It provides a degree of error for comparison purposes between different
sets of data.

Self Assessment Questions


1. State whether the following statements are true „T‟ or false „F‟.
i) Population is aggregate of objects under study.
ii) Sampling method consume time and resources.
iii) Any summarised figure from population is known as statistics.
iv) We adopt sampling technique in our activities.
v) Population is a subset of sample.

Sikkim Manipal University Page No. 186


Statistics for Management Unit 7

vi) An unbiased sample gives an accurate prediction of characteristics


of an entire population.
vii) The standard deviation of sampling distribution of a statistic is
known as standard error of that statistic.
viii) Standard error is used as a reliability measure.
ix) Faulty selection of sample contributes to sampling error.
x) Personal bias increases the non-sampling errors.
xi) Unbiased errors are cumulative in nature.
xii) Biased errors are also known as compensatory errors.

7.7 Types of Sampling


By choosing a sample technique carefully, errors can be minimised. Let us
take a look at the different techniques available. The sampling techniques
may be broadly classified into.
i) Probability Sampling
ii) Non-Probability Sampling
7.7.1 Probability sampling
Probability sampling provides a scientific technique of drawing samples from
the population. The technique of drawing samples is according to the law in
which each unit has a predetermined probability of being included in the
sample. The different ways of assigning probability are:
i) Each unit has the same chance of being selected.
ii) Sampling units have varying probability
iii) Units have probability proportional to the sample size
We will discuss here some of the important probability sampling designs.
Simple random sampling
Under this technique, sample units are drawn in such a way that each and
every unit in the population has an equal and independent chance of being
included in the sample. If a sample unit is replaced before drawing the next
unit, then it is known as Simple Random Sampling With Replacement
[SRSWR]. If the sample unit is not replaced before drawing the next unit,
then it is called Simple Random Sampling without replacement [SRSWOR].
In first case, probability of drawing a unit is 1/N, where N is the population
size. In the second case probability of drawing a unit is 1/Nn.

Sikkim Manipal University Page No. 187


Statistics for Management Unit 7

The selection of simple random sampling can be done by:


 Lottery method: In lottery method, we identify each and every unit with
distinct numbers by allotting an identical card. The cards are put in a
drum and thoroughly shuffled before each unit is drawn. The figure 7.6
represents a lotto machine through which samples can be selected
randomly.

Fig. 7.6: Lotto machine

 The use of table of random numbers: There are several random


number tables. They are Tippet‟s random number table, Fisher‟s and
Yate‟s Tables, Kendall and Babington Smiths random tables, Rand
Corporation random numbers and so on. The table 7.2 represents the
specimen of random numbers by Tippett‟s.
Table 7.2: Tippett’s random number table
2952 6641 3992 9792 7979 5911 3170 5624
4167 9524 1545 1396 7203 5356 1300 2693
2370 7483 3408 2762 3563 1089 6913 7691
0560 5246 1112 6107 6008 8126 4233 8776
2754 9143 1405 9025 7002 6111 8816 6446
Suppose, we want to select 10 units from a population size of 100. We
number the population units from 00 to 99. Then we start taking 2 digits.
Suppose, we start with 41 (second row) then the other numbers selected will
be 67, 95, 24, 15, 45, 13, 96, 72, 03.
Stratified random sampling
This sampling design is most appropriate if the population is heterogeneous
with respect to characteristic under study or the population distribution is
highly skewed.

Sikkim Manipal University Page No. 188


Statistics for Management Unit 7

We subdivide the population into several groups or strata such that :


i) Units within each stratum is more homogeneous
ii) Units between strata are heterogeneous
iii) Strata do not overlap, in other words, every unit of population belongs
to one and only one stratum
The criteria used for stratification are geographical, sociological, age, sex,
income and so on. The population of size „N‟ is divided into „K‟ strata
relatively homogenous of size „N1‟, „N2‟………….‟Nk‟ such that „N1 + N2
+……… + Nk = N‟.
Then, we draw a simple random sample from each stratum either
proportional to size of stratum or equal units from each stratum.
The table 7.3 displays the merits and demerits of stratified random
sampling.
Table 7.3: Merits and demerits of stratified random sampling
Merits Demerits
1. Sample is more representative 1. Many times the stratification is
not effective
2. Provides more efficient estimate 2. Appropriate sample sizes are
not drawn from each of the
stratum
3. Administratively more convenient
4. Can be applied in situation where
different degrees of accuracy is
desired for different segments of
population

Sikkim Manipal University Page No. 189


Statistics for Management Unit 7

Example 2
The items produced by factories located at three cities „X‟, „Y‟ and „Z‟ are
200, 300 and 500 respectively. We wish to draw a sample of 20 items
under proportional stratified sampling. We number the unit from 0 to 999.
Then refer to random table and select the numbers as represented in
table 7.4.
Table 7.4: Stratified random sampling

27717 43584 85192 88977 29490 69714 94015 62874


32444 48277 13025 14338 54066 15423 47724 66733
74108 82228 888570 74015 80217 36292 98525 24335
24432 24896 62880

Proportion of samples to be selected are:

200
For Factory X  20  4
1000

300
For Factory Y  20  6
1000

500
For Factory Z  20   10
1000
Total = 20
For first factory sample units selected are 174, 192, 069, 156.
For second factory sample units selected are 287, 432, 444, 482, 302,
254.
For third factory sample units selected are 854, 772, 733, 741, 822, 853,
570, 802, 629, 525.
Systematic sampling
This design is recommended if we have a complete list of sampling units
arranged in some systematic order such as geographical, chronological or
alphabetical order.
Suppose the population size is „N‟. The population units are serially
numbered „1‟ to „N‟ in some systematic order and we wish to draw a sample

Sikkim Manipal University Page No. 190


Statistics for Management Unit 7

of „n‟ units. Then we divide units from „1‟ to „N‟ into „K‟ groups such that each
group has „n‟ units.
This implies „nK = N‟ or „K = N/n‟. From the first group, we select a unit at
random. Suppose the unit selected is 6th unit, thereafter we select every 6 +
Kth units. If „K‟ is 20, „n‟ is 5 and „N‟ is 100 then units selected are 6, 26, 46,
66, 86.
The table 7.5 displays the merits and demerits of systematic sampling.
Table 7.5: Merits and demerits of systematic sampling
Merits Demerits
1. Very easy to operate and easy to 1. Many case we do not get up-to-
check. date list.
2. It saves time and labour. 2. It gives biased results if periodic
feature exist in the data.

3. More efficient than simple random


sampling if we have up-to-date
frame.

Cluster sampling
The total population is divided into recognisable sub-divisions, known as
clusters such that within each cluster units are more heterogeneous and
between clusters they are homogenous. The units are selected from each
cluster by suitable sampling techniques. The figure 7.7 represents the
cluster sampling where each packet of candy packet forms a cluster.

Fig. 7.7: Cluster sampling

Multi-stage sampling
The total population is divided into several stages. The sampling process is
carried out through several stages. It is represented as in figure 7.8.

Sikkim Manipal University Page No. 191


Statistics for Management Unit 7

Fig. 7.8: Multistage sampling

Example 3
We want to select 1000 colleges from southern states. In the first stages
we may select any three states. In the second stage we may select some
districts in that state. In the 3rd stage, we may select the colleges in each
district. We may adopt any sampling technique at each stage.

The table 7.6 displays the merits and demerits of multi-stage sampling.
Table 7.6: Merits and demerits of multi stage sampling
Merits Demerits
Greater flexibility in sampling Estimates are less accurate
method
Existing division can be used Investigator should have knowledge of the
entire population that will be sampled

7.7.2 Non-probability sampling


Depending upon the object of enquiry and other considerations a
predetermined number of sample units is selected purposely so that they
represent the true characteristics of the population.
A serious drawback of this sampling design is that it is highly subjective in
nature. The selection of sample units depends entirely upon the personal
convenience, biases, prejudices and beliefs of the investigator. This method
will be more successful if the investigator is thoroughly skilled and
experienced.
Judgment Sampling
The choice of sample items depends exclusively on the judgment of the
investigator. The investigator‟s experience and knowledge about the
population will help to select the sample units. It is the most suitable method

Sikkim Manipal University Page No. 192


Statistics for Management Unit 7

if the population size is less. The table 7.7 displays the merits and demerits
of judgement sampling.
Table 7.7: Merits and demerits of judgement sampling
Merits Demerits
1. Most useful for small population 1. It is not a scientific method.
2. Most useful to study some unknown 2. It has a risk of investigator‟s
traits of a population some of whose bias being introduced.
characteristics are known.
3. Helpful in solving day-to-day
problems.

Convenience sampling
The sample units are selected according to convenience of the investigator.
It is also called “chunk” which refers to the fraction of the population being
investigated which is selected neither by probability nor by judgment.
Moreover, a list or framework should be available for the selection of the
sample. It is used to make pilot studies. However, there is a high chance of
bias being introduced.
Quota sampling
It is a type of judgment sampling. Under this design, quotas are set up
according to some specified characteristic such as age groups or income
groups. From each group a specified number of units are sampled
according to the quota allotted to the group. Within the group the selection
of sample units depends on personal judgment. It has a risk of personal
prejudice and bias entering the process. This method is often used in public
opinion studies.
7.7.3 Caselet on types of sampling

Caselet
Read the information and answer the questions.
You have been given 5 boxes of biscuits. There are orange, brown and
yellow colour biscuits. You are asked to sample the biscuits. The target
population here is all of the biscuits and the sampling unit is the biscuit.
Answer the following questions.
i) How would you apply simple random sampling?
ii) How would you apply stratified sampling?
iii) How would you apply cluster sampling?

Sikkim Manipal University Page No. 193


Statistics for Management Unit 7

7.8 Determination of Sample Size


Sample size depends upon the size of the population; the resources
available, the degree of accuracy desired, homogeneity of the population,
nature of study, methods of sampling used and nature of respondents. The
following are the formulae available to determine sample size.

Key Statistic
The formula used for calculating the sample size for finite population is
given by:
P  Ps
Z (For finite population )
N - n / N - 1 PQ / n

where, „N‟ is population size.

Key Statistic
The formula used for calculating the sample size for infinite population is
given by:

P  Ps
Z (For infinite population )
PQ / n
where,
 Z = value according to the degree of accuracy desired
 P = Population value,
 Ps = Sample value which implies P - Ps error desired in the
result
 Q=1–P
 n = Sample size.

Sikkim Manipal University Page No. 194


Statistics for Management Unit 7

Key Statistic
The formula used for calculating the sample size for finite population,
when population mean and sample mean are given, is:

μ  μs

Z  (For finite population )


n
where,
  = Population mean
 s = Sample mean
  = Standard deviation of population
 n = Sample size

Key Statistic
The formula used for calculating the sample size for infinite population,
when population mean and sample mean are given, is:

μ μs
 Nn
Z  (For infinite population )
n N 1
where,
  = Population mean
 s = Sample mean
  = Standard deviation of population
 n = Sample size
 N = Size of population

Sikkim Manipal University Page No. 195


Statistics for Management Unit 7

Key Statistic
The formula used for calculating the sample size, when mean of sample
means is given, is:

x 
n
where,
 σ = Mean of sample means
x
  = Population standard deviation
 n = Sample size

7.9 Central Limit Theorem


If X1, X2…………Xn is a random sample of size „n‟ from any population, then
the sample mean (X) is normally distributed with mean „‟ and variance „2 /
n‟ provided „n‟ is sufficiently large.
From the central limit theorem, we infer the following.
i) The mean of the sampling distributions will be equal to the population
mean
ii) The sampling distribution of the mean approaches normal distribution
as the sample size increases
iii) It permits us to use sample statistics to make inferences about
population parameters irrespective of the shape of frequency
distribution of the population.

Self Assessment Questions


2. State whether the following statements are true „T‟ or false „F‟.
i) Sample in which units are selected by judgment is known as
probability sample.
ii) Judgment sampling does not give representativeness of a sample.
iii) Large sample size always results in minimising the standard error.
iv) A sampling plan that divides the population into well-defined
groups from which random samples are drawn is known as cluster
sampling.
v) The principles of simple random sampling are the theoretical basis
for statistical inference.
Sikkim Manipal University Page No. 196
Statistics for Management Unit 7

vi) If the mean of a certain population is 20, it is likely that most of the
sample means will be 20.
vii) Any sampling distribution can be totally described by its mean and
standard deviation.
viii) Sampling from infinite population and from a finite population with
replacement results in:
σ
σ 
x n
ix) The central limit theorem assures that the sampling distribution of
mean is always normal.
x) Stratified sampling is used when each group considered are more
homogenous within itself and heterogeneous between group.

7.10 Summary
There are two methods of studying the characteristics of population, census
and sampling. The various advantages of sampling and the various errors
that could prop up in using these methods were explained.
Mainly, there are two methods of sampling namely; probability sampling and
non-probability sampling. The merits and demerits of each sampling method
were explained. We discussed the procedure for determining sample size.
We concluded the chapter with the importance of central limit theorem.

7.11 Terminal Questions


1. Discuss the errors that arise in statistical survey.
2. Describe simple random sampling.
3. Describe systematic sampling.
4. What is quota sampling and when do we use it?
5. What are the basic principles on which sampling theory is based?
6. Explain about the sampling distributions of a static and its standard
error.
7. Discuss the uses of standard error.
8. The distribution of employees in three plants of a manufacturing unit is
as shown in table 7.8. Using random numbers discussed under topic
„Simple random sampling‟, draw a random sample of size 15.

Sikkim Manipal University Page No. 197


Statistics for Management Unit 7

Table 7.8: Distribution of employees in three manufacturing plants


Plant A B C
Number of employees 100 200 200

9. Population proportion of tea drinkers is 0.6. Determine the sample size


such that the error between actual and observed proportion will be less
than or equal to 0.05 with 95% confidence, (Z = 1.96).
10. The standard error of mean of bursting strength of card boards
produced by a company is 1.5 units. If the population standard deviation
is 50 , find the sample size.

7.12 Answers to SAQs and TQs

Answers to Self Assessment Questions


1. i- T, ii- F, iii- T, iv- T, v- F, vi- T, vii- T, viii - T, ix- T, x- T, xi- F, xii- F
2. i- F, ii- T, iii- T, iv- F, v- T, vi- F, vii- F, viii - T, ix- T, x- T
„T‟ denotes True
„F‟ denotes False
Answers to caselets
i) You could apply simple random sampling by choosing biscuits at
random, either through drawing lots or using random numbers. That
way, each biscuit has an equal chance of being sampled.
ii) For stratified sampling, you divide the biscuits into strata and apply
simple random sampling to each one. Each stratum comprises of a
group of biscuits with similar characteristics, so you could use the
different types of biscuits. One stratum could be orange biscuits,
another one could be brown biscuits, and the final one could be yellow
biscuits.
iii) For cluster sampling, you divide the biscuits into groups, but this time
each group needs to be similar. Assuming each box of biscuits is
similar; you could take one of the boxes, and sample all of the biscuits
in it.

Sikkim Manipal University Page No. 198


Statistics for Management Unit 7

Answers to Terminal Questions


1. Refer section 7.6
2. Refer section 7.7.1
3. Refer section 7.7.1
4. Refer section 7.7.2
5. Refer section 7.4
6. Refer section 7.5
7. Refer section 7.5
8. Refer section 7.7.1
9. The sample size is approximately 19.
10. The sample size is approximately 23.

7.13 References
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited.

Sikkim Manipal University Page No. 199


Statistics for Management Unit 8

Unit 8 Estimation

Structure:
8.1 Introduction
Learning objectives
8.2 Reasons for Making Estimates
8.3 Making Statistical Inference
8.4 Types of Estimates
Point estimate
Interval estimate
8.5 Criteria of a Good Estimator
Unbiasedness
Efficiency
Consistency
Sufficiency
8.6 Point Estimates
8.7 Interval Estimates
Case study on calculating estimates
Making the interval estimate
8.8 Interval Estimates and Confidence Intervals
Interval estimates of the mean of large samples
Interval estimates of the proportion of large samples
Interval estimates using the Student‟s „t‟ distribution
8.9 Determining the Sample Size in Estimation
8.10 Summary
8.11 Terminal Questions
8.12 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
8.13 References

8.1 Introduction
In the unit 7, „Sampling and Sampling Distributions‟, you have studied about
sampling design and different theories of sampling. The sampling errors in
the sampling distributions are also studied. In this unit 8, „Estimation‟, you

Sikkim Manipal University Page No. 200


Statistics for Management Unit 8

will study about estimation and different types of estimation. You will also
study about calculation of confidence intervals of the population mean when
the standard deviation is unknown. Finally, you will study the methods to
calculate the sample size if the confidence levels are given.
Everyone makes estimates. When you are ready to cross a street, you
estimate the speed of any car that is approaching, the distance between you
and that car, and your own speed. Having made these quick estimates, you
decide whether to wait, walk, or run. With the knowledge of inferential
statistics, you can do the estimations about the population using the random
samples which are drawn from the population.
Learning objectives
By the end of this unit, you should be able to:
 Distinguish between a point estimate and an interval estimate
 Calculate the confidence interval
 Describe the types of estimations
 Describe interval estimates and confidence intervals
 Calculate the sample size if the confidence intervals are given

8.2 Reasons for Making Estimates


All managers must make quick estimates. The outcome of these estimates
can affect their organisations as seriously as the outcome of your decision
whether to cross the street. Credit managers estimate whether a purchaser
will eventually pay his bills.
Prospective home buyers make estimates concerning the behavior of
interest rates in the mortgage market. All these people make estimates
without worrying about whether they are scientific but with the hope that the
estimates bear a reasonable resemblance to the outcome.
Managers use estimates because in all but the most trivial decisions, they
must make rational decisions without complete information and with a great
deal of uncertainty about what the future will bring. As educated citizens and
professionals, you will be able to make more useful estimates by applying
the techniques described in this unit and in the subsequent units

Sikkim Manipal University Page No. 201


Statistics for Management Unit 8

8.3 Making Statistical Inference


Statistical inference is based on estimation, and hypothesis testing. In both
estimation and hypothesis testing, we make inferences about characteristics
of populations from information contained in samples. Here, we infer
something about a population from information taken from a sample.
Here, we try to estimate with reasonable accuracy the population and the
population mean. To calculate the exact proportion or the exact mean would
be an impossible goal. Even so, we will be able to make an estimate, and
implement some controls to avoid as much of the error as possible.

8.4 Types of Estimates


The following are two types of estimates about a population.
i) Point estimate
ii) Interval estimate
8.4.1 Point estimate
Point estimate is a single number that is used to estimate an unknown
population parameter. A point estimate is often insufficient, because it is
either right or wrong. We do not know how wrong it is. Therefore, a point
estimate is much more useful if it is accompanied by an estimate of the error
that might be involved.
8.4.2 Interval estimate
Interval estimate is a range of values used to estimate a population
parameter. It indicates the error in the following two ways:
i) By the extent of its range
ii) By the probability of the true population parameter lying within that
range.

8.5 Criteria of a Good Estimator


8.5.1 Unbiasedness
Unbiasedness is a desirable property of a good estimator. The term
unbiasedness refers to the fact that a sample mean is an unbiased
estimator of a population mean because the mean of the sampling
distribution of sample means taken from the same population is equal to the
population mean itself.

Sikkim Manipal University Page No. 202


Statistics for Management Unit 8

We can say that a statistic is an unbiased estimator if, on average, it tends


to assume values that are above the population parameter being estimated
as frequently and to the same extent as it tends to assume values that are
below the population parameter being estimated.
8.5.2 Efficiency
Another desirable property of a good estimator is that it must be efficient.
Efficiency refers to the size of the standard error of the statistic. Let us
compare two statistics from a sample of the same size and try to decide
which one is the more efficient estimator. In this case, we would pick the
statistic that has the smaller standard error.

Example 1
Suppose, we choose a sample of a given size and must decide whether
to use the sample mean or the sample median to estimate the
population mean.
If we calculate the standard error of the sample mean and found it to be
1.05 and then, calculate the standard error of the sample median and
found it to be 1.6, we would say that the sample mean is a more efficient
estimator of the population mean, because its standard error is smaller.
It makes sense that an estimator with a smaller standard error (with less
variation) will have more chance of producing an estimate nearer to the
population parameter under consideration.

8.5.3 Consistency
A statistic is a consistent estimator of a population parameter, if the sample
size increases. It becomes almost certain that the value of the statistic
comes very close to the value of the population parameter. If an estimator is
consistent, it becomes more reliable with large samples.
8.5.4 Sufficiency
An estimator is sufficient if it makes so much use of the information in the
sample that no other estimator could extract from the sample any additional
information about the population parameter being estimated.

Sikkim Manipal University Page No. 203


Statistics for Management Unit 8

8.6 Point Estimates


We can use the sample variance „s2‟ and estimate the population variance,
where the sample variance „s2‟ is given by the formula.

s2 
 (X  X)2
n 1

Example 2
The table 8.1 displays the results of samples of 35 boxes which contain
bolts.
Table 8.1: Results of samples of 35 boxes of bolts (bolts per box)
101 103 112 102 98 97 93
105 100 97 107 93 94 97
97 100 110 106 110 103 99
93 98 106 100 112 105 100
114 97 110 102 98 112 99

Consider the table 8.1. We have taken a sample of 35 boxes of bolts


from a manufacturing line and have counted the bolts per box. We can
arrive at the population mean, that is, mean number of bolts by taking
the mean for the 35 boxes we have sampled. This is calculated by
adding all the bolts and dividing by the number of boxes.

X
 X  3570  102
n 35
Thus, using the sample mean X as the estimator we have a point
estimate of the population mean „µ‟.

8.7 Interval Estimates


The purpose of gathering samples is to learn more about a population. We
can compute this information from the sample data as either point
estimates, or as interval estimates.

Sikkim Manipal University Page No. 204


Statistics for Management Unit 8

Key Statistic
An interval estimate describes a range of values within which a
population parameter is likely to lie.

If we select and plot a large number of sample means from a population, the
distribution of these means will approximate to normal curve. Furthermore,
the mean of the sample means will be the same as the population mean.
8.7.1 Case study on calculating estimates

Case Study
The marketing research director needs an estimate of the average life in
months of car batteries his company manufactures. We select a random
sample of 200 batteries with a mean life of 36 months. If we use the
point estimate of the sample mean „x‟ as the best estimator of the
population mean „µ‟, we would report that the mean life of the company‟s
batteries is 36 months.
The director also asks for a statement about the uncertainty that is likely
to accompany this estimate, that is, a statement about the range within
which the unknown population mean is likely to lie. To provide such a
statement, we need to find the standard error of the mean. Our sample
size of 200 is large enough that we can apply the central limit theorem.
Suppose, we have already estimated the standard deviation of the
population of the batteries and reported that it is 10 months.
Using this standard deviation, we can calculate the standard error of the

mean by using the formula,  x 
n
We find the standard error S.E   x  10 / 200 to be 0.707 per month.
(Cont. on topic ‘Making the interval estimate’)

Sikkim Manipal University Page No. 205


Statistics for Management Unit 8

8.7.2 Making the interval estimate

Case Study
(Cont. from topic ‘Interval Estimates’)

We can tell to the director that our estimate of the life of the company‟s
batteries is 36 months, and the standard error that accompanies this
estimate is 0.707. In other words, the actual mean life for all the batteries
may lie somewhere in the interval estimate of 35.293 to 36.707 months.
This is helpful but insufficient information for the director.
Next, we need to calculate the chance that the actual life will lie in this
interval or in other intervals of different widths that we might choose,

 2(20.707), 3(30.707)
and so on.
The probability is 0.955 that the mean of a sample size of 200 will be
within ±2 standard errors of the population mean. It can be stated
differently as 95.5 percent of all the sample means are within ±2
standard errors from population mean „‟. The population mean „µ‟ will
be located within ±2 standard errors from the sample mean 95.5 percent
of the time.
Hence, we can now report to the director, that the best estimate of the
life of the company‟s batteries is 36 months, and we are 68.3 percent
confident that the life lies in the interval from 35.293 to 36.707
months 36  1   .

Similarly, we are 95.5 percent confident that the life falls within the
interval of 34.586 to 37.414 months 36  2   , and we are 99.7 percent
confident that battery life falls within the interval of 33.879 to 38.121
months 36  3   .

Sikkim Manipal University Page No. 206


Statistics for Management Unit 8

8.8 Interval Estimates and Confidence Intervals


In using interval estimates, we are not confined to ±1,2 and 3 standard
errors; for example, ± 1.64 standard errors include about 90 percent of the
area under the curve; it includes 0.4495 of the area on either side of the
mean in a normal distribution. Similarly, ±2.58 standard error includes about
99 percent of the area, or 49.51 percent on either side of the mean.

Key Statistic
The probability that we associate with an interval estimate is called the
confidence level.
Similarly, we are 95.5 percent confident that the life falls within the
interval of 34.586 to 37.414 months 36  2   , and we are 99.7 percent
confident that battery life falls within the interval of 33.879 to 38.121
months 36  3   .

This probability indicates how confident we are that the interval estimate will
include the population parameter. A higher probability means more
confidence. In estimation, the most commonly used confidence levels are 90
percent, 95 percent, and 99 percent, but we are free to apply any
confidence level. The confidence interval is the range of the estimate we are
making.

Example 3
If we report that we are 90 percent confident that the mean of the
population of incomes of people in a certain community will lie between
Rs. 8,000 and Rs. 24,000, then the range Rs. 8,000 - Rs. 24,000 is our
confidence interval.

Often, however, we will express the confidence interval in standard errors


rather than in numerical values. Thus, we will often express confidence
intervals like this:

x  1.64 x = upper limit of the confidence interval

x  1.64 x = lower limit of the confidence interval

Sikkim Manipal University Page No. 207


Statistics for Management Unit 8

where,  x is the standard error.

Thus, confidence limits are the upper and lower limits of the confidence
interval. In this case, x  1.64 x is called the upper confidence limit (UCL)

and x  1.64 x is the lower confidence limit (LCL).

8.8.1 Interval estimates of the mean of large samples


If the samples are large, then we use the finite population multiplier to
calculate the standard error. As discussed in unit 7, the standard error of the
mean of finite population can be calculated as:

 Nn
x  
n N 1

and also the sample size „n‟ is greater than five percent of the population
size „N‟, that is,
n
 0.05
N

8.8.2 Interval estimates of the proportion of large samples


Statisticians often use sample to estimate a proportion of occurrences in a
population. For example, the government estimates, by a sampling
procedure, the unemployment rate, or the proportion of unemployed people,
in the country‟s workforce.
We know that for a binomial distribution, the mean and the standard
deviation are:
Mean   np
Standard deviation   npq
where,
 n = number of trials
 p = probability of success
 q = probability of failure  1p
 Since we are taking the mean of the sample to be the mean of the
population we actually mean that -p = p.

Sikkim Manipal University Page No. 208


Statistics for Management Unit 8

Similarly, we can modify the formula for the standard deviation of the
binomial distribution, npq, which measures the standard deviation in the
number of successes. To change the number of successes to the proportion
of successes, we divide npq by n and get pq / n . Therefore, the
standard error of the proportion is given by:

SR pq / n

Solved Problem 1: In a very large organisation, the director wanted to find


out what proportions of the employees prefer to provide their own retirement
benefits in lieu of a company – sponsored plan. A simple random sample of
75 employees was taken. It was found that 40%, that is, 0.4 of them are
interested in providing their own retirement plans. The management
requests that we use this sample to find an interval about which they can be
99 percent confident that it contains the true population proportion.
Solution: Here, n = 75, p = 0.4; q = 1-p = 1 – 0.4 = 0.6

Therefore, standard error of the mean = pq / n

Therefore, the interval estimate for 99% level of confidence is 0.4 ± 2.58
(0.057) = 0.253 and 0.547.
Hence, the proportion of the total population of employees who wish to
establish their own retirements plans lie between 0.253 and 0.547.
8.8.3 Interval estimates using the Student’s ‘t’ distribution
So far, the sample sizes we were examining were all larger than 30. This is
not always the case. Questions like „handling estimates where the normal
distribution is not the appropriate sampling distribution‟ are answered in this
section. In other words, we will discuss here how we have to estimate the
population standard deviation when the sample size is 30 or less. For
example, we have data only from 10 weeks or sample sizes less than 30.
Then, fortunately, another distribution exists that is appropriate in these
cases. It is called the „t‟ distribution. Early theoretical work on „t‟ distributions
was done by a man named W. S. Gosset in the early 1990s. Gosset was
employed by the Guinness Brewery in Dublin, Ireland, which did not permit
employees to publish research findings under their own names. So Gosset
adopted the pen name „Student‟ and published under that name.
Sikkim Manipal University Page No. 209
Statistics for Management Unit 8

Consequently, the „t‟ distribution is commonly called Student‟s „t‟ distribution,


or simply Student‟s distribution.
Conditions for usage
Statisticians often associate the „t‟ distribution with small sample statistics,
because it is used when the sample size is 30 or less, This is misleading
because the size of the sample is only one of the conditions that lead us to
use the „t‟ distribution. The second condition is that the population standard
deviation must be unknown. Furthermore, in using the t distribution, we
assume that the population is normal or approximately normal.
Degrees of freedom
“There is a different „t‟ distribution for each of the possible degrees of
freedom.”

Key Statistic
We can define degrees of freedom as the number of values we can
choose freely. We will use degrees of freedom when we select a „t‟
distribution to estimate a population mean, and we will use „n-1‟ degrees
of freedom, where „n‟ is the sample size.

For example, if we use a sample of 20 to estimate the mean of population,


we will use 19 degrees of freedom in order to select the appropriate „t‟
distribution. With two sample values, we have one degree of freedom
(2-1 = 1), and with seven sample values, we have six degrees of freedom
(7-1 = 6). In each of these two examples, then, we had „n-1‟ degrees of
freedom; assuming „n‟ is the sample size. Similarly, a sample of 23 would
give us 22 degrees of freedom.

Key Statistic
In any estimation problem in which the sample size is 30 or less and the
standard deviation of the population is unknown and the underlying
population can be assumed to be normal or approximately normal, use
the „t‟ distribution.

Using the ‘t’ distribution table


We will discuss here the comparison between „t‟ and „z‟ tables. The table of
„t‟ distribution values differs in construction from the „z‟ table or normal

Sikkim Manipal University Page No. 210


Statistics for Management Unit 8

distribution table. The „t‟ table is more compact and shows areas and „t‟
values for only a few percentages (10, 5, 2, and 1 Percent). Because there
is a different „t‟ distribution for each number of degrees of freedom, a more
complete table would be quite lengthy. Although, we can conceive of the
need for a more complete table.
A second difference in the „t‟ table is that it does not focus on the chance
that the population parameter being estimated will fall with our confidence
interval. Instead, it measures the chance that the population parameter we
are estimating will not be within our confidence interval (that is, it will lie
outside the confidence interval).
If we are making an estimate at the 90 percent confidence level, we would
look in the „t‟ table under the 0.10 column (100 percent – 90 percent = 10
percent). This is 0.10 chance of error is symbolised by the Greek letter
alpha „α‟. We would find the appropriate „t‟ values for confidence intervals of
95 percent, 98 percent, and 99 percent under the columns headed 0.05,
0.02, and 0.01, respectively. A third difference in using the „t‟ table is that we
must specify the degrees of freedom with which we are dealing. Suppose,
we make an estimate at the 90 percent confidence level with a sample size
of 14, which is 13 degrees of freedom, then look under the 0.10 column until
we encounter the row labeled 13. Like a „z‟ value, the „t‟ value of 1.771
shows that if we mark off plus and minus 1.7716 (estimated standard errors
of x) on either side of the mean, the area under the curve between these
two limits will be 90 percent, and the area outside these limits(the chance of
error) will be 10 percent.
Self Assessment Questions
1. XY Pizza has developed quite a business in Bangalore by delivering
pizza orders promptly. It guarantees that its pizzas will be delivered in
30 minutes or less from the time the order was placed, and if the
delivery is late, the pizza is free. The time that it takes to deliver each
pizza order that is on time is recorded in the Pizza Time Book (PTB),
and the delivery time for those pizzas that are delivered late is recorded
as 30 minutes in the PTB. A sample of 12 random entries from the PTB
is listed in table 8.2.

Sikkim Manipal University Page No. 211


Statistics for Management Unit 8

Table 8.2: Twelve random entries of pizza delivery time

15.3 29.5 30 10.1 30 19.6


10.8 12.2 14.8 30 22.1 18.3

i) Find the mean for the sample.


ii) From what population was this sample drawn?
iii) Can this sample be used to estimate the average time that
it takes for Pizza Hut to deliver a pizza? Explain.
2. Madhu, a frugal student, wants to buy a used bike. After randomly
selecting 125 wanted advertisements, he found the average price of the
bike to be Rs. 3250 with a standard deviation of Rs. 615. Establish an
interval estimate for the average price of bike so that Madhu can be:
i) 68.3% certain that the population mean lies in this interval.
ii) 95.5% certain that the population mean lies in this interval.
3. Given the following confidence levels, express the lower and upper limits
of the confidence interval for these levels in terms of x and  x (Use the
normal distribution tables).
i) 54 percent
ii) 75 percent
iii) 94 percent
iv) 98 percent
4. From a population of 540, a sample of 60 individuals is taken. From this
sample the mean is found to be 6.2 and the standard deviation to be
1.368.
i) Find the estimated standard error of the mean.
ii) Construct a 96 % confidence interval of the mean.
5. For the following sample sizes and confidence levels, find the
approximate„t‟ values for constructing confidence intervals (use the „t‟
table).
i) n = 28; 95%
ii) n = 8; 98%
iii) n = 13; 90%
iv) n = 25; 95%

Sikkim Manipal University Page No. 212


Statistics for Management Unit 8

8.9 Determining the Sample Size in Estimation


So far, in all the examples we have discussed in this unit, the sample size
was known. Now we are trying to estimate the sample size „n‟. If it is too
small, we may fail to achieve the objective, if it is too large we will be
wasting resources. However, let us try to examine some of the methods that
are useful in determining what sample size is necessary for any specified
level of precision. The table 8.3 gives the comparison of two ways of
expressing the same confidence limits.
Table 8.3: Comparison of two ways of expressing the same confidence limits

Lower confidence limit Upper confidence limit


a. x - 500 x + 500
b. x – z x x + z x
c. x – t x x + t x

Solved Problem 2: IMA Management University wants to conduct a survey


of the annual earning of its graduates in international placements. It knows
from past experience that the standard deviation of its population of
students is Rs. 1500. How large a sample size should be taken in order to
estimate the mean annual earnings of last year‟s class within Rs. 500 at
95% level of confidence?
Solution: From the given data, it can be stated that variation of Rs. 500 on
either side of the population mean. That is,

z x  500

At 95 % level of confidence, we know from the „z‟ table that „z‟ is 1.96.
Therefore,
1.96 x  500
x  500 / 1.96  255
Now, if the standard error of the mean is 255; that lead us to:
 x   / n  255
Since, „‟ is 1500, we can find „n‟. that is:

1500 / n  255

Sikkim Manipal University Page No. 213


Statistics for Management Unit 8

Therefore,
2
 1500 
n   34.6
 255 
It implies that „n‟ should be greater than 34.6 or 35 if the university wants to
estimate the precision with which it wants to conduct the survey.

8.10 Summary
In this unit 8, we have discussed about the point estimates and interval
estimates. These estimates are the foundations for inferential statistics in
estimation and hypothesis testing, which we will be discussing in the next
unit. In this unit, you have studied the concept of confidence levels and the
concept of making estimations when the sample sizes are small and large.
You have studied about calculation of a sample size provided that we know
the level of accuracy we want to construct the estimate. Also we have
discussed that if the sample size is less than 30 and the population standard
deviation is not known, we use the Student‟s „t‟ distribution for estimations.

8.11 Terminal Questions


1. XYZ Bank is determining the number of tellers available during the
Friday lunch rush hour. The bank has collected data on the number of
people who entered the bank during the past three months on Friday
from 11 am to 1 pm. Using the data from table 8.4, find the point
estimates of the mean and standard deviation of the population from
which the sample was drawn.
Table 8.4: Data of the number of people entered into XYZ Bank
242 275 289 306 342 385
279 245 269 305 294 328
2. From a population known to have a standard deviation of 1.4, a sample
of 60 individuals is taken. The mean of this sample is found to be 6.2.
i) Find the standard error of the mean.
ii) Establish an interval estimate around the sample mean using one
standard deviation of the mean.

Sikkim Manipal University Page No. 214


Statistics for Management Unit 8

3. On collecting a sample of 250 from a population with a known standard


deviation of 13.7, the mean is found to be 112.4.
i) Find a 95% confidence level interval for the mean.
ii) Find a 99% confidence level interval for the mean.
8.12 Answers to SAQs and TQs

Answers to Self Assessment Questions


1. i) For the given sample the mean is 20.225 minutes.
ii) The population was drawn from the Pizza Time Book (PTB) of XY
Pizza.
iii) No. As the time over 30 minutes is recorded as 30 and hence, it will
underestimate the delivery time.
2. The population standard deviation is given as:
  615; n  x  3250
and standard error Se is calculated as:
σ 615
x   55.01
n 125
i) x 1 x = 3250  55.01 = 3194.99 and 3305.01 to be 68.3%
certain.
ii) 95.5% certain means x  2 x = 3250  110.02 giving a range
between 3139 and 3360.02.
3. The required lower and upper class intervals are:
i. x  0.74 x ii. x 1.15 x

iii. x 1.88 x iv. x  2.33 x


4.
 Nn n
i. x   as  0.05
N N1 N
1.368 540  60
x    0.167
60 540  1
ii. x  2.05 x = 6.2  2.05 (0.167)
Hence, the LCL and UCL are 5.86 and 6.54 respectively.

Sikkim Manipal University Page No. 215


Statistics for Management Unit 8

5.
i) 2.052
ii) 2.998
iii) 1.782
iv) 2.262

Answers to Terminal Questions


1. The mean and standard deviation are 296.583 people and 40.751
people.
2.
i) 0.181
ii) 6.019, 6.381
3.
i) 112.4  1.697
ii) 112.4  2.234

8.14 References
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited.

Sikkim Manipal University Page No. 216


Statistics for Management Unit 9

Unit 9 Testing of Hypothesis in Case of


Large and Small Samples
Structure:
9.1 Introduction – Large Samples
Learning objectives
Assumptions
9.2 Testing Hypothesis
Null and alternate hypothesis
Interpreting the level of significance
Hypotheses are accepted and not proved
9.3 Selecting a Significance Level
Preference of type I error
Preference of type II error
Determine appropriate distribution
9.4 Two – Tailed Tests and One – Tailed Tests
Two – tailed tests
Case study on two –tailed and one-tailed tests
9.5 Classification of Test Statistics
Statistics used for testing of hypothesis
Test procedure
How to identify the right statistics for the test
9.6 Testing of Hypothesis in Case of Small Samples
Introduction – small samples
9.7 „t‟ Distribution
Uses of „t‟ test
9.8 Summary
9.9 Terminal Questions
9.10 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
9.11 References

Sikkim Manipal University Page No. 217


Statistics for Management Unit 9

9.1 Introduction
In the unit 8, „Estimation‟, you have studied about the estimation of the
samples and the methods of estimation. In this unit 9, „Testing of Hypothesis
in Case of Large and Small Samples‟, you will study about hypothesis,
assumptions and testing of hypothesis. Estimation is about estimating the
errors in a sample, and finding out confidence intervals of samples.
Hypothesis testing is the opinion about the population parameter that may or
may not be true. Hypothesis testing is helpful in decision making. Before
starting this unit, refresh the concepts you have studied on sampling
estimation.
Hypothesis testing begins with an assumption, called a hypothesis that we
make about a population parameter. We assume a certain value for a
population mean. To test the validity of our assumption, we gather sample
data and determine the difference between the hypothesised value and the
actual value of the sample mean. Then we judge whether the difference is
significant.
The smaller the difference, the greater the likelihood that our hypothesised
value for the mean is correct. The larger the difference, the smaller the
likelihood that our hypothesised value for the mean is correct. Unfortunately,
the difference between the hypothesised population parameter and the
actual statistic is more often neither so large that we automatically reject our
hypothesis nor so small that we just as quickly accept it. So in hypothesis
testing, as in most significant real-life decisions, clear-cut solutions are the
exception, not the rule.
9.1.1 Learning objectives
By the end of this unit, you should be able to:
 Describe the basic concepts of hypothesis testing
 Describe the different test statistics available
 Identify the test for a given problem
 Identify the type of errors preferred
9.1.2 Assumptions
Although hypothesis testing sounds like some formal statistical term
completely unrelated to business decision making, in fact managers
propose and test hypothesis all the time. For example, “if we drop the price
Sikkim Manipal University Page No. 218
Statistics for Management Unit 9

of this car model by Rs.1, 500, we will sell 50,000 cars this year” is a
hypothesis. To test this hypothesis, total car sales till the end of the year
have to be counted.
Managerial hypothesis are based on intuition; the marketplace decides
whether the manager‟s intuitions were correct. Hypothesis testing is about
making inferences about a population from only a small sample. The bottom
line in hypothesis testing is when we ask ourselves (and then decide)
whether a population, like we think this one, would be likely to produce a
sample like the one we are looking at.

9.2 Testing Hypothesis

9.2.1 Null and alternate hypothesis


In hypothesis testing, we must state the assumed or hypothesised value of
the population parameter before we begin sampling. The assumption we
wish to test is called the null hypothesis and is symbolised by ‟Ho‟.

Example 1
We want to test the hypothesis that the population mean is equal to 500.
We would symbolise it as follows and read it as,
The null hypothesis is that the population mean = 500 written as,
 0 :   500

The term „null hypothesis‟ arises from earlier agricultural and medical
applications of statistics. In order to test the effectiveness of a new fertiliser
or drug, the tested hypothesis (the null hypothesis) was that it had no effect,
that is, there was no difference between treated and untreated samples. If
we use a hypothesised value of a population mean in a problem, we would
represent it symbolically as „H0‟. This is read - „The hypothesised value of
the population mean‟.
If our sample results fail to support the null hypothesis, we must conclude
that something else is true. Whenever we reject the hypothesis, the
conclusion we do accept is called the alternative hypothesis and is
symbolised H1 (“H sub-one”).

Sikkim Manipal University Page No. 219


Statistics for Management Unit 9

For the null hypothesis H0:  = 200, we will consider three alternative
hypothesis as:
H1:   200 (population mean is not equal to 200)
H1:  > 200 (population mean greater than 200)
H1:  < 200 (population mean less than 200)
9.2.2 Interpreting the level of significance
The purpose of hypothesis testing is not to question the computed value of
the sample statistic but to make a judgment about the difference between
that sample statistic and a hypothesised population parameter.
The next step after stating the null and alternative hypotheses is to decide
what criterion to be used for deciding whether to accept or reject the null
hypothesis. If we assume the hypothesis is correct, then the significance
level will indicate the percentage of sample means that is outside certain
limits (In estimation, the confidence level indicates the percentage of sample
means that falls within the defined confidence limits).
9.2.3 Hypotheses are accepted and not proved
Even if our sample statistic does fall in the non-shaded region (the region
shown in figure 9.1 that makes up 95 percent of the area under the curve),
this does not prove that our null hypothesis (H0) is true; it simply does not
provide statistical evidence to reject it. Why? It is because the only way in
which the hypothesis can be accepted with certainty is for us to know the
population parameter; unfortunately, this is not possible.
Therefore, whenever we say that we accept the null hypothesis, we actually
mean that there is not sufficient statistical evidence to reject it. Use of the
term accept, instead of do not reject, has become standard. It means that
when sample data do not cause us to reject a null hypothesis, we behave as
if that hypothesis is true.

Sikkim Manipal University Page No. 220


Statistics for Management Unit 9

Fig. 9.1: Acceptance and rejection region of sample

9.3 Selecting a Significance Level


There is no single standard or universal level of significance for testing
hypotheses. In some instances, a 5% level of significance is used. In the
published results of research papers, researchers often test hypotheses at
the 1 percent level of significance. Hence, it is possible to test a hypothesis
at any level of significance. But remember that our choice of the minimum
standard for an acceptable probability, or the significance level, is also the
risk we assume of rejecting a null hypothesis when it is true.
The higher the significance level we use for testing a hypothesis, the higher
the probability of rejecting a null hypothesis when it is true. 5% level of
significance implies we are ready to reject a true hypothesis in 5% of cases.
If the significance level is high then we would rarely accept the null hypothesis
when it is not true but, at the same time, often reject it when it is true.
When testing a hypothesis we come across four possible situations. The
table 9.1 shows the four possible situations.
Table 9.1: Possible situations when testing a hypothesis

Hypothesis is
True False
Test results says Accept Type II error
Reject

Type I error

Sikkim Manipal University Page No. 221


Statistics for Management Unit 9

The combinations are:


1. If the hypothesis is true, and the test result accepts it, then we have
made a right decision.
2. If hypothesis is true, and the test result rejects it, then we have made a
wrong decision (Type I error). It is also known as Consumer‟s Risk,
denoted by .
3. If hypothesis is false, and the test result accepts it, then we have made a
wrong decision (Type II error). It is known as producer‟s risk, denoted by
 1 – P is called power of the Test.
4. Hypothesis is false, test result rejects it – we have made a right decision.
9.3.1 Preference of type I error
Suppose that making a Type I error (rejecting a null hypothesis when it is
true) involves the time and trouble of reworking a batch of chemicals that
should have been accepted. At the same time, making a Type II error
(accepting a null hypothesis when it is false) means taking a chance that an
entire group of users of this chemical compound will be poisoned.
Obviously, the management of this company will prefer a Type I error to a
Type II error and, as a result, will set very high levels of significance in its
testing to get low ‟s.
9.3.2 Preference of type II error
Suppose, on the other hand, that making a Type I error involves
disassembling an entire engine at the factory, but making a Type II error
involves relatively inexpensive warranty repairs by the dealers. Then the
manufacturer is more likely to prefer a Type II error and will set lower
significance levels in its testing.
9.3.3 Determine appropriate distribution
After deciding what level of significance to use, our next task in hypothesis
testing is to determine the appropriate probability distribution. We have a
choice between the normal distribution, and the „t‟ distribution.
The rules for choosing the appropriate distribution are similar to those we
encountered in the unit on estimation. The table 9.2 summarises when to
use the normal and „t‟ distributions in making tests of means. Later in this
nit, we shall examine the distributions appropriate for testing hypotheses
about proportions.

Sikkim Manipal University Page No. 222


Statistics for Management Unit 9

You have to remember one more rule when testing the hypothesised values
of a mean. As in estimation, use the finite population multiplier whenever the
population is finite in size, sampling is done without replacement, and the
sample is more than five percent of the population.

Table 9.2: Conditions for using the normal and ‘t’ distributions in
testing hypothesis about means

When the Population When the Population


Standard Deviation is Standard Deviation is
known not known
Sample size „n‟ is larger Normal distribution, Normal distribution,
than 30. z – table z - table
Sample size „n‟ is 30 or Normal distribution, „t‟ distribution, „t‟ table
less and we assume the z – table
population is normal or
approximately so.

9.4 Two – Tailed Tests and One – Tailed Tests

9.4.1 Two – tailed tests


A two-tailed test of a hypothesis will reject the null hypothesis if the sample
mean is significantly higher than or lower than the hypothesised population
mean. Thus, in a two-tailed test, there are two rejection regions. This is
shown in figure 1 of 9.12.
A two-tailed test is appropriate when:
 the null hypothesis is  = Ho (where Ho is some specified value)
 the alternative hypothesis is   Ho

Sikkim Manipal University Page No. 223


Statistics for Management Unit 9

9.4.2 Case study on two-tailed and one-tailed tests

Case Study
Assume that a manufacturer of light bulbs wants to produce bulbs with a
mean life of:
   0 1000 hours

If the lifetime is shorter, he will lose customers to his competitors; if the


lifetime is longer, he will have a very high production cost because the
filaments will be excessively thick.
In order to see whether his production process is working properly, he
takes a sample of the output to test the hypothesis,
 0 ;   1000

He uses a two-tailed test because he does not want to deviate


significantly from 1,000 hours in either direction, the appropriate
alternative hypothesis is:
 0 ;   1000

Therefore, he rejects the null hypothesis if the mean life of bulbs in the
sample is either too far above 1,000 hours or too far below 1,000 hours.

However, there are situations in which a two-tailed test is not appropriate,


and we must use a one-tailed test.

Sikkim Manipal University Page No. 224


Statistics for Management Unit 9

Case Study (contd.)


Consider the case of a wholesaler that buys light bulbs from the
manufacturer discussed earlier. The wholesaler buys bulbs in large lots
and does not want to accept a lot of bulbs unless their mean life is at
least 1,000 hours. As each shipment arrives, the wholesaler tests a
sample to decide whether he should accept the shipment. The company
will reject the shipment only if he feels that the mean life is below 1,000
hours. If he feels that the bulbs are better than expected (with a mean life
above 1,000 hours), he certainly will not reject the shipment because the
longer life comes at no extra cost.
So the wholesaler‟s hypotheses are:
Ho:  = 1,000 and H1:  < 1,000 hours.
He rejects „Ho‟ only if the mean life of the sampled bulbs is significantly
below 1,000 hours. This situation is illustrated in the figure below. From
the figure 9.2, we can view why this test is called a left-tailed test (or a
lower-tailed test).

Fig. 9.2: Left-tailed test

In general, a left tailed (lower-tailed) test is used if the hypotheses are


Ho:  = Ho. In such a situation, it is sample evidence with the sample mean
significantly below the hypothesised population mean that leads us to reject
the null hypothesis in favour of the alternative hypothesis. Stated differently,
the rejection region is in the lower tail (left tail) of the distribution of the
sample mean, and that is why we call this a lower-tailed test.

Sikkim Manipal University Page No. 225


Statistics for Management Unit 9

A left-tailed test is one of two kinds of one-tailed tests. As you have probably
guessed by now, the other kind of one-tailed test is a right-tailed test (or an
upper-tailed test). An upper-tailed test is used when the hypotheses are
Ho:  > Ho. Only values of the sample mean that are significantly above the
hypothesised population mean will cause us to reject the null hypothesis in
favour of the alternative hypothesis. This is called an upper-tailed test as
shown in figure 9.3, because the rejection region is in the upper tail of the
distribution of the sample mean.

Fig. 9.3: Right-tailed test

This is to remind you again that, in each example of hypothesis testing,


when we accept a null hypothesis on the basis of sample information, we
are really saying that there is no statistical evidence to reject it. We are not
saying that the null hypothesis is true. The only way to prove a null
hypothesis is to know the population parameter, and that is not possible with
sampling. Thus, we accept the null hypothesis and behave as if it is true
simply because we can find no evidence to reject it.

Self Assessment Questions


1. For the following cases; specify which probability distribution to use in
hypothesis testing:
i. Ho:  = 27, H1:   27, x = 33, sample  = 4, n = 25
ii. Ho:  = 98.6, H1:  > 98.6, x = 99.1,  = 1.5, n = 50
iii. Ho:  = 3.5, H1:  < 3.5, x = 2.8, sample  = 0.6, n = 18
iv. Ho:  = 382, H1:   382, x = 363, sample  = 68, n = 12
v. Ho:  = 57, H1:  > 57, x = 65, sample  = 12, n = 42

Sikkim Manipal University Page No. 226


Statistics for Management Unit 9

9.5. Classification of Test Statistics


9.5.1 Statistics used for testing of hypothesis
The table 9.3a and the table 9.3b show the classification of statistics that are
used for testing of hypothesis for large samples (n > 30). When „n‟ is an
attribute the table 9.3a is used and when „n‟ is a variable the table 9.3b is used.
Table 9.3a. A Large Samples (n > 30) – Attributes (proportions)

Test Description
Test Statistics Notes
No. of Test
1 Test for P – Population
specified proportion
P  Ps
proportion – Z 1/ 2 Ps = Sample
infinite  PQ  proportion
 
population  n  Q = 1 – P, n sample
size
2 Test for P = Population
specified proportion
proportion – P  Ps
Z 1/ 2 1/ 2
Ps = Sample
Finite  PQ  Nn
Population     Q = 1 –P, n – Sample
 n   N 1 size
N - Population size
3 Test P1 -first sample
between proportion
proportions – Z  P  Ps
1/ 2 1/ 2
P2 -second sample
different  P1Q1   P2 Q 2  proportion
Population     
 n1   n2  Q1 = 1 – P, Q2 = 1-P2
n1- first sample size
n2 – second sample size
4 Test P1 -first sample
between proportion
proportion – P2 -second sample
same P  Ps proportion
population Z
PQ 1/ n1  1/ n 2 1 / 2 Q1 = 1 – P, Q2 = 1-P2
n1- first sample size
n2 – second sample
size

Sikkim Manipal University Page No. 227


Statistics for Management Unit 9

Table 9.3b. B Large Samples – n > 30: Variable

Test Description
Test Statistics Notes
No. of Test
5 Test for  – Population mean
specified
s = Sample mean
mean –    s
infinite Z  = Population S.D

population We can use Sample S.D
n
(s) also in case population
S.D. is not given
6 Test for  – Population mean
specified
s = Sample mean
mean –    s
Z  = Population S.D
Finite 
1/ 2
Nn
1/ 2

Population     We can use Sample S.D


n  N 1
(s) also in case population
S.D. is not given
7 Test P1 -first sample proportion
between P2 -second sample
means –    s
Z 1/ 2 1/ 2 proportion
different  12   22 
Population     Q1 = 1 – P1, Q2 = 1-P2
 n1   n2 
    n1- first sample size
n2 – second sample size
8 Test Where
between    s  n  2  n22 2 
1/ 2

Mean – Z  1 1 
 1/ n1  1/ n 2  1/ n1  1/ n 2 
same  
population

9.5.2 Test procedure


The figure 9.4 displays the hypothesis testing procedure.

Sikkim Manipal University Page No. 228


Statistics for Management Unit 9

Fig. 9.4: Hypothesis testing procedure

9.5.3 How to identify the right statistics for the test


The figure 9.5 displays the step by step procedure to identify the right
statistics for the test.

Sikkim Manipal University Page No. 229


Statistics for Management Unit 9

Fig. 9.5: Identification of right statistics for the test

Self Assessment Questions


2. State whether the following statements are „True‟ or „False‟.
i) Null hypothesis states that there is significant differences between
observed and hypothetical values.
ii) 1% level of significance means we are ready to reject a true
hypothesis in 99% of cases.

Sikkim Manipal University Page No. 230


Statistics for Management Unit 9

iii) If the Null hypothesis Ho:  = s then it is two-tailed test.


iv) If the calculated value of a statistic is less than tabulated value of the
statistics, then Ho is accepted.
v) 1 -  is called power of the test.
vi) if n1 = 300, n2 = 500, 1 = 50, 2 = 60, 1 = 10, 2 = 12 results of two
samples taken from two cities A and B then we test for between
means under different population.
vii) If n > 30, then we do not apply z test unless population S.D is
known.
Solved Problem 1: XYZ press hypotheses that the average life of its latest
web-offset press is 14,500 hours. They know the standard deviation of the
press life is 2,100 hours. From a sample of 25 presses, the company finds a
sample mean of 13,000 hours. At 0.01 significance level, should the
company conclude that the average life of the presses is less than the
hypothesised 14,500 hours?
Solution: The procedure followed is described below.
1. Null hypothesis Ho:  = 14,500
Alternate hypothesis HA: M < 14,500 (one-tailed test)
2. Level of significance  = 0.01  Ztab = 2.33
3. Test Statistics
   s 
Z

n

4. Given  = 14,500, s = 13,000,  = 2,100, n = 25


Note: Although n < 25, population standard deviation is given, therefore
it becomes Z test.
 2100 2100
    420
n 25 5

5. Test
13000  14500 
Z cal   3.57
420

Sikkim Manipal University Page No. 231


Statistics for Management Unit 9

6. Conclusion: Since Zcal (3.57) > Ztab (2.33), Ho is rejected.


 The average life of the press is less than 14,500 hours.
Solved Problem 2: Theatre owners in India know that a hit movie ran for an
average of 84 days with a standard deviation of 10 days in each city the
movie was screened. A particular movie distributor was interested in
comparing the popularity of movie in his region with that of the population.
He randomly chose 75 theatres at random in the region and found a popular
movie ran for 81.5 days.
a. State appropriate hypotheses for testing whether there was significant
difference between theatres in the distributor‟s region and the
population.
b. At a 1% significance level, test these hypotheses.
Solution: The procedure to be followed is explained in the form of steps.
1. Null hypothesis Ho:  = 84 where  = 84
Alternate hypothesis HA:   84 (two-tailed test)
2. Level of significance 1%  Ztab = 2.58
3. Test Statistics
   s
Z

n

4. Given  = 84, s = 81.5,  = 10, n = 75


 10
   1.1547
n 15
5. Test
84  81.5
Z cal   2.165
1.1547

6. Conclusion: Since Zcal (2.165) < Ztab (2.58), Ho is accepted.


Solved Problem 3: A ketchup manufacturer is in the process of deciding
whether to produce a new extra spicy brand of ketchup. The company‟s
market research team found in a survey of 6000 households that 355
households would buy the extra spicy brand. An earlier more extensive

Sikkim Manipal University Page No. 232


Statistics for Management Unit 9

study carried out 2 years ago showed that 5% of the house holds would buy
the brand then. At 2 % level of significance, should the company conclude
that there is an increased interest in the extra spicy flavor?
Solution: The procedure followed is explained in steps.
1. Null hypothesis Ho: P = Ps
Alternate hypothesis HA: P < Ps (one-tailed test)
2. Level of significance 2 %  Ztab = 2.05
3. Test Statistics
P  Ps 
Z
1/ 2
 PQ 
 
 n 
4. Given P = 0.05, Ps = 355 / 6000, = 0.05513, n = 6000, Q = 1 – P = 0.95
1/ 2
 0.05  0.95 
 PQ / n 1/ 2
   0.0028
 6000 
5. Test
0.05  0.05583 
Z cal   2.08
0.0028

6. Conclusion: Since Zcal (2.08) > Ztab (2.05), Ho is rejected.


Solved Problem 4: Microsoft estimated that out of 10,000 potential
software buyers, 35% planning to wait to purchase the new OS Windows
Vista, until an upgrade has been released. After an advertising campaign to
reassure the public, Microsoft surveyed 3000 buyers and found 950 who are
still skeptical. At 5% level of significance, can the company conclude that
the population of skeptical people had decreased? (Null hypothesis is
rejected. Use z distribution).
Solution: The procedure followed is explained in steps below.
1. Null hypothesis Ho: P0 = Ps
Alternate hypothesis HA: P0 > Ps
2. Level of significance 5%  Ztab = 1.645
3. Test Statistics

Sikkim Manipal University Page No. 233


Statistics for Management Unit 9

P  Ps
Z
1/ 2 1/ 2
 PQ  Nn
   
 n   N  1

4. Given Ps = 950 / 3000 = 19 / 60 = 0.317, P = 0.35, Q = 0.65, n = 10,000,


n = 3000
1/ 2 1/ 2
 0.35  0.65   10000  3000 
   
 3000   10000  1 
5. Test
0.317  0.65
Z cal   4.52
.0073
6. Conclusion: Since Zcal (4.52) > Ztab (1.645), Ho is rejected.
 Proportion of Skeptical people has significantly decreased.
Solved Problem 5: A machine is designed so as to pack 200ml of a
medicine with a standard deviation of 5ml. A sample of 100 bottles when
measured had a mean content of 201.3ml. Test whether the machine is
functioning properly (use 5% level of significance).
Solution: The procedure followed is explained in steps below.
1. Null hypothesis Ho:  = s
Alternate hypothesis HA:   s (two-tailed test)
2. Level of significance 5% implies Ztab = 1.96
3. Test Statistics
   s 
Z

n

4. Given  = 200, s = 201.3,  = 5, n = 100

 5
   0.5
n 100
5. Test
200  201.3
Z cal   13 / 5  2.60
0.5
Sikkim Manipal University Page No. 234
Statistics for Management Unit 9

6. Conclusion: Since Zcal (2.60) > Ztab (1.96) Ho is rejected


 The machine is not functioning properly.

9.6 Testing of Hypothesis in Case of Small Samples

9.6.1 Introduction – small samples


So far you have studied about the testing of hypothesis when sample size is
large using normal distribution. However, if the sample size is small, then
the distributions of the statistics are far from normal and hence normal test
cannot be applied. Hence to deal with small samples, tests of significance
known as Exact Sample Tests have been developed. For all practical
purposes the sample is termed as small if n  30.
The basic fundamental assumptions in all exact sample tests are:
i. the parent population from which sample is drawn is normally
distributed
ii. Sample / Samples is / are drawn at random
iii. They are independent of each other
It should be noted that the methods and theory of small samples are
applicable to large samples, but the reverse is not true.

9.7 ‘t’ Distribution


The „t‟ distribution was developed by W.S.Goosett in the pet name student.
Therefore, it is known as student‟s ‟t‟ distribution. The properties of „t‟
distribution are:
1. „t‟ Distribution is a continuous probability distribution
2. “t” Statistic is defined as:

( X  )
f (t)   n
S
1/ 2
 ( x  x ) 2 
where, S 
 n  1 

Sikkim Manipal University Page No. 235


Statistics for Management Unit 9

3. The probability density function is given by:

v  1/ 2
 t2 
f ( t )  C 1  
 v 
where,
C = Constant required to make the area under the curve equal to
unity.
 = n – 1, Degree of Freedom.
4. The value of „t‟ ranges from -  to + 
5. “” is called the parameter of the distribution
6. It is symmetrical about mean
7. Its mean is zero
8. Variance of the distribution is greater than one.
9. It has larger areas at the tails compared to normal distribution and
lower height at the mean.
10. It tends to a normal distribution as n  .
9.7.1 Uses of ‘t’ test
The „t‟ test is used:
 To test a specified value.
 To test the differences between values (independent sample).
 As a paired „t‟ – test (dependent sample)
 To construct confidence interval for the estimates
The table 9.6 display the description of test in case of small samples where
„n‟ is a variable and the population standard deviation is not known.

Sikkim Manipal University Page No. 236


Statistics for Management Unit 9

Table 9.6: Description of test in case of small samples

Test Description of
Test Statistics Notes
No. Test
1 Test for X – Population proportion
specified Value X    = Population mean
– Infinite t 
S
population 2
2 ( x  X )
D.O.F n -1 n S 
n 1
2 Test for  X  
specified Value t
1/ 2 N - Population size
– Finite S N  n
 
population latin n  N  1
D.O.F n-1
3 Test between X  Y  X -first sample mean
values – tS
independent 1/ n1  1/ n2 1/ 2 Y -second sample mean
2 2
samples 2 ( x  X ) ( Y  Y )
S 
D.O.F n1 + n2 - 2 n1  n 2  2

2 2
2 n1S1  n 2 S 2
S 
n1  n 2  2

4 Paired “t – test d d – Mean of difference


(Dependent t 2
samples) Sd 2 ( x  X)
Sd 
D.O.F n -1 n n 1
n – sample size

Solved Problem 6: A random sample of 10 bags of fertiliser are found to


have the following weight (kg)
45, 49, 50, 49, 44, 52, 48, 45, 46, 45.
Test at 5% level of significance whether the average packing weight can be
taken as 50 kg.
Solution: The table 9.7 displays the frequency table for the solved
problem 6.

Sikkim Manipal University Page No. 237


Statistics for Management Unit 9

Table 9.7: Frequency table for the solved problem 6

X D = X - 48 d2
45 -3 9
49 +1 1
50 2 4
49 +1 1
44 -4 16
52 4 16
48 0 0
45 -3 9
46 -2 4
45 -3 9
-7 69

d
XA
n
7
 48   47.3
10
1/ 2
1  2 (  d) 2 
S  d  
n  1  n 
1/ 2
1 ( 7)2 
S  69    7.12
9 10 
 

The steps followed are described as below.


1. Null hypothesis Ho: X  

Alternate hypothesis HA: X   (Two tailed test)


2. Level of significance 5 % and D.O.F 9  ttab = 2.262
3. Test Statistics
 X  
t
S
n

Sikkim Manipal University Page No. 238


Statistics for Management Unit 9

4. Given X  47.3 ,  = 50, S = 7.12, S / n = 0.8438


5. Test
47.3  50.0
tcal   3.2
0.8438

6. Conclusion: Since tcal (3.2) > ttab (2.262) Ho is rejected

 The mean of the population cannot be considered as 50 kg.


Solved Problem 7: Suppose in the above problem, out of 1000 bags
packed in a day, a random sample of 10 was selected and the readings
were as given in solved problem 6. Test whether the population average
weight is 50 kg.
Solution: The steps followed are described as below.
1. Null hypothesis Ho: X  

Alternate hypothesis HA: X   (Two tailed test)


2. Level of significance 5 % and D.O.F 9  ttab = 2.262
3. Test Statistics
 X  
t
1/ 2
S N  n 
 
n  N  1

4. Given n = 10 N = 1000 X  47.3 S / n = 0.8438


1/ 2
N  n  S
    0.8362
 N  1 n

5. Test
 47.3  50.0
t cal   3.23
0.8362
6. Conclusion: Since tcal (3.23) > ttab (2.262), Ho is rejected.

Sikkim Manipal University Page No. 239


Statistics for Management Unit 9

Solved Problem 8: Average tensile strength of nine samples of paper is


found to be 15.8 units and variance is 10.3. Can we say at 1% level of
significance that it is a random sample drawn from a population whose
mean tensile strength is 17.5?
Solution: The steps followed are described as below.
1. Null hypothesis Ho: X  

Alternate hypothesis HA: X  


2. Level of significance 1 % and D.O.F 8  ttab = 3.36
3. Test Statistics
 X  
t
S
n 1

4. Given X  15.8  = 17.5 S2 = 10.3 n=8

S 10.3
   1.135
n 8
5. Test
15.8  17.5
t cal   1.498
1.135
6. Conclusion: Since tcal (1.498) < ttab (3.36), Ho is accepted
 It can be considered as a random sample.
Solved Problem 9: Treatment „A‟ gave brightness index for a substance on
5 randomly selected samples as 60, 41, 38, 39, 42. Treatment „B‟ gave the
same index on another 8 randomly selected samples as 56, 42, 48, 69, 68,
64, 69, 62. At 5% level of significance can we conclude that treatment „B‟
increases the brightness?
Solution: The steps followed are described as below.
1. Null hypothesis Ho: X1  X 2
2. Alternate hypothesis HA: X Y (one tailed test)
3. Level of significance 5 % and D.O.F 5 + 7 – 2 = 10  ttab = 2.228

Sikkim Manipal University Page No. 240


Statistics for Management Unit 9

4. Test Statistics
X1  X 2
S 1/ n1  1/ n 2
5. Given that:
Table 9.8. Frequency table for treatment ‘A’

X d = X - 48 d2
60 +14 196
41 -5 25
48 -2 4
39 -7 49
42 -4 16
230 0 290

Table 9.9: Frequency table for treatment ‘B’

X d = X - 48 d2
56 -1 1
42 -15 225
48 -19 361
69 12 144
68 11 121
64 7 49
69 12 144
62 5 25
399 0 926

The table 9.8 and table 9.9 show the frequency table data for the
treatment „A‟ and treatment „B‟ respectively.

 S2 
1
n1  n 2  2
 
( X1  X1 ) 2  ( X 2  X 2 ) 2


1
290  926  121.6
10
 S  121.6  S 1/ 5  1/ 7  11.03  0.3429  3.782

Sikkim Manipal University Page No. 241


Statistics for Management Unit 9

6. Test
 46  57 11.0
t cal    1.7
121.6 (1/ 5  1/ 7)1 / 2 6.457

7. Conclusion: Since tcal (1.7) < ttab (2.26), Ho is accepted.

Treatment „B‟ is not superior to treatment „A‟.

Solved Problem 10: A sales manager wants to know whether a special


promotional campaign is a success. He had the data as shown in table
9.10a. Test at 5% level of significance, whether it is a success?

Table 9.10a. Sales data before and after the campaign

Retail Outlets 1 2 3 4 5 6
Sales before campaign 50 48 31 42 28 53
Sales after campaign 56 55 30 45 29 58

Solution: The table 9.10b shows the frequency table calculated for the
sales data before and after campaign.

Table 9.10b. Frequency table for the sales data before and after campaign

Before After d = After - Before d2


Campaign
50 56 6 36
48 55 7 49
31 30 -1 1
42 45 3 9
28 29 1 1
53 58 5 25
21 121

 d 21
d    3.5
n 6

Sikkim Manipal University Page No. 242


Statistics for Management Unit 9

S2 
1
n 1
 
 d 2  (  d) 2 / n

 1/ 5 121  441 / 6  9.5


S 9.5
   1.5833  1.2583
n n

The steps followed are described as below.


1. Null hypothesis Ho: d  0
Alternate hypothesis H1: d  0 (one tailed test)
2. Level of significance 5 % and D.O.F 5  ttab = 2.02
3. Test Statistics
 d
t
Sd
n
4. Test
3.5
t cal   2.782
1.2583

5. Conclusion: Since tcal (2.78) < ttab (2.02), Ho is rejected.

There is a significance success due to campaign.

Self Assessment Questions


3. State whether the following statements are true „T‟ or false „F‟.
i) „t‟ distribution is __________ probability distribution.
ii) „t‟ distribution‟s parameter is __________.
iii) „t‟ distribution has ___________ areas at the tail than normal
distribution.
iv) The mean and variance of the „t‟ distribution are ________ and
________.

Sikkim Manipal University Page No. 243


Statistics for Management Unit 9

9.8 Summary

In this unit 9, we have defined what is meant by hypothesis and studied the
procedure for testing of hypothesis. We have defined what is meant by
significance level and types of errors. We have also seen different types of
tests, two tailed and one tailed. You have also studied under what
circumstances these tests are done and also the steps involved in
identifying the test.
We discussed the four tests available for small samples. These tests can be
used for sample size (n  30) and samples whose population standard
deviations are not known. The different tests are illustrated with solved
problems.

9.9 Terminal Questions

1. Twenty households out of 1000 were using Brand „A‟ toothpaste. The
company increased the price of the brand. In a survey, they found that
only 12 households out of 1000 are using it now. Can we conclude at
5% level of significance that proportion of users has decreased?
2. A drill drills holes with standard deviation of depth 0.03cms. It is adjusted
to drill holes of depth 5.5cm. For 50 holes drilled, the mean depth is
5.503cm. Test at 5% level of significance whether the adjustment is
correct.
3. Out of 80 batteries produced by a process I, three were found to be
defective. Another sample of 130 produced by process II, two were
found to be defective. Test whether the proportion of defectives in two
processes differs, using 1% level of significance.
4. The table 9.11 displays the data related to mean weight of a product.
Test whether there is a significant difference in means of the plants.

Table 9.11: Mean weight of a product


Plant A Plant B
Size 300 200
Mean 75.4 74.3
Variance 65.6 57.8

Sikkim Manipal University Page No. 244


Statistics for Management Unit 9

5. A machine is set to produce particular characteristics with mean 21.3


and S.D 0.4. A random sample of 625 observations has 21.33 as mean.
Test whether the sample mean differ significantly from population mean.
6. Out 10,000 pumpkins harvested 1000 were randomly selected. 8% were
found to be rotten. The grower claims that only 7% are rotten. In his
claim tenable? Test at 5% level of significance.
7. A group of seven – week – old chickens reared on a high protein diet
weigh 12, 15, 11, 16, 14, 14 and 16 ounces. Another group 5 chicken
received low protein diet and weigh 8, 10, 14, 10, and 13. Test whether
there is significant increase in weight due to high protein use 5% level of
significance.
8. The strength test results of two yarns are displayed in table 9.12. Is
there a significant difference in the mean? Test at 5 % level of
significance.
Table 9.12: Strength results of the two yarns

Sample Size Mean Sample Variance


Type A 4 52 42
Type B 9 42 56

9. The table 9.13 displays the results relate to the memory capacity of 10
students before and after training. Test at 5% level of significance
whether training is effective.

Table 9.13: Memory capacity of 10 students

Roll No 1 2 3 4 5 6 7 8 9 10
Before 1 14 11 8 7 10 3 0 5 6
Training
After Training 1 16 10 7 5 12 10 2 3 8

Sikkim Manipal University Page No. 245


Statistics for Management Unit 9

9.10 Answers to SAQs and TQs

Answers to Self Assessment Questions


1. i. Normal distribution
ii. Normal distribution
iii. „t‟ distribution DOF
iv. Normal distribution
v. Normal distribution
2. i. False
ii. False
iii. True
iv. True
v. True
vi. True
vii. True
3. i. Continuous
ii. Degrees of freedom
iii. Larger
iv. Zero, greater than one

Answers to Terminal Questions


1. Zcal = 1.9457, Ho accepted
2. Zcal = 0.71, Ho accepted
3. Zcal = 0.50, Ho accepted
4. Zcal = 1.54, Ho accepted
5. Zcal = 18.75, Ho rejected
6. Zcal = 1.30, Ho accepted
7. tcal = 2.397, Ho is rejected
8. tcal = 2.21, Ho is rejected
9. tcal = 1.365, Ho is rejected

9.11 References

 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,


Seventh Edition, PHI Learning Private Limited

Sikkim Manipal University Page No. 246


Statistics for Management Unit 10

Unit 10 Chi – Square

Structure:
10.1 Introduction
Learning objectives
10.2 Chi-Square as a Test of Independence
Characteristics of 2 test
Degrees of freedom
Restrictions in applying 2 test
Practical applications of 2 test
Levels of significance
Steps in solving problems related to Chi-Square test
Interpretation of Chi-Square values
10.3 Chi-Square Distribution
Properties of 2 distribution
Conditions for applying the Chi-Square test
Uses of 2 test
10.4 Applications of Chi-Square test
Tests for independence of attributes
Test of goodness of fit
Test for specified variance
10.5 Summary
10.6 Terminal Questions
10.7 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
10.8 References

10.1 Introduction
In the unit 9, ‘Testing of Hypothesis for Large and Small Samples’, we
discussed about how to test hypotheses using data from either one or two
samples. We used one-sample tests to determine whether a mean or a
proportion was significantly different from a hypothesised value. In the two-
sample tests, we examined the difference between either two means or two
proportions, and we tried to learn whether this difference was significant.

Sikkim Manipal University Page No. 247


Statistics for Management Unit 10

Suppose, we have proportions from five populations instead of only two,


then for these cases, the methods for comparing proportions described for
testing hypothesis for two-samples do not apply; we must use the Chi-
Square test (2 test). In this unit 10, Chi-Square’, we will be discuss the Chi-
Square tests which enable us to test whether more than two population
proportions can be considered equal. In other words, a Chi-Square is a non
parametric test which can be applied on categorical data or qualitative data.
This test can be applied when we have few or no assumptions about the
population.
Actually, Chi-Square tests allow us to do a lot more than just test for the
quality of several proportions. If we classify a population into several
categories with respect to two attributes (such as age and job performance),
we can then use a Chi-Square test to determine whether the two attributes
are independent of each other. So, Chi-Square tests can be applied on
contingency table.
Learning objectives
By the end of this unit, you should be able to:
 Describe the non parametric method of testing hypothesis
 Describe the Chi-Square characteristics
 Recognise the applications of Chi-Square test
 Describe the steps in solving problems related to Chi-Square test
 Identify the conditions required for applying Chi-Square test for a given
population distribution

10.2 Chi-Square as a Test of Independence


10.2.1 Characteristics of Chi-Square test
The following are the characteristics of Chi-Square test (2 test).
 he2 test is based on frequencies and not on parameters
 It is a non-parametric test where no parameters regarding the rigidity
of population of populations are required
 Additive property is also found in 2 test
 he 2 test is useful to test the hypothesis about the independence of
attributes
 The 2 test can be used in complex contingency tables

Sikkim Manipal University Page No. 248


Statistics for Management Unit 10

 The 2 test is very widely used for research purposes in behavioral and
social sciences including business research
 It is defined as:
O  E2
2   E

where, ‘O’ is the observed frequency and ‘E’ is the expected frequency.

Key Statistic
The observed frequencies are the frequencies obtained from the
observation, which are sample frequencies.
The expected frequencies are the calculated frequencies.

10.2.2 Degrees of freedom


The number of degrees of freedom for ‘n’ observations is ‘n-k’ and is usually
denoted by ‘’ where ‘k’ is the number of independent linear constraints
imposed upon them.

Example 1
Suppose, we are asked to write any four numbers, then we will have all
the numbers of our choice. If a restriction is applied or imposed to the
choice that the sum of these numbers should be 50; then the freedom of
choice would be reduced to three only and so the degrees of freedom
would now be 3.

If a 2 is defined as the sum of the squares of ‘n’ independent standardised


normal variates and the condition of the satisfaction of one linear relation is
imposed upon them (such as the estimation of some population parametric
value and so on.) then the effect of these ‘n’ constraints would be replaced
by ‘n-k’. If the sum of squares of a sample mean is taken instead of the
population mean, then ‘n’ is replaced by n-1 = . This is because one linear
constraint had been imposed.

Sikkim Manipal University Page No. 249


Statistics for Management Unit 10

Key Statistic
The Chi-Square distribution has only one parameter, that is, degrees of
freedom.

10.2.3 Restrictions in applying 2 test


The sample observations should be independently and normally distributed.
For this either the parent population should be infinitely large (for example,
greater than 50) or sampling should be done with replacement.
Constraints imposed upon the observations must be of linear character, for
example,
 Oi   E i
The 2 distribution is essentially a continuous distribution but its character of
continuity is maintained only when the individual frequencies of the variate
values remain greater than or equal to 5. So, in applying 2 test in the
testing of the goodness of fit or testing of the dependency of variables in a
contingency table, the cell frequency should not be less than 5. In practical
problems we can combine a few values of small frequencies into one to get
the pooled frequency greater than 5.

Key Statistic
The results of Chi-Square test cannot be accurate if the cell frequencies
in a contingency table are less than 5.

10.2.4 Practical applications of 2 test


In inferential statistics, the Chi-Square test can also be applied for the
discrete distributions. In using Chi-Square test, we need no assumptions
regarding the shape of sampling distributions. The applications of Chi-
Square test include testing:
i) the significance of sample variances
ii) the goodness of fit of a theoretical distribution
iii) the independence in a contingency table whether the observed results
are consistent with the expected segregations in breeding experiments
of genetics

Sikkim Manipal University Page No. 250


Statistics for Management Unit 10

10.2.5 Levels of significance


Tables have been prepared for the values of ‘P’, the probability of getting a
value of 2  02 where 02 is an observed value. From these tables, we can
find the value of ‘P’ corresponding to an observed value of 2 and then
proceed to test whether the difference between observed and theoretical
frequencies is significant or not. Smaller the values of ‘P’, greater the
divergence between fact and theory so that small values lead us to suspect
the hypothesis. Not only do small values of ‘P’ lead us to suspect the
hypothesis but a value of ‘P’ very near to unity may also lead to a similar
result. Thus, if P = 1, 2 = 0, showing that there is perfect agreement
between fact and theory which is a very improbable event. There are two
conventional levels of significance. They are:
i) If P < 0.05, we say that the observed value of 2 is significant at 5
percent level of significance
ii) Similarly, if P < 0.01, the value is significant at 1 % level
The formula for calculating 2 is:
f0  fe 2
2   fe

where, ‘f0’ is observed frequency and ‘fe’ is expected frequency.


10.2.6 Steps in solving problems related to Chi-Square test
The figure 10.1 displays the steps required for solving the problems related
to Chi-Square test.

Sikkim Manipal University Page No. 251


Statistics for Management Unit 10

Fig. 10.1: Procedural steps in solving problems on Chi-Square test

10.2.7 Interpretation of Chi-Square values


After ascertaining the 2 value, the 2 table comprises of columns headed
with symbols 0.05 for 5% level of significance, 0.01 for 1% level of
significance and so on. The left hand side indicates the degrees of freedom.
If the calculated value of 2 falls in the acceptance region, the null
hypothesis ‘Ho’ is accepted and vice-versa. The figure 10.2 displays the
acceptance and rejection regions of Chi-Square distribution.

Sikkim Manipal University Page No. 252


Statistics for Management Unit 10

Fig. 10.2: Acceptance and rejection regions of Chi-Square distribution

Key Statistic
The Chi-Square curve will be on the positive side of X-axis because the
Chi-Square values are always positive.

10.3 Chi-Square Distribution


The square of a standard normal variate is called a Chi-Square variate with
1 degree of freedom (=1), that is, if ‘X’ variable is normally distributed with
a mean ‘’ and standard deviation ‘’, then (X - ) /  is a 2 variate with ‘’
equal to 1.

Key Statistic
If X1, X2……….Xn are ‘n’ independent random variables following the
normal distribution with mean ‘’ and standard deviation ‘’ respectively,
then the 2 variate is given by:
( x1) 2 ( x 2  ) 2 ( x n ) 2
2    .......... .
  
It is the sum of the squares of ‘n’ independent standard normal variates,
following the 2 distribution with ‘n’ degrees of freedom.

Sikkim Manipal University Page No. 253


Statistics for Management Unit 10

10.3.1 Properties of 2 distribution


The following are the some of the properties of 2 distribution.
i) Mean of 2 distribution = Degree of freedom = 

ii) Standard deviation of 2 distribution = 2


iii) Median of 2 distribution divides the area of the curve into two equal
parts, each part being 0.5.
iv) Mode of 2 distribution is equal to degrees of freedom less 2, that is,
mode is equal to ‘-2’.
v) 2 values are always positively skewed.
vi) 2 values increases with the increase in the ‘’, there is a new 2
distribution with every increase in the number of degrees of freedom.
vii) The lowest value of 2 is zero and the highest is infinity (), that is,
0 < 2 < .
viii) When two Chi-Squares 12 and 22 are independent following 2
distribution with ‘n1’ and ‘n2’ degrees of freedom, their sum 12 + 22 will
follow 2 distribution with ‘n1 + n2’ degrees of freedom.
ix) When >30, 22 – (2 -1) approximately follows the standard normal
distribution.
10.3.2 Conditions for applying the Chi-Square test
The following are the conditions for using the Chi-Square test.
1. The frequencies used in Chi-Square test must be absolute and not in
relative terms.
2. The total number of observations collected for this test must be large.
3. Each of the observations which make up the sample of this test must be
independent of each other.
4. As 2 test is based wholly on sample data, no assumption is made
concerning the population distribution. In other words, it is a non
parametric-test.
5. 2 test is wholly dependent on degrees of freedom. As the degrees of
freedom increase, the Chi-Square distribution curve becomes
symmetrical.

Sikkim Manipal University Page No. 254


Statistics for Management Unit 10

6. The expected frequency of any item or cell must not be less than 5, the
frequencies of adjacent items or cells should be polled together in order
to make it more than 5.
7. The data should be expressed in original units for convenience of
comparison and the given distribution should not be replaced by relative
frequencies or proportions.
8. This test is used only for drawing inferences through test of the
hypothesis, so it cannot be used for estimation of parameter value.
10.3.3 Uses of 2 test
The 2 test is used broadly to:
 Test goodness of fit for one way classification or for one variable only
 Test independence or interaction for more than one row or column in the
form of a contingency table concerning several attributes
 Test population variance ‘2’ through confidence intervals suggested by
2 test

10.4 Application of 2 test


10.4.1 Tests for independence of attributes
In the test for independence, the null hypothesis is that the row and column
variables are independent of each other. You have studied earlier that the
hypothesis testing is done under the assumption that the null hypothesis is
true.
The following are properties of the test for independence.
 The data are the observed frequencies.
 The data is arranged in the form of a contingency table.
 The degrees of freedom ‘’ can be calculated as:

 
  Number of row s 1  Number of columns  1 
where, ‘’ is the degree of freedom.

 The test for independence has a Chi-Square distribution and is always


a right tail test.

Sikkim Manipal University Page No. 255


Statistics for Management Unit 10

 The expected value is computed by taking the row total, multiplying it


with the column total and dividing by the grand total. That is given by:
Row Total  Column Total
E
Grand Total

 The test statistic value does not change if the order of the rows or
columns is interchanged. Also the value does not change even if the
rows and columns are interchanged.
Solved Problem 1: Calculate the degrees of freedom for a contingency
table with three rows and two columns.
Solution: The degrees of freedom denoted by ‘’ is calculated as:

  
  Number of row s 1  Number of columns  1 
  3  1  2  1  2

Hence, a contingency table with three rows and two columns has two
degrees of freedom.
Solved Problem 2: The table 10.1a gives the production in three shifts and
the number of defective goods that turned out in three weeks. Test at 5%
level of significance whether weeks and shifts are independent.

Table 10.1a. Production of defective goods in three shifts

Shift 1 Week 2 Week 3 Week Total


I 15 5 20 40
II 20 10 20 50
III 25 15 20 60
Total 60 30 60 150

Sikkim Manipal University Page No. 256


Statistics for Management Unit 10

Solution: The table 10.1b displays the observed and expected values
required to calculate 2.
Table 10.1b. Observed and expected values for data of solved problem 2
Observed Expected Value (E) (O – E)2 (O  E ) 2
Value (O)
E

15 40 x 60 /150 = 16 1 0.0625
20 50 x 60/150 = 20 0 0.0000
25 60 x 60/150 = 24 1 0.0417
5 40 x 30/150 = 8 9 1.1250
10 50 x 30/150 = 10 0 0.0000
15 60 x 30/150 = 12 9 0.7500
20 40 x 60/150 = 16 16 1.0000
20 50 x 60 /150 = 20 0 0.0000
20 60 x 60/150 = 24 16 0.6667
 2
3.6459

The steps followed to calculate 2 are described below.


1. Null hypothesis ‘Ho’: The week and shifts are independent
Alternate hypothesis ‘HA’: The week and shifts are dependent
2. Level of Significance is 5% and D.O.F (3 – 1) (3 – 1) = 4
2
  tab  9.49
3. Test Statistics
(O  E ) 2
2   E
4. Test 2cal = 3.6459
5. Conclusion: Since 2cal (3.6459) < 2tab (9.49), ‘Ho’ is accepted. Hence,
the attributes ‘week’ and ‘shifts’ are independent.
Solved Problem 3: Out of 1000 people surveyed, 600 belonged to urban
areas and rest to rural areas. Among 500 who visited other states, 400
belonged to urban areas. Test at 5% level of significance whether area and
visiting other states are dependent.

Sikkim Manipal University Page No. 257


Statistics for Management Unit 10

Solution: The table 10.2a displays the information given in solved problem
3 in a tabulated form.
Table 10.2a. Data related to solved problem 3
Other States Urban Rural Total
Visited 400 100 500
Not Visited 200 300 500
Total 600 400 1000

The table 10.2b. displays the observed and expected values for the
calculation of 2.
Table 10.2b. Observed and expected values for data of solved problem 3

Observed Value (O) Expected Value (E) (O – E)


2 ( O E ) 2
E
400 300 10000 33.33
200 300 10000 33.33
100 200 10000 50.00
300 200 10000 50.00
 cal
2
166.66

The steps followed for calculation of Chi-Square are described below.


1. Null hypothesis ‘Ho’: Area and visit are independent.
Alternate hypothesis ‘HA’: They are dependent.
2. Level of Significance is 5% and D.O.F (2 – 1) (2 – 1) = 1
2
  tab  3.841
3. Test Statistics
(O  E ) 2
2   E
4. Test 2cal = 166.66
5. Conclusion: Since 2cal (166.66) > 2tab (3.845), ‘Ho’ is rejected. Hence,
the ‘area’ and ‘visit’ are dependent.
10.4.2 Test of goodness of fit
The test of goodness of fit of a statistical model measures how accurately
the test fits a set of observations. This test measures and summarises the
differences if any, between the observed and expected values of the

Sikkim Manipal University Page No. 258


Statistics for Management Unit 10

considered statistical model. These test results are helpful to know whether
the samples are drawn from identical distributions or not. The degrees of
freedom is ‘n-1’ and the expected value is equal to the average of the
observed values.
Solved Problem 4: A personal manager is interested in trying to determine
whether absenteeism is greater on one day of the week than on another day
of the week. He has the record for the past years. Test whether
absenteeism is uniformly distributed over the week.
Table 10.3a. Comparison of data about absenteeism
Days of Monday Tuesday Wednesday Thursday Friday
Week
Number of 66 57 54 48 75
absentees
Solution: If the absenteeism is uniformly distributed over the week, then
expected number of absenteeism per day is given by:


66  57  54  48  75  60
5
The table 10.3b represents the calculated expected values required for
calculation of 2 for the data related to solved problem 4.
Table 10.3b. Observed and expected values for calculation of  for solved
2

problem 4

2 (O  E ) 2
Observed Value (O) Expected Value (E) (O – E)
5
66 60 36 0.6000
57 60 9 0.1500
54 60 36 0.6000
48 60 144 2.4000
75 60 225 3.7500
 cal
2
7.5000

The steps followed for calculation of Chi-Square are described below.


1. Null hypothesis ‘Ho’: The attributes are independent
Alternate hypothesis HA: They are dependent

Sikkim Manipal University Page No. 259


Statistics for Management Unit 10

2. Level of Significance 5% and D.O.F (5 – 1) = 4


2
  tab  9.49
3. Test Statistics
(O  E ) 2
2   E

4. Test 2cal = 7.50

5. Conclusion: Since 2cal (7.5) < 2tab (9.49), ‘Ho’ is rejected. Hence,
absenteeism and days of week are independent.
Solved Problem 5: According to theory in Genetics, the proportion of beans
of A, B C and D types in a generation should be 9:3:3:1. In an experiment
with 1600 beans, the frequency of bean of A, B, C and D type was observed
to be 882, 313, 287 and 118 respectively. Does the result support the
theory?
Solution: The steps followed for calculation of Chi-Square are described
below.
1. Null hypothesis ‘Ho’: The result supports theory
Alternate hypothesis ‘HA’: The result does not support theory
2. Level of Significance is 5% and 2 D.O.F (4 – 1) = 3
2
  tab  7.81

3. Test Statistics
(O  E ) 2
2   E
4. By Null hypothesis, E = Total No. x Corresponding ratio.
The table 10.4 displays the observed and expected values for calculation of
2 for solved problem 5.

Sikkim Manipal University Page No. 260


Statistics for Management Unit 10

Table 10.4: Observed and expected values for calculation of  for solved
2

problem 5

(O  E ) 2
Observed Value (O) Expected Value (E) (O – E)2
5
882 1600 x 19 / 10 = 900 324 0.36
313 300 169 0.56
287 300 169 0.56
118 100 324 3.24
2cal 4.72

5. Test 2cal = 4.72


6. Conclusion: Since 2cal (4.72) < 2tab (7.81), ‘Ho’ is rejected. Therefore,
the result supports the theory.
10.4.3 Test for specified variance
Suppose, we want to test whether the population has a given variance ‘02’,
then,
Ho: 2 = 02 and HA: 2  02
and

 X  X)2   nS 2
 X  X2 
 
 
 2
 0 0
2
0
2

If the calculated value lies between ‘K1’ and ‘K2’ then ‘H0’ is accepted. ‘K1’
and ‘K2’ values are read from the table.
Solved Problem 6: The standard deviations of heights of plants are known
to be 2 cms. Eight randomly selected plants have heights 172, 156, 154,
163, 170, 169, 170 and 164 cms. Test whether the sample standard
deviation differs significantly?

Sikkim Manipal University Page No. 261


Statistics for Management Unit 10

Solution: The table 10.5 displays the sample of heights of plants.


Table 10.5: Sample data of heights of plants
2
X d = X - 160 d
172 12 144
156 -4 16
154 -6 36
163 3 9
170 10 100
169 9 81
170 10 100
164 4 16
38 502

d 
2
2
 d
S 2
  
n  n 
2
502  38 
  
8 8
 40  1875

 nS  321  5
2

The steps followed for calculation of Chi-Square are described below.


1. Null hypothesis Ho: 02 = 2
Alternate hypothesis HA: 02  2
2. Level of Significance 5% and D.O.F (8 – 1) = 7  K1 = 1.69 K2 = 16.01
3. Test Statistics
nS 2
2 
0 2
4. Test 2cal = 321.5 / 4 = 80.375
5. Conclusion: Since 2cal lies outside ‘K1’ and ‘K2’, ‘Ho’ is rejected. Hence,
the standard deviation of the sample differs significantly.

Sikkim Manipal University Page No. 262


Statistics for Management Unit 10

Self Assessment Questions


Fill in the Blanks given below.
1. __________ divides the area under 2 into two equal portions.
2. The number of parameters for Chi-Square distribution is ____.
3. Mean of 2 distribution is __________.
4. 2 – test is a __________ test.
5. A table with 4 rows and 2 columns has the degrees of freedom of
_____________.
6. Mode of 2 distribution is equal to degrees of freedom less ________
7. 2 – test is wholly based on _________ data.
8. If there are four rows and five columns in classification for 2 –
test, then the number of degrees of freedom equal to __________.
9. If the calculated 2 value is greater than the tabulated 2 value, then
the null hypothesis is __________.

10.5 Summary
Chi-Square test is a non-parametric test. It is used to test the independence
of attributes, goodness of fit and specified variance. The Chi-Square test
does not require any assumptions regarding the shape of the population
distribution from which the sample was drawn.
Chi-Square test assumes that samples are drawn at random and external
forces, if any, act on them in equal magnitude.
Chi-Square distribution is a family of distributions. For every degree of
freedom, there will be one chi-square distribution.
An important criterion for applying the Chi-Square test is that the sample
size should be very large. None of the theoretical expected values
calculated should be less than five.
The important applications of Chi-Square test are the tests for
independence of attributes, the test of goodness of fit and the test for
specified variance.

Sikkim Manipal University Page No. 263


Statistics for Management Unit 10

10.6 Terminal Questions

1. Treatment ‘X’ and ‘Y’ were given to 400 items of each (material) to
enhance the strength of the material. 80 gained strength by treatment ‘X’
and 20 gained strength by treatment ‘Y’. Does the gain in strength
depend on treatment.
2. The table 10.6 gives the liking of a particular model car by different age
group.
Table 10.6: Data related to terminal question 2
AGE
60 and
Below 20 20 – 39 40 – 59 Total
above
Persons
who liked 140 80 40 20 280
Car
Disliked Car 60 50 30 80 220
Total 200 130 70 100 500

3. The demand for a particular spare part was found to vary from day to
day. In a sample study, the information represented in table 10.7 was
obtained. Test the hypothesis that the number demanded depends upon
the day.
Table 10.7: Data related to terminal question 3
Days Mon Tue Wed Thur Fri Sat
Quantity 1124 1125 1110 1120 1126 1115
Demanded

4. In a survey of 200 boys, of which 75 were intelligent, 40 had skilled


fathers. While 85 of the unintelligent boys had unskilled fathers. Can we
say on the basis of the information that skilled fathers had intelligent
boys?
5. The number of car accidents per month in a town was as follows: 6, 9, 4,
12, 8, 20, 14, 15, 2, 10. Test the hypothesis that number of accidents is
same every month.
6. In a particular industry the post graduate, graduate, undergraduates are
in the ratio 2:3:5. A firm belonging to the industry had 400, 550 and 1050

Sikkim Manipal University Page No. 264


Statistics for Management Unit 10

postgraduates, graduates and undergraduates on its pay-roll. Do they


follow earlier observation about the industry?
7. 36 random observations have variance 1.21. Can we conclude that
population variance is 2.4?
8. The standard deviation of quality of shampoo filled sachets is 3ml. Out
of 24 sachets selected at random standard deviation was observed to be
3.8 ml. What is your conclusion?

10.7 Answers to SAQs and TQs

Answers to Self Assessment Questions


1. Median
2. One
3. Degrees of Freedom
4. Non-parametric
5. 3
6. 2
7. Sample
8. 12
9. Not Rejected

Answers to Terminal Questions

1. 2cal = 41.142 Ho rejected


2. 2cal = 70.162 Ho rejected
3. 2cal = 0.179 Ho accepted
4. 2cal = 8.888 Ho rejected
5. 2cal = 26.6 Ho rejected
6. 2cal = 6.6667 Ho rejected
7. 2cal = 18.15 K1 = 20.61 K2 = 53.16 Ho rejected
8. 2cal = 38

Sikkim Manipal University Page No. 265


Statistics for Management Unit 10

10.8 References
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited.
 S. P. Gupta, Statistical Methods, (2006), Sultan Chand & Sons.

Sikkim Manipal University Page No. 266


Statistics for Management Unit 11

Unit 11F – Distribution and Analysis of Variance


(ANOVA)

Structure:
11.1 Introduction
Learning objectives
11.2 Analysis of Variance (ANOVA)
11.3 Assumptions for F-test
Objectives of ANOVA
ANOVA table
Assumptions for study of ANOVA
11.4 Classification of ANOVA
ANOVA table in one-way ANOVA
Two way classifications
11.5 Summary
11.6 Terminal Questions
11.7 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
11.8 References

11.1 Introduction
In the unit 10, ‘Chi-Square’, you have studied about the Chi-Square
characteristics and its properties. We have also discussed about how to find
the Chi-Square test results for the given sampling distribution. You also
studied the calculations of Chi-Square values for either rejecting or not
rejecting the null hypothesis. In this unit 11, ‘F-Distribution and Analysis of
Variance (ANOVA)’, we will discuss about the purpose of using analysis of
variance and conducting the F-test.
In previous unit, you studied that the Chi-Square test is used for testing the
differences among the two sample proportions and to make inferences
whether they are from the same population distribution or not. When we
have more than two populations, we have to use the analysis of variance to
evaluate the mean differences between two or more populations.

ikkim Manipal University Page No. 267


Statistics for Management Unit 11

Analysis of variance (ANOVA) will enable us to test for the significance of


the differences of variances among more than two sample means. Using
analysis of variance, we will be able to make inferences about whether our
samples are drawn from populations having the same mean or not.
Learning objectives
By the end of this unit, you should be able to:
 Evaluate mean differences between two or more populations using
analysis of variance
 Explain the classification of analysis of variance
 Describe the procedure for carrying out the two way analysis of variance
 Recognise the assumptions for applying the ANOVA technique
 Interpret the result of F-test to reject or not reject the null hypothesis
framed on two or more population variances

11.2 Analysis of Variance (ANOVA)


Analysis of variance is useful in such situations as comparing the mileage
achieved by five different brands of gasoline, testing which of four different
training methods produce the fastest learning record, or comparing the first-
year earnings of the graduates of half a dozen different business schools. In
each of these cases, we would compare the means of more than two
samples. Hence, in most of the fields, such as agriculture, medical, finance,
banking, insurance, education, the concept of Analysis Of Variance
(ANOVA) is used.
In statistical terms, the difference between two statistical data is known as
variance. When two data are compared for any practical purpose, their
difference is studied through the techniques of ANOVA. With the analysis of
variance technique, we can test the null hypothesis and the alternative
hypothesis.
Null hypothesis, ‘H0’: All sample means are equal.
Alternate Hypothesis, ‘HA’: all sample means are not equal or at least one of
sample means differ.

Key Statistic
The technique of analysis of variance is referred to as ANOVA.

ikkim Manipal University Page No. 268


Statistics for Management Unit 11

Initially the technique was applied in the field of Zoology and Agriculture, but
in a later stage, it was applied to other fields also. In analysis of variance,
the degree of variance between two or more data as well as the factors
contributing towards the variance is studied.
In fact, Analysis of Variance is the classification and cross-classification of
statistical data with the view of testing whether the means of specific
classification differ significantly or whether they are homogeneous.
The Analysis of Variance is a method of splitting the total variation of data
into constituent parts which measure different sources of variations.
The total variation is split up into the following two-components.
 Variance within the subgroups of samples
 Variation between the subgroups of the samples
Hence, the total variance is the sum of variance between the samples and
the variance within the samples. After obtaining the above two variations,
these are tested for their significance by F-test which is also known as
variance ratio test.
The ‘F’ statistic is defined as F = S12 / S22 where S1 > S2. It is used to test
differences between variance, that is, whether two populations can be
considered to have same variance or not. As you have studied in the unit
10, that to test a specified variance, we used 2 – test. The sample
variances ‘S1’ and ‘S2’ are calculated as:
1
S12   ( X  X)2 and
n1  1
1
S 22   (Y  Y) 2
n2 1
where,
 ‘n1’ is the size of the first sample
 ‘n2’ is the size of the second sample
 X and Y denotes the sample means of the random variable ‘X’ and ‘Y’
respectively
It is also known as variance ratio test. It has two degrees of freedom, one for
numerator of the ratio and another for denominator. They are represented
by:
1 = n1 – 1 and 2 = n2 – 1.

ikkim Manipal University Page No. 269


Statistics for Management Unit 11

where, ‘1’ and ‘2’ are degrees of freedom in numerator and denominator
respectively.

11.3 Assumptions for F – test


The following are the assumptions for applying the F-test.
 The samples are simple random samples.
 The samples are independent of each other.
 The parent populations from which they are drawn are normally
distributed
Note: 1. If F  F(1, 2), then 1/F  F (2, 1)
2. n1 F = 2 for F distribution with (n1, 1) degrees of freedom
The assumption that all the populations should have normal distribution is
hardly achieved in practical cases. Hence, it can be considered as a
limitation.
Solved Problem 1: The table 11.1a represents the time taken to do a job by
method I and method II by workers. Can we conclude that the variance of
time distribution for method I and method II are same?
Table 11.1a. Time taken by workers to finish a job by two different methods
Method I 27 23 16 20 26 22
Method II 33 35 34 27 42 32 38

Solution: The tables 11.1b and 11.1c represent the frequency table
required for the calculation of sample means for the data given for two
different methods.
Table 11.1b. Required values of the method I to calculate sample mean
2
X d = X - 22 d
27 5 25
23 1 1
16 -6 36
20 -2 4
26 4 11
22 0 0
Total 2 82

ikkim Manipal University Page No. 270


Statistics for Management Unit 11

Table 11.1c. Required values of the method II to calculate sample mean


2
X d = X – 35 d
33 –2 4
35 0 0
34 –1 1
27 –8 64
42 7 49
32 –3 1
38 3 9
Total –4 136

 ( d ) 2 
S12 
1 d 2  
1
82  4 / 6  16.266
n1  1  n1  5
 

1  2 ( d) 2 
S2   d  
2 n 1  n 
2  2 
1 136  16 

6  7 

= 22.286
1. Null hypothesis ‘Ho’: 1 = 22, that is, the sample variances of two
2

methods are equal.


Alternate hypothesis ‘H1’: 12  22, that is, the null hypothesis is wrong.
2. Level of significance 5 % and D.O.F (6,5), from the F-table:
 Ftab = 4.95
3. Test Statistics
S2
F 2
 1.37
S2
1
4. Test
22.2286
F   1.37
cal 16.266

ikkim Manipal University Page No. 271


Statistics for Management Unit 11

5. Conclusion: Since Fcal (1.37) < Ftab (4.95), ‘H0’ is accepted. Hence, there
is no significant difference.
11.3.1 Objectives of ANOVA
The objectives of ANOVA are to:
1. Obtain a measure of the total variation between or among the
components
2. Find a measure of variation between or among the components. Then,
the significance of difference between the variations in two series or
more may be measured
In other words, with the help of the technique of ANOVA we can test the
hypothesis that the means of all the components constituting a population
are equal to the mean of the population or that the samples have come from
the same population.
11.3.2 ANOVA table
Key Statistic
A table showing the source of variance, the sum of squares, degrees of
freedom, mean square (variance) and the formula for the F-ratio is
known as ANOVA table.

Computation of test statistics


The actual analysis of variance is carried out on the basis of ratio between
the variances. The variance ratio is obtained by dividing the variance
between the samples by the variance within the sample. The ratio forms the
test statistic known as F-Statistic, that is,

Variance betw een the samples


F – Statistic =
Variance w ithin the samples

Key Statistic
The means of samples will not be same if the variation caused by the
interaction between the samples is large when compared to variance
within the each group.

ikkim Manipal University Page No. 272


Statistics for Management Unit 11

11.3.3 Assumptions for study of ANOVA


The underlying assumptions for the study of ANOVA are:
i) Each of the samples is a simple random sample
ii) Population from which the samples are selected are normally
distributed
iii) Each of the samples is independent of the other samples
iv) Each of the population has the same variation and identical means
v) The effect of various components are additive

11.4 Classification of Analysis of Variance


ANOVA is mainly carried on under the following two classifications.
i) One way analysis of variance or one way classification
ii) Two way analysis of variance or two way classified data or manifold
classification
11.4.1 ANOVA table in one way analysis of variance
The ‘ANOVA’ table presents the various results obtained while carrying out
ANOVA. The table 11.2 represents the specimen of ANOVA table.
Table 11.2: ANOVA table

Source of Sum of Squares Degree of Mean


Variation Freedom Square
Between Samples SSC K–1 MSC
Within Samples SSE N–K MSE
Total SST N-1

where,
 SST = Total Sum of the Squares
 SSC = Sum of the Squares of the columns
 SSE = Sum of the squares of the Error
 MSC = Variance between samples
 MSE = Variance within the samples
You have studied in previous unit that a Chi-square distribution depends on
degrees of freedom. It has only one degree of freedom. But the F-
distribution has a pair of degrees of freedom. One is number of degrees of

ikkim Manipal University Page No. 273


Statistics for Management Unit 11

freedom in the numerator of F ratio. The other is the degrees of freedom in


the denominator. These degrees of freedom determine the shape of F-
distribution. Hence, these degrees of freedom are the parameters of F-
distribution.
Just like the Chi-Square distribution, F-distribution is not a single
distribution. It is a family of distributions. There are many different F
distributions, one for each pair of degrees of freedom.

Key Statistic
The number of degrees of freedom in numerator of the F ratio is
calculated as:
Degrees of freedom in numerator = (Number of samples – 1)
where, ‘k’ is the number of samples taken.

Key Statistic
The number of degrees of freedom in denominator of the F ratio is
calculated as:
Degrees of freedom in denominator = N – k
where, ‘N’ is total number of values in all samples combined and ‘k’ is
the number of samples taken.

Solved Problem 2: An official from Central Government is concerned about


the monthly expenses of three different boards, that is, Civil Supplies Board,
Electricity Board and Higher Education Board. He wants to find out whether
the boards spend equal amounts on personnel and equipment. He applies
the technique of analysis of variance to test his assumption at 0.05 level of
significance. He collects the monthly expenses of three different boards for
the previous few months and summarises them into a tabular form as shown
in table 11.3. Calculate the number of degrees of freedom to test at the
given level of significance?

ikkim Manipal University Page No. 274


Statistics for Management Unit 11

Table 11.3: Monthly office expenses (Rs. thousands)

Civil Supplies Board 14 8 12 9 18


Electricity Board 15 9 8 10 13 13
Higher Education Board 8 16 12 6

Solution: In analysis of variance, we use the F-test to test the null


hypothesis. In calculating the F-statistic, the degrees of freedom must be
found out. You have studied earlier that for an F-distribution, there will be a
pair of degrees of freedom.
From the given data, the number of samples ‘k’ is 3 and the total number of
samples ‘N’ is 15. Therefore, the degrees of freedom are calculated as:
Degrees of freedom in numerator = (Number of samples – 1)
=k–1
=3–1
=2
Degrees of freedom in denominator = N – k
= 15 – 3
= 12
Hence the degrees of freedom in numerator and denominator are 2 and 12
respectively.
Solved Problem 3: The table 11.4a shows the yield (in Kg) per acre for 5
trial plots treated using four different treatments. Carry out an analysis of
variance and state the conclusion.
Table 11.4a. Yield in Kg per acre for 5 trial plots treated using four varieties of
treatment
Plot No. Treatment
1 2 3 4
1. 42 48 68 80
2. 50 66 52 94
3. 62 68 76 78
4. 34 78 64 82
5. 52 70 70 66

ikkim Manipal University Page No. 275


Statistics for Management Unit 11

Solution: The table 11.4b displays the calculated totals of the yield per acre
for each of the four varieties of treatment used on 5 trial plots.
Table 11.4b. Calculated totals of the yield per acre of each of the four
treatments
Treatment
Plot No. (X1) (X2) (X3) (X4)
1 2 3 4
1. 42 48 68 80
2. 50 66 52 94
3. 62 68 76 78
4. 34 78 64 82
5. 52 70 70 66
Total 240 330 330 400

T= Sum of all observations  42  50  ...............  66  1300


T2 1300 2
Correction factor =   84500
N 20
T2
SST = Sum of squares of all observations 
N

SST = Crude sum of all observations – Correction factor


 
 42 2  50 2  62 2  34 2  .......... 66 2  84500  4236
 (  ) 2 (  2 ) 2 (  3) 2 (  4 ) 2 (  n ) 2 T 2 
 1 
SSC     ,,,,,,  
 n n n n n N 
 1 2 3 4 
n

2402 3302 3302 4002


=     84500 2580
5 5 5 5
SSE = SST – SSC = 4236  2580  1656
SSC 2580 2580
MSE =    860
K  1 4  1 3
SSE 1656
MSE =   103.5
N K (20  4)

ikkim Manipal University Page No. 276


Statistics for Management Unit 11

The degree of freedom = (K – 1, N – K) = (3, 16).


[K is the number of columns and N is the total number of observations.]
The table 11.4c represents the ANOVA table for the solved problem 3.
Table 11.4c. ANOVA table for solved problem 3

Source of Sum of Squares Degree of Mean Square


Variation Freedom
Between Samples SSC = 2580 K–1=3 MSC = 860
Within Samples SSE = 1656 N – K = 16 MSE = 103.5
Total SST N–1

MSC 860
F   8.3
MSE 103.5
The table value of ‘F’, at 5% level of significance for DF (3, 16), is 3.24
which is less than the calculated value of ‘F’. Therefore, the null hypothesis
is rejected. Hence, the treatments do not have the same effect.
11.4.2 Two way classifications
In the two way classification, observations are classified into groups on the
basis of two criteria.
Procedure for carrying out the two way analysis of variance
1. a) Assume the means of all columns are equal. That is, the effects of all
factors in first kind of treatment are equal.
1   2   3  ..........
c
b) Assume the means of all rows are equal. That is, the effects of all
factors in the second kind of treatment are equal.
1   2   3   4 ....... r
2. Compute the sum of all values ‘T’.
3. Find SST = Sum of squares of all observations – T2 / N
4. Find SSC as:
 2 2 2 2 2 
 (  x1 ) ( x ) ( x ) ( x ) ( x )  T2
2 3 4 n
SSC      .....  N
 n1 n
2
n
3
n
4
n
n 
 
where Σx1, Σx2, Σx3….are column totals.

ikkim Manipal University Page No. 277


Statistics for Management Unit 11

5. Find
 ( x ) 2 ( x j 2 )2 ( x j3 )2 ( x j 4 )2 ( x jn )2  T 2
 j1
SSR      ....  
 n1 n2 n3 n4 nn  N
 
where,  x j1 ,  x j 2 ,  x j3 …… are row totals.

6. SSE  SST  SSC  SSR


SSC SSR SSE
7. MSC  ; MSR  ; MSE 
MSE (r 1) {(C  4) (r 1)}
where, ‘c’ is the number of columns and ‘r’ is the number of rows.
MSC MSR
8. Fc = And F 
MSE r MSE

Degrees of freedom for Fc = {c-1, (c-1) (r-1)}


Degrees of freedom for Fr = {c-1, (c-1) (r-1)}
Fc is for column wise comparison
Fr is for row wise comparison
If Fc < table value of F then 1 = 2 = 3 =……….
If Fr < table value of F then 1 = 2 = 3 =……….
The table 11.5. displays the ANOVA table for two way analysis of variance
Table 11.5: ANOVA Table for two way analysis of variance

Source of Sum of DF Mean F Ratio


Variation Squares Square
Between Columns SSC c–1 MSC Fc
Within Rows SSR r–1 MSR Fr
Residual SSE (c-1) x (r -1) MSE
Total SST N–1

Solved Problem 4: Three varieties of crops ‘A’, ‘B’, ‘C’ are tested in a
randomised block design with four replications. The yields are given in table
11.6a. Test at 0.05 level of significance whether there is difference between
replications. Test also whether varieties differ significantly.

ikkim Manipal University Page No. 278


Statistics for Management Unit 11

Table 11.6a. Yields of three crops tested with four replications

Replications
Variety
1 2 3 4
A 6 4 8 6
B 7 6 6 9
C 8 5 10 9
The table 11.6b. represents the totals of yields of three crops tested with
four replications.
Table 11.6b. Totals of yields of three crops tested with four replications

Replications Total
Variety
1 2 3 4
A 6 4 8 6 24
B 7 6 6 9 28
C 8 5 10 9 32
Total 21 15 24 24 84

N = 12, T = sum or all values = 6 + 7 +8 + 4 + 6 + 5 + 8 + 6 + 10 + 6 + 9 + 9


= 84.

T2 84 2
Correction factor =   588
N 12
SST = sum of squares of all values – T2 / N

= 62+72+82+42+62+52+82+62+102+62+92+92 – 588 = 36
SST = 36
For columns, SSC is calculated as:

 2 2 2 2 2 
 ( 1 ) (  ) (  ) (  ) ....  (  )  T2
2 3 4 n
SSC      N
 n1 n
2
n
3
n
4
n
n 
 
212 15 2 24 2 24 2 1818
     588  588  18
3 3 3 3 3

ikkim Manipal University Page No. 279


Statistics for Management Unit 11

= SSC  18  6
(c  1) 3
For rows, SSR is calculated as:

 2 2 2 2 2
  ( Xj1)  ( Xj2)  ( Xj3)  ( Xj4)  ....  ( Xj2)   T
2

 n1 n2 n3 n4 nn  N

 2 2 2 
=  24  28  32    588  2384  588  8
 4 4 4  4

Hence, SSR = 8.
SSR 8
MSR =  4
(r  1) 2
SSE = SST – SSC – SSR = 36 – 18 – 8 = 10
SSE 10
MSE =   1.667
{(r  1) (c  1)} 6

The table 11.6c represents the ANOVA table for data of solved problem 4.

Table 11.6c. ANOVA table for solved problem 4


Source of Sum of DF Mean F.Ratio
Variation Squares Square
Between SSC = 18 c–1=3 MSC = 6 Fc =
Columns SSR = 18 r–1=2 MSR = 4 6/1.667 =
Within Rows 3.6
SSE = 10 (c-1) x (r -1) = 6 MSE =
Residual 1.667 Fr = 4/1.667
= 2.4
Total SST = 36 N – 1 = 11

Between columns
Degrees of Freedom (3,6), Table value of ‘F’ = 4.757 at  = 0.05
Calculated value of ‘F’ = 3.6 < Table value of ‘F’
Therefore, we accept the hypothesis that there is no significant difference
between replications.
Between rows
Degrees of freedom (2,6), Table value of ‘F’ = 5.143

ikkim Manipal University Page No. 280


Statistics for Management Unit 11

Calculated ‘F’ value is 2.4 < Table value of ‘F’


Therefore, we accept the hypothesis that there is no significant difference
between the varieties.
Solved Problem 5: Performance study conducted by the Sales Manager of
an NML Manufacturing Company on three salesmen during three seasons
and the data is presented in table 11.7a. He wants to know whether there is
significant difference between salesmen’s performances between seasons
using 0.05 level of significance.
Table 11.7a. Performance study of three salesmen

Sales men Season


Summer Rainy Winter
Salesman-I 32 20 24
Salesman-II 40 50 68
Salesman-III 54 46 58

Solution: The null hypothesis ‘H0’ is given as:


 0   A   B  C
(There is no difference between the salesmen or the seasons)
To simplify the arithmetic, we may subtract some suitable number, for
example 50, from all the data without affecting the values of the variations.
The data coded is presented in table 11.7b.
Table 11.7b. Total values of the performance counts of salesmen

Sales Men Season Total


Summer Rainy Winter
Sales man-I 7 5 17 29
Sales man-II -1 2 18 19
Sales man-III 4 -4 8 8
10 3 43 56

 2 562
Correction factor    348.44
N 9

ikkim Manipal University Page No. 281


Statistics for Management Unit 11

Sum of squares between seasons:


102  32  432 
2
3 3 3 N
=33.3 + 3 + 616.33 – 348.44 = 304.22
Degrees of freedom  = (3-1) = 2
292  192 82 82 T2
Sum of squares between salesmen    
3 3 3 3 N

= 280.33 + 120.33 + 21.33 – 348.44


= 73.55
Degrees of freedom  = (3-1) = 2
Total sum of squares
 7   1  4  5  2   4  17  18  8  CF
2 2 2 2 2 2 2 2 2

= 49 + 1 + 16 + 25 + 4 + 16 + 289 + 324 + 64 – 348.44 = 439.56


Table 11.7c. ANOVA table for solved problem 5

Source Sum of Degrees Mean Variance ratio


squares squares
of of
variation freedom
Between 304.22 2 152.110
columns 152.11
(Season) FC   9.85
15.44
Between 73.55 2 36.775 36.775
Rows FR   2.38
15.445
(Salesmen)
Error 61.79 4 15.445
439.56 8

The calculated value of FC is greater than the table value of F (9.85>6.94).


Hence, there is a significant difference in the three seasons.

ikkim Manipal University Page No. 282


Statistics for Management Unit 11

The calculated value of FR is less than the table value of F, that is,
(2.38<6.94). Hence, there is no significant difference between salesmen
performance.

Self Assessment Questions


1. State whether the following statements are true ‘T’ or false ‘F’.
i) Populations from which the samples are selected are normally
distributed.
ii) The effects of various components are not additive.
iii) Sum of squared due to factors is crude sum of squares minus
correction factor.
iv) Analysis variance is useful to test several means
v) Another tool applied to test several means is Z / t – test.
vi) ‘F’ ratio is always calculated with respect to mean square error.
vii) The F-distribution curve depends on the degrees of freedom.
viii) In applying analysis of variance, the sample sizes must be equal.
ix) In a one-way ANOVA, the null hypothesis always states that all the
population means are different.
x) The F-statistic is the ratio of variance between the samples to the
variance within the samples
xi) In a one-way ANOVA, if the F test statistic is greater than the critical
F value, you will reject null hypothesis, because there is a significant
difference between the sample means.

11.5 Summary
ANOVA is a statistical technique used to evaluate the variances between
three or more sample means. This helps to make inferences to judge
whether the samples are from populations having same mean or not.
ANOVA is classified into one way ANOVA and two way ANOVA.
ANOVA is a parametric test as it assumes normality regarding population
distributions and also as it deal in means.

ikkim Manipal University Page No. 283


Statistics for Management Unit 11

The F-test is conducted for performing analysis of variance. F-test is used to


test the equality of two variances. ANOVA is used to test the equality of
several means using the relation x =  / n. F-distribution has a pair of
degrees of freedom.
The assumptions for applying the F-test are that the random samples must
be independent to each other and are normally distributed.

11.6 Terminal Questions


1. The table 11.8 displays the data of the number of claims processed per
day of a group of four employees of XYZ Insurance Company observed
for a number of days. Test the hypothesis that the employees mean
claims per day are all the same. Use a 5% level of significance
(F 1.47, Fc = 3.29).
Table 11.8: Claims processed per day of four employees of an XYZ
Insurance Company

Employee 1 15 17 14 12
Employee 2 12 10 13 17
Employee 3 11 14 13 15 12
Employee 4 13 12 12 14 10 9

2. Fours makes of bulbs were tested for their length of life (in ‘000 hours)
and the data obtained is displayed in table 11.9. Test whether the length
of their life is significantly different.
Table 11.9. Four different makes of bulbs with their length of life

Make I Make II Make III Make IV


20 19 21 15
23 15 19 17
18 17 20 16
17 20 17 18
16 16
3. The table 11.10 represents the data on production rate by five workmen
on four machines. Test whether the rate is significantly different due to
workers and machines.
ikkim Manipal University Page No. 284
Statistics for Management Unit 11

Table 11.10. Production rate of five workmen on four machines

Machines Workmen
I II III IV V
1 46 48 36 35 40
2 40 42 38 40 44
3 49 54 46 48 51
4 38 45 34 35 41
4. The percentage sugar content of Tobacco in two samples was
represented in table 11.11. Test whether their population variances are
same.
Table 11.11. Percentage sugar content of Tobacco in two samples

Sample A 2.4 2.7 2.6 2.1 2.5


Sample B 2.7 3.0 2.8 3.1 2.2 3.6

11.7 Answers to SAQs and TQs

Answers to Self Assessment Questions


1.
i) T ii) F iii) T iv) T v) F vi) T vii) T viii) F ix) F x) T xi) T

Answers to Terminal Questions


1. Fcal = 1.47, not significant
2. Fcal = 1.67 not significant
3. Fcal = 8.20 for workman
Fcal = 19.20 for machines
Both are not significant
4. Fcal = 4.08, not significant

11.8 References
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited

ikkim Manipal University Page No. 285


Statistics for Management Unit 12

Unit 12 Simple Correlation and Regression


Structure:
12.1 Introduction
Learning objectives
12.2 Correlation
Causation and Correlation
Types of Correlation
12.3 Measures of Correlation
Scatter diagram
Karl Pearson‟s correlation coefficient
Properties of Karl Pearson‟s correlation coefficient
Factors influencing the size of correlation coefficient
12.4 Probable Error
Conditions under which probable error can be used
12.5 Spearman‟s Rank Correlation Coefficient
12.6 Partial Correlations
12.7 Multiple Correlations
12.8 Regression
Regression analysis
Regression lines
Regression coefficient
12.9 Standard Error of Estimate
12.10 Multiple Regression Analysis
12.11 Reliability of Estimates
12.12 Application of Multiple Regressions
12.13 Summary
12.14 Terminal Questions
12.15 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
12.16 References

12.1 Introduction
In the unit 11, „F – Distribution and Analysis of Variance (ANOVA)‟, you
have studied about the F-test which is used to test the hypothesis of the

Sikkim Manipal University Page No. 286


Statistics for Management Unit 12

equality of two variances. You have also studied about the ANOVA, which is
used to test the differences in several means. In this unit 12, „Simple
Correlation and Regression‟, we will discuss about the techniques such as
correlation and regression, used for investigating the relationship between
two or more variables.
Both correlation and regression are used to measure the strength of
relationships between variables. The following statistical tools measure the
relationship between the variables analysed in social science research.
1. Correlation
a. Simple correlation: In simple correlations, the relationships between
two variables are studied.
b. Partial correlations: In partial correlations, the relationships of any
two variables are studied, keeping all others constant.
c. Multiple correlations: In multiple correlations, the relationships
between variables are studied simultaneously.
2. Regression
a. Simple regression: In simple regression, we study the relationship
between only two variables at a time, in which one variable is
independent and the other is dependent.
b. Multiple regression: In this, we study the relationship between more
than two variables at a time, in which one variable is dependent and
others are independent variables.
3. Association of attributes
Correlation measures the relationship (positive or negative, perfect)
between the two variables. Regression analysis considers relationship
between variables and estimates the value of another variable, having
the value of one variable. Association of attributes attempts to ascertain
the extent of association between two variables.
12.1.1 Learning objectives
By the end of this unit, you should be able to:
 Calculate the coefficient for partial and multiple correlation
 Distinguish between parametric and non parametric measures of
correlation

Sikkim Manipal University Page No. 287


Statistics for Management Unit 12

 Apply the method of estimating unknown values from known values


through regression equations

12.2 Correlation
When two or more variables move in sympathy with other, then they are
said to be correlated. If both variables move in the same direction then they
are said to be positively correlated. If the variables move in opposite
direction then they are said to be negatively correlated. If they move
haphazardly then there is no correlation between them. Correlation analysis
deals with the following.
 Measuring the relationship between variables.
 Testing the relationship for its significance.
 Giving confidence interval for population correlation measure.
12.2.1 Causation and correlation
The correlation between two variables may be due to the following causes.
 Due to small sample sizes
Correlation may be present in sample and not in population.
 Due to a third factor
Correlation between yield of rice and tea may be due to a third factor -
„rain‟.
12.2.2 Types of correlation
The following are the three types of correlation.
i. Positive or Negative
ii. Simple, Partial and Multiple
iii. Linear and Non-linear
Positive and negative correlations: Both the variables (X and Y) will vary
in the same direction. If variable X increases, variable Y also will increase; if
variable X decreases, variable Y also will decrease, This is positive
correlation. If the given variables vary in opposite direction, then they are
said to be negatively correlated. If one variable increases, other variable will
decrease. In other words, the variables are negatively correlated if there is
an inverse relationship between the variables.
Simple, partial and multiple correlations: In simple correlation,
relationships between two variables are studied. In partial and multiple

Sikkim Manipal University Page No. 288


Statistics for Management Unit 12

correlations, three or more variables are studied. Three or more variables


are simultaneously studied in multiple correlations. In partial correlation
more than two variables are studied, but the effect on one variable is kept
constant and the relationship between the other two variables is studied.
Linear and non-linear correlation: Correlation depends upon the
constancy of the ratio of change between the variables. In linear correlation,
the percentage change in one variable will be equal to the percentage
change in another variable. It is not so in non linear correlation.

12.3 Measures of Correlation


The following are the measures of correlation.
i. Scatter Diagram
ii. Karl Pearson‟s correlation coefficient
iii. Spearman‟s Rank correlation coefficient
12.3.1 Scatter diagram
The ordered pair of observed values are plotted on XY plane as dots.
Therefore, it is also known as dot diagram. It is a diagrammatic
representation of relationship.
Interpreting a scatter plot
If the dots lie exactly on a straight line that runs from left bottom to right top,
then the variables are said to be perfectly positively correlated. The figure
12.1 represents the scattered diagram for perfectly positively correlated
variables.

Fig. 12.1: Perfect positive correlation

Sikkim Manipal University Page No. 289


Statistics for Management Unit 12

If the dots lie close to a straight line that runs from left bottom to right top,
then the variables are said to be positively correlated. The figure 12.2
represents the scattered diagram for positively correlated variables.

Fig. 12.2: Positive correlation

If the dots lie exactly on a straight line that runs from left top to right bottom
then the variables are said to be perfectly or exactly negatively correlated.
The figure 12.3 represents the scattered diagram for the perfectly negatively
correlated variables.

Fig. 12.3: Perfect negative correlation

If the dots lie very close to a straight line that runs from left top to right
bottom then the variables are said to be negatively correlated. The figure
12.4 represents the scattered diagram for the negatively correlated
variables.

Sikkim Manipal University Page No. 290


Statistics for Management Unit 12

Fig. 12.4: Negative correlation

If the dots lie all over the graph paper then the variables have zero
correlation. The figure 12.5 represents the scattered diagram of the
variables with zero correlation.

Fig. 12.5: Zero correlation

Scatter diagram tells us the direction in which they are related and does not
give any quantitative measure for comparison between data sets.

Sikkim Manipal University Page No. 291


Statistics for Management Unit 12

12.3.2 Karl Pearson’s correlation coefficient

Key Statistic
Karl Pearson‟s correlation coefficient is defined as:
 xy
i) r  ––––––––––––– (A)
N x  y

where, x    and y    

( x  x ) 2  ( y  y)2
 x2  and  y2 
N N
 xy
where, „N‟ is the number of paired observations and is called

covariance of „x‟ and „y‟.

Key Statistic
The other forms of Karl Pearson‟s correlation coefficient formula are:
 xy
ii) r  –––––––––––––––––––– (B)

 X2  Y 2  
N  XY   X  Y
r –––– (C)
N  X 2
 (  X) 
2 1/ 2
N  Y 2
 ( Y) 
2 1/ 2

N  dx dy   dx dy
r ––(D)
N  dx 2
 (  dx ) 
2 1/ 2
N  dy 2
 (  dy ) 
2 1/ 2

For all practical purposes, we can conveniently use form D;


whenever summary information is given choose proper form from A
to C.

12.3.3 Properties of Karl Pearson’s correlation coefficient


The following are the properties of Karl Pearson‟s correlation coefficient.
 Its value always lies between – 1 and 1
 It is not affected by change of origin or change of scale
 It is a relative measure. It does not have any unit attached to it

Sikkim Manipal University Page No. 292


Statistics for Management Unit 12

12.3.4 Factors influencing the size of correlation coefficient


The size of „r‟ is very much dependent upon the variability of measured
values in the correlation sample. The greater the variability, the higher will
be the correlation, everything else being equal. The size of „r‟ is altered
when researchers select extreme groups of subjects in order to compare
these groups with respect to certain behaviors. Selecting extreme groups on
one variable increases the size of „r‟ over what would be obtained with more
random sampling.
Combining two groups which differ in their mean values on one of the
variables is not likely to faithfully represent the true situation as far as the
correlation is concerned.
Addition of an extreme case (and conversely dropping of an extreme case)
can lead to changes in the amount of correlation. Dropping of such a case
leads to reduction in the correlation while the converse is also true1.
Solved Problem 1: Find Karl Pearson‟s correlation coefficient for the data
displayed in table 12.1a.
Table 12.1a: Data related to solved problem 1

X 20 16 12 8 4
Y 22 14 4 12 8
Solution: The table 12.1b displays the sums calculated for the data
represented in table 12.1a.
Table 12.1b: Sums related to solved problem 1

X Y X2 Y2 XY
20 22 400 484 440
16 14 256 196 224
12 4 144 16 48
8 12 64 144 96
4 8 16 64 32
X = 60 Y = 60 X = 880
2
Y = 904
2
XY = 840

1
Source: Aggarwal. Y. P, Statistical Methods, Sterling Publishers Pvt Ltd., New
Delhi, 1998, p.131)

Sikkim Manipal University Page No. 293


Statistics for Management Unit 12

Solution: Applying the formula for „r‟ and substituting the respective values
from the table we get r as:
N  XY   X  Y
r

N  X 2  (N  X) 2
1/ 2
 
N  Y 2  ( Y)2
1/ 2

5(840 )  (60)(60)
r
5(880 )  (60) 2 . 5(904 )  (30) 2
r  0  70
Hence, Karl Pearson‟s correlation coefficient is 0.70.
Solved Problem 2: Calculate Karl Pearson‟s Coefficient of Correlation from
the data displayed in table 12.2a.
Table 12.2a: Data related to index of production and number of unemployed
Year 1985 1986 1987 1988 1989 1990 1991 1992
Index of
100 102 104 107 105 112 103 99
Production
Number of
15 12 13 11 12 12 19 26
unemployed

Solution: The table 12.2b displays the sums required for calculation of Karl
Pearson‟s correlation coefficient.
Table 12.2b: Sums related to data given in solved problem 1
Index of
2 No. of yYY 2
Year Production xXX x y xy
unemployed
X
1985 100 -4 16 15 0 0 0
1986 102 -2 4 12 -3 9 +6
1987 104 0 0 13 -2 4 0
1988 107 +3 9 11 -4 16 - 12
1989 105 +1 1 12 -3 9 -3
1990 112 +8 64 12 -3 9 - 24
1991 103 -1 J 19 +4 16 -4
1992 99 -5 25 26 + 11 121 - 55
X = 832 x = 0 x = Y = 120 y = 0 y = xy = -92
2 2

120 194

Sikkim Manipal University Page No. 294


Statistics for Management Unit 12

X = 104 Y = 15
 xy  92
r   00.61
( x 2 ) ( y 2 ) 120  184

Therefore, a correlation between production and unemployed is negative.


Solved Problem 3: Calculate correlation coefficient from the data
represented in table 12.3a.
Table 12.3a: Data related to solved problem 3

X 50 60 58 47 49 33 65 43 46 68
Y 48 65 50 48 55 58 63 48 50 70
Solution: The table 12.3b displays the frequency table of the data related to
solved problem 3.
Table 12.3b: Frequency table data for solved problem 3

X-50 = dx dx2 Y Y-55 = dy dy2 dx dy


50 0 0 48 -7 49 0
60 + 10 100 65 + 10 100 + 100
58 +8 64 50 -5 25 - 40
47 -3 9 48 -7 49 + 21
49 -1 1 55 0 0 0
33 -17 289 58 3 9 - 51
65 + 15 225 63 8 64 + 120
43 -7 49 48 -7 49 + 49
46 -4 16 50 -5 25 + 20
68 +18 324 70 15 225 + 270
X = dx =+ dx2 = Y = dy = 5 dy2 = dxdy =
519 19 1077 535 595 489
Using the formula for calculating „r‟ as:
N  dx dy   dx  dy
r

N  dx 2   dx 2
1/ 2
 
N  dy 2   dy 2
1/ 2

And substituting values we get Karl Pearson‟s correlation coefficient,
„r‟ = 0.611.

Sikkim Manipal University Page No. 295


Statistics for Management Unit 12

Solved Problem 4: In a bivariate data on „x‟ and „y‟, variance of „x‟ = 49,
variance of „y‟ = 9 and covariance (x,y) = -17.5. Find coefficient of
correlation between „x‟ and „y‟.
Solution: We know that:
 xy
r
N x  y
 xy
Given r   17.5
N
σ x = 49 = 7 σy = 9 = 3
17.5
r= = 0.833
7×3
Hence, there is a highly negative correlation.
Solved Problem 5: Ten observation in Weight (x) and Height (y) of a
particular age group gave the following data.
x = 56 y = 138 x2 = 1357 y2 = 2136 xy = 836
Find „r‟.
Solution: We know that:
N  xy   x  y
r
N  x 2
 ( x)2  1/ 2
N  y 2
 ( y)2  1/ 2

Given N = 10, x = 56 y = 138


X2 = 1357 Y2 = 2136 XY = 836
10  836 (56)(138 )
r  0.1286
10  1357  (56)  10  2136  (138) 
2 1/ 2 2 1/ 2

Hence, Karl Pearson‟s correlation coefficient is 0.1286.

12.4 Probable Error


It measures the extent to which correlation coefficient is dependable. It is an
old measure of testing the reliability of “r”. It is given by:

Sikkim Manipal University Page No. 296


Statistics for Management Unit 12


    0  6475 1  r 2  n
where, „r‟ is measured from sample of size „n‟.
Probable error is used to:
i) Interpret the value of „r‟,
 If r < P.E, then it is not at all significant.
 If r > 6 P.E, then „r‟ is highly significant
 If P.E < r < 6 P.E, we cannot say anything about the significance of
„r‟
ii) Construct confidence limits within which population „P‟ is expected to
lie.
12.4.1 Conditions under which probable error can be used
The following are some conditions under which probable error (P.E) can be
used.
1. Samples should be drawn from a normal population
2. The value of „r‟ must be determined from sample values
3. Samples must have been selected at random
Solved Problem 6: If r = 0.6 and N = 64, then:
a) Interpret „r‟
b) find the limits within which „‟ is suppose to lie.
Solution:
1  (0.6) 2
P.E.  (0.6745 )
64
= 0.054
a) 6    6  0  054  0  324
Since r 0  6  6   , it is highly significant.
b) Limits for population “”
 0  6  0  054
 0  546  0  654
Hence, the limits within which ‘‟ lies are 0.546 and 0.654.

12.5 Spearman’s Rank Correlation Coefficient


Karl Pearson‟s correlation coefficient assumes that:

Sikkim Manipal University Page No. 297


Statistics for Management Unit 12

i. Samples are drawn from a normal population


ii. The variables under study are affected by a large number of
independent causes so as to form a normal distribution.
When we do not know the shape of population distribution and when the
data is of qualitative type, Spearman‟s Ranks correlation coefficient is used
to measure the relationship.

Key Statistic
Spearman‟s Rank correlation coefficient is defined as:
6  D2
  1
N3  N
where, D is the difference between ranks assigned to the variables.

Value of „‟ lies between „-1‟ and „+1‟ and its interpretation is same as
that of Karl Pearson‟s correlation coefficient.

There are four types of problems. The table 12.4 represents the type of
problems involved in calculating rank correlation coefficient.
Table 12.4: Types of problems

Type i Ranks are assigned


Type ii Ranks are to be assigned and there is no tie between ranks
Type iii When there is tie between ranks
Type iv When ranks are assigned already
Solved Problem 7: In a singing competition, two judges assigned the ranks
for seven candidates, which is displayed in table 12.5a. Find Spearman‟s
rank correlation coefficient.
Table 12.5a: Ranks of seven candidates

Competitor 1 2 3 4 5 6 7
Judge I 5 6 4 3 2 7 1
Judge II 6 4 5 1 2 7 3
Solution: The table 12.5b represents the data of solved problem 7.

Sikkim Manipal University Page No. 298


Statistics for Management Unit 12

Table 12.5b: Data of seven candidates

Competitor R1 (Judge 1) R2 (Judge 2) D = R1 – R2 D2


1 5 6 -1 1
2 6 4 -2 4
3 4 5 -1 1
4 3 1 2 4
5 2 2 0 0
6 7 7 0 0
7 1 3 2 4
13
6  13 6  13
=1– 1  0.768
7(7  1)
2
7  48

Hence, Spearman‟s rank correlation coefficient  is 0.768.


Solved Problem 8: Find the rank difference coefficient of correlation (in
case of no ties) for the data displayed in table 12.6.
Table 12.6: Scores of students on test I and test II

Student Score Score Rank of Rank Difference Difference


on Test on Test I on between squared
I Test II Test II Ranks
2
X Y R1 R2 D D
A 16 8 2 5 -3 9
B 14 14 3 3 0 0
C 18 12 1 4 -3 9
D 10 16 4 2 2 4
E 2 20 5 1 4 16
D = 38
2
N=5

Applying the formula of regulations, we get:


6  D2 6(38)
=1– 3 1  3  1  1.9  0.9
N N 5 5
Relation between „x‟ and „y‟ is very high and inverse. Relationship between
score on Test I and II is very high and inverse.

Sikkim Manipal University Page No. 299


Statistics for Management Unit 12

Type iii: When ranks are repeated

Solved Problem 9: The table 12.7a represents the sales statistics of six
sales representatives in two different localities. Find whether there is a
relationship between buying habits of the people in the localities.
Table 12.7a: Sales data of six representatives

Representative 1 2 3 4 5 6
Locality I 70 40 65 110 60 20
Locality II 70 30 80 100 90 20
Solution: The table 12.7b represents the calculated values of correlation
coefficient of data in solved problem 9.

Table 12.7b: Calculating the coefficient of correlation

Representative Sales in Sales in D = R1-R2 D2


Locality I, R1 locality II, R2
1 2 4 -2 4
2 5 5 0 0
3 3 3 0 0
4 1 1 0 0
5 4 2 2 4
6 6 6 0 0
68 8
=1–  1  0.7714
6  (6  1)
2
35
Therefore, there is high positive correlation between buying habits of the
locality people.
Type iii: When ranks are repeated

Solved Problem 10: Find rank correlation coefficient for the data displayed
in table 12.8a.
Table 12.8a: Scores of student in test I and test II
Student A B C D E F G H I J
Score on Test I 20 30 22 28 32 40 20 16 14 18
Score on Test II 32 32 48 36 44 48 28 20 24 28

Sikkim Manipal University Page No. 300


Statistics for Management Unit 12

Solution: The table 12.8b displays the required data for calculating the
correlation coefficient.
Table 12.8b: Ranks of test I and test II

Score Score Rank Rank Difference


Difference
Student on on of on between
squared
Test I Test II Test I Test II Ranks
X Y R1 R2 D D2
A 20 32 6.5 5.5 0 1.00
B 30 32 3 5.5 - 2.5 6.25
C 22 48 5 1.5 3.5 12.25
D 28 36 4 4 0 0
E 32 44 2 3 - 1.0 1.00
F 40 48 1 1.5 - 0.5 0.25
G 20 28 6.5 7.5 - 1.0 1.00
H 16 20 9 10 - 1.0 1.00
I 14 24 10 9 1.0 1.00
J 18 28 8 7.5 0.5 0.25
N = 10 D2 = 24


 = 1 – 6  D  1/ 12(m1  m1 )  1/ 12(M2 m 2 )  1/ 12(M3 m3 )  1/ 12(M4 m 4 )
2 3 3 3 3

N3  N

Where, mi represents the number of times a rank is repeated.

=1–
6  24  1/ 12(2 3
 2)  1/ 12(2 3  2)  1/ 12(2 3  2)  1/ 12(2 3  2)
10(102  1)

=1–
144  0.5  0.5  0.5  0.5
10  99
146
=1–  0.8525
10  99

Sikkim Manipal University Page No. 301


Statistics for Management Unit 12

Testing of correlation
„t‟ test is used to test correlation coefficient.
Example 1
The table 12.9 displays the height and weight of a random sample of
six adults.
Table 12.9: Height and weight of six adults

Height (cm) 170 175 176 178 183 185


Weight (Kg) 57 64 70 76 71 82

It is reasonable to assume that these variables are normally


distributed, so that Karl Pearson correlation coefficient is the
appropriate measure of the degree of association between height and
weight. R = 0.875
Hypothesis test for Pearson‟s population correlation coefficient is:
Ho: = 0; this implies no correlation between the variables in the
population
H1: > 0; this implies that there is positive correlation in the population
(increasing height is associated with increasing weight) 5%
significance level is taken.
Statistic for “t” test =
n2 62 2 2
    8.53
1 r 2
1  (0.875) 2
1  0.7656 0.2344

Table value of 5% significance level and 4 degree of freedom


(6-2) = 2.132.
Since, the calculated value is more than the tabulated value, null
hypothesis is rejected. There is significant positive correlation between
height and weight.

12.6 Partial Correlation

Sikkim Manipal University Page No. 302


Statistics for Management Unit 12

Partial Correlation is used in a situation where three or four variables are


involved. The three variables may be age, height and weight. Correlation
between height and weight can be computed by keeping age constant.
Age may be the important factor influencing the strength of relationship
between height and weight. Partial correlation is used to keep constant the
effect of age. The effect of one variable is partialled out from the correlation
between other two variables. This statistical technique is known as partial
correlation. Correlation between variables „x‟ and „y‟ is denoted as „rxy‟.

Key Statistic
Partial correlation is denoted by the symbol „r12.3‟. Here correlation
between variable 1 and 2 keeping 3rd variable constant.
r12  r13 .r23
r123 
1  r13 2  1  r23 2

where,
r12.3 = Partial correlation between variables 1 and 2 keeping 3rd
constant
r12 = correlation between variables 1 and 2
r13 = correlation between variables 1 and 3
r23 = correlation between variables 2 and 3
Similarly,
r13  r12 . r23 r23  r12 . r23
r132  and r23.1 
1  r12 2  1  r23 2 1  r12 2  1  r13 2

Self Assessment Questions


Calculate the required correlation coefficients.
1. i. From the following data, calculate the correlation between variables 1
and 2 keeping the 3rd constant.
r12 = 0.7; r13 = 0.6 r23 = 0.4
ii. Calculate r23.1 and r13.2 from the following:
r12 = 0.60; r13 = 0.51; r23 = 0.40
iii. Given the zero order correlation coefficients, calculate the partial
correlation between variables 1 and 3 keeping the 2nd constant.
Interpret your result.

Sikkim Manipal University Page No. 303


Statistics for Management Unit 12

r12 = 0.8; r13 = 0.6; r23 = 0.5

12.7 Multiple Correlations


Three or more variables are involved in multiple correlations. The dependent
variable is denoted by X1 and other variables are denoted by X2, X3 and so
on. Gupta S.P. has expressed that “the coefficient of multiple linear
correlation is represented by R1 and it is common to add subscripts
designating the variables involved”2. Thus R1.234 would represent the
coefficient of multiple linear correlations between X1 on the one hand X2, X3
and X4 on the other. The subscript of the dependent variable is always to
the left of the point.
The coefficient of multiple correlations for r12, r13 and r23 can be expressed
as:

R1.23 = r12
2
 r13 2  2 r12 r13 r23  1 r 
23
2

R2.13 = r
2
12
 r 2  2 r12 r13 r23
23
 1  r 
2
13

R3.12 = r132  r232  2 r12 r13 r23  1  r122 


Coefficient of multiple correlations for R1.23 is the same as R1.32.
A coefficient of multiple correlation lies between „0‟ and „1‟. If the coefficient
of multiple correlations is „1‟, it shows that the correlation is perfect. If it is „0‟,
it shows that there is no linear relationship between the variables. The
coefficients of multiple correlations are always positive in sign and range
from „0‟ to „+1‟. Coefficient of multiple determinations can be obtained by
squaring R1.23.
Alternative formula for computing R1.23 is:
R1.23  r12  r13.2 (1  r12 ) or
2 2 2

R 21.23  r12  r13.2 (1  r12 )


2 2 2

2
Source: Gupta S.P, Statistical Method, 2006, Sultan Chand & Sons, New Delhi.

Sikkim Manipal University Page No. 304


Statistics for Management Unit 12

Similarly, alternative formulas for R1.24 and R1.34 can be computed. The
following formula can be used to determine a multiple correlation coefficient
with three independent variables.
(1  r 14 ) (1  r 13.4 ) (1  r 12.34 )
2 2 2
R1.24 =

Multiple correlation analysis measures the relationship between the given


variables. In this analysis, the degree of association is measured between
one variable (which is considered as the dependent variable) and a group of
other variables (which are considered as independent variables).

Solved Problem 11: The following are the zero order correlation
coefficients.

r12 = 0.98; r13 = 0.44 r23 = 0.54

Calculate multiple correlation coefficient treating first variable as dependent


and second and third variables as independent.

Solution: First variable is dependent. Second and third variables are


independent. Using the formula for multiple correlation coefficients for R1.23
we get:

R1.23 = r
2
12
2
 r13  2r 12 r 13 r 23  1  r 
2
23
= 0.986

Hence the multiple correlation coefficient is 0.986.

Self Assessment Questions


2. State whether the following questions are „True‟ or „False‟.
i. Scatter diagram does not give us a quantitative measure of
correlation coefficient.
ii. Correlation studies estimate the values of one variable from the
knowledge of the other.
iii. Correlation coefficient is an absolute measure.
iv. The correlation studied between height and weight, keeping age as
constant.

Sikkim Manipal University Page No. 305


Statistics for Management Unit 12

12.8 Regression
According to M. M. Blair, regression is defined as, “the measure of the
average relationship between two or more variables in terms of the original
units of the data”3.
Correlation analysis attempts to study the relationship between the two
variables „x‟ and „y‟. Regression analysis attempts to predict the average „x‟
for a given „y‟. In regression, it is attempted to quantify the dependence of
one variable on the other. For example, if there are two variables „x‟ and „y‟
and „y‟ depends on „x‟, then the dependence is expressed in the form of the
equations.
12.8.1 Regression analysis
Regression analysis is used to estimate the values of the dependent
variables from the values of the independent variables. Regression analysis
is used to get a measure of the error involved while using the regression line
as a basis for estimation. Regression coefficient is used to calculate
correlation coefficient. The square of correlation is what prevails between
the given two variables.
12.8.2 Regression lines
For a set of paired observations there exists two straight lines. The line
drawn in such a way that the sum of vertical deviation is zero and the sum of
their squares is minimum, is called regression line of „y‟ on „x‟. It is used to
estimate „y‟ values for given „x‟ values. The line drawn in such a way that the
sum of horizontal deviation is zero and sum of their squares is minimum, is
called regression line of „x‟ on „y‟. It is used to estimate „x‟ values for given „y‟
values. The smaller the angle between these lines, the higher is the
correlation between the variables. The regression lines always intersect at
(X, Y).
The regression lines have equation,
i) The regression equation of „y‟ on „x‟ is given by:

    b yx    

3
T. R. Jain, S. C. Aggarwal, Dr. R. K. Rana, Basic Statistics for Economists, 2006-
2007 Edition, V. K. Publications.

Sikkim Manipal University Page No. 306


Statistics for Management Unit 12

ii) The regression equation of „x‟ on „y‟ is given by:



    bxy    
where,
N  dxdy  (  dx ) (  dy )
by x 
N  dx 2  (  dx ) 2
and
N  dxdy  (  dx ) (  dy )
by x 
N  dy 2  (  dy ) 2

The regression equations found by the above conditions is said to fit by


method of least squares. „byx‟ and „bxy‟ are called regression coefficients.

12.8.3 Regression coefficient


When a regression is linear, then the regression coefficient is given by the
slope of the regression line.
 The geometric mean of regression coefficients gives the correlation
coefficient.
b yx .bxy  r 2

b yx .b xy  1
 The product of regression coefficients is always less than 1,
that is,
b yx .b xy  1
 If „byx‟ is negative, then „bxy‟ is also negative and „r‟ is negative.
 They can also be expressed as:
y 
b yx  r . and b xy  r . x
x y
 It is an absolute measure
The differences between correlation and regression coefficient are listed in
table 12.10.

Sikkim Manipal University Page No. 307


Statistics for Management Unit 12

Table 12.10: Differences between correlation and regression coefficient


Correlation Coefficient Regression Coefficient
The correlation coefficients, The regression coefficients,
rxy = ryx byx = bxy
„r‟ lies between -1 and 1. „byx‟ can be greater than one, but „bxy‟
must be less than one such that
byx.bxy<1
It has no units attached to it. It has units attached to it.
There exist nonsense There is no such nonsense
correlation. regression.
It is not based on cause and effect It is based on cause and effect
relationship. relationship.
It indirectly helps in estimation. It is meant for estimation.

Solved Problem 12: Find regression equation from the data represented in
table 12.11a. Then calculate correlation coefficient.
Table 12.11a: Data of ages of wife and husband
Age of Husband 18 19 20 21 22 23 24 25 26 27
Age of Wife 17 17 18 18 19 19 19 20 21 22

Solution: The table 12.11b represents the data required for calculation of
correlation and regression coefficients.
Table 12.11b: Data required for calculation of correlation and regression
coefficients
Age of husband 2 2
dx = x-22 dx Age of wife (y) dy = y-19 dy dx dy
(x)
18 -4 16 17 -2 4 8
19 -3 9 17 -2 4 6
20 -2 4 18 -1 1 2
21 -1 1 18 -1 1 1
22 0 0 19 0 0 0
23 1 1 19 0 0 0
24 2 4 19 0 0 0
25 3 9 20 1 1 3
26 4 16 21 2 4 8
27 5 25 22 3 9 15
Total 225 5 85 190 0 24 43

Sikkim Manipal University Page No. 308


Statistics for Management Unit 12

225 190
X= = 22.5 Y= = 19
10 10
Regression equation of Y on X is given by:
Y  Y b y x ( X  X )

10  43  (5) (0) 430


byx =   0.521
2
10  85  (5) 825

   19  0.521  22.5

   0.521  7.2775
Regression Equation of X and Y is:
10  43  (5) (0) 43
byx =   1.392
10  24  (5) 2 24
   22.5  1.792  19
   1.792 11.548
r = 0.521x1.792 = 0.966
Hence, the correlation coefficient „r‟ is 0.966.
Solved Problem 13: In a correlation study, we have the data represented in
table 12.12. Find the two regression equations.
Table 12.12: Data about series X and series Y

Series X Series Y
Mean Standard Deviation 65 67
Standard Deviation 2.5 3.5
Correlation coefficient 0.8
Solution:
y
Y Y  r (X  X )
x
 3.5 
Y  67  (0.8)   ( X  65)
 2.5 
   67  1.12  65
   1.12  5.8

Sikkim Manipal University Page No. 309


Statistics for Management Unit 12

Regression equation of „x‟ and „y‟ is given by:


x
X  X r (Y  Y )
y
 2.5 
X  65  (0.8)   Y  67
 3.5 
   65  0.57  67
   0.57  26.72
Hence, the two regression equations are:
  1.12  5.8
  0.57  26.72

12.9 Standard Error of Estimate


The standard error of estimates helps to measure the accuracy of the
estimated figures in regression analysis. If the value of the standard error of
estimate is small, it shows that the estimate provided by the regression
equation is better and closer. If standard error of estimate is zero, it shows
that there is no variation about the line and the correlation will be perfect.
“The standard error of estimate uses to ascertain how good and
representative the regression line is as a description of the average
relationship between two series:
The standard error of regression of „X‟ values from „Xc‟ is:

( X  X c ) 2 also
Sx  y 
N

Sx  y  6  1  r 2 also

 X 2  a  X  b  XY
Sx  y 
N

( Y  Yc ) 2
Sx  y 
N

Solved Problem 14: The table 12.13 displays the results that were worked
out from scores in Statistics and Mathematics in a certain examination.
Sikkim Manipal University Page No. 310
Statistics for Management Unit 12

Table 12.13: Scores in statistics and mathematics

Scores in Statistics (X) Scores in Mathematics


(Y)
Mean 40 48
Standard 10 15
Deviation
Karl Pearson‟s correlation coefficient between „x‟ and „y‟ is = + 0.42. Find the
regression lines „x‟ on „y‟ and „y‟ on „x‟. Use the regression lines to find the
value of „y‟ when x = 50 and value of „x‟ when y = 30.
Solution: Given the following data:
X = 40; Y = 40 x = 10; y = 15; r = 0.42
The regression line x on y is:
( X  X )  r x /  y ( Y  Y) ………….(1)

The regression line „y‟ on „x‟ is given as:


( Y  Y)  r y /  x ( X  X) ………….(2)

Therefore substituting the values we get the respective equation as:


Χ = 0.279 y + 26.608 ……………3
Υ = 0.63x + 22.80 ………………..4
Therefore,
when y = 30; x =35.518 using equation (3) and
when x =50, y = 54.3 by using equation (4)
Solved Problem 15: For the data in table 12.14a, obtain the two regression
equations. Estimate Y for X = 15 and estimate X for Y = 20
Table 12.14a: Data of the solved problem 15

X 12 4 20 8 16
Y 18 22 10 16 14
Solution: The table 12.14b displays the values required for obtaining the
regression equations.
X = (12 + 4 + 20 + 8 + 16)/ 5 =12 = mean of X
Y = (18 + 22 + 10 + 16 + 14) / 5 = 16 = mean of Y

Sikkim Manipal University Page No. 311


Statistics for Management Unit 12

Table 12.14b: Calculated values of X and Y to obtain regression equations

XX YY
X Y (X  X) 2 (Y  Y) 2 (X  X) (Y  Y)
X - 12 Y - 16
12 8 0 2 0 4 0
4 22 -8 6 64 36 - 48
20 10 8 -6 64 36 - 48
8 16 -4 0 16 0 0
16 14 4 -2 16 4 -8
160 80 - 104

 ( X  X )( Y  Y ) 104
b yx    0.65
2
 ( X X ) 160
and
( X  X) ( Y  Y ) 104
b xy     1.3
2
( X  X) 80

Regression equation X on Y is given by:


( X  X )  bxy (Y  Y )
  12  1.3  16
   32.8  1.3
When, Y = 20, X =6.8.
Regression equation Y on X is given by:
( Y  Y)  b( X  X)
  16  0.65  12
   23.8  0.65
When X = 15, Y =14.05.

12.10 Multiple Regression Analysis


Multiple regression analysis is an extension of two variable regression
analysis. In this analysis, two or more independent variables are used to

Sikkim Manipal University Page No. 312


Statistics for Management Unit 12

estimate the values of a dependent variable, instead of one independent


variable.
Objectives of multiple regression analysis are:
 To derive an equation, which provides estimates of the dependent
variable from values of the two or more independent variables
 To obtain the measure of the error involved in using the regression
equation as a basis of estimation
 To obtain a measure of the proportion of variance in the dependent
variable accounted for or explained by the independent variables
Multiple regression equation explains the average relationship between the
given variables and the relationship is used to estimate the dependent
variable. Regression equation refers the equation for estimating a
dependent variable. Estimating dependent variable X1 from the independent
variables X2, X3……, is known as regression equation of X1 on X2, X3……….
Regression equation, when three variables are involved, is given below:
Χ1.23 = a1.23 + b1.23 Χ 2 + b13.2 Χ3
where, X1.23 is an estimated value of the dependent variable
X2 and X3 are independent variables.
a1.23 = (Constant) the intercept made by the regression plan. It gives the
value of the dependent variable, when all the independent variables
assume a value equal to zero.
b1.23 and b13.2 = partial regression coefficients or net regression coefficients.
b1.23 = measures the amount by which a unit change in X2 is expected to
affect X1 when X3 is held constant.
Deviations taken from actual means are:
1.23  b1.232  b13.2  3
X1  ( X1  X1 )
X2  ( X2  X2 )
X3  ( X3  X3 )
b1.23 and b13.2 can be obtained by solving the following equations.
   b1.23 2  b13.2  2  3
2
1 2

Sikkim Manipal University Page No. 313


Statistics for Management Unit 12

 X1  2  b1.23  X2 3  b13.2  X3
σ 1.23
b12.3 =
σ 3.12

r  r r   S  r  r r   S 
( X 1  X 1 )   12 13 2 23   1  ( X 2  X 2 )   12 13 2 23   1  ( X 3  X 3 )
 1  r23   S 2   1  r23   S 3 
Regression equation of X3 and X2 and X1 is:

r  r13 r12   S3  r  r r   S3 
( X 3  X 3 )   23    (X2  X2 )   13 23 12
   ( X1  X1 )
 1  r23 2   S2   1  r23 2   S1 

12.11 Reliability of Estimates


Reliability of estimates test the estimated value obtained by applying
regression equation - whether the estimated value is very close to the actual
observed value. Standard error is used to measure the closeness of
estimate derived from the regression equation to actual observed values.
The measure of reliability is an average of the deviations of the actual value
of non-dependent variable from the estimate from the regression equation.
Determining the accuracy of estimates from the multiple regressions is
reliability of estimates. It is also known as standard error of estimate.

Key Statistic
Standard error of estimate of X1 on X2 and X3 is given below:
 ( X1  Xlast ) 2
S 1.23 
N3
Where
S1.23 = Standard error of estimate X1 on X2 and X3
Xlast = Estimate value of X1 as calculated from the regression
equations

12.12 Application of Multiple Regression


Multiple regression analysis can be applied to test the factors such as export
elasticity, import elasticity and structural change (contribution of

Sikkim Manipal University Page No. 314


Statistics for Management Unit 12

manufacturing sector towards GDP) influencing over employment. Here,


employment is a dependent variable.
Similarly, researchers can attempt to use multiple regressions in their
research work appropriately.

Self Assessment Questions


3. State whether the following statements are „True‟ or „False‟.
i. Correlation coefficient is a geometric mean between regression
coefficients
ii. The regression lines pass through ( X, Y )
iii. byx = r . S.D of X / S.D of Y
iv. The higher the angle between regression coefficients, the lower is
the correlation coefficient.

12.13 Summary
In this unit we studied the concept of correlation and regression and the
different types of correlation and regression.
We saw how regression helps us to study unknown variables with the help
of known variables. It also establishes reliability measure for estimated
values.
Regression analysis helps to quantify the dependence of one variable on
the other. Some of the regression types are simple and multiple regression,
linear and non linear regression.
Regression analysis is useful in business and economic scenarios in
decision making process.

12.14 Terminal Questions


1. Test the significance correlation for the values based on the number of
observations
i. 10
ii. 100
and „r‟ is 0.4 and 0.9

Sikkim Manipal University Page No. 315


Statistics for Management Unit 12

2. The table 12.15 gives marks obtained by 10 students in commerce and


statistics. Calculate the rank correlation
Table 12.15: Marks of students obtained in commerce and statistics
Marks in Statistics 35 90 70 40 95 45 60 85 80 50
Marks in Commerce 45 70 65 30 90 40 50 75 85 60

3. Calculate Spearman‟s rank correlation coefficient between the series A


and B given in table 12.16.
Table 12.16: Series data of the terminal question 3
Series A 57 59 62 63 64 65 55 58 57
Series B 113 117 126 126 130 129 111 116 112

4. For the data in table 12.17, obtain the two lines of regression and its
estimation of the blood pressure when age is 50 yrs.
Table 12.17: Data for the terminal question 4
Age 56 42 72 39 63 47 52 49 40 42 68 60
(X) in
yrs
BP 127 112 140 118 129 116 130 125 115 120 135 133
(Y)

5. The table 12.18 displays the results that were worked out from scores in
statistics and mathematics in a certain examination.
Table 12.18: Results of scores in statistics and mathematics examination

Scores in Statistics Scores in Mathematics


(X) (Y)
Mean 39.5 47.5
Standard Deviation 10.8 17.8

Karl Pearson‟s correlation coefficient between X and Y = 0.42. Find both the
regression lines. Use these lines to estimate the value of Y when X = 50 and
the value of X when Y = 30.

Sikkim Manipal University Page No. 316


Statistics for Management Unit 12

12.15 Answers to SAQs and TQs

Answers to Self Assessment Questions


1. i. Refer section 12.6
ii. Refer section 12.6
iii. Refer section 12.6
2. i. True ii. False iii. False iv. True
3. i. True ii. True iii. False iv. True

Answers to Terminal Questions


1. i. Non significant
ii. Highly significant
iii. Highly significant
iv. Highly significant
2. 0.903
3. 0.967
4. X = - 95 + 1.184
Y = 87.2 + 0.724
5. X = 27.62 + 0.25Y
Y = 20.24 + 0.69X

12.16 References
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
 S. P. Gupta, Statistical Methods, (2006), Sultan Chand & Sons
 T. R. Jain, S. C. Aggarwal, Dr. R. K. Rana, Basic Statistics for
Economists, 2006-2007 Edition, V. K. Publications

Sikkim Manipal University Page No. 317


Statistics for Management Unit 13

Unit 13 Business Forecasting


Structure:
13.1 Introduction
Learning objectives
13.2 Business Forecasting
Objectives of forecasting in business
Prediction, projection and forecasting
Characteristics of business forecasting
Steps in forecasting
13.3 Methods of Business Forecasting
Business barometers
Time series analysis
Extrapolation
Regression analysis
Modern econometric methods
Exponential smoothing method
13.4 Theories of Business Forecasting
Sequence or time-lag theory
Action and reaction theory
Economic rhythm theory
Specific historical analogy
Cross-cut analysis theory
13.5 Utility of Business Forecasting
Advantages of business forecasting
Limitations of business forecasting
13.6 Summary
13.7 Terminal Questions
13.8 Answers to SAQs amd TQs
Answers to self assessment questions
Answers to terminal questions
13.9 References

13.1 Introduction
In the unit 12, „Simple Correlation and Regression‟, you have studied about
the techniques such as correlation and regression, which are used for

Sikkim Manipal University Page No. 318


Statistics for Management Unit 13

investigating the relationship between two or more variables. In this unit 13,
„Business Forecasting‟, we will discuss about business forecasting, the
methods available in forecasting, and the use of forecasting models in
business improvement processes.
The growing competition, rapidity of change in circumstances and the trend
towards automation demand that decisions in business are not based purely
on guesses and hunches but rather on a careful analysis of data concerning
the future course of events. The future is unknown to us. Yet every day we
are forced to make decisions involving future and therefore there is
uncertainty. Great risk is associated with business affairs. All businessmen
are forced to make forecast regarding business activities.
Success in business depends upon successful forecasts of business events.
In business or trade the importance of forecasting is so great, that when
someone enters into the business world, he really enters the profession of
forecasting. In recent times, considerable research has been conducted in
this field. Attempts are being made to make forecasting as scientific as
possible.
Business forecasting as such is not a new development. Every
businessman must forecast; even if his whole product is sold before
production. Forecasting has always been necessary. What is new in the
attempt to put forecasting on a scientific basis is to forecast by reference to
past history and statistics rather than by pure intuition and guess-work.
One of the most important tasks before businessmen and economists these
days are to make estimates for the future. For example, a business man is
interested in finding out his likely sales next year or as long term planning in
next five or ten years so that he could adjust his production accordingly and
avoid the possibility of either inadequate production to meet the demand or
unsold stocks.
Similarly, an economist is interested in estimating the likely population in the
coming years so that proper planning can be carried out with regard to jobs
for the people, food supply and so on. First step in making estimates for the
future consists of gathering information from the past. In this connection we
usually deal with statistical data which are collected, observed or recorded
at successive intervals of time. Such data is generally referred to as time

Sikkim Manipal University Page No. 319


Statistics for Management Unit 13

series. Thus, when we observe numerical data at different points of time the
set of observations is known as time series.
13.1.1 Learning objectives
By the end of this unit, you should be able to:
 Describe the meaning of business forecasting
 Distinguish between prediction, projection and forecast
 Describe the forecasting methods available
 Apply the forecasting theories in taking effective business decisions

13.2 Business Forecasting


Business forecasting refers to the analysis of past and present economic
conditions with the object of drawing inferences about probable future
business conditions. The process of making definite estimates of future
course of events is referred to as forecasting and the figure or statements
obtained from the process is known as „forecast‟ future course of events is
rarely known. In order to be assured of coming course of events, help is
taken of an organised system of forecasting. The following are two aspects
of scientific business forecasting.
Analysis of past economic conditions
For this purpose, the components of active series are to be studied. The
secular trend will show how the series has been moving in the past and
what its future course is likely to be over a long period. The cyclic
fluctuations would reveal whether the business activity is subjected to boom
or depression. The seasonal fluctuations would indicate the seasonal
changes in the business activity.
Analysis of present economic conditions
The object of analysing present economic conditions is to study those
factors which affect the sequential changes expected on the basis of the
past conditions. Such factors are new inventions, changes in fashion,
changes in economic and political spheres, economic and monetary policies
of the Government, war. These factors may affect and alter the duration of
trade cycle. Therefore it is essential to keep in mind the present economic
conditions since they have an important bearing on the probable future
tendency.

Sikkim Manipal University Page No. 320


Statistics for Management Unit 13

13.2.1 Objectives of forecasting in business


Forecasting is a part of human conduct. Businessmen also need to look to
the future. Success in business depends on correct predictions. In fact when
a man enters business, he automatically takes with it the responsibility for
attempting to forecast the future.
To a very large extent, his success or failure would depend upon the ability
to successfully forecast the future course of events. Without some element
of continuity between past, present and future, there would be little
possibility of successful prediction. But history is not likely to repeat itself
and we would hardly expect economic conditions next year or over the next
ten years to follow a clear cut prediction. Yet, frequently past patterns
prevail sufficiently to justify using the past as a basis for predicting the
future.
A businessman cannot afford to base his decisions on guesses. Forecasting
helps a businessman in reducing the areas of uncertainty that surround
management decision making with respect to costs, sales, production,
profits, capital investment, pricing, expansion of production, extension of
credit, development of markets, increase of inventories and curtailment of
loans. These decisions cannot be made off-hand. They are to be based on
present indications of future conditions.
However, we should know that it is impossible to forecast the future
precisely. There is a possibility of occurrence of some range of error in the
forecast. Statistical forecasts are the methods in which we can use the
mathematical theory of probability to measure the risks of errors in
predictions.
13.2.2 Prediction, projection and forecasting
A great amount of confusion seem to have grown up in the use of words
„forecast‟, „prediction‟ and „projection‟.

Sikkim Manipal University Page No. 321


Statistics for Management Unit 13

Key Statistic
A prediction is an estimate based solely on past data of the series
under investigation. It is purely mechanical extrapolation.
A projection is a prediction where the extrapolated values are subject
to certain numerical assumptions.
A forecast is an estimate which relates the series in which we are
interested to external factors.

Forecasts are made by estimating future values of the external factors by


means of prediction, projection or forecast and from these values calculating
the estimate of the dependent variable.
13.2.3 Characteristics of business forecasting
Based on past and present conditions
Business forecasting is based on past and present economic condition of
the business. To forecast the future, various data, information and facts
concerning to economic condition of business for past and present are
analysed.
Based on mathematical and statistical methods
The process of forecasting includes the use of statistical and mathematical
methods. By using these methods, the actual trend which may take place in
future can be forecasted.
Period
The forecasting can be made for long term, short term, medium term or any
specific period.
Estimation of future
The business forecasting is to forecast the future regarding probable
economic conditions.
Scope
The forecasting can be physical as well as financial.

Sikkim Manipal University Page No. 322


Statistics for Management Unit 13

13.2.4 Steps in forecasting


The forecasting of business fluctuations consists of the following steps:
Understanding why changes in the past have occurred
One of the basic principles of statistical forecasting is that the forecaster
should use the data on past performance. The current rate and changes in
the rate constitute the basis of forecasting. Once they are known, various
mathematical techniques can develop projections from them. If an attempt is
made to forecast business fluctuations without understanding why past
changes have taken place, the forecast will be purely mechanical.
The business fluctuations are based solely upon the application of
mathematical formulae and are subject to serious error.
Determining which phases of business activity must be measured
After understanding the reasons of occurrence of business fluctuations, it is
necessary to measure certain phases of business activity in order to predict
what changes will probably follow the present level of activity.
Selecting and compiling data to be used as measuring devices
There is an independent relationship between the selection of statistical
data and determination of why business fluctuations occur. Statistical data
cannot be collected and analysed in an intelligent manner unless there is a
sufficient understanding of business fluctuations. It is important that reasons
for business fluctuations be stated in such a manner that is possible to
secure data that are related to the reasons.
Analysing the data
Lastly, the data are analysed in the light of understanding of the reason why
change occurs. For example, if it is reasoned that a certain combination of
forces will result in a given change, the statistical part of the problem is to
measure these forces, from the data available, to draw conclusions on the
future course of action. The methods of drawing conclusions may be called
forecasting techniques.

13.3 Methods of Business Forecasting


Almost all businessmen make forecasting about the business conditions
related to their business. In recent years scientific methods of forecasting
have been developed. The base of scientific forecasting is statistics. To

Sikkim Manipal University Page No. 323


Statistics for Management Unit 13

handle the increasing variety of managerial forecasting problems, several


forecasting techniques have been developed in recent years. Forecasting
techniques vary from simple expert guesses to complex analysis of mass
data. Each technique has its special use, and care must be taken to select
the correct technique for a particular situation.
Before applying a method of forecasting, the following questions should be
answered:
1. What is the purpose of the forecast and how is it to be used?
2. What are the dynamics and components of the system for which the
forecast will be made?
3. How important is the past in estimating the future?
The following are the main methods of business forecasting.
i. Business barometers
ii. Time series analysis
iii. Extrapolation
iv. Regression analysis
v. Modern econometric methods
vi. Exponential smoothing method
13.3.1 Business barometers
Business indices are constructed to study and analyse the business
activities on the basis of which future conditions are predetermined. As
business indices are the indicators of future conditions, so they are also
known as “business barometers” or „economic barometers‟. With the help of
these business barometers the trend of fluctuations in business conditions
are made known and by forecasting a decision can be taken relating to the
problem.
The construction of business barometer consists of gross national product,
wholesale prices, consumer prices, industrial production, stock prices, bank
deposits. These quantities may be converted into relatives on a certain
base. The relatives so obtained may be weighted and their average is
computed. The index thus arrived at in the business barometer.
There are three types of business barometers. They are barometers for:
i. General business activities
ii. Specific business or industry
iii. Individual business firm
Sikkim Manipal University Page No. 324
Statistics for Management Unit 13

Barometers relating to general business activities


Barometers relating to general business activities are also known as general
indices of business activities which refer to weighted or composite indices of
individual index business activities. With the help of general index of
business activity, long term trends and cyclical fluctuations in the economic
activities of a country are measured. But in some specific cases, the long
term trends can be different from general trends. These types of index help
in the formation of a country‟s economic policies.
Business barometers for specific business or industry
These barometers are used as the supplement of general index of business
activity and these are constructed to measure the future variations in a
specific business or industry.
Business barometers concerning to individual business firm
This type of barometer is constructed to measure the expected variations in
a specific individual firm of an industry.
The table 13.1 displays the merits and demerits of business barometers.
Table 13.1: Merits and demerits of business barometers method
Merits Demerits
The business barometer method is It is very difficult to construct indices of
scientific and reliable and used by business activities.
management for the purpose of
various business decisions at different
levels.
Business barometer method helps in In most of the cases, the business
proper forecasting of future trends of a barometers provide inaccurate,
business. incomplete and inconclusive
forecasting due to index numbers
prepared on the basis of incorrect and
inadequate data.
The business barometers are the The business barometers are the
indicators of future business trends indicators of past conditions and the
and help to forecast the speed of forecasting based on these conditions
fluctuations. may be erroneous.
This method helps to find solutions of Separate indices are calculated for
various business problems such as individual industry and firm which are
development of market, capital entirely different from general indices.
investment, exploration of new
consumer market and so on.

Sikkim Manipal University Page No. 325


Statistics for Management Unit 13

13.3.2 Time series analysis


Time series analysis is also used for the purpose of making business
forecasting. The forecasting through time series analysis is possible only
when the business data of various years are available which reflects a
definite trend and seasonal variation. By time series analysis the long term
trend, secular trend, seasonal and cyclical variations are ascertained,
analysed and separated from the data of various years.
The table 13.2 list the merits and demerits of time series analysis.
Table 13.2: Merits and demerits of time series analysis
Merits Demerits
It is an easy method of forecasting. This method is expensive, difficult and
time taking.
By this method a comparative study of This method deals with past data only.
variations can be made.
Reliable results of forecasting are This method can only be used when
obtained as this method is based on the data for several years are
mathematical model. available.

13.3.3 Extrapolation
Extrapolation is the simplest method of business forecasting. By
extrapolation, a businessman finds out the possible trend of demand of his
goods and also about the future price trends. The accuracy of extrapolation
depends on two factors:
i) Knowledge about the fluctuations of the figures
ii) Knowledge about the course of events relating to the problem under
consideration
Thus, there are two assumptions on which extrapolations are based:
i) There is no sudden jumps in figures from one period to another
ii) There is regularity in fluctuations and the rise and fall is uniform
In extrapolation, we assume that the variable will follow the established
pattern of growth. For the purpose of business forecasting, it is to determine
accurately the appropriate trend curve and the values of its parameters.

Sikkim Manipal University Page No. 326


Statistics for Management Unit 13

Some of these trend curves are explained below.


Arithmetic trend
The straight line arithmetic trend assumes that growth will be a constant
amount each year.
Semi-log trend
It assumes a constant percentage increase each year. As the annual
increment is constant in logarithm, this line will become a straight line when
drawn on semi-log paper.
Modified exponential curve
The curve is given by:
y  ab x
This relationship is referred to as an exponential function. It assumes that
each increment of growth will be a constant percent of the previous one.
Logistic curve
This curve has both an upper asymptote and a lower asymptote. A curve of
this type is well suited to describe the growth of industries as they pass
through early periods of experimentation, rapid growth as the product is
perfected and economics of scale make possible price reductions. The
equation of the curve is given by:
1 1
y or ab x  g 
ab x  g y

Gompertz curve
It is given by:
c  ab c

In the logarithmic form, it is given by:


Logc  Loga  Logb c 
To decide the curve to be used, it is helpful to obtain scatter diagram of
transformed variable.

The table 13.3 lists the merits and demerits of extrapolation method.

Sikkim Manipal University Page No. 327


Statistics for Management Unit 13

Table 13.3: Merits and demerits of extrapolation method


Merits Demerits
This method is very useful to forecast This method can be used under its
the future demand and production. own assumptions only.
This method is widely used for the This method is not simple but
forecasting of business events because technical, because of its mathematical
it is a simple method. formulation.
We get pure and reliable results by this The selection of trend curve is very
method, because it is a mathematical difficult.
method.

13.3.4 Regression analysis


The regression approach offers many valuable contributions to the solution
of the forecasting problem. It is the means by which we select from among
the many possible relationships between variables in a complex economy,
which will be useful for forecasting.
Regression relationship may involve one predicted or dependent variable
and one independent variable under simple regression, or it may involve
relationships between the variable to be forecasted and several independent
variables under multiple regressions.
Statistical techniques to estimate the regression equations are often fairly
complex and time-consuming. However, there are many computer programs
now available that estimate simple and multiple regressions quickly.
13.3.5 Modern econometric methods
Econometric techniques, which originated in the eighteenth century, have
recently gained in popularity for forecasting. The term „econometrics‟ refers
to the application of mathematical economic theories and statistical
procedures to economic data in order to verify economic theorems. Models
take the form of a set of simultaneous equations. The values of the
constants in such equations are supplied by a study of statistical time series,
and a large number of equations may be necessary to produce an adequate
model.
At the present time, most short-term forecasting uses only statistical
methods with little qualitative information. However, in the years to come
when most large companies develop and refine econometric models of their
major business, this tool of forecasting will become more popular.

Sikkim Manipal University Page No. 328


Statistics for Management Unit 13

The table 13.4 lists the merits and demerits of modern econometric
methods.
Table 13.4: Merits and demerits of modern econometric methods
Merits Demerits
Accurate and reliable results are This method is difficult and
obtained under this method. complicated.
It is a scientific method where This method can be used only when
computer technology is used. adequate series of data is available.
This method explains in detail and in It is very difficult to construct growth
quantitative terms the way in which model for every business activity.
various aspects of the economy are
interrelated.

13.3.6 Exponential smoothing method


This method is regarded as the best method of business forecasting as
compared to other methods. Exponential smoothing is a special kind of
increasing exponential weighted average assigned to recent observation
data and is found extremely useful in short-term forecasting of inventories
and sales.

Selection of different methods of forecasting


The selection of an appropriate forecasting method depends on many
factors, such as:
 Context of the forecast
 Relevance and availability of historical data
 Degree of accuracy desired
 Time period for which forecasts are required
 Cost benefit of the forecast to the company
 Time available for making the analysis
The forecaster should use a technique that makes the best use of
available data. Where a company wishes to forecast with reference to a
particular product, it must consider the stage of the product‟s life cycle.

Sikkim Manipal University Page No. 329


Statistics for Management Unit 13

13.4 Theories of Business Forecasting


There are a few theories that are followed while making business forecasts.
Some of them are:
i. Sequence or time-lag theory
ii. Action and reaction theory
iii. Economic rhythm theory
iv. Specific historical analogy
v. Cross-cut analysis theory
13.4.1 Sequence or time-lag theory
This is the most important theory of business forecasting. It is based on the
assumption that most of the business data have the lag and lead
relationships, that is, changes in business are successive and not
simultaneous. There is time-lag between different movements.

Example 1
When government makes use of deficit financing, it leads to inflationary
pressures; the purchasing power of people goes up. Therefore, the
wholesale prices, the retail prices starts rising. With the rise in retail
prices, the cost of living goes up and with it there is a demand for
increased wages. Thus, one factor, that is, more money in circulation,
has affected various fields of economic activity not simultaneously but
successively.

The table 13.5 lists the merits and demerits of sequence or time-lag theory.
Table 13.5: Merits and demerits of sequence or time-lag theory
Merits Demerits
This method is largely used for This method studies only the action
business forecasting because of the not the reaction.
accuracy.
Though this theory is based on This method cannot be regarded as
statistical techniques, yet it is easy to accurate because by using statistical
understand. techniques the results can be up to
the truth but not an accurate one.
Time-interval between two events can
be ascertained.
Government can use this technique for
the purpose of economic stability of the
economy by exercising control over

Sikkim Manipal University Page No. 330


Statistics for Management Unit 13

possible losses.
13.4.2 Action and reaction theory
This theory is based on the following two assumptions.
 Every action has a reaction
 Magnitude of the original action influences the reaction
Thus, if the price of rice has gone up above a certain level in a certain
period, there is a likelihood that after some time it will go down below the
normal level. Thus, according to this theory a certain level of business
activity is normal or abnormal; conditions cannot remain so for ever. Thus,
we find four phases of a business cycle. They are:
i. Prosperity
ii. Decline
iii. Depression
iv. Improvement
The table 13.6 lists the merits and demerits of action and reaction theory.
Table 13.6: Merits and demerits of action and reaction theory
Merits Demerits
This is better than other theories. The determination of normal level is
very difficult.
By this theory more reliable results can It is not necessary that reaction is
be obtained because this theory gives equal to the action.
attention to action and reaction of an
event.

13.4.3 Economic rhythm theory


The basic assumption of this theory is that history repeats itself and hence
assumes that all economic and business events behave in a rhythmic order.
According to this theory, the speed and time of all business cycles are more
or less the same and by using statistical and mathematical methods, a trend
is obtained which will represent a long term tendency of growth or decline. It
is done on the basis of the assumption that the trend line denotes the
normal growth or decline of business events.
The table 13.7 lists the merits and demerits of economic rhythm theory.

Sikkim Manipal University Page No. 331


Statistics for Management Unit 13

Table 13.7: Merits and demerits of economic rhythm theory


Merits Demerits
Forecasting is made on the basis of The business events are not strictly
past conditions, hence they are more periodic and prediction of business
reliable. cycle on the basis of statistical method
is not satisfactory.

This method is helpful in long-term Past conditions are given more


forecasting. weightage than the present conditions.

13.4.4 Specific historical analogy


History repeats itself is the main foundation of this theory. If conditions are
the same, whatever happened in the past under a set of circumstances is
likely to happen in future also. A time series relating to the data in question
is thoroughly scrutinised and from it such period is selected in which
conditions were similar to those prevailing at the time of making the forecast
but it is largely dependent on past data. The table 13.8 lists the merits and
demerits of specific historical analogy.
Table 13.8: Merits and demerits of specific historical analogy
Merits Demerits
It is an easy method. In this theory, the forecasting is based
on guess work, not on a scientific
method because the past and present
conditions are rarely found to be similar.
As the future is forecasted on the basis It is very difficult to select the past
of past business conditions, the period with the same business
forecasting is more reliable. conditions like present.

13.4.5 Cross-cut analysis theory


This theory proceeds on the analysis of interplay of current economic forces.
In this method, the combined effects of various factors are not studied. The
effect of each factor is studied independently. Under this theory, forecasting
is made on the basis of analysis and interpretation of present conditions
because the past events have no relevance with present conditions. The
table 13.9 lists the merits and demerits of cross-cut analysis theory.

Sikkim Manipal University Page No. 332


Statistics for Management Unit 13

Table 13.9: Merits and demerits of cross-cut analysis theory


Merits Demerits
Present conditions are preferred than Independent analysis of individual facts
past. is very difficult.
The effect of each factor is studied Past facts are equally important for the
independently purpose of forecasting, but in this
method no weight-age is given to past
facts.
Forecast is nearer to the accuracy as it The forecasting made on the basis of
is based on present conditions. this technique cannot be regarded as
reliable.

13.5 Utility of Business Forecasting


Business forecasting acquires an important place in every field of the
economy. Business forecasting helps the businessmen and industrialists to
form the policies and plans related with their activities. On the basis of the
forecasting, the businessman can forecast the demand of the product, price
of the product, condition of the market and so on. The business decisions
can also be reviewed on the basis of business forecasting.
13.5.1 Advantages of business forecasting
Helpful in increasing profit and reducing losses
Every business is carried out with the purpose of earning maximum profits,
so by forecasting the future price of the product and its demand the
businessman can predetermine the production cost, production and the
level of stock to be determined. Thus, business forecasting is regarded as
the key of success of business.
Helpful in taking management decisions
Business forecasting provides the basis for management decisions,
because in present times the management has to take the decision in the
atmosphere of uncertainties. Also, the business forecasting explains the
future conditions and enables the management to select the best
alternative.
Useful to administration

Sikkim Manipal University Page No. 333


Statistics for Management Unit 13

On the basis of forecasting, the government can control the circulation of


money. It can also modify the economic, fiscal and monetary policies to
avoid the adverse effects of trade cycles. So, with the help of forecasting,
the government can control the expected fluctuations in future.
Basis for capital market
The business forecasting helps in estimating the requirement of capital,
position of stock exchange and the nature of investors.
Useful in controlling the business cycles
The trade cycles cause various depressions in business such as sudden
change in price level, increase in the risk of business, increase in
unemployment and so on. By adopting a systematic business forecasting,
the businessman and government can handle and control the depression of
trade cycles.
Helpful in achieving the goals
The business forecasting helps to achieve the objective of business goals
through proper planning of business improvement activities.
Facilitates control
By business forecasting, the tendency of black marketing, speculation,
uneconomic activities and corruption can be controlled.
Utility to society
With the help of business forecasting the entire society is also benefited
because the adverse effects of fluctuations in the conditions of business are
kept under control.
13.5.2 Limitations of business forecasting
The business forecasting cannot be accurate due to various limitations
which are mentioned below.
i. The forecasting cannot be accurate, because it is largely based on
future events and there is no guarantee that they will happen.
ii. The business forecasting is generally made by using statistical and
mathematical methods. But the use of these methods cannot claim to be
able to make uncertain future certain.

Sikkim Manipal University Page No. 334


Statistics for Management Unit 13

iii. The underlying assumptions of business forecasting cannot be satisfied


simultaneously. In such a case, the results of forecasting will be
misleading.
iv. The forecasting cannot guarantee the elimination of errors and mistakes.
The managerial decision will be wrong if the forecasting is done in a
wrong way.
v. Factors responsible for economic changes are often difficult to discover
and to measure. Hence, business forecasting becomes an unnecessary
exercise.
vi. The business forecasting does not evaluate risks.
vii. The forecasting is made on the basis of past information and data and
relies on the assumption that economic events are repeated under the
same conditions. But there may be circumstances where these
conditions are not repeated.
viii. Forecasting is not a continuous process. In order to be effective, it
requires continuous attention.

Self Assessment Questions


1. State whether the following statements are „True‟ or „False‟.
i. Forecast is an estimate based solely on past data of the series
under investigation.
ii. In time series analysis method a comparative study of variations
can be made.
iii. In exponential smoothing, old observations are given increasing
exponential weightage.

13.6 Summary
In this unit, you have studied about the theory behind business forecasting
and the objectives of forecasting. The steps involved in forecasting the
trends and different forecasting methods available are also studied. Finally
we have ended the unit by explaining the advantages and limitations of
business forecasting.

13.7 Terminal Questions

Sikkim Manipal University Page No. 335


Statistics for Management Unit 13

1. What is business forecasting?


2. Explain objectives of business forecasting.
3. Give the names of theories of business forecasting.
4. Explain the characteristics of business forecasting.
5. Differentiate between prediction, projection and forecasting.
6. Describe the limitations of business forecasting.
7. Give any two criteria that can be used for choosing the suitable method
of forecasting.
8. Critically examine the important theories of business forecasting.

13.8 Answers to SAQs and TQs

Answers to self assessment questions


1. i. False
ii. True
iii. False

Answers to terminal questions


1. Refer section 13.2
2. Refer section 13.2.1
3. Refer section 13.4
4. Refer section 13.2.3
5. Refer section 13.2.2
6. Refer section 13.5.2
7. Refer section 13.3
8. Refer section 13.4

13.9 References
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited

Sikkim Manipal University Page No. 336


Statistics for Management Unit 14

Unit 14 Time Series Analysis


Structure:
14.1 Introduction
Learning Objectives
14.2 Time Series Analysis
14.3 Utility of the Time Series
14.4 Components of Time Series
Long term trend or secular trend
Seasonal variations
Cyclic variations
Random variations
14.5 Methods of Measuring Trend
Free hand or graphic method
Semi-average method
Method of moving averages
Method of least squares
14.6 Mathematical Models for Time Series
Additive model
Multiplicative model
14.7 Editing of Time Series
14.8 Measurement of Seasonal Variation
Seasonal average method
Seasonal variation through moving averages
Chain or link relative method
Ratio to trend method
14.9 Forecasting Methods Using Time Series
Mean forecast
Naive forecast
Linear trend forecast
Non-linear trend forecast
Forecasting with exponential smoothing
14.10 Summary
14.11 Terminal Questions
14.12 Answers to SAQs and TQs
Answers to Self Assessment Questions
Answers to Terminal Questions
14.13 References

Sikkim Manipal University Page No. 337


Statistics for Management Unit 14

14.1 Introduction
In the unit 13, ‘Business Forecasting’ you have studied about the ways of
forecasting business events successfully. You also studied about the
different methods available for forecasting. In this unit 14, ‘Time Series
Analysis’, you will study about the time series analysis and different
components of time series. You will also study about the forecasting
methods using time series.
A time series is a set of numerical values of a given variable listed at
successive intervals of time. That is, the data regarding the variable is listed
in chronological order. Usually the interval of time is taken as uniform.
Yearly production of wheat in the country, hourly temperature of a city,
bimonthly electricity bills are all examples of time series. Almost all the data
like industrial production, agricultural production, exports, imports, dairy
products can be arranged in chronological order.
14.1.1 Learning Objectives
By the end of this unit, you should be able to:
 Analyse the time series
 Describe different components of time series
 Describe the forecasting methods
 Apply time series analysis in business scenarios

14.2 Time Series Analysis


Given a time series, we wish to study about the forces that influence the
variations in time series and the behaviour of phenomenon over the given
period of time. For example, consider the sales of T.V sets (in thousands) by
a producing company. The table 14.1 represents the sales data of TV sets
sold from 1995 to 2000.
Table 14.1: Sales data of TV sets sold from 1995 to 2000
Year 1995 1996 1997 1998 1999 2000
Number of TV sets
12 14 16 12 10 18
sold (in thousands)

We would like to analyse the above data and give some trends about the
sales. For example, the company would like to know as to why the sales

Sikkim Manipal University Page No. 338


Statistics for Management Unit 14

dropped in 1998 and 1999, and then why the sales increased. That is, the
company would like to analyse the various forces that affect the sales.
There can be changes in the values of the variable recorder over different
points of time due to various forces. Analysing the effect of all such forces
on the values of the variable is generally known as the analysis of time
series. Broadly, there can be four types of changes in the values of the
variable as discussed below:
i) Changes which generally occur due to general tendency of the data
to increase or decrease
ii) Changes which occur due to change in climate, weather conditions,
festivals
iii) Changes which occur due to booms and depressions
iv) Changes which occur due to some unpredictable forces like floods,
famines, earthquakes

14.3 Utility of the Time Series


The following are the possible uses of the time series.
i. The comparative study of behaviour of the variable over different
periods of time can be done. The variable may be export figures,
quantity of industrial production and so on.
ii. Forecasting can be done using the time series. By studying the
variations and other behaviour of the variables over a sufficiently long
period of time, it may be possible to forecast the future behaviour of the
variables. However, such a forecast has meaning only if the period of
forecast is a normal period. For example, various five-year plans by the
Government of India are formulated by studying the time series and
forecasting.
iii. Study of the time series helps in analysing the post behaviour of the
variables. This helps in identifying the various forces that affect its
behaviour.

14.4 Components of Time Series


The behaviour of a time series over periods of time is called the movement
of the time series. The time series is classified into the following four
components:

Sikkim Manipal University Page No. 339


Statistics for Management Unit 14

i) Long term trend or secular trend


ii) Seasonal variations
iii) Cyclic variations
iv) Random variations
14.4.1 Long term trend or secular trend
This refers to the smooth or regular long term growth or decline of the
series. This movement can be characterised by a trend curve. If this curve is
a straight line, then it is called a trend line. If the variable is increasing over a
long period of time, then it is called an upward trend. If the variable is
decreasing over a long period of time, then it is called a downward trend. If
the variable moves upward or downwards along a straight line then the
trend is called a linear trend, otherwise it is called a non-linear trend.
14.4.2 Seasonal variations
Variations in a time series that are periodic in nature and occur regularly
over short periods of time during a year are called seasonal variations. By
definition, these variations are precise and can be forecasted.
The following are examples of seasonal variations in a time series.
i. The prices of vegetables drop down after rainy season or in winter
months and they go up during summer, every year.
ii. The prices of cooking oils reduce after the harvesting of oil seeds and
go up after some time.
14.4.3 Cyclic variations
The long-term oscillations that represent consistent rises and declines in the
values of the variable are called cyclic variations. Since these are long-term
oscillations in the time series, the period of oscillation is usually greater than
one year. The oscillations are about a trend curve or a trend line. The period
of one cycle is the time-distance between two successive peaks or two
successive troughs.
14.4.4 Random variations
Random variations are called irregular movements. Movements that occur
usually in brief periods of time, without any pattern and which are
unpredictable in nature are called irregular movements. These movements
do not have any regular period or time of occurrences. For example, the
effect of national strikes, floods, earthquakes and so on. It is very difficult to
study the behaviour of such a time series.

Sikkim Manipal University Page No. 340


Statistics for Management Unit 14

14.5 Methods of Measuring Trend


We will be studying the following methods of measuring the trend of a time
series:
i. Free hand or graphic methods
ii. Semi averages method
iii. Moving average method
iv. Method of least squares
14.5.1 Free hand or graphic method
This is the simplest method of drawing a trend curve. We plot the values of
the variable against time on a graph paper and join these points. The trend
line is then fitted by inspecting the graph of the time series. Fitting a trend
line by this method is arbitrary. The trend line is drawn such that the
numbers of fluctuations on either side are approximately the same. The
trend line should be a smooth curve.
The free hand method has the following disadvantages.
i. It depends on individual judgement
ii. It cannot be used for any predictions of trends, as drawing the trend
curve is arbitrary

Solved Problem 1: Find trend with the help of free hand curve method for
the data given in table 14.2:
Table 14.2: Production data from 1991 to 2001
Year Production Data (in Lakh ton)
1991 15
1992 18
1993 16
1994 22
1995 19
1996 24
1997 20
1998 28
1999 22
2000 30
2001 26

Sikkim Manipal University Page No. 341


Statistics for Management Unit 14

Solution: The figure 14.1 represents free hand curve of the production data
versus the time period. In the graph, we have taken production data values
on Y-axis and values of time on X-axis.

Fig. 14.1: Free hand curve for solved problem 1

14.5.2 Semi-average method


The methods of fitting a linear trend with the help of semi average method
are as follows:
i. When the number of years is even:, then the data of the time series is
divided into two equal parts. The total of the items in each of the part is
done and it is then divided by the number of items to obtain arithmetic
means of the two parts. Each average is then centred in the period of
time from which it has been computed and plotted on the graph paper.
A straight line is drawn passing through these points. This is the
required trend line.
ii. When the number of years is odd, then the value of the middle year is
omitted to divide the time series into two equal parts. Then the
procedure described in ‘i’ is followed.
A trend value of any future year may be predicted by multiplying the periodic
increment by the number of years into the future that is desired and adding
the result to the best trend value listed in the series.

Sikkim Manipal University Page No. 342


Statistics for Management Unit 14

The merits and demerits of semi-average method are represented in the


table 14.3.
Table 14.3: Merits and demerits of semi-average method
Merits Demerits
The semi average method is The method of semi average assumes a
simple straight line relationship between the plotted
points, regardless of the fact whether such
relationship exists or not.
The trend line can be This method has an in built limitation of
extended on either side in arithmetic mean. This method is not suitable in
order to obtain past or future case of very low or very large extreme values.
estimates.
This is an objective method, as There is no assurance that the influence of
any one applying this method cycle is eliminated.
get the same trend line.

14.5.3 Method of moving averages


Moving averages method is used for smoothing the time series. That is, it
smoothes the fluctuations of the data by the method of moving averages.
When period of moving average is odd
To determine the trend by this method, the procedure is described in
figure 14.2.

Sikkim Manipal University Page No. 343


Statistics for Management Unit 14

Fig. 14.2: Procedure for determining the trend when moving average is odd

By plotting these trend values (if desired) you can obtain the trend curve
with the help of which you can determine the trend whether it is increasing
or decreasing. If needed, you can also compute short-term fluctuations by
subtracting the trend values from the actual values.

Sikkim Manipal University Page No. 344


Statistics for Management Unit 14

Solved Problem 2: Calculate the 3 yearly and 5 yearly averages of the data
in table 14.4.
Table 14.4: Production data from 1988 to 1997
Year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
Production 15 18 16 22 19 24 20 28 22 30
(in Lakh ton)

Solution: The table 14.5 displays the calculated values of 3 yearly and 5
yearly averages.
Table 14.5: Calculated values of 3 yearly and 5 yearly averages
Production 3 –yearly 3 –yearly Short term
Year (Thousand Y moving moving totals fluctuations
Tonnes) totals Ye (Y - Yc)
1988 21 - - -
1989 22 66 22.00 0
1990 23 70 23.33 - 0.33
1991 25 72 24.00 1.00
1992 24 71 23.67 0.33
1993 22 71 23.67 - 1.67
1994 25 73 24.33 0.67
1995 27 79 26.33 0.67
1996 26 - - -

When period of moving averages is even


When period of moving average is even (such as 4 years), we compute the
moving averages by using the steps described in figure 14.3.

Sikkim Manipal University Page No. 345


Statistics for Management Unit 14

Fig. 14.3: Procedure for determining the trend when moving average is even

The table 14.6 lists the merits and demerits of the moving averages method.

Sikkim Manipal University Page No. 346


Statistics for Management Unit 14

Table 14.6: Merits and demerits of moving averages method

Merits Demerits
This is a simple method. No functional relationship between the
values and the time. Thus, this method is
not helpful in forecasting and predicting the
values on the basis of time.
This method is objective in the No trend values for some years in the
sense that anybody working on a beginning and some in the end. For
problem with this method will get example, for 5 – yearly moving average,
the same results. there will be no trend values for the first two
years and the last three years.
This method is used for In case of non–linear trend, the values
determining seasonal, cyclic and obtained by this method are biased in one or
irregular variations besides the the other direction.
trend values.
This method is flexible enough to The period selection of moving average is a
add more figures to the data difficult task. Hence, great care has to be
because the entire calculations taken in period selection, particularly when
are not changed. there is no business cycle during that time.
If the period of moving averages
coincides with the period of cyclic
fluctuations in the data, such
fluctuations are automatically
eliminated.
14.5.4 Method of least squares
Under this method, the trend curve is determined by fitting a mathematical
equation. This method is more accurate and precise and can be used even
for forecasting. We can fit either a straight line or a parabolic curve from the
given data by this method.
Key Statistic
Let ‘y’ be the actual values of ‘y’ and ‘yc’ be the computed values of ‘y’
for a given value of ‘x’.
Let ‘y = a + bx’ be a straight line to be fitted for trend. To find the values
of ‘a’ and ‘b’, such that the sum of squares of differences of the actual
and computed values of ‘y’ is least, that is,

 y  y c 
2
is least
where, the condition
y  y c  0 is satisfied,
is known as method of least squares. The line obtained by the method
is known as the ‘line of best fit.’
Sikkim Manipal University Page No. 347
Statistics for Management Unit 14

For a given time series data, to find a linear trend, the values of ‘a’ and ‘b’
are obtained by the normal equations.

   a  b 
   a   b  2
where, N is the number of pairs for which data are given. Here ‘a’ is
intercept of the line on the y – axis and ‘b’ is the slope of the line. ‘b’ is also
known as growth rate (if b > 0) or decline rate (if b< 0), ‘b’ gives the change
in the value of ‘y’, for per unit change in the value of ‘x’.
Direct method
The procedure to be followed is described below.
i) Convert the years into natural numbers (1, 2, 3……) and denote by ‘x’
and find x.
ii) Find the squares of ‘x’ values and obtain xz.
iii) Multiply the x – values with corresponding y – values and obtain xy.
iv) Add the values of y to obtain y.
v) Put these values in the two normal equations and solve for ‘a’ and ‘b’.
vi) Substitute these values of ‘a’ and ‘b’ in ‘y = a + bx’ and then find trend
values for various values of ‘x’.
Short cut method
Measure the variables ‘x’ from any point of time in origin as the first year, but
the calculations are simplified when the mid-point in time is taken as origin
so that:
x=0
When,  x = 0 then normal equations reduce to:
 y  a
y
therefore, a 
N

 xy  b  x 2
xy
therefore, b 
x 2

Sikkim Manipal University Page No. 348


Statistics for Management Unit 14

The merits and demerits of method of least squares are displayed in


table 14.7.
Table 14.7; Merits and demerits of direct method of least squares
Merits Demerits
This method is a completely It requires many calculations and is tedious
objective method. and complicated.
This method gives the trend If even a single item is added to the series
values for the entire time period. a new equation has to be formed.
This method can be used to Future forecasts made by this method are
forecast future trend because based only on trend values. Seasonal,
trend line establishes a functional cyclical or irregular variations are ignored.
relationship between the value
and the time.

Non-linear trend
When the time series data do not confirm with the linear trend then we
obtain non-linear trend. We do so by obtaining a parabolic curve or non-
linear curve in the method of least squares. For this we use the equation of
the form.
  a b  c 2  d 3 .......... k n
which is known as a polynomial of degree ‘n’ in ‘X’, k ≠ 0.
Let the parabolic curve be
  a  b  c 2
with usual notations. The values of a, b, and c can be determined by solving
the normal equations:
  ab  c  2
  a  b  2 c  3
  2   a   2 b   3  c   4
If we can change the origin at a suitable point, such that ‘x = 0’, then the
normal equations reduce to:

  ac  2
  b  2
 2   a  2 c  4
Sikkim Manipal University Page No. 349
Statistics for Management Unit 14

Self Assessment Questions


1. State whether the following statements are ‘True’ or ‘False’
i) ‘The prices of cooking oils reduce after the harvesting of oil seeds
and go up after some time’ is an example of cyclic variations in a
time series.
ii) The effect of national strikes, floods, earthquakes are examples of
random variations in time series.

14.6 Mathematical Models for Time Series


The following are the two models commonly used for the decomposition of a
time series into its components.
 Additive model
 Multiplicative model
Most of the time series relating to economic and business phenomenon
conform to the multiplication model. In practice, additive model is rarely
used.
14.6.1 Additive model
Key Statistic
The additive model assumes that the observed value is the sum of four
components of time series, that is,
Y=T+S+C+I
where,
 Y = original data
 T = trend value
 S = seasonal component
 C = cyclical component
 I = irregular component
The additive model for decomposition of time series assumes that all the
four components of the time series operate independently of one another. It
also assumes that the behaviour of components is of an additive character.
It is to be noted that only absolute values are added or deducted from the
trend value to arrive at the observed value.

Sikkim Manipal University Page No. 350


Statistics for Management Unit 14

14.6.2 Multiplicative model

Key Statistic
The multiplicative model assumes that the observed value is obtained by
multiplying the trend (T) by the rates of three other components, that is,
Y=TxSxCxI
where,
 Y = original data
 T = trend value
 S = seasonal component
 C = cyclical component
 I = irregular component
The multiplicative model assumes that the components, although due to
different causes, are not necessarily independent and they can affect one
another. It also assumes that the behaviour of components is of
multiplicative character. It may be noted that except the value of trend, all
the other values on the right hand side are rates or index numbers.

14.7 Editing of Time Series


It is necessary to make certain adjustments in the available data. Some
important adjustments are:
1. Time variation
When data are available on monthly basis, the effect of time variation needs
to be adjusted because all months of the year do not have the same number
of days. This adjustment of time variation is done by dividing each monthly
total by daily average. It is then multiplied by 365 / 12 which is the average
number of days in a month.
2. Population changes
Adjustment for population change becomes necessary when a variable is
affected by change in population. If we are studying National Income figures
such adjustment is necessary. In this case, adjustment is to divide the
income by the number of persons concerned. Then we can have per capita
income figures.

Sikkim Manipal University Page No. 351


Statistics for Management Unit 14

3. Price changes
Adjustment for price changes becomes necessary wherever we have real
value changes. Current values are to be deflated by the ratio of current
prices to base year prices.
4. Comparability
In order to have valid conclusion the data which are being analysed should
be comparable. When we are dealing with the analysis of time series it
involves the data relating to past which must be homogeneous and
comparable. When we are dealing with the analysis of time series it involves
the data relating to past which must be homogenous and comparable.
Therefore, effects should be there to make the data as homogeneous and
comparable as possible.

14.8 Measurement of Seasonal Variation


In order to isolate and identify seasonal variations, we first eliminate as far
as possible the effect of trend, cyclical variations and irregular fluctuations
on the time series. The main methods of measuring seasonal variations are:
 Seasonal variation index or seasonal average method
 Seasonal variation through moving averages
 Chain or link relative method
 Ratio to trend method
Now we will discuss separately each of the methods of measuring seasonal
variation.
14.8.1 Seasonal average method
In the seasonal average method, the steps followed are described below.
i) The time series is arranged by years and months or quarters.
ii) Totals of each month or quarter over all the years are obtained.
iii) The average for each month or quarter is obtained. The average may
be mean or median. In general, we take mean if not specified
otherwise.
iv) Taking the average of monthly or quarterly average equal to 100,
seasonal index for each month or quarter is calculated by the following
formula:
v) Seasonal Index for a month (or quarter) =

Sikkim Manipal University Page No. 352


Statistics for Management Unit 14

Monthly (or quarterly ) Average for the month (or quarter )


100
Averageor monthly (or quarterly ) averages
S
Symbolically, seasonal index for first term is given by: I  1
100
1 S
Where, S1 = Average of first term
S = Average of all terms Sj / k
j = 1, 2, 3, 4……..k
k = 12 for monthly data
k = 4 for quarterly data
The merits and demerits of seasonal average method are listed in
table 14.8.
Table 14.8: Merits and demerits of seasonal average method

Merits Demerits
This method is the Most economic time series have trends and
simplest one. therefore, the seasonal index computed by this
method is really an index of trends and seasons.

This method is useful The simple averages method of isolating seasonal


where no definite trend fluctuations in time series is based on the
exists in the time series. assumption that the series contains only the
seasonal and irregular fluctuations.

This method does not give a true reflection of the


normal seasonal variation. This is because it is
obtained from the original data which are affected
by not only seasonal movements but also by
remaining three components.
The effects of cycles of the original data are not
eliminated by the process of averaging.

14.8.2 Seasonal variation through moving averages


“Seasonal variation through moving averages method is also known as
percentage of moving average method.”
The steps involved in the computation of seasonal indices by this method
are described below.
i) The moving averages of the data are computed. If the data are
monthly then 12-monthly moving averages, if they are quarterly, then
4-quarterly moving averages will be computed. In both the cases, time
Sikkim Manipal University Page No. 353
Statistics for Management Unit 14

periods of moving averages are even. Hence, these moving averages


are to be centred.
ii) Under additive model, from each original value, the corresponding
moving average is deducted to find out short time fluctuations, which is
given as:
Y–T=S+C+I
iii) By preparing a separate table, monthly (or quarterly) short time
fluctuations are added for each month (or quarter) over all the years
and their average is obtained. These averages are known as seasonal
variations for each month or quarter.
iv) If we want to isolate / measure irregular variations, the mean of the
respective month or quarter is deducted from the short time
fluctuations.
14.8.3 Chain or link relative method
The steps involved in the chain or link relative method are described below.
i) Each quarterly or monthly value is divided by the preceding quarterly
or monthly value and the result is multiplied by 100. These
percentages are known as Link Relatives of the seasonal values.
Thus:
Current ' Season Value
Link Re lative  100
Pr evious Season Value
There shall be no Link Relative corresponding to the first.

ii) The mean of the link relatives for each season is computed over all the
years. Median can also be taken instead of mean of the Link
Relatives.

iii) These average link relatives are converted into chain relatives. The
chain relative of first is taken as 100.
The Chain Re lative of current year
 
 Average Link Re lative of current year Chain Re lative of previous year 
 

100

Sikkim Manipal University Page No. 354


Statistics for Management Unit 14

iv) The second chain relative of first is computed on the basis of the chain
relative for the last:
Chain Re lative of first quarter


Av erage Link Relativ e of the f irst quarter Chain Relativ e of the last 
100
This chain relative may or may not be 100. It is not equal to 100 due
to secular trend. If it is 100, go to ‘step vi’, if it is not 100, go to ‘step
v’ and then go to ‘step vi’.
v) Compute the difference ‘d’ between the new chain relatives first
obtained in ‘step iv’ and chain relative assumed as 100. ‘d’ is divided
by the number of seasons and the resulting figure is multiplied by
1, 2, 3 and the product is deducted respectively from the chain
relatives of 2nd, 3rd, and 4th quarters. These are called corrected
relatives.
vi) The seasonal indices are obtained when the corrected chain relatives
are expresses as percentage of their relative averages
14.8.4 Ratio to trend method
The steps to determine seasonal indices by this method are as described
below.
i) Determine the trend values by the method of least squares.
ii) To find ratio to trend, divide the original data by the corresponding
trend values and multiply these ratios by 100, that is,
 Original Data 
Ratio to Trend    100
 Trend Value 
iii) Calculate the Arithmetic Mean of the Trend Ratios obtained in
‘step ii’.
iv) Finally all the trend ratios will be converted into seasonal indices. For
this, add all averages obtained in ‘step iii’ and find their general
average. Seasonal indices are calculated by using the following
formula:
 Quarterly Averages 
Seasonal Indices    100
 General Averages 

Sikkim Manipal University Page No. 355


Statistics for Management Unit 14

14.9 Forecasting Methods using Time Series


There are five forecasting methods using time series. They are:
1. Mean Forecast
2. Naive Forecast
3. Linear Trend Forecast
4. Non-Linear Trend Forecast
5. Forecasting with exponential smoothing
14.9.1 Mean forecast
It is the simplest method of forecasting in which for the time period t, we
forecast the value of the series to be equal to the mean of the series, that is,
y y
t
In this method the trend effect and cyclic effects do not come into account.

14.9.2 Naive forecast


In this method we forecast the value, for the time period t, to be equal to the
actual value observed in the previous period, that is, time period (t-1). This
is given as:
Y y
t t 1

14.9.3 Linear trend forecast


It is given by yt = a + bx, where x is to be found from the value of t; a and b
are constants. This method is based on the least squares method where a
linear relationship is to be obtained between time and the response value ‘x’
by the formula which is given as:
Y y
t t 1

14.9.4 Non-linear trend forecast


In this method a non-linear relationship between the time and the response
value has been found by the method of least squares. The value of forecast
‘yt’ for the time period ‘t’, is given as:
Y  a b  c 2
t
where, X-value will be calculated from the value of ‘t’ and the constant ‘a’.

Sikkim Manipal University Page No. 356


Statistics for Management Unit 14

14.9.5 Forecasting with exponential smoothing


Exponential smoothing is the forecasting method in which the observation
values are constantly updated and used to revise a forecast. As the
observations get older, they get exponentially decreasing weights.
Exponential smoothing is of many types, such as single, double, triple
exponential smoothing.

Self Assessment Questions


2. Fill in the following blanks.
i) A set of numerical value observed at regular interval of time is called
_______.
ii) Long term movements in time series are called ______.
iii) Variations that occur within a year are known as _______.
iv) Semi-Average Method is used to measure _________.
v) Method of Moving Averages does not show any _______
relationship.

14.10 Summary
In this unit, you have studied about business forecasting. The different steps
involved in forecasting are discussed in a simple manner.
The four different components of time series are discussed. The concept of
time series analysis is discussed next with examples. Action and reaction
theory is explained with its merits and demerits in a simple manner.
In this unit, you have also studied about the method of least squares with
merits and demerits discussed in detail.
The five types of forecasting methods using time series are discussed in
detail.

14.11 Terminal Questions


1. What is meant by analysis of time series?
2. State the difference between seasonal variations and cyclical
fluctuations.
3. What is trend? State various methods of measuring it.
4. Explain the moving average method of measuring long term trend.

Sikkim Manipal University Page No. 357


Statistics for Management Unit 14

5. What are the components of time series? Bring out the significance of
moving average in analysing a time series and point out its limitations.
6. What is meant by secular trend? Discuss any two methods of isolating
trend values in a time series.
7. What is seasonal variation of a time series? Describe the various
methods you know to evaluate it and examine their relative merits.
8. Find a straight line trend to the following data and find trend value.
Table 14.9: Yearly production data
Year Production in 1000 kg
1990 80
1991 90
1992 92
1993 83
1994 94
1995 99
1996 92

9. Find seasonal values for the data in table 14.10.


Table 14.10: Data of terminal question 9
st
Year 1 QI II III IV
1995 3.7 4.1 3.3 3.5
1996 3.7 3.9 3.6 3.6
1997 4.0 4.1 3.3 3.1
1998 3.3 4.4 4.0 4.0

14.12 Answers to SAQs and TQs

Answers to Self Assessment Questions


1.
i) False
ii) True
2.
i) Time series
ii) Secular trend

Sikkim Manipal University Page No. 358


Statistics for Management Unit 14

iii) Seasonal variations


iv) Trend
v) Functional relationship

Answers to Terminal Questions


1. Refer section 14.2
2. Refer section 14.4.2 and section 14.4.3
3. Refer section 14.5
4. Refer section 14.5.3
5. Refer section 14.4 and section 14.5
6. Refer section 14.4
7. Refer section 14.8
8. The equation of the straight line is given as: y = 90 + 2x
The trend values are 84, 86, 88, 90, 92, 94, 96.
9. The seasonal values obtained are 98.66, 110.74, 95.30, 95.30.

14.13 References
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited.

Sikkim Manipal University Page No. 359


Statistics for Management Unit 15

Unit 15 Index Numbers


Structure:
15.1 Introduction
Learning Objectives
15.2 Definition of an Index Number
Relative
Classification of index numbers
Base year and current year
Chief characteristics of index numbers
Main steps in the construction of index numbers
15.3 Methods of Computation of Index Numbers
Unweighted index numbers
Weighted index numbers
15.4 Tests for Adequacy of Index Number Formulae
15.5 Cost of Living Index Numbers of Consumer Price Index
Utility of consumer price index numbers
Assumptions of cost of living index number
Steps in construction of cost of living index numbers
15.6 Methods of Constructing Consumer Price Index
Aggregate expenditure method
Family budget method
Weight average of price relatives
15.7 Limitations of Index Numbers
15.8 Utility and Importance of Index Numbers
15.9 Summary
15.10 Terminal Questions
15.11 Answers to SAQs and TQs
Answers to Self Assessment Questions
Answers to Terminal Questions
15.12 References

15.1 Introduction
In the unit 14, ‘Time Series Analysis’, you have studied about the definition
and components of time series. You have also studied about different
forecasting methods using time series analysis. In this unit 15, ‘Index

Sikkim Manipal University Page No. 360


Statistics for Management Unit 15

Numbers’, we will discuss about the meaning and definition of index


numbers, the types of indices along with examples. You will also study
about different kinds of index numbers. Finally, you will study about the
limitations of index numbers.
We know that most values change and therefore we may want to know how
much change has taken place over a period of time. For example, we may
want to know how much the prices of different items essential to a
household have increased or decreased so that necessary adjustments can
be made in the monthly budget. However, while price of a few items may
have increased, others may have decreased over a given period of time.
Consequently, in all such situations, an average measure needs to be
defined to compare such difference over a time period. Index numbers are
yardsticks for describing such differences. These differences may have to
do with the physical quantities of the goods, the prices of the commodities,
or such concepts as ‘efficiency’ ‘intelligence’ or beauty’. The comparison
may be between the periods of time, between places, between categories
and so on.
We may have index numbers comparing the cost of living at different times
or in different localities or countries. Index numbers are used in comparison
of the physical volume of production in different years, or efficiency or
different government offices. However, we confine most of our attention to
the construction of index numbers measuring changes over time.
15.1.1 Learning Objectives
By the end of this unit, you should be able to:
 Represent a data set in index number form
 Describe how much the economic variables have changed over time
 Describe three principal types of indices: price indices, quality indices,
and value indices
 Calculate various kinds of index numbers

15.2 Definition of an Index Number


An index number is a number which is used to measure the level of a
certain phenomenon as compared to the level of the same phenomenon at
some standard period. In other words, an index number is a number which

Sikkim Manipal University Page No. 361


Statistics for Management Unit 15

is used as a device for comparison between the price, quantity or value of a


group of articles in different situations, for example, at a certain place or a
period of time and that of another place or period of time.

Key Statistic
An index number is a statistical measure which is designed to express
changes or differences in a variable or a group of related variables. It is
usually expressed in percentage form.

When a comparison is in respect of prices, it is called an index number of


price, when in respect of physical quantities; it is named as index number of
quantities. Other index numbers are defined in the similar manner. The
index numbers are meant for comparison of variations arising out of the
difference in situations, for example, change of time or change of place.
15.2.1 Relative
The value of a variable in a given year (or place) divided by the value of the
same variable in a specified year (or place) is called a relative. It is generally
expressed in percentage.
a. Price relative
The price of commodity in a given year expressed as a percentage of the
price of the same commodity in a specified year is called price relative.
Solved Problem 1: The price of a commodity in India in 2001 was Rs. 95
per kg and in 2000 it was Rs. 80 per kg. Calculate the price relative for the
year 2001.
Solution: The price relative for 2001, (using 2000 as base) is calculated as:
95
Price relative for 2001  100  118.75 %
80
Hence, the price relative for 2001 is 118.75 %.
b. Production relative
Let us understand production relative with an example.
Solved Problem 2: If the wheat production in India in 2002 was 5,82,000
metric tons and in 2004, it was 6,96,000 metric tons, then assuming the
production of 2002 as 100, calculate the production relative for 2004.

Sikkim Manipal University Page No. 362


Statistics for Management Unit 15

Solution: Let us take the production of 2002 as base.


696000
Production relative for 2004  100 119.6%
582000
Hence, the production relative for 2004 is 119.6 %.
c. Quantity relative
The quantity (q1) of a commodity consumed in a given year expressed as a
percentage of the quantity (q0) of the same commodity consumed in a
specified year is called quantity relative. Thus,
q1
Quantity relative  100
q0

d. Value relative
If ‘p1’ and ‘q1’ are the price and quantity respectively for a commodity in a
given year and ‘p0’ and ‘q0’ are the specified price and quantity respectively
of the same commodity, in a specified year, then the value of the specified
year, ‘V1’ and the value of the given year, ‘V0’ are calculated as:
V1 = p1 q1
V0 = p0 q0
The value relative of the specified year with respect to the given year is
calculated as the ratio of ‘V1’ to ‘V0’, and then the ratio is multiplied with 100.
That is,
V1 p q
Value relative  100  1 1 100
V0 p 0 q0

The overall change in price, production, quantity or value and so on, is


represented by these typical summaries which are known as relatives.
15.2.2 Classification of index numbers
There are various approaches for classification of index numbers. They are:
1. Based on variables
a. Price index: when the variable is price
b. Quantity index: when the variable is quantity
c. Value index: when the variable is value
d. Production index: when the variable is production

Sikkim Manipal University Page No. 363


Statistics for Management Unit 15

2. Based on retail or wholesale prices


a. Cost of living index number: where we use retail prices
b. Wholesale price index number: where we use wholesale prices
3. Based on weights
a. Simple (unweighted) index number
b. Weighted index number
4. Based on number of commodities
When the number of commodities is more than one, then we obtain a
single (combined) index number. This can be done in four ways:
a. Simple average of relatives
b. Weighted average of relatives
c. Simple aggregate
d. Weighted aggregate
15.2.3 Base year and current year
In the computation of an index number we require two years (or places).
The given year whose values are to be compared is called a current year (or
current period) and the specified year whose values are taken as standard
(for example, 100) is called a base year (base period).

Example 1
If the prices of 2005 are compared with the prices of 2004, then 2005 is
the current year and 2004 is the base year. The index number of 2005
based on 2004, is denoted by ‘Q01’ or ‘P01’, where subscript ‘0’ stands for
the year 2004, and subscript ‘1’ stands for the year 2005.

15.2.4 Chief characteristics of index numbers


1. Expressed in numbers
Index numbers represent the relative changes such as production is
increased; prices are down and so on, in the numbers.
2. Expressed in percentage
Index numbers are expressed in terms of percentages so as to show the
extent or relative change where the value of base is assumed to be 100 but
the sign of percentage (%) is not used.

Sikkim Manipal University Page No. 364


Statistics for Management Unit 15

3. Relative measure
Index numbers measure changes which are not capable of direct
measurement.
4. Specified averages
Index number represents a special case of average, in general, a weighted
average. It is a special type of average, because whereas in a simple
average, the data are homogenous having the same unit of measurement,
they average variables having different units of measurement.
5. Basis of Comparison
Index numbers by their very nature are comparative. They compare
changes over time or between places or similar categories.
15.2.5 Main steps in the construction of index numbers
To follow the steps involved in the construction of index numbers many
problems are encountered which are to be discussed carefully:
1. Purpose of index number
The steps which are taken in the construction of index numbers generally
depend on the purpose of the index number. Hence, the purpose of an
index numbers must be defined clearly and precisely. For example, the
purpose of the general index number of wholesale price index number is to
know the general price level. On the other hand, the purpose of the
consumer price index number is to give an idea of the effect of the change
in retail prices on the cost of living of classes of people.
2. Selection of base period
The base period of an index number is the period of time against which the
comparisons are made. There are three types of base periods.
i) Fixed base (a single period)
ii) Fixed base (an average of selected periods)
iii) Chain base
While selecting the base, a decision has to be made to decide whether we
have fixed base or chain base.
Fixed base (a single period): In a fixed base (a single period), the base
period must be a normal period. By normal period, we mean that the period
must be free from all sorts of abnormalities or random causes such as
financial crisis, floods, famines, earth quakes, strikes of labourers, wars. The
Sikkim Manipal University Page No. 365
Statistics for Management Unit 15

base period should be a period for which reliable figures are available. The
base period should not be too distant in the past.
Fixed base (an average of selected periods): When it is difficult to choose
just one single period as the normal, then a better choice will be an average
of several periods.
Chain base: If the comparisons are required from year to year, a system of
chain base is used. In this method, there are 10 fixed bases for comparing
the values of subsequent years, but the value of each year is compared with
the value of the preceding year.
3. Selection of commodities
The following problems can occur while selecting the commodities.
 First problem is the selection of commodities because it is not feasible to
include all commodities. The purpose of the index number is to help in
deciding the number of commodities.
 Another problem is to decide on which commodities are to be included?
A careful selection of the commodities must be made in such a way that:
 It represents the real tastes, habits and the customs of the people.
 It should be of a standard quality and there must be no significant
variation in the quality.
 It must be easily recognisable and describable.
 It should not be a non-tangible commodity such as personal service.
4. Selection of the representative prices
In the collection of price quotations we have to consider the following points:
 The method of quoting prices of the commodities
 The type of quotations - whether wholesale prices or retail prices
 The place from where the quotations are to be obtained
5. System of weighting
The term ‘weight’ refers to the relative importance of the different
commodities included in the construction of index numbers. There are two
methods of assigning weights. They are:

Sikkim Manipal University Page No. 366


Statistics for Management Unit 15

 Implicit method: In this method, several varieties of a certain type of


commodity under study are used. Such weights are called implicit
weights.
 Explicit method: In this method, the weights are laid down on the basis
of one outward evidence of importance of commodities. One of the
problems in the selection of appropriate weight is to decide this
evidence. Another problem with regard to the system of weighting is
whether weights should be fixed or fluctuating.
6. Selection of the average
To find composite index number we can use any average such as arithmetic
mean, geometric mean, harmonic mean, median and mode. The use of an
average depends on the relative merits and demerits of the various
averages. The average may be weighted or unweighted.
7. Selection of suitable formula
There are various formulae for computing index numbers so the selection of
a suitable formula also poses some problem. A particular formula is suitable
in a particular situation.

15.3 Methods of Computation of Index Numbers


The various methods of constructing index numbers can be classified into
two groups. They are:
 Unweighted index numbers
 Weighted index numbers
In unweighted index numbers, each item is supposed to have the same
weight but in weighted index numbers the weights are assigned to various
items in accordance with their importance. The figure 15.1 illustrates the
further classification of methods of constructing index numbers.

Sikkim Manipal University Page No. 367


Statistics for Management Unit 15

Fig. 15.1: Methods of constructing index numbers


Unweighted index numbers can be further divided into two categories. They
are:
i) Simple aggregative method
ii) Simple average of relatives method.
Weighted index numbers can also be further divided into two categories.
They are:
i) Weighted aggregative method
ii) Weighted average of relatives method
15.3.1 Unweighted index numbers
Simple aggregative method
To construct a price index by simple aggregative method, we proceed by
doing the following:
i) Add the prices of all commodities in the current year, that is, find p1
ii) Add the prices of all commodities in the base year, that is, p0
iii) Divide the total of current year prices by the total of base year prices
and multiply the quotient by 100, that is,
P1
P01  100
P0
where, ‘P01’ is the simple price index number of current year based on
base year (0).

Sikkim Manipal University Page No. 368


Statistics for Management Unit 15

The table 15.1 lists the merits and demerits of simple aggregative method.
Table 15.1: Merits and demerits of simple aggregative method
Merits Demerits
This is the simplest method of This method gives inappropriate results when
constructing index numbers. the prices of different commodities are quoted in
different units.
It is simple and easy to Since weights are not used, this method does
understand. not give any consideration to the relative
importance of commodities.
It requires simple Index number calculated by this method is
calculations. unduly affected by high or low values.

Solved Problem 3: Find the simple aggregative price index from the data
displayed in table 15.2.
Table 15.2: Price of commodities for the years 2000 and 2004
Price in Rs. per unit
Commodity Unity
2000 2004
A One kilogram 10 15
B One kilogram 40 30
C One dozen 10 12
D One litre 5 13
Total 65 70

Solution: The price index number of 2004 is based in 2000. Using the
formula:
p
P01  1 100
p 0
Where, P1 = total of prices in 2004 = 70
P0 = total of prices in 2000 = 65

Therefore,
70
P01  100107.7
65

Sikkim Manipal University Page No. 369


Statistics for Management Unit 15

This implies that the prices had increased by 7.7% in year 2004 as
compared to the year 2000.

Self Assessment Question


1. Find out the price index number using simple aggregate method for the
data represented in table 15.3.
Table 15.3: Price of the commodities for years 2001 and 2002
Price in Rs. per quintal
Commodity
Base year, 2001 Current year, 2002
Wheat 80 100
Rice 120 250
Gram 100 150
Pulses 200 300

Simple average of relatives method


To construct a price index by this method, we proceed by doing the
following:
i) Obtain the price relative for each commodity, which is calculated as:
Pr ice of current year
Pr ice relative for current year   100
Pr ice of base year
P
R 1
 100
P
0
ii) Calculate the arithmetic mean, geometric mean for the price relatives
obtained in ‘step i’ and denote it by ’P01’.
a. When arithmetic mean is used:
 ( P1 / P0 )
P01  100
N
b. When geometric mean is used:
log R
P01  Anti log
N

Solved Problem 4: The prices of three different commodities for 2002 and
2003 are displayed in table 15.4a. The price given is per each ton of the
commodity. Taking the year 2002 as base, calculate the price index by

Sikkim Manipal University Page No. 370


Statistics for Management Unit 15

using the simple average of relatives method by using both arithmetic mean
and geometric mean.
Table 15.4a: Prices of commodities for 2002 and 2003
Commodity Corn Wheat Cocoa
Price in 2002 800 500 900
Price in 2003 880 480 940

Solution: The table 15.4b represents the calculated values for determining
price index.
Table 15.4b: Calculated values for determining price index
Price
Price Price Relative
Commodity Pn log R
in 2002, Po in 2003, Pn
R 100
Po
Corn 800 880 880 2.04
100 110
800
Wheat 500 480 480 1.98
100  96
500
Cocoa 900 940 940 2.02
100 104.44
900

Total
Po  2200 Pn  2300 R  310.44 6.04

i) Simple average of relatives method by using arithmetic mean:


 ( Pn / Po ) R 310.44
Pon  100    103.48
N N 3
Simple average of relatives method by using geometric mean:
log R 6.04
P01  Anti log  Anti log  102.33
N 3
Hence, the price index obtained by simple average of relative method using
arithmetic mean and geometric mean are 103.48 and 102.33 respectively.

Sikkim Manipal University Page No. 371


Statistics for Management Unit 15

The table 15.5 displays the merits and demerits of simple average of
relatives method.
Table 15.5: Merits and demerits of simple average of relatives method
Merits Demerits
It is not affected by units in which As it is an unweighted average the
prices are quoted importance of all items is assumed to be
the same.
It is not affected by absolute values The index number constructed by this
of prices as prices are converted method does not satisfy all the criterion
into price relatives. laid down for an ideal index.
It gives equal importance to all The index number is unduly influenced by
items and extreme items do not high or low prices when arithmetic mean is
unduly affect the index number. used.
The index number calculated by More labour is involved if geometric mean
this method satisfies the unit test. is used.

15.3.2 Weighted index numbers


To meet the weakness of the simple or unweighted method, we weigh the
price of each commodity by a suitable factor - often we take as the quantity
or the volume of the commodity sold during the base year. In other words, in
this method, appropriate weights are assigned to various commodities to
reflect their relative importance in the group. The weight can be production
figures, consumption figures or distributive figures.

Key Statistic
For the construction of the price index number quantity weights are used.
If ‘w’ is the weight attached to a commodity, then the price index is given
by:
P1  w
Pr ice Index P01  100
P0  w

Weighted aggregative index number


In the weighted aggregative index numbers, the weights are assigned to
various items and the weighted aggregate of the prices are obtained.
Weights are assigned in various ways and the weighted aggregates are
used in different ways for the construction of index numbers.

Sikkim Manipal University Page No. 372


Statistics for Management Unit 15

Some of the important methods of constructing weighted aggregative index


numbers are described below.

Laspeyre’s price index


Laspeyre’s method is based on fixed weights of the base year. Base
year’s quantities are used as weights. The formula given by Laspeyre is
given below.
P1Q 0
Laspeyre' s Pr ice Index I 01   100
P1Q 0
Where, P1 = Current year price
P0 = Base year price
Q0 = Quantity used for weight in the base years
This index number has an upward bias, that is, when prices increase,
there is a tendency to reduce the consumption of higher priced goods.
This index number is widely used in practical work.
The quantity index number using Laspeyre’s formula is given by:

Q 01 
Q 1P1 100
Q 0 P0

Paasche’s method
Paasche’s method is based on current year’s quantities. Current year’s
quantities are used as weights. Paache’s Price Index is given by:
P1Q 1
PP 01  100
P0 Q 1
Where, P1 = Current year price; P0 = Base year price
Q1 = Current year quantity which are taken as weights.
This index number has downward bias. This formula is not used frequently
in practice where the number of commodities is large.
Quantity index number using Paasche’s formula is given by:
 Q 1P1
PQ 01 
 Q 0 P1

Sikkim Manipal University Page No. 373


Statistics for Management Unit 15

Dorbish and Bowley’s method


This method is a combination of Laspeyre’s and Paasche’s method. If we
find out the arithmetic average of Laspeyre’s index and Paasche’s index,
we get the index suggested by Dorbish and Bowley. This index number
takes into account both the base year’s as well as the current year’s
weights. Dorbish and Bowley’s Price Index is given by:

 LP  PP 01  P1Q o  P1Q 1
DP    100    100
01
 2  P0 Q 0 P0 Q 1

1/ 2
 P Q P Q 
 1 0  1 1   100
 P0 Q 0 P0 Q 1 
Where, ‘LP’ is Laspeyre’s price index and ‘PP01’ Paasche’s price index.

The table 15.6 displays the merits and demerits of weighted index number
Table 15.6: Merits and demerits of weighted index number
Merits Demerits
It is free from bias, upward as well as This formula is difficult to interpret.
downward.
This formula takes into account both It is not a practical index to compute
current years as well as base year because it is excessively laborious.
prices and quantities.
It satisfies both ‘time several test’ as It requires the prices and quantities for
well as the ‘factor reversal test’. This base year and current year.
is why it is called an ideal index
number.

Quantity index numbers


The quantity index numbers measure the average storage in quantities and
enable us to compare changes in physical quantity of goods produced or
sold. These index numbers can also be simple or weighted. Therefore,
quantity index numbers can be easily obtained from price index numbers
just by interchanging P’s and Q’s in the formulae used for calculating the
price index numbers. The weighted average of relatives quantity index is
given by:

Sikkim Manipal University Page No. 374


Statistics for Management Unit 15

 Qi  
  100  Q n Pn 
  Q 0 
 
Quantity index =
 Q n Pn
where,
 ‘Qi’ and ‘Q0’ are the quantities for the current and base period
respectively
 ‘Pn’ and ‘Qn’ are the quantities and prices that determine values that we
use for weights.

Value index numbers


The value index numbers are very easy to calculate. Value is the product of
price and quantity. A simple value index number is equal to the value of the
current year divided by the value of the base year. If this value is multiplied
by 100 we get the value index number. The required formula is:

P Q
V 1 1
 100
P Q
0 0

Simple value index number is given by:

 V1
V 100
 V0

where, V1 = value of the current year.


Such index numbers are not weighted, because they do not take into
account either the price or the quantity. These index numbers are not very
popular because the situation revealed by price and quantities are not fully
revealed by the values.

15.4 Tests for Adequacy of Index Number Formulae


1. Unit test
This test requires the formula should be free of units. Except simple
aggregative index, all the others satisfy this test.

Sikkim Manipal University Page No. 375


Statistics for Management Unit 15

2. Time reversal test


This test requires the formula for calculating the index number should be
such that it will give the same ratio between one period of comparison and
the other. Symbolically,
P01 P10 1
This test is satisfied by Fisher’s ideal index, simple geometric mean of price
relatives, weighted geometric mean of price relatives and Marshall-
Edgeworth index number.
3. Factor reversal test
The formula should permit the interchange of price and quantity without
giving inconsistent results.
P Q
P Q  1 1
01 01 P Q
0 0
This test is satisfied by Fisher’s ideal index
4. Circular Test
It is an extension of time reversal test. The test requires that if an index is
constructed for the year ‘a’ on base year ‘b’, and for the year ‘b’ on the base
year ‘c’, we should get the same result as if we calculated directly for the
year ‘a’ on the base year ‘c’ without going through ‘b’. Symbolically,
P01 P12 P20 1

It is satisfied by index numbers with fixed weights by aggregate methods.

15.5 Cost of Living Index Numbers of Consumer Price Index


The ‘cost of living index’, also known as “consumer price index’ or ‘cost of
living price index’ is the country’s principal measure of price change. The
consumer price index helps us in determining the effect of rise and fall in
prices on different classes of consumers living in different areas.

Sikkim Manipal University Page No. 376


Statistics for Management Unit 15

Key Statistic
Cost of living price index measures average change over time in the
prices paid by the consumer of specific baskets of goods and services.
The cost of living price index numbers are designed to measure the
average change in the price paid by the ultimate consumers for specified
quantities of goods and services over a period of time

Different people consume different kinds of commodities and the same


commodities in different proportions. The consumer price index helps us in
determining the effect of size. Fall in price index helps us in determining the
effect of rise and fall in prices on different classes of consumers living in
different areas. The consumer price index number is significant because the
demand of a higher wage is based on the cost of living index and the wages
and salaries in most nations are adjusted according to this index number.
The cost of living index does not measure the actual cost of living or the
fluctuations in the cost of living due to causes other than the change in price
level. But its object is to find out how much the consumers of a particular
class have to pay for a certain quantity of goods and services.
15.5.1 Utility of consumer price index numbers
The following are the uses of consumer price index numbers.
i) It is useful to measure the change in purchasing power of currency,
real income.
ii) It helps the government in formulating wage policy, price policy,
taxation and general economic policies.
iii) Market prices for particular kinds of goods and services are analysed
by consumer price index.
iv) The salaries and wages are fixed on the basis of consumer price
index. So, it is very helpful to revise wage of dearness allowance.
15.5.2 Assumptions of cost of living index number
Cost of living index number is based on the following assumptions.
1. Similar needs
The needs of the people for which this index number is constructed are
same.

Sikkim Manipal University Page No. 377


Statistics for Management Unit 15

2. Same goods
The goods consumed in the base and current years remain unchanged.
3. No change in quantity of goods
It is also assumed that the quantity of goods consumed will remain same in
the base year and current year.
4. Price quotations are same
It is also assumed that the prices at different places are same and they do
not change frequently.
5. True on the average
Cost of living index numbers are true on the average.
6. Representative goods
The commodities included in the cost of living index number represent the
consumption of the class of people.
15.5.3 Steps in construction of cost of living index numbers
There are 5 steps involved in construction of cost of living index numbers.

Step 1: Select the class of people


Step 2: Define scope of the index
Step 3: Conduct family budget inquiry
Step 4: Obtain price quotations
Step 5: Prepare a frame or list of persons

15.6 Methods of Constructing Consumer Price Index


There are three methods for constructing consumer price index number.
They are:
 Aggregate expenditure method
 Family budget method
 Weight average of price relatives
15.6.1 Aggregate expenditure method
Based on Laspeyre’s method, base year quantities are taken as weights
(w = Q0).
P1Q 0 P1Q1
P01  100 or  100
P0 Q 0 P0 Q1

Sikkim Manipal University Page No. 378


Statistics for Management Unit 15

15.6.2 Family budget method


Family budget method or the method of weighted relatives is the method
where weights are the value (P0Q0) in the base year often denoted by V.
RV
P01  100
V
 P1Q 0
  100 same as the equation in sub  sec tion 15.6.1
P0 Q 0
15.6.3 Weight average of price relatives
IW
Let I = group index and W = weights, Then, P01 
W

15.7 Limitations of Index Numbers


There is no doubt that the technique of index numbers is a very useful tool. But
there are certain limitations of index numbers which should be borne in mind.
The chief limitations are:
 Index numbers are not perfect. They are approximated values.
 Difficulties in the construction of index numbers. Due to selection of
base year, items, changes in habits and selection of average.
 Sampling errors occur.
 Index numbers can also be manipulated.
 They have limited applications. An index number constructed for one
purpose cannot be used for other purposes.
 Lack of adequate and accurate data

Self Assessment Questions


2. The data in table 15.7 is related to workers in an industrial town.
Calculate consumer price index number.
Table 15.7: Price index and percentage expenditures of items
Item of consumption Price index P Percentage expenditure
Food 200 50
Clothing 175 10
Fuel & lighting 160 12
Housing 225 15
Miscellaneous 150 13

Sikkim Manipal University Page No. 379


Statistics for Management Unit 15

3. Shift the base of the index numbers to 1990 for the data in table 15.8.
Table 15.8: Index numbers corresponding to year
Year 1982 1986 1990 1994 1998
Index number
100 140 200 260 320
(base 1982)

15.8 Utility and Importance of Index Numbers


The primary purpose of index numbers is to measure relative temporal or
cross-sectional changes in a variable or a group of related variables which
are not capable of being directly measured. The greatest purpose of index
numbers has been to measure and compare the changes in prices and
purchasing power of money which have received great attention from
economists for many years.
Today, index number is not only used for measuring price changes alone.
The factors like wages, employment, production, trade, demand, supply,
business condition, industrial activity, financial problems are also studied
through this statistical device. As a barometer measures the pressure of
atmosphere or gases, so the index numbers measure the pressure of
economic behaviour, and thus the index numbers are called economic
barometers.

Main uses of index numbers:


 Comparative study is made possible
 Simplifies data
 Provides guidelines to economic policy and in formulating decisions
 Measures purchasing power of money
 Change in cost of living
 National income calculations
 It is used as control by government
 Reveals trends and tendencies
 Useful in deflating
 Universal utility

Sikkim Manipal University Page No. 380


Statistics for Management Unit 15

15.9 Summary
In this unit, you have studied about the concept of index numbers, and
classification of index numbers into different types. The different index
numbers that are formally available and the utility and importance of index
numbers are explained in a simple way. You have also studied the
limitations and uses of index numbers.

15.10 Terminal Questions


1. What is index number? State its utility.
2. Discus the problems of:
i) selection of the base year
ii) selection of weights in the construction of index numbers
3. What are the characteristics of an index number?
4. Construct Fisher’s ideal index for the data represented in table 15.9.
Table 15.9: Price of commodities for the years 1997 and 2005
Base year 1997 Current year 2005
Commodity
Price Qty Price Qty
A 10 12 12 15
B 7 15 5 20
C 5 24 9 20
D 16 5 14 5

5. The table 15.10 displays the price of commodities along with the weights
of respective commodities. Calculate index number for 2000 based on
year 1995.
Table 15.10: Price of commodities along with the weights
Commodity 1995 2000 Weights
A 13 8 6
B 15 22 5
C 249 185 4
D 228 259 1
E 497 448 2

Sikkim Manipal University Page No. 381


Statistics for Management Unit 15

15.11 Answers to Self Assessment Questions


1. For the data in table 15.3, we can calculate the price index number of
2002 based on 2001 as:
P Q
I  1 0
01 P Q
0 o

where, P1 = total of prices in 2002 = 800


P0 = total of prices in 2001 = 500
800
Therefore, I 01   100 160
500
This means that the price has increased by 60% in 2002 as compared to
2001
2. The table 15.11 displays the price of items along with the weighted
price.
Table 15.11: Price of items along with the weighted prices
Item P w(weight) wP
Food 200 50 10000
Clothing 175 10 1750
Fuel & Lighting 160 12 1920
Housing 225 15 3375
Miscellaneous 150 13 1975
Total w = 100 wP = 18995

Consumer price index number by family budget method is given by:


 w P 18995
P    189.95
01 w 100
Hence, the consumer price index number by family budget method is
189.95.
3. The index number for 1990 with base 1982 is 200. Therefore,
Old index
New index   100
200

Sikkim Manipal University Page No. 382


Statistics for Management Unit 15

Answers to Terminal Questions


1. Refer section 15.2 , section 15.5.1, section 15.8
2. Refer section 15.2.5
3. Refer section 15.2.4
4. The Fisher ideal index is equal to 115.6.
5. The required index number for the year 2000 is 92.17.

15.12 References
 Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
 S. C. Gupta, Fundamentals of Statistics, 2008, Himalaya Publishing
House
 U K Srivastava, G V Shenoy, S C Sharma, Quantitative Techniques for
Management Decisions, Second edition, New Age International

–––––––––––––––––––––

Sikkim Manipal University Page No. 383

You might also like