MB0040
MB0040
Board of Studies
Chairman Mr. Pankaj Khanna
HOD Management & Commerce Director
SMU – DDE HR, Fidelity Mutual Fund
Additional Registrar Mr. Shankar Jagannathan
SMU – DDE Former Group Treasurer
Wipro Technologies Limited
Controller of Examination Mr. Abraham Mathew
SMU – DDE Chief Financial Officer
Infosys BPO, Bangalore
Dr. T. V. Narasimha Rao Ms. Sadhna Dash
Adjunct Faculty & Advisor Ex-Senior Manager, HR
SMU – DDE Microsoft India Corporation (Pvt.) Ltd.
Prof. K. V. Varambally
Director
Manipal Institute of Management
Manipal
Unit 1 Introduction
Structure:
1.1 Introduction to Statistics
Learning objectives
Importance of Statistics in modern business environment
1.2 Definition of Statistics
1.3 Scope and Applications of Statistics
1.4 Characteristics of Statistics
1.5 Functions of Statistics
1.6 Limitations of Statistics
1.7 Statistical Softwares
1.8 Summary
1.9 Terminal Questions
1.10 Answers to SAQs and TQs
Answers to Self Assessment Questions
Answers to Terminal Questions
1.11 References
1.1 Introduction
Welcome to the unit on Statistics. In this unit, you will study about Statistics,
which deals with gathering, organising and analysing data.
Statistics plays an important role in almost every facet of human life. In the
business context, managers are required to justify decisions on the basis of
data. They need statistical models to support these decisions. Statistical skills
enable managers to collect, analyse and interpret data and make relevant
decisions. Statistical concepts and statistical thinking enable them to:
Solve problems in almost any domain
Support their decisions
Reduce guesswork
1.1.1 Learning objectives
By the end of this unit, you should be able to:
Describe the scope of Statistics
Distinguish between statistical data and non-statistical data
Caselet 1
The new General Manager Mr. Ravi of a manufacturing company is
concerned about the dwindling profits of the company. The Marketing
and Production Managers identifies the reason as guarantee period
given to customers, since the product has to be replaced if it fails within
guarantee period. This replacement lowers the company‟s profits and
also causes loss of reputation. The General Manager is now thinking in
terms of reducing the percentage of failure of units within a year. This
means that he should take action to improve the life of the unit. After
preliminary studies he decides to
I. Estimate the average life of the units and their variation.
II. Take action to improve the life.
III. Lower the replacement cost as much as possible.
As you can see, what the General Manager is doing here is using Statistics
to solve a problem and to increase profits.
Decision making is a key part of our day-to-day life. Even when we wish to
purchase a television, we like to know the price, quality, durability, and
maintainability of various brands and models before buying one. As you can
see, in this scenario we are collecting data and making an optimum
decision. In other words, we are using Statistics.
Again, suppose a company wishes to introduce a new product, it has to
collect data on market potential, consumer likings, availability of raw
materials, feasibility of producing the product. Hence, data collection is the
back-bone of any decision making process.
Statistics
Descriptive Inferential
Statistics Statistics
Caselet 2
In a firm, Human Resources Manager (HR Manager) calculates
the average salary of employees pertaining to production
department. The statistical data collected is related to production
department and does not give any information about other
departments of the firm. Here, the HR Manager is using
descriptive statistics. In this example, the HR Manager displays
the summarised numerical data in the form of tables, charts, and
diagrams, which comes under descriptive statistics.
Caselet 3
In a firm, the Human Resources Manager (HR Manager) uses the
average salary of employees pertaining to production department to
calculate the average salary of employees of all other departments
of the firm. Here, the HR Manager is using inferential statistics as the
estimation of averages deals with inferential statistics.
1. Collection of Data
Careful planning is needed while collecting data. The different methods
used for collecting data such as census method, sampling method and
so on. The investigator has to take care while selecting appropriate
collection methods.
1 th
Agarwal B L (2006) Basic Statistics 4 ed. Pgs 1-2 New Age International
Publishers
2 th
Agarwal B L (2006) Basic Statistics 4 ed. Pg 1 New Age International Publishers
3 th
Agarwal B L (2006) Basic Statistics 4 ed. Pg 2 New Age International Publishers
4 th
Agarwal B L (2006) Basic Statistics 4 ed. Pg 2 New Age International Publishers
Example 1
The pie-chart in figure 1.5 represents the sales figures of SPQ Company
for the year 2008.
3. Analysis of ata
The data presented has to be carefully analysed to make any inference
from it. The inferences can be of various types, for example, as
measures of central tendencies, dispersion, correlation, regression.
Measures of central tendency will quantify the middle of the distribution.
The measures in case of population are the parameters and in case of
sample, the measures are statistics that are estimates of population
parameters. The three most common ways of measuring the centre of
distribution is the mean, mode and median.
In case of population, the measures of dispersion are used to quantify
the spread of the distribution. Range, interquartile range, mean absolute
deviation and standard deviation are four measures to calculate the
dispersion.
4. Interpretation of Data
The final step is to draw conclusions from the analysed data.
Interpretation requires high degree of skill and experience. We can
interpret the data easily from pie-charts.
Example 2
The pie-chart in figure 1.6 shows the monthly expenses of „family A‟.
From the pie-chart, we can infer that Prasad‟s family spent maximum on
food and spent equal amounts on the fuel and miscellaneous items.
Thus, Statistics contains the tools and techniques required for the collection,
presentation, analysis and interpretation of data. Thus, we see that this
definition is precise and comprehensive.
The data in table 1.1a can be condensed and is presented in table 1.1b
using the statistical concepts such as calculating frequency and frequency
distribution to draw conclusions and then frequency table is prepared. In this
example, from the bulk data consisting of 50 rating scores, the frequency
table was prepared. The frequency table is in condensed and simple form.
From the tabled data, we can easily interpret that for the regional movie,
most of the customers gave a 7 rating (that is, 11 customers). Only two
customers gave a rating of 1 for the regional movie, which means only two
out of 50 customers surveyed liked the regional movie the most.
Table 1.1b. Frequency table
Example 3
The graphical curve represented in figure 1.7 and figure 1.8 shows the
profits of CBA Company and ZYX Company respectively, for ten years
from 1998 to 2008. The profits are plotted on the Y-Axis and the timeline
in years on X-Axis. From the graphs, we can compare the profits of two
companies and derive to a conclusion that profits of CBA Company in the
year 2008 are higher than that of ZYX Company.
The graphical curve in case of figure 1.7 shows that the profits for CBA
Company are increasing, whereas the profits curve in figure 1.8 is
constant for ZYX Company from middle of the decade (1998-2008).
Fig. 1.7:
Hence, visual Profits of CBA
representation Fig. 1.8:
of numerical data Profits
helps you of
toZYX
compare the
data with less effort and can make effective decisions.
3. Statistics brings out trends and tendencies in the data
After data is collected, it is easy to analyse the trend and tendencies in the
data by using the various concepts of Statistics.
4. Statistics brings out the hidden relations between variables
Statistical analysis helps in drawing inferences on data. Statistical analysis
brings out the hidden relations between variables.
5. Decision making power becomes easier
With the proper application of Statistics and statistical software packages on
the collected data, managers can take effective decisions, which can
increase the profits in a business.
Minitab
Minitab is a statistical software package that was designed especially for
the teaching of introductory statistics courses. It is our view that an easy-
to-use statistical software package is a vital and significant component of
such a course. This permits the student to focus on statistical concepts
and thinking rather than computations or the learning of a statistical
package. The main aim of any introductory statistics course should
always be the why of statistics rather than technical details that do little to
stimulate the majority of students or, in our opinion, do little to reinforce
the key concepts.
Source: http://www.minitab.com
SPSS
SPSS Inc. technology encapsulates advanced mathematical and
statistical expertise to extract predictive knowledge that when deployed
into existing processes makes them adaptive to improve outcomes.
Our Predictive Analytics Software will help you:
Capture all the information you need about people's attitudes and
opinions
Predict the outcomes of interactions before they occur
Act on your insights by embedding analytic results into business
processes
Source: http://www.spss.com
Eviews
EViews is a statistical software tool, which offers academic researchers,
corporations, government agencies, and students access to powerful
statistical, forecasting, and modeling tools through an innovative, easy-to-
use object-oriented interface.
EViews is the ideal package for anyone who works with time series,
cross-section, or longitudinal data. EViews offers an extensive array of
powerful features for data handling, statistics and econometric analysis
forecasting and simulation, data presentation, and programming. EViews
generates forecasts or model simulations, and produce high quality
graphs and tables.
Source: http://www.eviews.com/
1.8 Summary
Decision making process becomes more efficient with the help of Statistics.
Statistics deals with aggregate of facts. Statistics is applied in all fields of
our activities. Statistical interpretation requires skilled and experienced
statisticians. Statistical data is numerical data or quantitative data but not
qualitative data.
1.11 References
B.L. Agarwal, (2006) Basic Statistics, Fourth Edition, New Age
International Publishers
Rand R. Wilcox , (2009) Basic Statistics – Understanding Conventional
Methods and Modern Insights, Oxford University Press
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-1.pdf
2.1 Introduction
In Unit 1, „Introduction‟, you have studied about Statistics and definition of
Statistics. You also studied the broad divisions of Statistics. You now have
an idea about what Statistics is, the characteristics of Statistics and the
limitations of Statistics. In this unit 2, „Statistical Survey‟, you will study about
the collection and analysis of numerical data.
When the population is large, it is hard to conduct a survey. In such
situations, a sample is drawn and studied to determine the characteristics of
the entire population from which the sample was taken. The primary
purpose of conducting a sample survey is to obtain certain information about
the population and to draw or infer valid conclusions about the
characteristics of the population.
We can define the term „survey‟ as a measurement tool, which is used to
gather people‟s opinions. Surveys differ in terms of purpose, field of study,
scope, and the source of information. Surveys are used by companies to
assess the level of satisfaction their customers feel, to find out what
products their customers choose and to determine which target population is
buying their products. All the following activities require collection and
analysis of data in a systematic manner.
Formulation of a theory such as “Tobacco Consumption Leads to
Cancer”
Framing of policies according to existing nature of a population
Finding the relationship between characteristics of units in the
population
In other words, a search for knowledge by analysing numerical data is
known as Statistical Survey or Statistical Investigation.
2.1.1 Learning objectives
By the end of this unit, you will be able to:
Recall the definition of Statistical survey
Describe the activities involved in planning of a Statistical survey
Recall the definition of terms used in Statistics
Differentiate between sample and population
Differentiate between quantitative and qualitative characteristics
Describe various methods of data collection
Describe the methods of collecting data
Distinguish between primary and secondary data
Identify the sources of primary and secondary data
2.1.2 Definition of statistical survey
A Statistical survey is a scientific process of collection and analysis of
numerical data. Statistical surveys are used to collect numerical information
about units in a population. Surveys involve asking questions to individuals.
Statistical Survey
Planning Execution
Key Statistic
A parameter is a characteristic of population. Population can have many
parameters.
Statistic is a characteristic of sample. Sample can have many statistics.
2.3.3 Sample
A sample is a part or subset of the population. By studying the sample, you
can predict the characteristics of the entire population from where the
sample is taken. The data that describes the characteristics of a sample is
known as statistic.
If the population is large, it is hard to collect data. Hence, a part of the
population is chosen to study the characteristics of the entire population.
The size of the sample can never be as large as the size of the population.
Proper care must be taken while choosing the samples. In the figure 2.3, a
sample of three consumers is drawn from the entire population of eight
consumers.
2.3.4 Quantitative characteristic
A characteristic which is numerically measurable is called a quantitative.
Quantitative data is data expressing a certain quantity, amount or range.
Usually, there are measurement units associated with the data, for example,
the height of a person in metres.
2.3.5 Qualitative characteristic
A characteristic which is not numerically measurable is called a qualitative
characteristic. Qualitative data is data describing the attributes or properties
that an object possesses.
Let us understand the basic terminologies of Statistics with the help of a
caselet.
Caselet 1
Consider the survey of the average number of children below 16 years
in a ward of a municipality. The number of houses in the ward is finite.
Therefore, the population is finite. The objects are households. The
characteristic measured is number of children below 16 years in a
household. It is measurable and hence quantitative. On the other hand,
in survey to find the total number of blind people in a locality, the
characteristic „blindness‟ is qualitative.
2.3.6 Variable
In a population, some characteristics remain the same for all units and some
others vary from unit to unit. The quantitative characteristic that varies from
unit to unit is called a variable. The qualitative characteristic that varies from
unit to unit is called an attribute.
A variable that assumes only some specified values in a given range is
known as discrete variable. A variable that assumes all the values in the
range is known as continuous variable. For example, the number of children
per family and number of petals in a flower are examples of discrete
variables. The height and weight of persons are examples of continuous
variables.
Key Statistic
A sample which consists of entire population is called a census.
The table 2.1 shows the merits and demerits of direct personal observation
method.
Table 2.1: Merits and demerits of direct personal observation
Merits Demerits
1. We get the original data which 1. This method consumes more cost.
is more accurate and reliable.
2. Satisfactory information can be 2. This method consumes more time.
extracted by the investigator
through indirect questions.
3. Data is homogeneous and 3. This method cannot be used when the
comparable. scope of investigation is wide.
4. Additional information can be 4. Most of the data collected through this
gathered. method is maintained confidential.
Hence, there is a chance of leakage of
data.
5. Misinterpretation of questions
can be avoided.
There are different types of questions that can be used in the questionnaire.
A questionnaire can have Contingency questions, Matrix questions, Closed
ended questions and Open ended questions. Let‟s have a look at each one
in detail.
Contingency questions are questions that are answered only if the
respondent gives a particular response to a previous question. This
avoids asking people questions that do not apply to them
Matrix questions are questions which are placed one under the other,
forming a matrix. The response categories are placed along the top and
a list of questions are placed down the side. This is used to efficiently
occupy page space and respondents‟ time.
Closed ended questions are those where the respondents‟ answers are
limited to a fixed set of responses. Usually scales are closed ended.
There are various types of closed ended questions.
Yes/no questions – here the respondents answer with “yes” or “no”. Some
of the examples are:
Multiple choices – here the respondents have several options from which to
choose. For example:
Example 1
The sun rises in which direction?
East [ ]
West [ ]
North [ ]
South [ ]
Example 2
Read the following statement and then indicate by a tick whether you
strongly agree, agree, disagree or strongly disagree with the statement.
“Tasks when organised and prioritised take less time to complete.”
1. Strongly Agree [ ]
2. Agree [ ]
3. Disagree [ ]
4. Strongly Disagree [ ]
Open ended questions are those questions for which the respondent
supplies their own answer without any fixed set of possible responses.
Examples of types of open ended questions include:
Sentence completion – In these, respondents complete an incomplete
sentence.
Example 3
Complete the sentence below.
“I like the management courses offered by Sikkim Manipal
University because ...”.
With secondary data, people have to compromise between what they want
and what they are able to find.
The differences between primary and secondary data are listed in the
table 2.3.
Table 2.3: Differences between primary and secondary data
Primary Data Secondary Data
1. Data is original and thus more 1. Data is not reliable.
accurate and reliable.
2. Gathering data is expensive. 2. Gathering data is cheap
3. Data is not easily accessible. 3. Data is easily accessible through
internet or other resources.
4. Most of the data is homogeneous. 4. Data is not homogeneous.
5. Collection of data requires more 5. Collection of data requires less
time. time.
6. Extra precautionary measures 6. Data needs extra care.
need not be taken.
7. Data gives detailed information. 7. Data may not be adequate.
2.6 Summary
A Statistical survey is a search for knowledge. There are two main stages in
any Statistical survey - planning and execution. Planning a Statistical survey
encompasses the following issues.
i) The nature of problem
ii) The objectives
iii) The scope
iv) Statistical units
v) The degree of accuracy
vi) The time period
vii) The source of information
viii) The organisation
The collected data should be edited, analysed and interpreted for
completeness, accuracy and consistency. Sample is a subset of population.
Sample can never be larger than the population from which the sample was
taken.
Quantitative characteristic is a characteristic which is numerically
measurable otherwise it is a qualitative characteristic. The quantitative
characteristic that varies from unit to unit is called a variable. The qualitative
characteristic that varies from unit to unit is called an attribute.
There are two categories of data - primary and secondary data. Primary
data is collected directly from the respondents whereas secondary data is
collected through agencies.
The various methods of collecting primary data are:
Direct personal observation
Indirect oral interview
Information through agencies
Information through mailed questionnaires
Information through schedule filled by investigators
Questionnaires must be structured well and must not be ambiguous. A
covering letter must be included along with the questionnaire. Pilot survey is
a beneficial method when prior information about the survey does not exist
or when the results about the survey is needed quickly.
2.9 References
B. L. Agarwal, (2006) Basic Statistics, Fourth Edition, New Age
International Publishers
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
Rand R. Wilcox , (2009) Basic Statistics – Understanding Conventional
Methods and Modern Insights, Oxford University Press
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-3.pdf
3.1 Introduction
In unit 2, „Statistical Survey‟, you have studied about surveys and different
methods of collecting the data. In this unit 3, „Classification, Tabulation and
Presentation of Data‟, you will know about the simplification of collected
Sikkim Manipal University Page No. 40
Statistics for Management Unit 3
data. You will also know about some methods for graphical summarisation
of data that reveals certain patterns.
Collected data in the raw form would be voluminous and non-
comprehensible. Therefore, it should be condensed and simplified for better
understanding and usefulness.
Classification is the first stage in simplification. It can be defined as a
systematic grouping of the units according to their common characteristics.
Each of the group is called class.
For example, in a survey of industrial workers of a particular industry,
workers can be classified as unskilled, semi-skilled and skilled, each of
which form a class.
3.1.1 Learning Objectives
By the end of this unit, you should be able to:
Describe the functions and methods of classification
Identify the parts of table
Describe the functions of tabulation
Calculate the frequency and frequency distribution for the data
Display the numerical data as graphical representation
Example 1
The data displayed in figure 3.1 is the number of students who has
secured more than 60% in various sub-modules of statistics. This can
be classified using one-way classification method.
Two-way classification
Classification done according to two attributes or variables is known as two-
way classification.
Example 2
The data displayed in figure 3.2 is the classification of students according
to gender, who has secured more than 60% in respective sub-modules of
statistics. In the sub-module titled „Basic Concepts‟, ten students got
more than 60%. Out of ten students, four are males and six are females.
Manifold classification
Classification done according to more than two attributes or variables is
known as manifold classification.
Example 3
The figure 3.3 shows the classification of employees according to skill,
sex and education.
3.3 Tabulation
Tabulation follows classification. It is a logical or systematic listing of related
data in rows and columns. The row of a table represents the horizontal
arrangement of data and column represents the vertical arrangement of
data. The presentation of data in tables should be simple, systematic and
unambiguous.
Classification Tabulation
It is the basis for tabulation It is the basis for further analysis
It is the basis for simplification It is the basis for presentation
Data is divided into groups and sub- Data is listed according to a logical
groups on the basis of similarities sequence of related characteristics
and dissimilarities.
2 9
1
Source: ………..
Age Total
20 – 40 40 and Above
Depart- A C A
ments C
Under B Gra- Post Under B
duate Post
Gra- Gra- Gra- Graduate
Graduate
duate duate duate
Accounts 10 40 10 10 15 5 90
Finance 10 30 10 12 14 7 83
Personal 15 25 10 10 14 5 79
Production 10 30 10 8 12 6 76
Marketing 5 25 10 0 15 7 62
Total 50 150 50 40 70 30 390
Age
Departments
20 – 40 40 & above
Accounts 2.564 1.282
Finance 2.564 1.795
Personal 3.846 1.282
Production 2.564 2.051
Marketing 1.282 1.795
Total 12.920 8.205
Construction
Different types of tables under this classification of tables are:
1. Simple table
Simple table presents only one characteristic. The table illustrated in table
3.5 is a simple table.
2. Complex table
Complex table presents two or more characteristics. The table illustrated in
table 3.6 is a complex table.
3. Cross-classified table
In the cross-classified table, the entries are classified in both directions. An
example of cross-classified table is illustrated in table 3.7.
Batch Defects
Major Minor
I 8 7
II 15 5
III 25 15
Total 40 27
Table 3.7: Population of a city according to age, sex and education during
2003 to 2005
Age
Years Educated Not Educated
Sex
Below 20 - Above Below Above
Total 20 – 40 Total
20 yrs 40 40 20 yrs 40
Male
2003
Female
Male
2004
Female
Male
2005
Female
Key Statistic
Class intervals are of two types; exclusive and inclusive. The class
interval that does not include upper class limit is called an exclusive type
of class interval. The class interval that includes the upper class limit is
called an inclusive type of class interval.
In table 3.10, the class „0 – 9‟ includes the value „9‟. In table 3.11, the class
„0 – 10‟ does not include the value „10‟. If the value of „10‟ occurs, it is
included in the class „10 – 20‟.
Table 3.10: Inclusive type class interval
Marks Number of Students
0–9 15
10 – 19 20
Key Statistic
If the class interval does not prescribe lower limit for first class or upper
limit for the last class, then it is known as open-end class interval.
Tally marks
Tally marks are used to construct frequency table. Tally mark is a small
vertical line drawn against a class as soon as we observe a value belonging
to the class. The fifth tally mark is crossed for easy counting purposes. The
table 3.15 represents the marks secured in mathematics by the students of
a class.
Example 4
From the table 3.15, we can depict that ten students got 90 marks in
mathematics, six students got 82 and seven got 75.
82
75
Solution: The simple bar diagram in figure 3.5 shows the yield of paddy in
Karnataka.
Solved Problem 5: Create a multiple bar diagram for the data represented
in the table 3.17.
Table 3.17. Product A
Year Cost of Manufacturing / Unit Revenue / Unit
2002 - 2003 40 70
2003 – 2004 45 85
2004 – 2005 55 90
Solution: The multiple bar diagram in figure 3.6 shows the cost and
revenue per unit.
Fig. 3.6: Multiple bar diagram showing the cost and revenue per unit
Key Statistic
It is easier to draw the bar diagram, if we first find the cumulative total for
each section.
Sikkim Manipal University Page No. 59
Statistics for Management Unit 3
360 R
A
T
Solved Problem 7: Draw pie-diagram for the data in table 3.19, regarding
expenses of Prasad‟s family and Krishna‟s family.
Table 3.19. Monthly expenses of two families
Monthly Expenses of
Items
Prasad’s Family Krishna’s Family
Food 2000 4000
Rent 1000 1500
Fuel 500 1000
Misc 500 1500
Total 4000 8000
We draw two circles with radii 1.3 cms and 1.8 (where, 1 cm = 50 units).
The angles at the centre are determined and represented in a table 3.20.
Monthly Expenses of
Items
Prasad’s Family Krishna’s Family
Food 180 180
Rent 90 67.5
Fuel 45 45
Misc 45 67.5
Total 360 360
Solution: The figure 3.10 displays the histogram for the distribution of age
data.
We join the upper left corner of highest rectangle to the right adjacent
rectangle‟s left corner and right upper corner of highest rectangle to left
adjacent rectangle‟s right corner. From the intersecting point of these lines
we draw a perpendicular to the X-axis. The X-reading at that point gives the
mode of the distribution.
If the widths of the rectangles are not equal then we make areas of
rectangles proportional and draw the histogram. This is explained in the
solved problem 8.
Solved Problem 9: Suppose we have the frequency distribution shown in
table 3.22a. Draw a histogram for the data.
Table 3.22a. Frequency distribution data for solved problem 9
Age Frequency
0-10 5
10-30 20
30-60 45
60-70 12
70-90 16
Solution: From the table 3.22a, we can interpret that the class intervals are
unequal. Hence, the class intervals are made equal to calculate the adjusted
frequencies. For the class interval 10-30:
Divide the class interval into two equal class intervals
Calculate the adjusted frequency by dividing the frequency of that class
interval by 2
Similarly, follow the procedure for other unequal class intervals. Then, we
can construct the histogram with the adjusted frequencies. The table 3.22b
represents the class intervals along with the adjusted frequencies.
The figure 3.11 displays the histogram for the distribution of age data when
the class intervals are irregular.
3.6.4 Ogives
Ogive is obtained by drawing the graph of a cumulative frequency
distribution. Hence, ogives are also called as cumulative frequency curves.
Solution: The figure 3.14 displays the ogive curve for the data related to
wage distribution of workers.
Key Statistic
With the help of an ogive, we can find all positional values of a
distribution. An ogive curve gives, at a glance, percentage of readings
that lie above or below a specified value.
3.7 Summary
For better understanding and usefulness, the collected data is classified in a
systematic manner according to common characteristics. Classification
simplifies and makes data more comprehensible and renders the data ready
for statistical analysis.
Classified data is tabulated in rows and columns for presentation, using
various types of classification. The tabulated data should be simple and
unambiguous, which should be understood and interpreted easily.
Frequency distribution is a special type of tabulation. In more concise form,
it brings out the salient features of the distribution.
Data presented in diagram or graphical form is more appealing and gives
rough idea of the situation for busy executives.
Graphical data is visual representation of data in the form of line diagrams,
pie-charts, histograms, frequency polygons, frequency curves, or ogives.
50 72 61 64 72 62 61 56 75 55
52 71 54 64 71 64 59 59 70 54
60 60 57 57 66 68 60 62 68 54
62 65 58 64 65 60 60 67 58 56
70 62 60 68 64 62 59 69 52 58
2. Junior executive of XYZ Company has prepared budget for a new
division of the company. The budget data is shown in table 3.24. Vice
president of the company wanted to see the summary of the budget in
diagrammatic form. Prepare a pie diagram.
Table 3.24. Budget of an XYZ Company
3. ABC Ice cream Company attempts to keep all of its ten flavours of ice
cream in stock at each of its stores. In-charge of stores operation
collects data to the nearest half gallon on the daily amount of each
flavour.
i. Is the flavour classification discrete or continuous? Open or closed?
ii. Data collected, is it qualitative or quantitative?
iii. Is the amount collected on each flavour discrete or continuous?
2. The table 3.28 displays the data required to construct the pie-chart
(figure 3.15) for the budget data of an XYZ Company.
Table 3.28. Budget of an XYZ Company
3.
i. Discrete and closed
ii. Flavour is qualitative. Volume is quantitative
iii. Continuous
4. The figure illustrates the histogram diagram for the data in terminal
question 4.
5. The figure 3.17 is the ogive curve for the data given in terminal
question 5.
i. 16% ii. 47%
3.10 References
B.L. Agarwal, (2006) Basic Statistics, Fourth Edition, New Age
International Publishers
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-5.pdf
4.1 Introduction
In the unit 3, „Classification, Tabulation and Presentation of Data‟, you have
studied about data classification and representation of data in tables and
graphs. In this unit 4, „Measures Used to Summarise Data‟, you will study
the measures used to summarise data such as mean, median and mode.
Graphical representation is a good way to represent summarised data.
However, graphs provide us only an overview and thus may not be used for
further analysis. Hence, we use summary statistics like computing averages.
to analyse the data. Mass data, which is collected, classified, tabulated and
presented systematically, is analysed further to bring its size to a single
representative figure. This single figure is the measure which can be found
at central part of the range of all values. It is the one which represents the
entire data set. Hence, this is called the measure of central tendency.
In other words, the tendency of data to cluster around a figure which is in
central location is known as central tendency. Measure of central tendency
or average of first order describes the concentration of large numbers
around a particular value. It is a single value which represents all units.
4.1.1 Learning objectives
By the end of this unit, you will be able to:
Describe the concept of statistical average
Calculate arithmetic mean for discrete and continuous data
Calculate median and mode of data
Calculate quartiles, deciles and percentiles for the statistical data
Compute coefficient of variance for the statistical data
4.1.2 Objectives of statistical average
The statistical average or simply an average refers to the measure of middle
value of the data set. The objectives of statistical average are to:
Present mass data in a concise form
The mass data is condensed to make the data readable and to use it for
further analysis.
Facilitate comparison
It is difficult to compare two different sets of mass data. But we can
compare those two after computing the averages of individual data sets.
While comparing, the same measure of average should be used. It leads
Sikkim Manipal University Page No. 74
Statistics for Management Unit 4
Key Statistic
For discrete data, the arithmetic mean is given by:
i
Key Statistic
For discrete data with frequency, the arithmetic mean is given by:
X
f X
i i
f i
Solved Problem 1: Find out the arithmetic mean of 15, 17, 22, 21, 19,
26, 20?
Solution: The arithmetic mean X is given by:
15 17 22 21 19 26 20 140
X 20
7 7
Therefore, the arithmetic mean is 20.
Solved Problem 2: The data in table 4.1 shows the number of students with
respect to age. Calculate the arithmetic mean of the students‟ ages.
Key Statistic
For continuous series, the arithmetic mean is given by:
X A
fd C.I.
f
where,
d = Assumed Mean
Width of Class Interval
C.I. is the width of class-interval
X is the mid value of the class
A is the Assumed Mean
Solved Problem 3: The table 4.2a shows the distribution of data of number
of students according to height. The table 4.2b shows the frequency table.
Find the arithmetic mean of the height of students.
Table 4.2a. Number of students with respect to height
Mid X 155
f d fd
(Middle value, X) 10
145 50 -1 -50
155 65 0 0
165 80 1 80
175 55 2 110
250 140
fd C.I 155 140 10 155 5 6 160 6 cm
f 250
n1 30 1 158 , n 2 40 2 162
30 158 40 162
160.28 cms
30 40
The average height of the combined group is 160.28 cms.
Solved Problem 5: From solved problem 4, if you are given any 4 values
among n1, n2, x1, x2 and , we can find the fifth value. Suppose,
30 1 40 162 160.28 70
30 1 160.28 70 6480
Solution: In the table 4.3a, the values given for the column „number of
students‟ is in cumulative frequency distribution. Now, we have to convert it
to frequency distribution. The calculated values are shown in table 4.3b.
0 – 10 5 –3 4 – 12
10 – 20 15 –2 12 – 24
20 – 30 25 –1 4 –4
30 – 40 35 0 45 0
40 – 50 45 1 20 20
50 – 60 55 2 12 24
60 – 70 65 3 3 9
100 13
X
Mid X d f fd
CI
90 –1 8 –8
110 = A 0 f 0
130 1 26 26
150 2 14 28
170 3 10 30
58+f 76
XA
fidi C.I.
fi
76 76
129 110 20 19 20
58 f 58 f
that is,
19 f 1102 1520
f 22
Hence, the missing frequency is 22.
4.4 Median
Median of a set of values is the value which is the middle most value when
they are arranged in the ascending order of magnitude. Median is denoted
by „M‟. In case of discrete series without or with frequency, it is given by:
th
n 1
is the value
2
Key Statistic
To solve problems on median,:
i. Arrange the data in ascending order or descending order
ii. Make class-interval as exclusive type
Solved Problem 10: Find the median value of the following set of values
45, 32, 31, 46, 40, 28, 27, 37, 36, 41, 47, 50.
Solution: Arranging in ascending order, we get:
27, 28, 31, 32, 36, 37, 40, 41, 45, 46, 47, 50
we have, n = 12
th
12 1
Median = value 6.5th
2
37 40 38.5
2
The median for the given set of values is 38.5.
Solved Problem 11: Find the median value for the data shown in table
4.6a.
Table 4.6a. Data for solved problem 11
X 12 16 10 14 17 20 15
f 4 9 3 5 4 2 10
X f Cumulative frequency
10 3 3
12 4 7
14 5 12
15 10 22
16 9 31
17 4 35
20 2 37
th th
n 1 37 1
19 th value
2 2
Therefore, the median, M is 15.
Key Statistic
In case of continuous series, median M is given by:
n / 2 Cf p
M = Lower limit of median class + C.I.
fc
where,
Cfp = Cumulative frequency up to previous class
fc = frequency of class
C.I. = Width of class interval
Solved Problem 12: Find the median of the data in table 4.7a.
Table 4.7a. Distribution of weight data for solved problem 12
Cumulative
Weight Frequency Frequency
Frequency
30-35 10 10
35-40 15 25
40-45 40fc 65
45-50 27 92
50-55 8 100
(n / 2) Cf p
M = Lower limit of median class + CI .
fc
where,
Lower limit of median class = 40.
Cfp = Cumulative frequency up to previous class = 25
fc = frequency of class = 40
C.I. = Width of class interval = 5
50 25
Median 40 5 43 125
40
Hence, the median weight is 43.125 kg.
Solved Problem 13: Find the missing frequency for the data shown in table
4.8a, given that its median is 34.
Table 4.8a. Data for solved problem 13
N / 2 Cf p
Median L.L. C.I.
fc
(61 f ) / 2 (13 f )
34 30 X10
20
61 / 2 f / 2 13 f 35 / 2 f / 2
4 4 16 35 f
2 2
f = 19
Therefore, the missing frequency is 19.
4.4.1 Merits and demerits of median
The table 4.9 displays the merits and demerits of median.
Table 4.9. Merits and demerits of median
Merits Demerits
It can be easily understood and It is not based on all values.
computed.
It is not affected by extreme values. It is not capable of further algebraic
treatment.
It can be determined graphically It is not based on all values.
(Ogives).
4.5 Mode
Mode is the value which has the highest frequency and is denoted by Z.
Modal value is most useful for business people. For example, shoe and
readymade garment manufacturers will like to know the modal size of the
people to plan their operations. For discrete data with or without frequency,
it is that value corresponding to highest frequency.
Solved Problem 14: The following data relate to size of shoes. Find the
mode.
6, 7, 6, 8, 9, 9, 9, 10, 8, 7, 7, 9, 10, 9, 9, 9, 8, 8, 11
Solution: Arranging the data in ascending order, data obtained is shown in
table 4.10.
Table 4.10. Frequency table for data in solved problem 14
Size Frequency
6 3
7 3
8 4
9 7
10 2
11 1
Key Statistic
In case of continuous series, mode is given by:
fm fp
Mode L.L. C.I.
2fm fp fs
Where,
L.L. = lower limit of modal class
fm = frequency of modal class
fp = frequency of previous class
fs = frequency of succeeding class
C.I = width of class interval
Solution: We note that the intervals are exclusive type and the highest
frequency is 25. Therefore, the corresponding interval is 1200-1400, which
is called modal class.
fm fp
Mode L.L. C.I.
2fm fp fs
Where,
L.L. = lower limit of modal class = 1200
fm = frequency of modal class = 25
fp = frequency of previous class = 15
fs = frequency of succeeding class = 12
C.I = width of class interval = 200
Therefore, the mode is calculated as:
25 15 2000
Mode 1200 200 1200 = 1286.95
2 25 15 12 23
Hence, the modal plinth area is 1286.95 square feet.
Solved Problem 16: The distributions shown in table 4.12 are the average
monthly balances of customers in a nationalised bank. The mode of the
distribution is 119. Find the total number of customers surveyed.
Solution: Let the missing frequency be „f‟ since the mode is given to be 119.
Modal class is 100 – 150. fm = f fp = 123 fs = 82 C.I = 50
f 123 f 123
119 100 50 119 100 50
2f 123 83 2f 205
Key Statistic
The empirical relationship between mean, median and mode:
Mean – Mode = 3 (Mean – Median)
which is same as,
Mode = 3 Median – 2 Mean.
GM n X1f1.X 2 f 2 .......... Xn f n
where,
n f1 f2 .......... . fn
iii. In case of continuous series,
Solution: The data in table 4.15b is obtained from the data in table 4.15a.
Key Statistic
Whenever data deal with rates, ratios, growth rates, and so on, the
geometric mean is the best measure
Geometric mean is not defined even if one of the values is zero or
negative.
Key Statistic
For discrete series with frequency, the harmonic mean is given by:
N
H.M =
( fi / x i )
where, fi are the corresponding frequencies for values of i equal to 1 to N.
Solved Problem 19: Calculate the harmonic mean of 9.7, 9.8, 9.5, 9.4, 9.7.
Solution: The harmonic mean (HM) is calculated as:
Table 4.16. Calculation of harmonic mean
X f/
9.7 0.1031
9.8 0.1020
9.5 0.1053
9.4 0.1064
9.7 0.1031
Total 0.5199
5
HM = = 9.6172
0.5199
Therefore, the harmonic mean is 9.6172.
Key Statistic
Quartiles: When distribution is divided into four equal portions, then we
get first quartile (Q1), second quartile (Q2 = Median) and third quartile
(Q3) as the positional averages.
For discrete series with or without frequency, Q1 and Q3 are given by:
th
N 1
Q1 is value
4
th
(3(n 1))
Q 3 is value
4
st
Table 4.17a. Distribution of weight of 1 standard students
Class Interval 13 - 18 18 - 20 20 - 21 21 - 22 22 - 23 23 - 25 25 – 30
Frequency 22 27 51 42 32 16 10
Solution: The table 4.17b displays the cumulative frequency distribution of data
for solved problem 21.
Table 4.17b. Cumulative frequency distribution of data for solved problem 21
P20 class
Q1 class and Q2 class
D7 class
Q3 class
NthValue
N=200 Q1 50th value
4
50 49
Q1 20 1 20.02
51
N
Q 2 th value
2
N
Q 2 th value 100 th value
2
100 49
Q 2 20 1 21
51
th
3
Q3 value 150 th value
4
150142
Q 3 22 1 22.25
32
Therefore the quartiles Q1, Q2, and Q3 are 20.02, 21 and 22.25.
Key Statistic
For deciles, we divide N / 10 and multiply by required deciles value.
Solved Problem 22: Find the 7th decile for the same data given in solved
problem 22.
Solution: The 7th decile is given by:
7NthValue 7 200
D7 = 140 th value
10 10
140 100
D7 = 21 1 = 21.95
42
Key Statistic
To find percentiles we divide N/100 and multiply by required percentile
value.
Solved Problem 23: For the solved problem 21, find the 20th percentile.
Solution: The 20th percentile is given by:
Key Statistic
Suppose the values x1, x2, … xn are assigned the weights w1, w2………wn
then their weighted average is given by:
Xw
Wx
W
and their weighted Geometric Mean is given by:
Gw = antilog
W log x
W
where, „W‟ acts as frequency
Solved Problem 24: A professor assigns 5, 10, 10, 20, as weights for
assignments, presentations, first test and final test respectively. Moni and
Mani got the percentages in the above categories as shown in table 4.18.
Find the weighted percentage.
Table 4.18. Percentages of assignment weightages
4.10 Dispersion
It describes another characteristic of a distribution. Consider the two
distribution of weights of a product produced by two machines, shown in
table 4.19.
Table 4.19. Distribution of weights of a product
Machine A B
Sample size 1000 1000
Average weight 80 80
Minimum weight 20 40
Maximum weight 140 100
Machine „B‟ produces products with weights much closer to the average
than Machine „A‟. As a manufacturer or customer, we would choose
Machine „B‟. In other words, we choose that machine whose spread is
smaller.
The property of deviations of values from the average is called dispersion or
variations. The degree of variations is found by the measures of variations.
They are:
1. Range (R)
2. Quartile Deviations (Q.D)
3. Mean Deviations (M.D)
4. Standard Deviations (S.D)
Sikkim Manipal University Page No. 99
Statistics for Management Unit 4
Merits Demerits
It is easily understood and It is affected by extreme values.
simple to calculate.
It is rigidly defined. It is not based on all values. It uses
extreme values only.
Range is used:
In Statistical Quality control
When the study does not require deep analysis
When data has no abnormal values
Solved Problem 25: Find the range of the following discrete series 26, 28,
28, 26, 28, 30, 27, 29, 26, 24
Solution: The range „R‟ is calculated as:
R=H-L
where,
H: Highest value
L: Lowest value
R = 30 – 24 = 6
Therefore, the range of the given discrete series is 6.
Solved Problem 26: Find the range for the continuous series of data shown
in table 4.21.
Table 4.21. Frequency table for data of solved problem 26
Key Statistic
Range is not defined if the class intervals are open.
Key Statistic
1. Q3-Q1 is called inter quartile range.
2. Q3-Q1 gives the middle 50% of reading. Q3 and Q1 are also known
as upper and lower limit of middle 50% of readings.
3. Quartile range is not capable of further algebraic treatment.
Solved Problem 27: Compute the inter quartile range, Q.D and coefficient
of Q.D for the age distributions shown in table 4.22a..
Table 4.22a. Age distributions
Age (Years) 18 21 22 24 27 30 32
Frequency 7 13 20 36 14 8 2
Solution: The table 4.22b shows the cumulative frequency distributions for
the age distributions.
Table 4.22b. Cumulative frequency table for the age distributions
100 1th
Q1 value 25.25th value
4
Q1 22
3(100 1)th
Q3 value 75.75th value
4
Q3 = 24
Therefore, the inter quartile range, Q3 –Q1 = 24-22 = 2 Yrs.
24 22
Q .D. = 1 year
2
24 22 2
Coefficient of Q.D.
24 22 26
The table 4.23 shows the merits and demerits of quartile deviations.
Table 4.23. Merits and demerits of quartile deviations
Merits Demerits
It is easy to understand and to It is not based on all values.
compute.
It is rigidly defined. It is affected by sampling fluctuations.
It is not affected by extreme It is not capable of further algebraic
values. treatment.
M.D.( X)
( X X)f
N
In case of continuous series „X‟ represents mid value of class-interval.
Similarly, we can have mean deviation from median or mode. „X‟ is replaced
by median or mode in the above formula. However, mean deviation from
median is the least. It is known as minimal property of mean deviation.
The corresponding relative measures are coefficient of mean deviation.
M.D.( X)
Coefficient of M.D. X
X
Sikkim Manipal University Page No. 103
Statistics for Management Unit 4
M.D.(Median )
Coefficient of M.D.Median
Median
Solved Problem 28: Calculate mean deviation and also coefficient of mean
deviation using:
i) Mean
ii) Median
Compare the results.
Heights of plants (cms) 140, 147,143,145,144,150,142,141.
Solution: The frequency table for the data of solved problem 28 is
represented in table 4.24.
Table 4.24. Data for the solved problem 28
1160
( X) 145
8 cms.
30
Mean deviation from mean = 3.75
8
3.75
Coefficient of MD ( X) = 0.0258
145
(8 1)th
Median is value = 4.5th value
2
Median = 143 + 0.5(144 – 143) = 143.5 cms
20
Mean deviation from median = 2.5
8
2.5
Coefficient of MD ( X) = 0.001742
143.5
The mean deviation from median (2.5 cms) is less than that of the mean
deviation from mean (3.75 cms).
Solved Problem 29: The data in table 4.25a is the distribution of employees
of a firm according to their efficiency. Find the mean deviation and
coefficient of mean deviation from:
i. Mean
ii. Median
Table 4.25a. Distribution of employees according to their efficiency
28 65
( X) 4 24
65
160
M.D.( X) 2.46
65
Nth Value 65
32.5
2 2
Median class is 22 – 26
32.5 20 12.5 50
Median 22 4 22 4 22 22 1.66 23.66
30 30 30
168
M.D. (Median ) 2.58
85
2.46
Coefficient of M..D.( X) 0.1025
25
2.58
Coefficient of M.D.( from Median ) 0.1091
23.6
Therefore, the mean deviation and coefficient of mean deviation from mean
are 2.46 and 0.1025 respectively. The mean deviation and coefficient of
mean deviation from median are 2.58 and 0.1091 respectively. The table
4.26 shows the merits and demerits of mean deviation.
Table 4.26. Merits and demerits of mean deviation
Merits Demerits
It is based on all values. It is not capable of further algebraic
treatment.
It is less affected by extreme It does not take into account
values. negative signs.
It is not affected much by sampling
fluctuations.
The standard deviation of a set of values is the positive square root of mean
of the squared deviations of the values from their arithmetic mean. It is
denoted by „‟ (sigma).
For discrete series without frequency it is given by:
Variance =
( X X) 2 ( A )
N
= ( Variance)
Variance =
( X X)2 f (B)
f
= ( Variance)
Where, „X‟ is the mid value of class interval for continuous series in case of
grouped data, alternative form for (A) & (B) are the followings –
For (A)
Variance =
d2 (d)2
N
= ( Variance)
For (B)
2
Variance =
fd2 fd (C.F.) 2
N f
= ( Variance)
Key Statistic
The square of standard deviation is called variance. It is denoted by 2.
Solved Problem 30: The diastolic blood pressures of men are distributed
as shown in table 4.27a. Find the standard deviation and variance.
Table 4.27a. Distribution of diastolic blood pressures of men
Pressure(men) 78-80 80-82 82-84 84-86 86-88 88-90
No. of Men 3 15 26 23 9 4
2
=
2 fd2 fd (C.I.) 2
N f
122 32 2
=
2
(2) 2 1.525 0.16 4 5.46 (mm) = Variance
80 80
n1 (12 d12 ) n 2 ( 2 2 d2 2 )
Variance = ; Variance
n1 n 2
Where, d1 = X – X1 and d2 = X – X 2
X being the combined mean of n1 and n2 values.
The table 4.28 shows the merits and demerits of standard deviation.
Table 4.28. Merits and demerits of standard deviation
Merits Demerits
It is rigidly defined. It is difficult to understand.
It is based on all values. It gives undue weightage for extreme
values.
It is capable of further algebraic It cannot be calculated for classes
treatment. with open end interval.
It is not very much affected by
sampling fluctuations.
Solved Problem 31: The average weight of 100 apples from area “A” is
150gms with standard deviation of 10gms. Similarly the average weight of
200 apples from area “B” is 200gms with standard deviation of 15gms. Find
the combined standard deviation.
Solution: Given that:
Series AX d = x-260 d2
192 -68 4624
288 28 784
236 -24 576
229 -31 961
184 -76 5776
160 0 0
384 124 15376
291 31 961
330 70 4900
43 -17 289
+37 34247
37
260 263.7
10
2
34247 37
2 = (58.4) 2
10 10
58.4
58.4
CV% = 100 22.15%
263.7
Series B X X2
31 961
48 2304
13 169
51 2601
38 1444
43 1849
50 2500
36 1296
47 2209
82 6724
Total 439 22057
X 43.9
2
22057 439 2
=
2
2205 .7 ( 43.9) 2205 .7 1927 .21 278.49
10 10
278.49 16.68802
16.69
CV.% = 100 38.0154 %
43.9
The series A is more stable, since the CV for series A (22.15) is less than
the CV for series B (38.02).
iv. C.V % can be used to compare the variability of two sets of data
measuring the same characteristics.
4.13 Summary
The measures of central tendency and measures of dispersion summarise
mass data in terms of its two important features.
i. With respect to nature of data to cluster around a central value
ii. With respect to their spread from their central value
Arithmetic mean is defined as the sum of all values divided by number of
values.
Median of a set of values is the middle most value when the values are
arranged in the ascending order of magnitude.
Mode is the value which has the highest frequency
The measures of variations are:
i. Range (R)
ii. Quartile Deviations ( Q.D)
iii. Mean Deviations (M.D)
iv. Standard Deviations (S.D)
Coefficient of variation is a relative measure expressed in percentage and is
defined as:
S.D.
CV in % = 100
Mean
4.14 Terminal Questions
1. In an office there are 84 employees. Their salaries in Indian rupees are
as given in table 4.30. Find the mean salary per day.
Table 4.30. Salaries of 84 employees
Expenditure
10 - 20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 70 - 80
(Rs.)
No. of
23 44 35 12 9 3 2
Smokers
3. The average price/kg of Grade “A” tea is Rs.120 and that of grade “B”
tea is Rs.100. A trader mixes them and sells the mixture for Rs.115.
Find proportion of grade A and grade B in the mixture.
4. For the distribution shown in table 4.32, find the median and mode.
Table 4.32. Distribution data for terminal question 4
% Marks 0 - 10 10 - 20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70
No. of 4 9 19 20 18 7 80
Smokers
7. Find the quartile deviation and the coefficient of quartile deviation for
the data shown in table 4.35.
Table 4.35. Distribution data for terminal question 7
8. Given sum of upper and lower quartiles as 122 and their difference as
23; find the quartile deviation of the series.
9. If C.V% = 22 and S.D = 4. Find the mean.
10. The table 4.36 shows the distribution of age at the time of first delivery
of 65 women. Find mean deviation from mean and median.
Table 4.36. Distribution of age at the time of first delivery of 65 women
Age 18 – 22 22 – 26 26 – 30 30 – 34 34 – 38
Frequency 20 30 11 3 1
11. Read the data given below and find the combined mean, S.D and
coefficient of variation.
n1 = 15 n2 = 20
X1 = 40 X2 = 50
1 = 3 2 = 5
12. Mean and Standard deviation of lengths of tails of 8 rats were found to
be 4.7 cm and 0.8 cm respectively. However, one reading was taken as
3.6 cm instead of 6.3 cm; find the corrected mean and standard
deviation.
4. 34
5. 116.7 cm
6. 123.33
7. Q.D = 11.07
Coefficient of Q.D = 0.338
8. 11.5
9. 18.18
10. 2.462
11. Combined Mean = 45.7
Combined S.D = 6.53,
C.V in % = 14.29
12. Corrected Mean = 5.0375 cm
Corrected S.D = 0.8336 cm
4.16 References
B. L. Agarwal, (2006) Basic Statistics, Fourth Edition, New Age
International Publishers
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
Unit 5 Probabilities
Structure:
5.1 Introduction
Learning objectives
Definition of probability
Basic terminology used in probability theory
Approaches to probability
5.2 Rules of Probability
Addition rule
Multiplication rule
5.3 Conditional Probability
5.4 Steps Involved in Solving Problems on Probability
5.5 Bayes’ Probability
5.6 Random Variables
5.7 Summary
5.8 Terminal Questions
5.9 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
5.10 References
5.1 Introduction
In the unit 4, ‘Measures used to Summarise Data’, you have studied about
the measures of central tendency and measures of dispersion. In this unit 5,
‘Probabilities’, you will study about the ways of minimising the uncertainty
involved in our day to day lives by using probability theory.
Every human activity has an element of uncertainty. Uncertainty affects the
decision making process. In your daily lives, you very often use the word
‘probably’, like, probably it may rain today; probably the share price may go
up in the next week. Therefore, there is a need to handle uncertainty
systematically and scientifically.
Mathematicians and statisticians developed a separate field of mathematics
and named it as ‘Probability Theory’. The theory of probability helps us to
make wiser decisions by reducing the degree of uncertainty.
Key Statistic
The probability of event A [denoted P(A)], must lie within the interval
from 0 to 1.
Random Experiment
When the outcome of an experiment cannot be predicted, then it is called
random experiment or stochastic experiment
Sample Space
Sample Space or total number of outcomes of an experiment is the set of all
possible outcomes of a random experiment and is denoted by ‘S’.
Example 1
In tossing of coins, the outcomes are head and tail. The head is denoted
as ‘H’ and the tail as ‘T’. In tossing two coins, the sample space ‘S’ is
given by:
S , , ,
The number of outcomes is denoted by n(S).
Key Statistic
Ifn(the 4
S ) number of outcomes is finite then it is called as finite sample space,
otherwise it is called as an infinite sample space.
Event
Events may be a single outcome or combination of outcomes. Event is a
subset of sample space.
Example 2
In tossing a coin getting a head is (event A) a single outcome. Therefore,
P( A ) 1
2
In tossing two fair coins, for getting a head (event A) the possible
combinations of outcomes are HT and TH. The sample space consists of
HH, HT, TH, and TT. Therefore,
P( A ) 1
2
Example 3
In tossing an unbiased coin, getting head and tail are equally likely.
Example 4
In tossing a coin, if head falls, it prevents the occurrence of tail and vice
versa.
Independent events
Two events are said to be independent of each other if the occurrence of
one is not affected by the occurrence of other or does not affect the
occurrence of the other.
Example 5
Consider tossing of three fair coins as shown in figure 5.2.. Then,
S = { HHH, HHT, HTH, THH, TTH, THT, HTT, TTT}
Let:
A be the event of getting three heads
B be the event of getting two heads
C be the event of getting one head
D be the event of not getting a head
m
P( A ) Limit
n
n
Example 6
A sales manager may like to know the probability that he will exceed the
target for product A or product B. Sometimes, he would like to know the
probability that the sales of product A and B will exceed the target. The
first type of probability is answered by addition rule. The second type of
probability is answered by multiplication rule.
Fig. 5.4: AB for two Fig. 5.5: AB for two Fig. 5.6: ABC for
dependent events A and independent events A three dependent
B and B events A, B and C
iv) If A1, A2, A3………, An are ‘n’ mutually exclusive and exhaustive events
then the probability of occurrence of at least one of them is given by:
1 2 ....... n 1 2 ........ n .
5.2.2 Multiplication rule
If ‘A’ and ‘B’ are two independent events then the probability of occurrence
of ‘A’ and ‘B’ is given by:
B P( A B)
P
A P( A)
For any bivariate distribution, there exists two marginal distributions and
‘m + n’ conditional distributions, where ‘m’ and ‘n’ are the number of
classifications/characteristics studied on two variables.
Example 7
Consider the example of a librarian who analysed the type of visitors and
their choice of library section. The data is represented in table 5.1a.
Table 5.1a: Bivariate distribution
iii) The table 5.1d represents the distribution of people in sections given
that they are under graduate. Therefore, it is a conditional
distribution.
Level of News
Magazine Novels Subjects Total
education paper
Under
50 100 120 50 320
graduate
Thus for any bivariate distributions having ‘m’ and ‘n’ classifications there
exits two marginal distributions and ‘m + n’ conditional distributions. In
this case there are 3 + 4 = 7 conditional distributions.
Solved Problem 1: Calculation of nCr for the following values of ‘n’ and ‘r’:
i. n = 10 and r = 2
ii. n =16 and r = 3
Solution:
10 9
10 C 2 45
1 2
16 15 14
16 C 3 560
1 2 3
The value of 16 C3 is 560.
Key Statistic
nCr = nCr-1
n 0 = nCn = 1
C
0! = 1
S ,
n(S ) 2
n(A) 1
n(A) 1
P(A)
n(S) 2
Therefore, the probability of getting a head when a coin is tossed is 0.5.
Solved Problem 4: What is the probability of getting two heads when 3
coins are tossed and what is the probability of getting at least one head?
Solution:
i) Let ‘A’ be the event of getting two heads.
S , , , , , , , nS 8
, , n 3
n(A) 3
P(A)
N(S) 8
Therefore, the probability of getting two heads when three coins are
tossed is 3/8.
, , , n 4
4 1
P(A)
8 2
Therefore, the probability of getting at least two heads when three coins are
tossed is 1/2.
Solved Problem 5: What is the probability of getting a sum of ‘Nine’ when
two dice are thrown?
Solution: Let ‘A’ be the probability of getting a sum ‘Nine’.
nS 6 2 36
6,3, 3,6, 4,5, 5,4
n 4
4 1
P(A)
36 9
Therefore, the probability of getting a sum of ‘Nine’ when two dice are
thrown is 1/9.
Solved Problem 6: What is the probability of getting at least a sum of ‘nine’
when two dice are thrown?
Solution:
Let ‘A’ be the probability of getting at least a sum of nine.
nS 6 2 36
A is the event of combination of mutually exclusive events of getting a sum 9
or 10 or 11 or 12.
6,3, 3,6, 5,4, 4,5, 6,4, 4,6, 5,5, 6,5, 5,6, 6,6 n 10
10 5
P(A)
36 18
Therefore, the probability of getting at least a sum of ‘nine’ when two dice
are thrown is 5/18.
Solved Problem 7: A number is selected at random from the numbers 1 to
30. What is the probability that:
i. It is divisible by either 3 or 7
ii. It is divisible by 5 or 13
Solution:
i) Let ‘A’ be the event of selecting a number divisible by 3. Let ‘B’ be the
event of selecting a number divisible by 7.
nS 30 C1 30
3, 6, 9, 12, 15, 18, 21, 24, 27, 30
n 10
7, 14, 21, 28
n 4
21 n 1
ii) Let ‘A’ be the event of selecting a number divisible by 5. Let ‘B’ be the
event of selecting a number divisible by 13.
nS 30 C1 30
5, 10, 15, 20, 25, 30 n 6
13, 26 n 2
6 2 8 4
30 30 30 15
15 14 13 12 11
n(S)15 C 5 3003
1 2 3 4 5
n() 5 C 2 4 C1 6 C 2
54 65
4 10 4 15 600
1 2 1 2
600
P(A)
3003
Therefore, the probability that the committee will contain 2 scientists,
1 engineer and 2 accountants is 600/3003.
3 2 8 15 16 6 25 5
8 5 30 40 40 8
Therefore, the probability that at least one of the persons hit the target is
5/8.
Solved Problem 10: The probabilities that drivers A, B and C will drive
home safely after consuming liquor are 2/5, 3/7 and 3/4, respectively. What
is the probability that they will drive home safely after consuming liquor?
Solution: Let ‘A’ be the event of driver ‘A’ driving safely after consuming
liquor. Let ‘B’ be the event of driver ‘B’ driving safely after consuming liquor.
Let ‘C’ be the event of driver ‘C’ driving safely after consuming liquor.
2 3 3
Given P(A) P(B) P(C)
5 7 4
The events A, B, and C are independent. Therefore,
2 3 3 9
5 7 4 70
Therefore, the probability that all the drivers will drive home safely after
consuming liquor is 9/70.
Solved Problem 11: The probabilities that ‘A’ and ‘B’ will tell the truth are
2/3 and 4/5 respectively. What is the probability that:
i) They agree with each other
ii) They contradict each other while giving a testimony in the court
Solution:
i) Let ‘A’ be the event of A telling truth. Let ‘B’ be the event of B telling
truth.
2 1
Given P(A) P(A c ) 1 P(A)
3 3
4 1
P(B) P(B c )
5 5
Both will agree if they say truth or they together lie, that is,
or c c
They are mutually exclusive. Therefore,
c c c c
2 4 1 1 9 3
3 5 3 5 15 5
since, the events A and B are independent.
Therefore, the probability that both A and B agree with each other is 3/5.
ii) They will contradict if A tells truth and B tells lies or B tells truth and A
tells lies.
c
or c
Since, they are mutually exclusive.
c c c c
2 1 1 4 6 2
3 5 3 5 15 5
since, they are independent.
Therefore, the probability that A and B contradict each other is 2/5.
Solved Problem 12: A box contains five red and four blue similar shaped
balls. Two balls are drawn at random from the box. Find the probability that
both of them are red if:
i. the balls are drawn together
ii. the balls are drawn one after the other, with replacement
iii. the balls are drawn one after the other, without replacement
Solution:
i) Let ‘A’ be the event of drawing two balls together.
98
n(S) 9 C 2 36
1 2
5 4
n(A) 5 C 2 10
1 2
10 5
P(G)
36 18
Therefore, the probability that both of them are red if the balls are drawn
together is 5/18.
ii) Let ‘A’ be the event of drawing a red ball in the first draw. Let ‘B’ be the
event of drawing a red ball in the second draw. The required probability
is given by:
Therefore, the probability that both of them are red if the balls are drawn
one after the other, with replacement, is 25/81.
iii) Let ‘A’ be the event of drawing red ball in the first draw. Let ‘B’ be the
event of drawing red ball in the second draw. Since the first ball is not
replaced, the sample space changes for second draw. Therefore the
required probability is given by:
5 4 5
9 8 18
Therefore, the probability that both of them are red if the balls are drawn
one after the other, without replacement, is 5/18.
Solved Problem 13: Box I contains 5 red and 6 blue balls. Box II contains 6
red and 4 blue balls. A ball is drawn at random from box I and is transferred
to box II. Now from box II a ball is drawn at random. What is the probability
that it is red?
Solution: A ball drawn from box I and transferred to box II could be either
red or blue. Let ‘A’ be the event of drawing a red ball from box I. Let ‘B’ be
the event of drawing a blue ball from box I. Let ‘C’ be the event of drawing
red ball from box II.
The required events are C or C .
5 7 6 6 35 36 71
. .
11 11 11 11 126 121
Therefore, the required probability is 71/121.
Solved Problem 14: The probabilities that component A and component B
of a machine will fail are 0.09 and 0.06 respectively. The machine will fail if
any one of them fails. Find the probability that it will fail?
The event ‘B’ is made up of four mutually exclusive and exhaustive events.
1 2 3 4
i (1) [by using the Law of Marginal Probability]
We know that:
1 . * 1 ..….. (2) [by the Law of Conditional
Probability for dependent events]
1 .
1 Numerator from (3)
i
In general, the Bayes’ theorem states that if A1, A2………….., An are ‘n’
mutually exclusive and exhaustive events and ‘B’ is a common event to all
P(A1 ) . P(B/A1 )
P(A1 /B) n
P( A ) P( B / A )
i 1
i i
2. It is possible to incorporate
2. It is not possible to do so.
latest information.
3. It is possible to incorporate cost
3. It is not possible in this case.
aspects.
.
and i
i
i
The required probabilities are calculated and represented in the table 5.3.
Table 5.3: Required probabilities for the data in solved problem 16
Event Prior Conditional Joint Posterior
Ai Probability probability Probability Probability
P(Ai) P(B/Ai) P(Ai ∩ B)
0.0400
A1 0.4 0.10 0.0400 0.2807
0.1425
0.0525
A2 0.35 0.15 0.0525 0.3684
0.1425
0.0500
A3 0.25 0.20 0.0500 0.3509
0.1425
Total 1.00 P(B) = 0.1425 1.0000
Solved Problem 17: A factory has three machines M1, M2 and M3. They
produce 4000, 10,000 and 6,000 products per day. From past records, it is
known that M1, M2, and M3 produce 5%, 4%, and 8% defectives. A product
is selected at random from the day’s production. What is the probability that
it was not produced by machine M3?
Solution: Let us have the following:
Let ‘A1’ be the event that the product was produced by M1
Let ‘A2’ be the event that the product was produced by M2
Let ‘A3’ be the event that the product was produced by M3
Let ‘B’ be the event that the product is defective.
Then we are given:
1
4000
0.20
20000
2
10000
0.5
2000
3
6000
0.3
20000
P(B/A1) = 0.05P(B/A2) = 0.04 P(B/A3) = 0.08
The above information is represented in table 5.4.
Table 5.4: Required probabilities for the data in solved problem 17
Even Prior Conditional Joint Posterior
t Probability Probability Probability Probability
Ai P(Ai) P(B/Ai) P(Ai ∩ B)
0.010
A1 0.2 0.05 0.010 0.1852
0.054
0.020
A2 0.5 0.04 0.020 0.3704
0.054
0.054
A3 0.3 0.08 0.024 0.4444
1.0000
1.00 P(B) 0.054 1.0000
3
Required probability 1 = 1 – 0.4444 = 0.5556
Hence, the required probability is 0.5556.
No. of Heads
P(Xi)
(Xi)
3 ⅛
2 ⅜
1 ⅜
0 ⅛
Total 1
1
i
Var E 2 E 2 E 2 i2 i i i 2
Where, E 2 i2 i
Its standard deviation is:
S.D Var E 2 E
2
Solved Problem 18: A random variable takes the values -3, -2, 1, 0, 4, 6
with probabilities 1/12, 2/12, 3/12, 4/12, 1/12, 1/12 respectively. Find its
mean or expected value and variance?
Solution: The table 5.6 represents the values required to calculate
expectation and variance for the data in solved problem 18.
Table 5.6: Required values for calculating mean and variance for the data in
solved problem 18
E i i 6 12 1 2
Var E 2 E2 6 1 4 23 4
S.D 23 4
Hence, the mean, variance and standard deviation are 0.5, 5.75 and 2.4.
Solved Problem 19: Mr. Arun and Mr. Bandari play a game. If Mr. Arun
picks up an even number from 1 to 6, Mr. Bandari will pay him double the
amount equal to picked up number. If Mr. Arun picks up an odd number then
he has to pay amount equal to double the picked up number. What is Mr.
Arun’s expectation?
Solution: Let Xi be the random variable and P(Xi) be its probability. The
probabilities are indicated in table 5.7.
Table 5.7. Required values for calculating mean and variance for the data in
solved problem 19
Solved Problem 20: The table 5.8 displays the distribution of random
variable X. Find the following probabilities:
i) P(Xi) 3
ii) P(Xi = 0)
iii) P(1 Xi 3)
iv) P(Xi) 4
Xi -3 -2 0 1 2 3 4 5
P(Xi) K 2K 2K 3K 3K 2K K K
K + 2K + 2K + 3K + 3K + 2K + K + K = 1
15K = 1 K = 1/15
i) i 3 i 3 i 4 i 5
2K K K 4K 4 15
ii) i 0 2K 2 15
iii) 1 i 3
i 1 i 2 i 3
3K + 3K + 2K = 8K = 8 15
iv) i 4 i 4 i 5
K + K = 2K = 2 15
5.7 Summary
Probability plays an important role in decision making process. The basic
definitions and approaches were explained with examples. The real life
situations where you can apply different rules of probability are also
explained with examples.
10. A recently developed car has two important components A and B. The
probability of failure of A and B are 0.2 and 0.1. What is the probability
that the car will fail?
11. The probability that a football player will play on ordinary ground is 0.6
and on green turf is 0.4. The probability that he will get knee injury
when playing an ordinary ground is 0.07 and that a green turf is 0.04.
What is the probability that he got knee-injury due to the play on
ordinary ground?
12. Find the E(X) and Var(X) for the distribution of a random variable, X
represented in table 5.9.
Table 5.9: Distribution of a random variable
Xi -3 -1 1 2 4 6 8
P(Xi) K K 2K 3K 2K 2K K
5. 3/4
6. 5/8
7. 2/3
8. 544/625
9. 0.92
10. 0.28
11. 21/29
12. E(X) = 7/3, Var(X) = 115/18
5.10 References
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
6.1 Introduction
In the unit-5, ‘Probabilities’, we have studied about basic probability theory
concepts. We have also studied the application of probability rules in solving
problems related to real life situations. We have ended the previous unit with
concept of random variables. In this unit-6, ‘Theoretical Distributions’, we will
discuss about the probability distributions of the random variables; both
discrete and continuous. Before studying this unit, you have to refresh the
concept of random variables which was covered in the previous unit.
Individuals and corporates generate several data that resemble certain
theoretical distributions. Mathematically, we have many derived
characteristics of the theoretical distributions. We can make use of such
derived characteristics for a quick analysis of the observed distributions.
The examples of observed distributions are:
i. Number of male children in a family
ii. Number of defectives produced per production run
iii. Number of employees drawing salary in some brackets
The theoretical distributions are formed under certain assumptions. The
theorectical distributions are classified into two types. The two types of
theoretical distributions are:
i. Discrete probability distributions
ii. Continuous probability distributions
The figure 6.1 shows the two groups of theoretical distributions.
Example 1
When a fair coin is tossed as shown in figure 6.2, the outcome is
either head or tail. The variable ‘X’ assumes ‘1’ or ‘0’.
Key Statistic
The mean and variance of a Bernoulli distribution are ‘p’ and ‘pq’
respectively.
Key Statistic
The binomial probability distribution is given by:
x q pn qn n C1qn 1p1 n C2 qn 2p 2 ................ pn
where, the successive terms of the expansion give the probability of 0,
1, 2……..n success.
The mean and variance of the distribution are ‘np’ and ‘npq’ respectively,
where, ‘n’ and ‘p’ are its parameters. This distribution is a unimodal
distribution. For fixed ‘n’ or ‘p’, as ‘p’ or ‘n’ increases, the distribution shifts
from left to right.
Key Statistic
The mean and variance of a binomial distribution are ‘np’ and ‘npq’
respectively, where, ‘n’ and ‘p’ are its parameters.
Binominal distribution is = 12 + 12
6
( )
i) The probability that the tosses will result in exactly two heads is given
by:
26C2 1
2
62
1
2
2
6 5 1 1 15
1 2 24 22 64
Therefore, the probability that the tosses will result in exactly two
heads is 15/64.
ii) The probability that the tosses will result in at least five heads is given
by:
5 5 6 6C5 1 2 1 2
6 5 5
6C6 1 2
6 6
2
1
6
5 6 1 2 1 2
6 6
7
64
Therefore, the probability that the tosses will result in at least five
heads is 7/64.
iii) The probability that the tosses will result in at most two heads is given
by:
2 0 1 2
2 C 1 2 1 2 C 1 2 1 2
1
6 6
1
61 1 6
2
62 2
1 6 5 1 1 6 15 22 11
2
1
6
64 64 1 2 64 64 64 32
Therefore, the probability that the tosses will result in at most two
heads is 11/32.
iv) The probability that the tosses will result in not greater than one head
is given by:
Therefore, the probability that the tosses will result in not greater
than one head is 7/64.
v) The probability that the tosses will result in not less than five heads is
given by:
1 1 0 1
1 1 63
6
1 .
2 64 64
Therefore, the probability that the tosses will result in at least one head
is 63/64.
The graph shown in figure 6.3 illustrates the binomial distribution obtained
for different values of ‘x’.
Solution: Let ‘A’ be the event of employee contracting the disease. Given
that:
0.2 p
q 1 0.2 0.8
n=5
Binominal distribution is q p 0.8 0.2
n 5
i) The probability that none of the employees get the disease is given by:
0 0.85 0.3277
Therefore, the probability that none of the employees get the disease
is 0.3277.
ii) The probability that exactly two employees will get the disease is given
by:
25C2 0.83 0.22 10 0.512 0.04 0.2048
Therefore, the probability that exactly two employees will get the
disease is 0.2048.
iii) The probability that more than four employees will get the disease is
given by:
4 5 0.25 0.00032
Therefore, the probability that more than four employees will get the
disease is 0.00032.
Solved Problem 3: The probability that a bomb dropped on a bridge hitting
it is 0.5. Eight bombs are dropped on the bridge. The bridge will be
destroyed if any two bombs fall on it. Find the probability that:
i) All bombs hit the bridge
ii) The bridge is destroyed
Solution: Let the probability that the bomb will hit the bridge be p. Given
that:
p 0.5 and n 8
q 1 0.5 0.5
i) The probability that all the bombs hit the bridge is given by:
8 0.58 1 2 1256
8
Therefore, the probability that all the bombs hit the bridge is 1/256.
ii) Bridge is destroyed if two or more bombs fall on it. The required
probability is given by:
2 1 0 1
1
8 8 8
1 2 C1 1 2 1 1 256 8 256 247 256
Therefore, the probability that the bridge is destroyed is 247/256.
Solved Problem 4: An engineering graduate student randomly guesses at
eight multiple-choice questions. There are four possible answers for every
question. However, there is only one correct answer. Assuming that all
questions are independent to each other, find the probability that the student
guesses five correct answers.
Solution: From the data given in the solved problem 4, we can say that the
experiment is a binomial experiment because of the following reasons.
There are fixed number of events or trials (8 questions)
Probability of success in case of each question (probability of guessing
correct answer) is 0.25.
It is given that the trials (questions) are independent to each other.
There are only two possible outcomes on each question (guessing
correct answer or guessing incorrect answer).
Let X denote the number of correct guesses.
Then X is a binomial random variable with,
n 8, p 0.25, q 0.75, x5
So, the probability that the graduate student guesses five correct answers is
0.0231.
5 and 625
p2
q 1 2 5 35
Binominal distribution is q p 3 5 2 5
n 5
15C1 3 5 2 5
51 1
1 5
81 2 162
625 5 625
The expected number of packets to contain exactly one leaking sachet is
given by:
162
1 625 162
625
Hence, the expected number of packets to contain exactly one leaking
sachet is 162.
5C0 0.850 0.20 5C1 0.851 0.21 5C2 0.852 0.22 5C3 0.853 0.23
0.85 5 0.84 0.2 10 0.83 0.22 10 0.82 0.23
= 0.32768 + 0.4096 + 0.2048 + 0.0512
0.99328
Therefore, the values for P(x=3) and P(x<4) are 0.0512 and 0.99328
respectively.
Solved Problem 7: Bring out the fallacy, if any, in the following statement
on binominal distribution.
‘The mean of a binomial distribution is 4 and its variance is 5’.
Solution: Given that:
np 4 (Mean)……………. (1)
npq 5 (Variance)………… (2)
npq 5
……………… (3)
np 4
q 5/ 4
Since, q > 1, the statement ‘The mean of a binomial distribution is 4 and its
variance is 5’ is wrong.
Solved Problem 8: Find the probability that X = 3 for a binomial distribution
whose mean is 3 and variance is 2.
Solution: Given that:
np 3 (Mean)……………. (1)
npq 2 (Variance)………… (2)
1
p 1 q
3
Substituting value of p in equation 1, you get the value of ‘n’ as:
n9
Binominal distribution is q p n 2 3 13
9
Therefore, the probability that X=3 is given by:
3 1 3
3 9 C3 2
6 3
1792
6561
Hence, the probability that X = 3 for a binomial distribution is 1792/6561.
Key Statistic
The probability distribution of a Poisson random variable X is given by:
m
e m
x
where,
x varies from ‘0’ to infinity
e 2.71828 , the base of natural logarithm
m mean number of successes in the given time interval
The mean and variance of the distribution is ‘m’. Its standard deviation is
m and ’m’ is called the parameter of the distribution.
Key Statistic
The mean of the Poisson distribution is also given by:
m n p
where, ‘p’ is the probability of success and ‘n’ is the number of trials.
v) The probability of success should be very small and ‘n’ should be large
such that ‘np’ is a constant m [Generally, p < 0.1 and n > 10]
6.5.2 Real life examples of Poisson variate
Some of the real life examples of Poisson variate are:
i) Number of accidents in any traffic circle
ii) Number of incoming telephone calls at an exchange per minute
iii) Number of radio-active particles emitted by substances
iv) Number of defects in a product
v) Number of micro-organisms developed during a period
6.5.3 Recurrence relation
Key Statistic
Recurrence relation between successive terms of a Poisson
expansion is given by:
m
Tx T
x x 1
m np 2000 0.002 4
Therefore, the required probabilities are calculated as follows:
i. The probability that none catches fire is given by:
m0
0 e m e 4 0.01832
0
Therefore, the probability that none of the houses catches fire is
0.01832.
ii. The probability that at least one catches fire is given by:
1 1 0 1 0.01832 0.98168
Therefore, the probability that at least one house catches fire is
0.98168.
iii. The probability that not more than 2 houses catches fire is given by:
m0 m1 m2
2 0 1 2 e m e m e m
0 1 2
e m 1 m
m2 e 4 1 4 16 e 4 13 0.01832 13 0.2382
2 2
Therefore, the probability that not more than 2 houses catches fire is
0.2382.
Solved Problem 10: One percent of bulbs manufactured by a firm are
expected to be defective. A carton contains 200 bulbs. Find the probability
that the carton contains 3 or more defective bulbs?
Solution: Given that:
The probability that bulb is defective, p 0.01 ,
n 200
m np 200 0.01 2
The probability that the carton contains 3 or more defective bulbs is given
by:
3 1 0 1 2
m0 21 22
1 e 2 e 2 e 2 1 e 2 1 2 2
0 1 2
1 0.13534 5 1 0.6767 0.3233
Therefore, the probability that the carton contains 3 or more defective bulbs
is 0.3233.
Solved Problem 11: On an average, there are three mistakes on a page of
a book. The book contains 200 pages. What is the probability that a
randomly selected page has exactly one mistake?
Solution: Given that m 3 the required probability is calculated as:
1 e 3
3
0.04979 3 0.14937
1
Hence, the probability that a randomly selected page has exactly one
mistake is 0.14937
Solved Problem 12: A sales representative of RSR Insurance Company
sells 3 insurance policies on an average in a week. Using the Poisson law,
calculate the probability that in a given week, the salesman will sell:
i. some life insurance policies
ii. two or more policies but less than 4 policies
Solution: In this problem, it is given that the mean ‘m’ is 3.
i. Some policies mean that salesman selling one or more insurance
policies. Hence, P(X>0) must be found out first which is equal to 1
minus P(X=0)
0 1 0
Calculating P(X=0) using the Poisson distribution formula:
mx
x e m
x
30
0 e 3 4.9787 10 2
0
0 1 0 1 0.0498 0.9502
The probability that the salesman of RSR Insurance Company will sell
some life insurance policies is 95.02%.
ii. To find the probability of the salesman selling more than two and
lesser than four policies means that we have to find the values for
P(2≤X<4).
2 4 2 3
32 33
2 4 e 3 e 3
2 3
2 4 0.4482
The probability that the salesman of RSR Insurance Company will sell
two or more policies but less than four policies is 44.82%.
Solved Problem 13: From the data given in solved problem 11, how many
pages would you expect to be free from mistakes?
Solution: Given that:
m 3 and n 200
0 e 3 0.04979
Expected number of pages to be free from mistakes is given by:
n 0 200 0.04979 9.958 10 pages
Expected number of pages to be free from mistakes is approximately 10
pages.
Solved Problem 14: If X is a Poisson variate such that P(X = 1) = P(X = 2),
find P(X = 0).
Solution: Let ‘m’ be the parameter of the distribution, and P(X = 1) =
P(X = 2)
m1 m2
e m e m
1 2
m m2
1 2
2m m 2 2 m
0 e 2 0.13534
Limits Area %
68.2
1.96 95
2 95.4
3 99.7
Key Statistic
The Normal distribution is the limiting form of binomial distribution.
1
1 ( z )2
f (Z ) e 2
2
Key Statistic
Any Normal distribution can be converted into a Standard Normal
distribution by the transformation:
x
The Standard Normal variate, ‘Z’ is given by: Z where, ‘Z’ is
called Standard Normal variate which gives the number of standard
deviations from x to the mean of this distribution
x is the value of random variable X
is the mean of the distribution random variable X
is the standard deviation of this distribution
Key Statistic
The mean of Standard Normal distribution is ‘0’ and the standard
deviation is ‘1’.
Solved Problem 15: The weight of Cocavito packs packed by the filling
machine follows a normal distribution with mean weight of 500 gms and
standard deviation of 10 gms. A pack is selected at random. What is the
probability that:
i. The pack’s weight will exceed 515 gms?
ii. The pack’s weight lie within 480 to 520 gms?
iii. The proportion of packs will have less than 480 and greater than 520
gms?
If 10,000 packs are supplied, how many packs will be rejected, given that
480 gms and 520 gms are lower and upper limit for acceptance?
Solution: To solve this problem we will draw the normal curve as
shown in figure 6.5.
i. The probability that the packs weight will exceed 515 gms is given by:
515 0.5 500 515
500 500 515 500
0.5 0.5 0 1.5 0.5 0.4332 0.0668
10 10
Therefore, the probability that the packs weight will exceed 515 gms is
0.0668.
Note: Mean divides it into two equal portions and
ii. The probability that the pack’s weight lie within 480 to 520 gms is
given by:
480 520 480 500 480 500
480 500 520 500
0 0
10 10
Solved Problem 16: The sales volume of 1000 retail outlets of a soap
company follows Normal distribution. 20% of retail outlets sell less than 50
units per day and 15% of them sells 200 unit and above. Find:
i. The mean and standard deviation of the sales volume
ii. The expected number of retail outlets that sells units between 50 and
148 units
Solution: Let ‘m’ and ‘’ be the mean and standard deviation. The
given information can be represented in a graph as shown in figure
6.7.
Given that:
50 x 0.30
50 50
0 0.30 0.84
or 50 0.84 .................1
50 0.84 .................1
And
x 200 0.35
200 200
0 0.35 1.04
200 1.04 ...............2
50 0.84
…………. (1)
150 1.88
6.7 Summary
Quick analysis of observed data can be done if it is identified with the
theoretical distribution. The probabilities associated with random variate of
the distribution help us to know the chances of occurrence of several events
within specified values. We can also extend the solution to the cost aspects.
Binomial distribution ‘is applied when you run a series of finite independent
Bernoulli trials and the probability of success remains same for every trial. In
this distribution, 1’ represents the occurrence of success and ‘0’ represents
the occurrence of failure.
Poisson distribution is a unimodal distribution with mean ‘m’ and standard
deviation is m . This distribution is the limiting form of binomial distribution
as ‘n’ tends to infinity.
Normal distribution is a continuous probability distribution with probability
density function f(x) given by:
2
1 x
f ( x) e 1/ 2
2
6.10 References
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
7.1 Introduction
In the unit 6, ‟Theoretical Distributions‟, you have studied about both
discrete and continuous random variables along with the probability
Example 1
In the statistical survey aimed at determining average per capita income
of the people in the city, all earning individuals in the city form the
population.
Population
Sample
Key Statistic
The standard deviation of sampling distribution of any statistic is called
standard error of that statistic. It is denoted as „S‟ and is given by:
2 2
fX fX
S2
f f
Let us understand about each of the error types and the factors causing
those errors.
Sampling errors
The sample results are bound to differ from population results, since sample
is only a small portion of the population. It is also known as inherent error
and cannot be avoided. It is not worth to eliminate them completely. These
errors may be due to the following factors:
Faulty selection of sample
Substitution of units to be studied
Faulty demarcation of sampling units
Error due to bias in estimation
However, the sampling errors follow random or chance variations and tend
to cancel out each other on averaging.
Non-sampling errors
Non-sampling errors are attributed to factors that can be controlled and
eliminated by suitable actions. It is worth to eliminate these errors. They are
due to the following factors:
Faulty planning, faulty definitions
Defective methods of interviewing
Personal bias of investigator
Lack of trained and qualified investigators
Respondents‟ failure to answer
Improper coverage
Compiling errors
Publication errors
Biased errors
It arises in both census and sampling method. These errors occur due to
personal bias of the investigator and the instruments used for measuring.
They are also due to faculty collection of data, respondent‟s bias and bias
due to non-response. Biased errors have a tendency to grow with sample
size. Therefore, they are also known as cumulative errors. The magnitude of
biased errors is directly proportional to the sample size.
Unbiased errors
The errors that are due to over-estimation and under-estimation such that
they are equal are known as unbiased errors. They are also known as
compensatory errors. They do not increase with sample size.
7.6.1 Measures of statistical errors
Key Statistic
Absolute error is the difference between true value „t‟ and the observed
value „a‟. Symbolically, absolute error „AE‟ is represented as:
AE t a
It is independent of magnitude of the actual value.
Key Statistic
Relative error is the ratio of the absolute error to the actual value. It is
symbolically represented as:
AE t - a
RE
a a
It provides a degree of error for comparison purposes between different
sets of data.
Example 2
The items produced by factories located at three cities „X‟, „Y‟ and „Z‟ are
200, 300 and 500 respectively. We wish to draw a sample of 20 items
under proportional stratified sampling. We number the unit from 0 to 999.
Then refer to random table and select the numbers as represented in
table 7.4.
Table 7.4: Stratified random sampling
200
For Factory X 20 4
1000
300
For Factory Y 20 6
1000
500
For Factory Z 20 10
1000
Total = 20
For first factory sample units selected are 174, 192, 069, 156.
For second factory sample units selected are 287, 432, 444, 482, 302,
254.
For third factory sample units selected are 854, 772, 733, 741, 822, 853,
570, 802, 629, 525.
Systematic sampling
This design is recommended if we have a complete list of sampling units
arranged in some systematic order such as geographical, chronological or
alphabetical order.
Suppose the population size is „N‟. The population units are serially
numbered „1‟ to „N‟ in some systematic order and we wish to draw a sample
of „n‟ units. Then we divide units from „1‟ to „N‟ into „K‟ groups such that each
group has „n‟ units.
This implies „nK = N‟ or „K = N/n‟. From the first group, we select a unit at
random. Suppose the unit selected is 6th unit, thereafter we select every 6 +
Kth units. If „K‟ is 20, „n‟ is 5 and „N‟ is 100 then units selected are 6, 26, 46,
66, 86.
The table 7.5 displays the merits and demerits of systematic sampling.
Table 7.5: Merits and demerits of systematic sampling
Merits Demerits
1. Very easy to operate and easy to 1. Many case we do not get up-to-
check. date list.
2. It saves time and labour. 2. It gives biased results if periodic
feature exist in the data.
Cluster sampling
The total population is divided into recognisable sub-divisions, known as
clusters such that within each cluster units are more heterogeneous and
between clusters they are homogenous. The units are selected from each
cluster by suitable sampling techniques. The figure 7.7 represents the
cluster sampling where each packet of candy packet forms a cluster.
Multi-stage sampling
The total population is divided into several stages. The sampling process is
carried out through several stages. It is represented as in figure 7.8.
Example 3
We want to select 1000 colleges from southern states. In the first stages
we may select any three states. In the second stage we may select some
districts in that state. In the 3rd stage, we may select the colleges in each
district. We may adopt any sampling technique at each stage.
The table 7.6 displays the merits and demerits of multi-stage sampling.
Table 7.6: Merits and demerits of multi stage sampling
Merits Demerits
Greater flexibility in sampling Estimates are less accurate
method
Existing division can be used Investigator should have knowledge of the
entire population that will be sampled
if the population size is less. The table 7.7 displays the merits and demerits
of judgement sampling.
Table 7.7: Merits and demerits of judgement sampling
Merits Demerits
1. Most useful for small population 1. It is not a scientific method.
2. Most useful to study some unknown 2. It has a risk of investigator‟s
traits of a population some of whose bias being introduced.
characteristics are known.
3. Helpful in solving day-to-day
problems.
Convenience sampling
The sample units are selected according to convenience of the investigator.
It is also called “chunk” which refers to the fraction of the population being
investigated which is selected neither by probability nor by judgment.
Moreover, a list or framework should be available for the selection of the
sample. It is used to make pilot studies. However, there is a high chance of
bias being introduced.
Quota sampling
It is a type of judgment sampling. Under this design, quotas are set up
according to some specified characteristic such as age groups or income
groups. From each group a specified number of units are sampled
according to the quota allotted to the group. Within the group the selection
of sample units depends on personal judgment. It has a risk of personal
prejudice and bias entering the process. This method is often used in public
opinion studies.
7.7.3 Caselet on types of sampling
Caselet
Read the information and answer the questions.
You have been given 5 boxes of biscuits. There are orange, brown and
yellow colour biscuits. You are asked to sample the biscuits. The target
population here is all of the biscuits and the sampling unit is the biscuit.
Answer the following questions.
i) How would you apply simple random sampling?
ii) How would you apply stratified sampling?
iii) How would you apply cluster sampling?
Key Statistic
The formula used for calculating the sample size for finite population is
given by:
P Ps
Z (For finite population )
N - n / N - 1 PQ / n
Key Statistic
The formula used for calculating the sample size for infinite population is
given by:
P Ps
Z (For infinite population )
PQ / n
where,
Z = value according to the degree of accuracy desired
P = Population value,
Ps = Sample value which implies P - Ps error desired in the
result
Q=1–P
n = Sample size.
Key Statistic
The formula used for calculating the sample size for finite population,
when population mean and sample mean are given, is:
μ μs
Key Statistic
The formula used for calculating the sample size for infinite population,
when population mean and sample mean are given, is:
μ μs
Nn
Z (For infinite population )
n N 1
where,
= Population mean
s = Sample mean
= Standard deviation of population
n = Sample size
N = Size of population
Key Statistic
The formula used for calculating the sample size, when mean of sample
means is given, is:
x
n
where,
σ = Mean of sample means
x
= Population standard deviation
n = Sample size
vi) If the mean of a certain population is 20, it is likely that most of the
sample means will be 20.
vii) Any sampling distribution can be totally described by its mean and
standard deviation.
viii) Sampling from infinite population and from a finite population with
replacement results in:
σ
σ
x n
ix) The central limit theorem assures that the sampling distribution of
mean is always normal.
x) Stratified sampling is used when each group considered are more
homogenous within itself and heterogeneous between group.
7.10 Summary
There are two methods of studying the characteristics of population, census
and sampling. The various advantages of sampling and the various errors
that could prop up in using these methods were explained.
Mainly, there are two methods of sampling namely; probability sampling and
non-probability sampling. The merits and demerits of each sampling method
were explained. We discussed the procedure for determining sample size.
We concluded the chapter with the importance of central limit theorem.
7.13 References
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited.
Unit 8 Estimation
Structure:
8.1 Introduction
Learning objectives
8.2 Reasons for Making Estimates
8.3 Making Statistical Inference
8.4 Types of Estimates
Point estimate
Interval estimate
8.5 Criteria of a Good Estimator
Unbiasedness
Efficiency
Consistency
Sufficiency
8.6 Point Estimates
8.7 Interval Estimates
Case study on calculating estimates
Making the interval estimate
8.8 Interval Estimates and Confidence Intervals
Interval estimates of the mean of large samples
Interval estimates of the proportion of large samples
Interval estimates using the Student‟s „t‟ distribution
8.9 Determining the Sample Size in Estimation
8.10 Summary
8.11 Terminal Questions
8.12 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
8.13 References
8.1 Introduction
In the unit 7, „Sampling and Sampling Distributions‟, you have studied about
sampling design and different theories of sampling. The sampling errors in
the sampling distributions are also studied. In this unit 8, „Estimation‟, you
will study about estimation and different types of estimation. You will also
study about calculation of confidence intervals of the population mean when
the standard deviation is unknown. Finally, you will study the methods to
calculate the sample size if the confidence levels are given.
Everyone makes estimates. When you are ready to cross a street, you
estimate the speed of any car that is approaching, the distance between you
and that car, and your own speed. Having made these quick estimates, you
decide whether to wait, walk, or run. With the knowledge of inferential
statistics, you can do the estimations about the population using the random
samples which are drawn from the population.
Learning objectives
By the end of this unit, you should be able to:
Distinguish between a point estimate and an interval estimate
Calculate the confidence interval
Describe the types of estimations
Describe interval estimates and confidence intervals
Calculate the sample size if the confidence intervals are given
Example 1
Suppose, we choose a sample of a given size and must decide whether
to use the sample mean or the sample median to estimate the
population mean.
If we calculate the standard error of the sample mean and found it to be
1.05 and then, calculate the standard error of the sample median and
found it to be 1.6, we would say that the sample mean is a more efficient
estimator of the population mean, because its standard error is smaller.
It makes sense that an estimator with a smaller standard error (with less
variation) will have more chance of producing an estimate nearer to the
population parameter under consideration.
8.5.3 Consistency
A statistic is a consistent estimator of a population parameter, if the sample
size increases. It becomes almost certain that the value of the statistic
comes very close to the value of the population parameter. If an estimator is
consistent, it becomes more reliable with large samples.
8.5.4 Sufficiency
An estimator is sufficient if it makes so much use of the information in the
sample that no other estimator could extract from the sample any additional
information about the population parameter being estimated.
s2
(X X)2
n 1
Example 2
The table 8.1 displays the results of samples of 35 boxes which contain
bolts.
Table 8.1: Results of samples of 35 boxes of bolts (bolts per box)
101 103 112 102 98 97 93
105 100 97 107 93 94 97
97 100 110 106 110 103 99
93 98 106 100 112 105 100
114 97 110 102 98 112 99
X
X 3570 102
n 35
Thus, using the sample mean X as the estimator we have a point
estimate of the population mean „µ‟.
Key Statistic
An interval estimate describes a range of values within which a
population parameter is likely to lie.
If we select and plot a large number of sample means from a population, the
distribution of these means will approximate to normal curve. Furthermore,
the mean of the sample means will be the same as the population mean.
8.7.1 Case study on calculating estimates
Case Study
The marketing research director needs an estimate of the average life in
months of car batteries his company manufactures. We select a random
sample of 200 batteries with a mean life of 36 months. If we use the
point estimate of the sample mean „x‟ as the best estimator of the
population mean „µ‟, we would report that the mean life of the company‟s
batteries is 36 months.
The director also asks for a statement about the uncertainty that is likely
to accompany this estimate, that is, a statement about the range within
which the unknown population mean is likely to lie. To provide such a
statement, we need to find the standard error of the mean. Our sample
size of 200 is large enough that we can apply the central limit theorem.
Suppose, we have already estimated the standard deviation of the
population of the batteries and reported that it is 10 months.
Using this standard deviation, we can calculate the standard error of the
mean by using the formula, x
n
We find the standard error S.E x 10 / 200 to be 0.707 per month.
(Cont. on topic ‘Making the interval estimate’)
Case Study
(Cont. from topic ‘Interval Estimates’)
We can tell to the director that our estimate of the life of the company‟s
batteries is 36 months, and the standard error that accompanies this
estimate is 0.707. In other words, the actual mean life for all the batteries
may lie somewhere in the interval estimate of 35.293 to 36.707 months.
This is helpful but insufficient information for the director.
Next, we need to calculate the chance that the actual life will lie in this
interval or in other intervals of different widths that we might choose,
2(20.707), 3(30.707)
and so on.
The probability is 0.955 that the mean of a sample size of 200 will be
within ±2 standard errors of the population mean. It can be stated
differently as 95.5 percent of all the sample means are within ±2
standard errors from population mean „‟. The population mean „µ‟ will
be located within ±2 standard errors from the sample mean 95.5 percent
of the time.
Hence, we can now report to the director, that the best estimate of the
life of the company‟s batteries is 36 months, and we are 68.3 percent
confident that the life lies in the interval from 35.293 to 36.707
months 36 1 .
Similarly, we are 95.5 percent confident that the life falls within the
interval of 34.586 to 37.414 months 36 2 , and we are 99.7 percent
confident that battery life falls within the interval of 33.879 to 38.121
months 36 3 .
Key Statistic
The probability that we associate with an interval estimate is called the
confidence level.
Similarly, we are 95.5 percent confident that the life falls within the
interval of 34.586 to 37.414 months 36 2 , and we are 99.7 percent
confident that battery life falls within the interval of 33.879 to 38.121
months 36 3 .
This probability indicates how confident we are that the interval estimate will
include the population parameter. A higher probability means more
confidence. In estimation, the most commonly used confidence levels are 90
percent, 95 percent, and 99 percent, but we are free to apply any
confidence level. The confidence interval is the range of the estimate we are
making.
Example 3
If we report that we are 90 percent confident that the mean of the
population of incomes of people in a certain community will lie between
Rs. 8,000 and Rs. 24,000, then the range Rs. 8,000 - Rs. 24,000 is our
confidence interval.
Thus, confidence limits are the upper and lower limits of the confidence
interval. In this case, x 1.64 x is called the upper confidence limit (UCL)
Nn
x
n N 1
and also the sample size „n‟ is greater than five percent of the population
size „N‟, that is,
n
0.05
N
Similarly, we can modify the formula for the standard deviation of the
binomial distribution, npq, which measures the standard deviation in the
number of successes. To change the number of successes to the proportion
of successes, we divide npq by n and get pq / n . Therefore, the
standard error of the proportion is given by:
SR pq / n
Therefore, the interval estimate for 99% level of confidence is 0.4 ± 2.58
(0.057) = 0.253 and 0.547.
Hence, the proportion of the total population of employees who wish to
establish their own retirements plans lie between 0.253 and 0.547.
8.8.3 Interval estimates using the Student’s ‘t’ distribution
So far, the sample sizes we were examining were all larger than 30. This is
not always the case. Questions like „handling estimates where the normal
distribution is not the appropriate sampling distribution‟ are answered in this
section. In other words, we will discuss here how we have to estimate the
population standard deviation when the sample size is 30 or less. For
example, we have data only from 10 weeks or sample sizes less than 30.
Then, fortunately, another distribution exists that is appropriate in these
cases. It is called the „t‟ distribution. Early theoretical work on „t‟ distributions
was done by a man named W. S. Gosset in the early 1990s. Gosset was
employed by the Guinness Brewery in Dublin, Ireland, which did not permit
employees to publish research findings under their own names. So Gosset
adopted the pen name „Student‟ and published under that name.
Sikkim Manipal University Page No. 209
Statistics for Management Unit 8
Key Statistic
We can define degrees of freedom as the number of values we can
choose freely. We will use degrees of freedom when we select a „t‟
distribution to estimate a population mean, and we will use „n-1‟ degrees
of freedom, where „n‟ is the sample size.
Key Statistic
In any estimation problem in which the sample size is 30 or less and the
standard deviation of the population is unknown and the underlying
population can be assumed to be normal or approximately normal, use
the „t‟ distribution.
distribution table. The „t‟ table is more compact and shows areas and „t‟
values for only a few percentages (10, 5, 2, and 1 Percent). Because there
is a different „t‟ distribution for each number of degrees of freedom, a more
complete table would be quite lengthy. Although, we can conceive of the
need for a more complete table.
A second difference in the „t‟ table is that it does not focus on the chance
that the population parameter being estimated will fall with our confidence
interval. Instead, it measures the chance that the population parameter we
are estimating will not be within our confidence interval (that is, it will lie
outside the confidence interval).
If we are making an estimate at the 90 percent confidence level, we would
look in the „t‟ table under the 0.10 column (100 percent – 90 percent = 10
percent). This is 0.10 chance of error is symbolised by the Greek letter
alpha „α‟. We would find the appropriate „t‟ values for confidence intervals of
95 percent, 98 percent, and 99 percent under the columns headed 0.05,
0.02, and 0.01, respectively. A third difference in using the „t‟ table is that we
must specify the degrees of freedom with which we are dealing. Suppose,
we make an estimate at the 90 percent confidence level with a sample size
of 14, which is 13 degrees of freedom, then look under the 0.10 column until
we encounter the row labeled 13. Like a „z‟ value, the „t‟ value of 1.771
shows that if we mark off plus and minus 1.7716 (estimated standard errors
of x) on either side of the mean, the area under the curve between these
two limits will be 90 percent, and the area outside these limits(the chance of
error) will be 10 percent.
Self Assessment Questions
1. XY Pizza has developed quite a business in Bangalore by delivering
pizza orders promptly. It guarantees that its pizzas will be delivered in
30 minutes or less from the time the order was placed, and if the
delivery is late, the pizza is free. The time that it takes to deliver each
pizza order that is on time is recorded in the Pizza Time Book (PTB),
and the delivery time for those pizzas that are delivered late is recorded
as 30 minutes in the PTB. A sample of 12 random entries from the PTB
is listed in table 8.2.
z x 500
At 95 % level of confidence, we know from the „z‟ table that „z‟ is 1.96.
Therefore,
1.96 x 500
x 500 / 1.96 255
Now, if the standard error of the mean is 255; that lead us to:
x / n 255
Since, „‟ is 1500, we can find „n‟. that is:
1500 / n 255
Therefore,
2
1500
n 34.6
255
It implies that „n‟ should be greater than 34.6 or 35 if the university wants to
estimate the precision with which it wants to conduct the survey.
8.10 Summary
In this unit 8, we have discussed about the point estimates and interval
estimates. These estimates are the foundations for inferential statistics in
estimation and hypothesis testing, which we will be discussing in the next
unit. In this unit, you have studied the concept of confidence levels and the
concept of making estimations when the sample sizes are small and large.
You have studied about calculation of a sample size provided that we know
the level of accuracy we want to construct the estimate. Also we have
discussed that if the sample size is less than 30 and the population standard
deviation is not known, we use the Student‟s „t‟ distribution for estimations.
5.
i) 2.052
ii) 2.998
iii) 1.782
iv) 2.262
8.14 References
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited.
9.1 Introduction
In the unit 8, „Estimation‟, you have studied about the estimation of the
samples and the methods of estimation. In this unit 9, „Testing of Hypothesis
in Case of Large and Small Samples‟, you will study about hypothesis,
assumptions and testing of hypothesis. Estimation is about estimating the
errors in a sample, and finding out confidence intervals of samples.
Hypothesis testing is the opinion about the population parameter that may or
may not be true. Hypothesis testing is helpful in decision making. Before
starting this unit, refresh the concepts you have studied on sampling
estimation.
Hypothesis testing begins with an assumption, called a hypothesis that we
make about a population parameter. We assume a certain value for a
population mean. To test the validity of our assumption, we gather sample
data and determine the difference between the hypothesised value and the
actual value of the sample mean. Then we judge whether the difference is
significant.
The smaller the difference, the greater the likelihood that our hypothesised
value for the mean is correct. The larger the difference, the smaller the
likelihood that our hypothesised value for the mean is correct. Unfortunately,
the difference between the hypothesised population parameter and the
actual statistic is more often neither so large that we automatically reject our
hypothesis nor so small that we just as quickly accept it. So in hypothesis
testing, as in most significant real-life decisions, clear-cut solutions are the
exception, not the rule.
9.1.1 Learning objectives
By the end of this unit, you should be able to:
Describe the basic concepts of hypothesis testing
Describe the different test statistics available
Identify the test for a given problem
Identify the type of errors preferred
9.1.2 Assumptions
Although hypothesis testing sounds like some formal statistical term
completely unrelated to business decision making, in fact managers
propose and test hypothesis all the time. For example, “if we drop the price
Sikkim Manipal University Page No. 218
Statistics for Management Unit 9
of this car model by Rs.1, 500, we will sell 50,000 cars this year” is a
hypothesis. To test this hypothesis, total car sales till the end of the year
have to be counted.
Managerial hypothesis are based on intuition; the marketplace decides
whether the manager‟s intuitions were correct. Hypothesis testing is about
making inferences about a population from only a small sample. The bottom
line in hypothesis testing is when we ask ourselves (and then decide)
whether a population, like we think this one, would be likely to produce a
sample like the one we are looking at.
Example 1
We want to test the hypothesis that the population mean is equal to 500.
We would symbolise it as follows and read it as,
The null hypothesis is that the population mean = 500 written as,
0 : 500
The term „null hypothesis‟ arises from earlier agricultural and medical
applications of statistics. In order to test the effectiveness of a new fertiliser
or drug, the tested hypothesis (the null hypothesis) was that it had no effect,
that is, there was no difference between treated and untreated samples. If
we use a hypothesised value of a population mean in a problem, we would
represent it symbolically as „H0‟. This is read - „The hypothesised value of
the population mean‟.
If our sample results fail to support the null hypothesis, we must conclude
that something else is true. Whenever we reject the hypothesis, the
conclusion we do accept is called the alternative hypothesis and is
symbolised H1 (“H sub-one”).
For the null hypothesis H0: = 200, we will consider three alternative
hypothesis as:
H1: 200 (population mean is not equal to 200)
H1: > 200 (population mean greater than 200)
H1: < 200 (population mean less than 200)
9.2.2 Interpreting the level of significance
The purpose of hypothesis testing is not to question the computed value of
the sample statistic but to make a judgment about the difference between
that sample statistic and a hypothesised population parameter.
The next step after stating the null and alternative hypotheses is to decide
what criterion to be used for deciding whether to accept or reject the null
hypothesis. If we assume the hypothesis is correct, then the significance
level will indicate the percentage of sample means that is outside certain
limits (In estimation, the confidence level indicates the percentage of sample
means that falls within the defined confidence limits).
9.2.3 Hypotheses are accepted and not proved
Even if our sample statistic does fall in the non-shaded region (the region
shown in figure 9.1 that makes up 95 percent of the area under the curve),
this does not prove that our null hypothesis (H0) is true; it simply does not
provide statistical evidence to reject it. Why? It is because the only way in
which the hypothesis can be accepted with certainty is for us to know the
population parameter; unfortunately, this is not possible.
Therefore, whenever we say that we accept the null hypothesis, we actually
mean that there is not sufficient statistical evidence to reject it. Use of the
term accept, instead of do not reject, has become standard. It means that
when sample data do not cause us to reject a null hypothesis, we behave as
if that hypothesis is true.
Hypothesis is
True False
Test results says Accept Type II error
Reject
Type I error
You have to remember one more rule when testing the hypothesised values
of a mean. As in estimation, use the finite population multiplier whenever the
population is finite in size, sampling is done without replacement, and the
sample is more than five percent of the population.
Table 9.2: Conditions for using the normal and ‘t’ distributions in
testing hypothesis about means
Case Study
Assume that a manufacturer of light bulbs wants to produce bulbs with a
mean life of:
0 1000 hours
Therefore, he rejects the null hypothesis if the mean life of bulbs in the
sample is either too far above 1,000 hours or too far below 1,000 hours.
A left-tailed test is one of two kinds of one-tailed tests. As you have probably
guessed by now, the other kind of one-tailed test is a right-tailed test (or an
upper-tailed test). An upper-tailed test is used when the hypotheses are
Ho: > Ho. Only values of the sample mean that are significantly above the
hypothesised population mean will cause us to reject the null hypothesis in
favour of the alternative hypothesis. This is called an upper-tailed test as
shown in figure 9.3, because the rejection region is in the upper tail of the
distribution of the sample mean.
Test Description
Test Statistics Notes
No. of Test
1 Test for P – Population
specified proportion
P Ps
proportion – Z 1/ 2 Ps = Sample
infinite PQ proportion
population n Q = 1 – P, n sample
size
2 Test for P = Population
specified proportion
proportion – P Ps
Z 1/ 2 1/ 2
Ps = Sample
Finite PQ Nn
Population Q = 1 –P, n – Sample
n N 1 size
N - Population size
3 Test P1 -first sample
between proportion
proportions – Z P Ps
1/ 2 1/ 2
P2 -second sample
different P1Q1 P2 Q 2 proportion
Population
n1 n2 Q1 = 1 – P, Q2 = 1-P2
n1- first sample size
n2 – second sample size
4 Test P1 -first sample
between proportion
proportion – P2 -second sample
same P Ps proportion
population Z
PQ 1/ n1 1/ n 2 1 / 2 Q1 = 1 – P, Q2 = 1-P2
n1- first sample size
n2 – second sample
size
Test Description
Test Statistics Notes
No. of Test
5 Test for – Population mean
specified
s = Sample mean
mean – s
infinite Z = Population S.D
population We can use Sample S.D
n
(s) also in case population
S.D. is not given
6 Test for – Population mean
specified
s = Sample mean
mean – s
Z = Population S.D
Finite
1/ 2
Nn
1/ 2
Mean – Z 1 1
1/ n1 1/ n 2 1/ n1 1/ n 2
same
population
5. Test
13000 14500
Z cal 3.57
420
study carried out 2 years ago showed that 5% of the house holds would buy
the brand then. At 2 % level of significance, should the company conclude
that there is an increased interest in the extra spicy flavor?
Solution: The procedure followed is explained in steps.
1. Null hypothesis Ho: P = Ps
Alternate hypothesis HA: P < Ps (one-tailed test)
2. Level of significance 2 % Ztab = 2.05
3. Test Statistics
P Ps
Z
1/ 2
PQ
n
4. Given P = 0.05, Ps = 355 / 6000, = 0.05513, n = 6000, Q = 1 – P = 0.95
1/ 2
0.05 0.95
PQ / n 1/ 2
0.0028
6000
5. Test
0.05 0.05583
Z cal 2.08
0.0028
P Ps
Z
1/ 2 1/ 2
PQ Nn
n N 1
5
0.5
n 100
5. Test
200 201.3
Z cal 13 / 5 2.60
0.5
Sikkim Manipal University Page No. 234
Statistics for Management Unit 9
( X )
f (t) n
S
1/ 2
( x x ) 2
where, S
n 1
v 1/ 2
t2
f ( t ) C 1
v
where,
C = Constant required to make the area under the curve equal to
unity.
= n – 1, Degree of Freedom.
4. The value of „t‟ ranges from - to +
5. “” is called the parameter of the distribution
6. It is symmetrical about mean
7. Its mean is zero
8. Variance of the distribution is greater than one.
9. It has larger areas at the tails compared to normal distribution and
lower height at the mean.
10. It tends to a normal distribution as n .
9.7.1 Uses of ‘t’ test
The „t‟ test is used:
To test a specified value.
To test the differences between values (independent sample).
As a paired „t‟ – test (dependent sample)
To construct confidence interval for the estimates
The table 9.6 display the description of test in case of small samples where
„n‟ is a variable and the population standard deviation is not known.
Test Description of
Test Statistics Notes
No. Test
1 Test for X – Population proportion
specified Value X = Population mean
– Infinite t
S
population 2
2 ( x X )
D.O.F n -1 n S
n 1
2 Test for X
specified Value t
1/ 2 N - Population size
– Finite S N n
population latin n N 1
D.O.F n-1
3 Test between X Y X -first sample mean
values – tS
independent 1/ n1 1/ n2 1/ 2 Y -second sample mean
2 2
samples 2 ( x X ) ( Y Y )
S
D.O.F n1 + n2 - 2 n1 n 2 2
2 2
2 n1S1 n 2 S 2
S
n1 n 2 2
X D = X - 48 d2
45 -3 9
49 +1 1
50 2 4
49 +1 1
44 -4 16
52 4 16
48 0 0
45 -3 9
46 -2 4
45 -3 9
-7 69
d
XA
n
7
48 47.3
10
1/ 2
1 2 ( d) 2
S d
n 1 n
1/ 2
1 ( 7)2
S 69 7.12
9 10
5. Test
47.3 50.0
t cal 3.23
0.8362
6. Conclusion: Since tcal (3.23) > ttab (2.262), Ho is rejected.
S 10.3
1.135
n 8
5. Test
15.8 17.5
t cal 1.498
1.135
6. Conclusion: Since tcal (1.498) < ttab (3.36), Ho is accepted
It can be considered as a random sample.
Solved Problem 9: Treatment „A‟ gave brightness index for a substance on
5 randomly selected samples as 60, 41, 38, 39, 42. Treatment „B‟ gave the
same index on another 8 randomly selected samples as 56, 42, 48, 69, 68,
64, 69, 62. At 5% level of significance can we conclude that treatment „B‟
increases the brightness?
Solution: The steps followed are described as below.
1. Null hypothesis Ho: X1 X 2
2. Alternate hypothesis HA: X Y (one tailed test)
3. Level of significance 5 % and D.O.F 5 + 7 – 2 = 10 ttab = 2.228
4. Test Statistics
X1 X 2
S 1/ n1 1/ n 2
5. Given that:
Table 9.8. Frequency table for treatment ‘A’
X d = X - 48 d2
60 +14 196
41 -5 25
48 -2 4
39 -7 49
42 -4 16
230 0 290
X d = X - 48 d2
56 -1 1
42 -15 225
48 -19 361
69 12 144
68 11 121
64 7 49
69 12 144
62 5 25
399 0 926
The table 9.8 and table 9.9 show the frequency table data for the
treatment „A‟ and treatment „B‟ respectively.
S2
1
n1 n 2 2
( X1 X1 ) 2 ( X 2 X 2 ) 2
1
290 926 121.6
10
S 121.6 S 1/ 5 1/ 7 11.03 0.3429 3.782
6. Test
46 57 11.0
t cal 1.7
121.6 (1/ 5 1/ 7)1 / 2 6.457
Retail Outlets 1 2 3 4 5 6
Sales before campaign 50 48 31 42 28 53
Sales after campaign 56 55 30 45 29 58
Solution: The table 9.10b shows the frequency table calculated for the
sales data before and after campaign.
Table 9.10b. Frequency table for the sales data before and after campaign
d 21
d 3.5
n 6
S2
1
n 1
d 2 ( d) 2 / n
9.8 Summary
In this unit 9, we have defined what is meant by hypothesis and studied the
procedure for testing of hypothesis. We have defined what is meant by
significance level and types of errors. We have also seen different types of
tests, two tailed and one tailed. You have also studied under what
circumstances these tests are done and also the steps involved in
identifying the test.
We discussed the four tests available for small samples. These tests can be
used for sample size (n 30) and samples whose population standard
deviations are not known. The different tests are illustrated with solved
problems.
1. Twenty households out of 1000 were using Brand „A‟ toothpaste. The
company increased the price of the brand. In a survey, they found that
only 12 households out of 1000 are using it now. Can we conclude at
5% level of significance that proportion of users has decreased?
2. A drill drills holes with standard deviation of depth 0.03cms. It is adjusted
to drill holes of depth 5.5cm. For 50 holes drilled, the mean depth is
5.503cm. Test at 5% level of significance whether the adjustment is
correct.
3. Out of 80 batteries produced by a process I, three were found to be
defective. Another sample of 130 produced by process II, two were
found to be defective. Test whether the proportion of defectives in two
processes differs, using 1% level of significance.
4. The table 9.11 displays the data related to mean weight of a product.
Test whether there is a significant difference in means of the plants.
9. The table 9.13 displays the results relate to the memory capacity of 10
students before and after training. Test at 5% level of significance
whether training is effective.
Roll No 1 2 3 4 5 6 7 8 9 10
Before 1 14 11 8 7 10 3 0 5 6
Training
After Training 1 16 10 7 5 12 10 2 3 8
9.11 References
Structure:
10.1 Introduction
Learning objectives
10.2 Chi-Square as a Test of Independence
Characteristics of 2 test
Degrees of freedom
Restrictions in applying 2 test
Practical applications of 2 test
Levels of significance
Steps in solving problems related to Chi-Square test
Interpretation of Chi-Square values
10.3 Chi-Square Distribution
Properties of 2 distribution
Conditions for applying the Chi-Square test
Uses of 2 test
10.4 Applications of Chi-Square test
Tests for independence of attributes
Test of goodness of fit
Test for specified variance
10.5 Summary
10.6 Terminal Questions
10.7 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
10.8 References
10.1 Introduction
In the unit 9, ‘Testing of Hypothesis for Large and Small Samples’, we
discussed about how to test hypotheses using data from either one or two
samples. We used one-sample tests to determine whether a mean or a
proportion was significantly different from a hypothesised value. In the two-
sample tests, we examined the difference between either two means or two
proportions, and we tried to learn whether this difference was significant.
The 2 test is very widely used for research purposes in behavioral and
social sciences including business research
It is defined as:
O E2
2 E
where, ‘O’ is the observed frequency and ‘E’ is the expected frequency.
Key Statistic
The observed frequencies are the frequencies obtained from the
observation, which are sample frequencies.
The expected frequencies are the calculated frequencies.
Example 1
Suppose, we are asked to write any four numbers, then we will have all
the numbers of our choice. If a restriction is applied or imposed to the
choice that the sum of these numbers should be 50; then the freedom of
choice would be reduced to three only and so the degrees of freedom
would now be 3.
Key Statistic
The Chi-Square distribution has only one parameter, that is, degrees of
freedom.
Key Statistic
The results of Chi-Square test cannot be accurate if the cell frequencies
in a contingency table are less than 5.
Key Statistic
The Chi-Square curve will be on the positive side of X-axis because the
Chi-Square values are always positive.
Key Statistic
If X1, X2……….Xn are ‘n’ independent random variables following the
normal distribution with mean ‘’ and standard deviation ‘’ respectively,
then the 2 variate is given by:
( x1) 2 ( x 2 ) 2 ( x n ) 2
2 .......... .
It is the sum of the squares of ‘n’ independent standard normal variates,
following the 2 distribution with ‘n’ degrees of freedom.
6. The expected frequency of any item or cell must not be less than 5, the
frequencies of adjacent items or cells should be polled together in order
to make it more than 5.
7. The data should be expressed in original units for convenience of
comparison and the given distribution should not be replaced by relative
frequencies or proportions.
8. This test is used only for drawing inferences through test of the
hypothesis, so it cannot be used for estimation of parameter value.
10.3.3 Uses of 2 test
The 2 test is used broadly to:
Test goodness of fit for one way classification or for one variable only
Test independence or interaction for more than one row or column in the
form of a contingency table concerning several attributes
Test population variance ‘2’ through confidence intervals suggested by
2 test
Number of row s 1 Number of columns 1
where, ‘’ is the degree of freedom.
The test statistic value does not change if the order of the rows or
columns is interchanged. Also the value does not change even if the
rows and columns are interchanged.
Solved Problem 1: Calculate the degrees of freedom for a contingency
table with three rows and two columns.
Solution: The degrees of freedom denoted by ‘’ is calculated as:
Number of row s 1 Number of columns 1
3 1 2 1 2
Hence, a contingency table with three rows and two columns has two
degrees of freedom.
Solved Problem 2: The table 10.1a gives the production in three shifts and
the number of defective goods that turned out in three weeks. Test at 5%
level of significance whether weeks and shifts are independent.
Solution: The table 10.1b displays the observed and expected values
required to calculate 2.
Table 10.1b. Observed and expected values for data of solved problem 2
Observed Expected Value (E) (O – E)2 (O E ) 2
Value (O)
E
15 40 x 60 /150 = 16 1 0.0625
20 50 x 60/150 = 20 0 0.0000
25 60 x 60/150 = 24 1 0.0417
5 40 x 30/150 = 8 9 1.1250
10 50 x 30/150 = 10 0 0.0000
15 60 x 30/150 = 12 9 0.7500
20 40 x 60/150 = 16 16 1.0000
20 50 x 60 /150 = 20 0 0.0000
20 60 x 60/150 = 24 16 0.6667
2
3.6459
Solution: The table 10.2a displays the information given in solved problem
3 in a tabulated form.
Table 10.2a. Data related to solved problem 3
Other States Urban Rural Total
Visited 400 100 500
Not Visited 200 300 500
Total 600 400 1000
The table 10.2b. displays the observed and expected values for the
calculation of 2.
Table 10.2b. Observed and expected values for data of solved problem 3
considered statistical model. These test results are helpful to know whether
the samples are drawn from identical distributions or not. The degrees of
freedom is ‘n-1’ and the expected value is equal to the average of the
observed values.
Solved Problem 4: A personal manager is interested in trying to determine
whether absenteeism is greater on one day of the week than on another day
of the week. He has the record for the past years. Test whether
absenteeism is uniformly distributed over the week.
Table 10.3a. Comparison of data about absenteeism
Days of Monday Tuesday Wednesday Thursday Friday
Week
Number of 66 57 54 48 75
absentees
Solution: If the absenteeism is uniformly distributed over the week, then
expected number of absenteeism per day is given by:
66 57 54 48 75 60
5
The table 10.3b represents the calculated expected values required for
calculation of 2 for the data related to solved problem 4.
Table 10.3b. Observed and expected values for calculation of for solved
2
problem 4
2 (O E ) 2
Observed Value (O) Expected Value (E) (O – E)
5
66 60 36 0.6000
57 60 9 0.1500
54 60 36 0.6000
48 60 144 2.4000
75 60 225 3.7500
cal
2
7.5000
5. Conclusion: Since 2cal (7.5) < 2tab (9.49), ‘Ho’ is rejected. Hence,
absenteeism and days of week are independent.
Solved Problem 5: According to theory in Genetics, the proportion of beans
of A, B C and D types in a generation should be 9:3:3:1. In an experiment
with 1600 beans, the frequency of bean of A, B, C and D type was observed
to be 882, 313, 287 and 118 respectively. Does the result support the
theory?
Solution: The steps followed for calculation of Chi-Square are described
below.
1. Null hypothesis ‘Ho’: The result supports theory
Alternate hypothesis ‘HA’: The result does not support theory
2. Level of Significance is 5% and 2 D.O.F (4 – 1) = 3
2
tab 7.81
3. Test Statistics
(O E ) 2
2 E
4. By Null hypothesis, E = Total No. x Corresponding ratio.
The table 10.4 displays the observed and expected values for calculation of
2 for solved problem 5.
Table 10.4: Observed and expected values for calculation of for solved
2
problem 5
(O E ) 2
Observed Value (O) Expected Value (E) (O – E)2
5
882 1600 x 19 / 10 = 900 324 0.36
313 300 169 0.56
287 300 169 0.56
118 100 324 3.24
2cal 4.72
X X)2 nS 2
X X2
2
0 0
2
0
2
If the calculated value lies between ‘K1’ and ‘K2’ then ‘H0’ is accepted. ‘K1’
and ‘K2’ values are read from the table.
Solved Problem 6: The standard deviations of heights of plants are known
to be 2 cms. Eight randomly selected plants have heights 172, 156, 154,
163, 170, 169, 170 and 164 cms. Test whether the sample standard
deviation differs significantly?
d
2
2
d
S 2
n n
2
502 38
8 8
40 1875
nS 321 5
2
10.5 Summary
Chi-Square test is a non-parametric test. It is used to test the independence
of attributes, goodness of fit and specified variance. The Chi-Square test
does not require any assumptions regarding the shape of the population
distribution from which the sample was drawn.
Chi-Square test assumes that samples are drawn at random and external
forces, if any, act on them in equal magnitude.
Chi-Square distribution is a family of distributions. For every degree of
freedom, there will be one chi-square distribution.
An important criterion for applying the Chi-Square test is that the sample
size should be very large. None of the theoretical expected values
calculated should be less than five.
The important applications of Chi-Square test are the tests for
independence of attributes, the test of goodness of fit and the test for
specified variance.
1. Treatment ‘X’ and ‘Y’ were given to 400 items of each (material) to
enhance the strength of the material. 80 gained strength by treatment ‘X’
and 20 gained strength by treatment ‘Y’. Does the gain in strength
depend on treatment.
2. The table 10.6 gives the liking of a particular model car by different age
group.
Table 10.6: Data related to terminal question 2
AGE
60 and
Below 20 20 – 39 40 – 59 Total
above
Persons
who liked 140 80 40 20 280
Car
Disliked Car 60 50 30 80 220
Total 200 130 70 100 500
3. The demand for a particular spare part was found to vary from day to
day. In a sample study, the information represented in table 10.7 was
obtained. Test the hypothesis that the number demanded depends upon
the day.
Table 10.7: Data related to terminal question 3
Days Mon Tue Wed Thur Fri Sat
Quantity 1124 1125 1110 1120 1126 1115
Demanded
10.8 References
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited.
S. P. Gupta, Statistical Methods, (2006), Sultan Chand & Sons.
Structure:
11.1 Introduction
Learning objectives
11.2 Analysis of Variance (ANOVA)
11.3 Assumptions for F-test
Objectives of ANOVA
ANOVA table
Assumptions for study of ANOVA
11.4 Classification of ANOVA
ANOVA table in one-way ANOVA
Two way classifications
11.5 Summary
11.6 Terminal Questions
11.7 Answers to SAQs and TQs
Answers to self assessment questions
Answers to terminal questions
11.8 References
11.1 Introduction
In the unit 10, ‘Chi-Square’, you have studied about the Chi-Square
characteristics and its properties. We have also discussed about how to find
the Chi-Square test results for the given sampling distribution. You also
studied the calculations of Chi-Square values for either rejecting or not
rejecting the null hypothesis. In this unit 11, ‘F-Distribution and Analysis of
Variance (ANOVA)’, we will discuss about the purpose of using analysis of
variance and conducting the F-test.
In previous unit, you studied that the Chi-Square test is used for testing the
differences among the two sample proportions and to make inferences
whether they are from the same population distribution or not. When we
have more than two populations, we have to use the analysis of variance to
evaluate the mean differences between two or more populations.
Key Statistic
The technique of analysis of variance is referred to as ANOVA.
Initially the technique was applied in the field of Zoology and Agriculture, but
in a later stage, it was applied to other fields also. In analysis of variance,
the degree of variance between two or more data as well as the factors
contributing towards the variance is studied.
In fact, Analysis of Variance is the classification and cross-classification of
statistical data with the view of testing whether the means of specific
classification differ significantly or whether they are homogeneous.
The Analysis of Variance is a method of splitting the total variation of data
into constituent parts which measure different sources of variations.
The total variation is split up into the following two-components.
Variance within the subgroups of samples
Variation between the subgroups of the samples
Hence, the total variance is the sum of variance between the samples and
the variance within the samples. After obtaining the above two variations,
these are tested for their significance by F-test which is also known as
variance ratio test.
The ‘F’ statistic is defined as F = S12 / S22 where S1 > S2. It is used to test
differences between variance, that is, whether two populations can be
considered to have same variance or not. As you have studied in the unit
10, that to test a specified variance, we used 2 – test. The sample
variances ‘S1’ and ‘S2’ are calculated as:
1
S12 ( X X)2 and
n1 1
1
S 22 (Y Y) 2
n2 1
where,
‘n1’ is the size of the first sample
‘n2’ is the size of the second sample
X and Y denotes the sample means of the random variable ‘X’ and ‘Y’
respectively
It is also known as variance ratio test. It has two degrees of freedom, one for
numerator of the ratio and another for denominator. They are represented
by:
1 = n1 – 1 and 2 = n2 – 1.
where, ‘1’ and ‘2’ are degrees of freedom in numerator and denominator
respectively.
Solution: The tables 11.1b and 11.1c represent the frequency table
required for the calculation of sample means for the data given for two
different methods.
Table 11.1b. Required values of the method I to calculate sample mean
2
X d = X - 22 d
27 5 25
23 1 1
16 -6 36
20 -2 4
26 4 11
22 0 0
Total 2 82
( d ) 2
S12
1 d 2
1
82 4 / 6 16.266
n1 1 n1 5
1 2 ( d) 2
S2 d
2 n 1 n
2 2
1 136 16
6 7
= 22.286
1. Null hypothesis ‘Ho’: 1 = 22, that is, the sample variances of two
2
5. Conclusion: Since Fcal (1.37) < Ftab (4.95), ‘H0’ is accepted. Hence, there
is no significant difference.
11.3.1 Objectives of ANOVA
The objectives of ANOVA are to:
1. Obtain a measure of the total variation between or among the
components
2. Find a measure of variation between or among the components. Then,
the significance of difference between the variations in two series or
more may be measured
In other words, with the help of the technique of ANOVA we can test the
hypothesis that the means of all the components constituting a population
are equal to the mean of the population or that the samples have come from
the same population.
11.3.2 ANOVA table
Key Statistic
A table showing the source of variance, the sum of squares, degrees of
freedom, mean square (variance) and the formula for the F-ratio is
known as ANOVA table.
Key Statistic
The means of samples will not be same if the variation caused by the
interaction between the samples is large when compared to variance
within the each group.
where,
SST = Total Sum of the Squares
SSC = Sum of the Squares of the columns
SSE = Sum of the squares of the Error
MSC = Variance between samples
MSE = Variance within the samples
You have studied in previous unit that a Chi-square distribution depends on
degrees of freedom. It has only one degree of freedom. But the F-
distribution has a pair of degrees of freedom. One is number of degrees of
Key Statistic
The number of degrees of freedom in numerator of the F ratio is
calculated as:
Degrees of freedom in numerator = (Number of samples – 1)
where, ‘k’ is the number of samples taken.
Key Statistic
The number of degrees of freedom in denominator of the F ratio is
calculated as:
Degrees of freedom in denominator = N – k
where, ‘N’ is total number of values in all samples combined and ‘k’ is
the number of samples taken.
Solution: The table 11.4b displays the calculated totals of the yield per acre
for each of the four varieties of treatment used on 5 trial plots.
Table 11.4b. Calculated totals of the yield per acre of each of the four
treatments
Treatment
Plot No. (X1) (X2) (X3) (X4)
1 2 3 4
1. 42 48 68 80
2. 50 66 52 94
3. 62 68 76 78
4. 34 78 64 82
5. 52 70 70 66
Total 240 330 330 400
MSC 860
F 8.3
MSE 103.5
The table value of ‘F’, at 5% level of significance for DF (3, 16), is 3.24
which is less than the calculated value of ‘F’. Therefore, the null hypothesis
is rejected. Hence, the treatments do not have the same effect.
11.4.2 Two way classifications
In the two way classification, observations are classified into groups on the
basis of two criteria.
Procedure for carrying out the two way analysis of variance
1. a) Assume the means of all columns are equal. That is, the effects of all
factors in first kind of treatment are equal.
1 2 3 ..........
c
b) Assume the means of all rows are equal. That is, the effects of all
factors in the second kind of treatment are equal.
1 2 3 4 ....... r
2. Compute the sum of all values ‘T’.
3. Find SST = Sum of squares of all observations – T2 / N
4. Find SSC as:
2 2 2 2 2
( x1 ) ( x ) ( x ) ( x ) ( x ) T2
2 3 4 n
SSC ..... N
n1 n
2
n
3
n
4
n
n
where Σx1, Σx2, Σx3….are column totals.
5. Find
( x ) 2 ( x j 2 )2 ( x j3 )2 ( x j 4 )2 ( x jn )2 T 2
j1
SSR ....
n1 n2 n3 n4 nn N
where, x j1 , x j 2 , x j3 …… are row totals.
Solved Problem 4: Three varieties of crops ‘A’, ‘B’, ‘C’ are tested in a
randomised block design with four replications. The yields are given in table
11.6a. Test at 0.05 level of significance whether there is difference between
replications. Test also whether varieties differ significantly.
Replications
Variety
1 2 3 4
A 6 4 8 6
B 7 6 6 9
C 8 5 10 9
The table 11.6b. represents the totals of yields of three crops tested with
four replications.
Table 11.6b. Totals of yields of three crops tested with four replications
Replications Total
Variety
1 2 3 4
A 6 4 8 6 24
B 7 6 6 9 28
C 8 5 10 9 32
Total 21 15 24 24 84
T2 84 2
Correction factor = 588
N 12
SST = sum of squares of all values – T2 / N
= 62+72+82+42+62+52+82+62+102+62+92+92 – 588 = 36
SST = 36
For columns, SSC is calculated as:
2 2 2 2 2
( 1 ) ( ) ( ) ( ) .... ( ) T2
2 3 4 n
SSC N
n1 n
2
n
3
n
4
n
n
212 15 2 24 2 24 2 1818
588 588 18
3 3 3 3 3
= SSC 18 6
(c 1) 3
For rows, SSR is calculated as:
2 2 2 2 2
( Xj1) ( Xj2) ( Xj3) ( Xj4) .... ( Xj2) T
2
n1 n2 n3 n4 nn N
2 2 2
= 24 28 32 588 2384 588 8
4 4 4 4
Hence, SSR = 8.
SSR 8
MSR = 4
(r 1) 2
SSE = SST – SSC – SSR = 36 – 18 – 8 = 10
SSE 10
MSE = 1.667
{(r 1) (c 1)} 6
The table 11.6c represents the ANOVA table for data of solved problem 4.
Between columns
Degrees of Freedom (3,6), Table value of ‘F’ = 4.757 at = 0.05
Calculated value of ‘F’ = 3.6 < Table value of ‘F’
Therefore, we accept the hypothesis that there is no significant difference
between replications.
Between rows
Degrees of freedom (2,6), Table value of ‘F’ = 5.143
2 562
Correction factor 348.44
N 9
102 32 432
2
3 3 3 N
=33.3 + 3 + 616.33 – 348.44 = 304.22
Degrees of freedom = (3-1) = 2
292 192 82 82 T2
Sum of squares between salesmen
3 3 3 3 N
The calculated value of FR is less than the table value of F, that is,
(2.38<6.94). Hence, there is no significant difference between salesmen
performance.
11.5 Summary
ANOVA is a statistical technique used to evaluate the variances between
three or more sample means. This helps to make inferences to judge
whether the samples are from populations having same mean or not.
ANOVA is classified into one way ANOVA and two way ANOVA.
ANOVA is a parametric test as it assumes normality regarding population
distributions and also as it deal in means.
Employee 1 15 17 14 12
Employee 2 12 10 13 17
Employee 3 11 14 13 15 12
Employee 4 13 12 12 14 10 9
2. Fours makes of bulbs were tested for their length of life (in ‘000 hours)
and the data obtained is displayed in table 11.9. Test whether the length
of their life is significantly different.
Table 11.9. Four different makes of bulbs with their length of life
Machines Workmen
I II III IV V
1 46 48 36 35 40
2 40 42 38 40 44
3 49 54 46 48 51
4 38 45 34 35 41
4. The percentage sugar content of Tobacco in two samples was
represented in table 11.11. Test whether their population variances are
same.
Table 11.11. Percentage sugar content of Tobacco in two samples
11.8 References
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
12.1 Introduction
In the unit 11, „F – Distribution and Analysis of Variance (ANOVA)‟, you
have studied about the F-test which is used to test the hypothesis of the
equality of two variances. You have also studied about the ANOVA, which is
used to test the differences in several means. In this unit 12, „Simple
Correlation and Regression‟, we will discuss about the techniques such as
correlation and regression, used for investigating the relationship between
two or more variables.
Both correlation and regression are used to measure the strength of
relationships between variables. The following statistical tools measure the
relationship between the variables analysed in social science research.
1. Correlation
a. Simple correlation: In simple correlations, the relationships between
two variables are studied.
b. Partial correlations: In partial correlations, the relationships of any
two variables are studied, keeping all others constant.
c. Multiple correlations: In multiple correlations, the relationships
between variables are studied simultaneously.
2. Regression
a. Simple regression: In simple regression, we study the relationship
between only two variables at a time, in which one variable is
independent and the other is dependent.
b. Multiple regression: In this, we study the relationship between more
than two variables at a time, in which one variable is dependent and
others are independent variables.
3. Association of attributes
Correlation measures the relationship (positive or negative, perfect)
between the two variables. Regression analysis considers relationship
between variables and estimates the value of another variable, having
the value of one variable. Association of attributes attempts to ascertain
the extent of association between two variables.
12.1.1 Learning objectives
By the end of this unit, you should be able to:
Calculate the coefficient for partial and multiple correlation
Distinguish between parametric and non parametric measures of
correlation
12.2 Correlation
When two or more variables move in sympathy with other, then they are
said to be correlated. If both variables move in the same direction then they
are said to be positively correlated. If the variables move in opposite
direction then they are said to be negatively correlated. If they move
haphazardly then there is no correlation between them. Correlation analysis
deals with the following.
Measuring the relationship between variables.
Testing the relationship for its significance.
Giving confidence interval for population correlation measure.
12.2.1 Causation and correlation
The correlation between two variables may be due to the following causes.
Due to small sample sizes
Correlation may be present in sample and not in population.
Due to a third factor
Correlation between yield of rice and tea may be due to a third factor -
„rain‟.
12.2.2 Types of correlation
The following are the three types of correlation.
i. Positive or Negative
ii. Simple, Partial and Multiple
iii. Linear and Non-linear
Positive and negative correlations: Both the variables (X and Y) will vary
in the same direction. If variable X increases, variable Y also will increase; if
variable X decreases, variable Y also will decrease, This is positive
correlation. If the given variables vary in opposite direction, then they are
said to be negatively correlated. If one variable increases, other variable will
decrease. In other words, the variables are negatively correlated if there is
an inverse relationship between the variables.
Simple, partial and multiple correlations: In simple correlation,
relationships between two variables are studied. In partial and multiple
If the dots lie close to a straight line that runs from left bottom to right top,
then the variables are said to be positively correlated. The figure 12.2
represents the scattered diagram for positively correlated variables.
If the dots lie exactly on a straight line that runs from left top to right bottom
then the variables are said to be perfectly or exactly negatively correlated.
The figure 12.3 represents the scattered diagram for the perfectly negatively
correlated variables.
If the dots lie very close to a straight line that runs from left top to right
bottom then the variables are said to be negatively correlated. The figure
12.4 represents the scattered diagram for the negatively correlated
variables.
If the dots lie all over the graph paper then the variables have zero
correlation. The figure 12.5 represents the scattered diagram of the
variables with zero correlation.
Scatter diagram tells us the direction in which they are related and does not
give any quantitative measure for comparison between data sets.
Key Statistic
Karl Pearson‟s correlation coefficient is defined as:
xy
i) r ––––––––––––– (A)
N x y
where, x and y
( x x ) 2 ( y y)2
x2 and y2
N N
xy
where, „N‟ is the number of paired observations and is called
covariance of „x‟ and „y‟.
Key Statistic
The other forms of Karl Pearson‟s correlation coefficient formula are:
xy
ii) r –––––––––––––––––––– (B)
X2 Y 2
N XY X Y
r –––– (C)
N X 2
( X)
2 1/ 2
N Y 2
( Y)
2 1/ 2
N dx dy dx dy
r ––(D)
N dx 2
( dx )
2 1/ 2
N dy 2
( dy )
2 1/ 2
X 20 16 12 8 4
Y 22 14 4 12 8
Solution: The table 12.1b displays the sums calculated for the data
represented in table 12.1a.
Table 12.1b: Sums related to solved problem 1
X Y X2 Y2 XY
20 22 400 484 440
16 14 256 196 224
12 4 144 16 48
8 12 64 144 96
4 8 16 64 32
X = 60 Y = 60 X = 880
2
Y = 904
2
XY = 840
1
Source: Aggarwal. Y. P, Statistical Methods, Sterling Publishers Pvt Ltd., New
Delhi, 1998, p.131)
Solution: Applying the formula for „r‟ and substituting the respective values
from the table we get r as:
N XY X Y
r
N X 2 (N X) 2
1/ 2
N Y 2 ( Y)2
1/ 2
5(840 ) (60)(60)
r
5(880 ) (60) 2 . 5(904 ) (30) 2
r 0 70
Hence, Karl Pearson‟s correlation coefficient is 0.70.
Solved Problem 2: Calculate Karl Pearson‟s Coefficient of Correlation from
the data displayed in table 12.2a.
Table 12.2a: Data related to index of production and number of unemployed
Year 1985 1986 1987 1988 1989 1990 1991 1992
Index of
100 102 104 107 105 112 103 99
Production
Number of
15 12 13 11 12 12 19 26
unemployed
Solution: The table 12.2b displays the sums required for calculation of Karl
Pearson‟s correlation coefficient.
Table 12.2b: Sums related to data given in solved problem 1
Index of
2 No. of yYY 2
Year Production xXX x y xy
unemployed
X
1985 100 -4 16 15 0 0 0
1986 102 -2 4 12 -3 9 +6
1987 104 0 0 13 -2 4 0
1988 107 +3 9 11 -4 16 - 12
1989 105 +1 1 12 -3 9 -3
1990 112 +8 64 12 -3 9 - 24
1991 103 -1 J 19 +4 16 -4
1992 99 -5 25 26 + 11 121 - 55
X = 832 x = 0 x = Y = 120 y = 0 y = xy = -92
2 2
120 194
X = 104 Y = 15
xy 92
r 00.61
( x 2 ) ( y 2 ) 120 184
X 50 60 58 47 49 33 65 43 46 68
Y 48 65 50 48 55 58 63 48 50 70
Solution: The table 12.3b displays the frequency table of the data related to
solved problem 3.
Table 12.3b: Frequency table data for solved problem 3
Solved Problem 4: In a bivariate data on „x‟ and „y‟, variance of „x‟ = 49,
variance of „y‟ = 9 and covariance (x,y) = -17.5. Find coefficient of
correlation between „x‟ and „y‟.
Solution: We know that:
xy
r
N x y
xy
Given r 17.5
N
σ x = 49 = 7 σy = 9 = 3
17.5
r= = 0.833
7×3
Hence, there is a highly negative correlation.
Solved Problem 5: Ten observation in Weight (x) and Height (y) of a
particular age group gave the following data.
x = 56 y = 138 x2 = 1357 y2 = 2136 xy = 836
Find „r‟.
Solution: We know that:
N xy x y
r
N x 2
( x)2 1/ 2
N y 2
( y)2 1/ 2
0 6475 1 r 2 n
where, „r‟ is measured from sample of size „n‟.
Probable error is used to:
i) Interpret the value of „r‟,
If r < P.E, then it is not at all significant.
If r > 6 P.E, then „r‟ is highly significant
If P.E < r < 6 P.E, we cannot say anything about the significance of
„r‟
ii) Construct confidence limits within which population „P‟ is expected to
lie.
12.4.1 Conditions under which probable error can be used
The following are some conditions under which probable error (P.E) can be
used.
1. Samples should be drawn from a normal population
2. The value of „r‟ must be determined from sample values
3. Samples must have been selected at random
Solved Problem 6: If r = 0.6 and N = 64, then:
a) Interpret „r‟
b) find the limits within which „‟ is suppose to lie.
Solution:
1 (0.6) 2
P.E. (0.6745 )
64
= 0.054
a) 6 6 0 054 0 324
Since r 0 6 6 , it is highly significant.
b) Limits for population “”
0 6 0 054
0 546 0 654
Hence, the limits within which ‘‟ lies are 0.546 and 0.654.
Key Statistic
Spearman‟s Rank correlation coefficient is defined as:
6 D2
1
N3 N
where, D is the difference between ranks assigned to the variables.
Value of „‟ lies between „-1‟ and „+1‟ and its interpretation is same as
that of Karl Pearson‟s correlation coefficient.
There are four types of problems. The table 12.4 represents the type of
problems involved in calculating rank correlation coefficient.
Table 12.4: Types of problems
Competitor 1 2 3 4 5 6 7
Judge I 5 6 4 3 2 7 1
Judge II 6 4 5 1 2 7 3
Solution: The table 12.5b represents the data of solved problem 7.
Solved Problem 9: The table 12.7a represents the sales statistics of six
sales representatives in two different localities. Find whether there is a
relationship between buying habits of the people in the localities.
Table 12.7a: Sales data of six representatives
Representative 1 2 3 4 5 6
Locality I 70 40 65 110 60 20
Locality II 70 30 80 100 90 20
Solution: The table 12.7b represents the calculated values of correlation
coefficient of data in solved problem 9.
Solved Problem 10: Find rank correlation coefficient for the data displayed
in table 12.8a.
Table 12.8a: Scores of student in test I and test II
Student A B C D E F G H I J
Score on Test I 20 30 22 28 32 40 20 16 14 18
Score on Test II 32 32 48 36 44 48 28 20 24 28
Solution: The table 12.8b displays the required data for calculating the
correlation coefficient.
Table 12.8b: Ranks of test I and test II
= 1 – 6 D 1/ 12(m1 m1 ) 1/ 12(M2 m 2 ) 1/ 12(M3 m3 ) 1/ 12(M4 m 4 )
2 3 3 3 3
N3 N
=1–
6 24 1/ 12(2 3
2) 1/ 12(2 3 2) 1/ 12(2 3 2) 1/ 12(2 3 2)
10(102 1)
=1–
144 0.5 0.5 0.5 0.5
10 99
146
=1– 0.8525
10 99
Testing of correlation
„t‟ test is used to test correlation coefficient.
Example 1
The table 12.9 displays the height and weight of a random sample of
six adults.
Table 12.9: Height and weight of six adults
Key Statistic
Partial correlation is denoted by the symbol „r12.3‟. Here correlation
between variable 1 and 2 keeping 3rd variable constant.
r12 r13 .r23
r123
1 r13 2 1 r23 2
where,
r12.3 = Partial correlation between variables 1 and 2 keeping 3rd
constant
r12 = correlation between variables 1 and 2
r13 = correlation between variables 1 and 3
r23 = correlation between variables 2 and 3
Similarly,
r13 r12 . r23 r23 r12 . r23
r132 and r23.1
1 r12 2 1 r23 2 1 r12 2 1 r13 2
R1.23 = r12
2
r13 2 2 r12 r13 r23 1 r
23
2
R2.13 = r
2
12
r 2 2 r12 r13 r23
23
1 r
2
13
2
Source: Gupta S.P, Statistical Method, 2006, Sultan Chand & Sons, New Delhi.
Similarly, alternative formulas for R1.24 and R1.34 can be computed. The
following formula can be used to determine a multiple correlation coefficient
with three independent variables.
(1 r 14 ) (1 r 13.4 ) (1 r 12.34 )
2 2 2
R1.24 =
Solved Problem 11: The following are the zero order correlation
coefficients.
R1.23 = r
2
12
2
r13 2r 12 r 13 r 23 1 r
2
23
= 0.986
12.8 Regression
According to M. M. Blair, regression is defined as, “the measure of the
average relationship between two or more variables in terms of the original
units of the data”3.
Correlation analysis attempts to study the relationship between the two
variables „x‟ and „y‟. Regression analysis attempts to predict the average „x‟
for a given „y‟. In regression, it is attempted to quantify the dependence of
one variable on the other. For example, if there are two variables „x‟ and „y‟
and „y‟ depends on „x‟, then the dependence is expressed in the form of the
equations.
12.8.1 Regression analysis
Regression analysis is used to estimate the values of the dependent
variables from the values of the independent variables. Regression analysis
is used to get a measure of the error involved while using the regression line
as a basis for estimation. Regression coefficient is used to calculate
correlation coefficient. The square of correlation is what prevails between
the given two variables.
12.8.2 Regression lines
For a set of paired observations there exists two straight lines. The line
drawn in such a way that the sum of vertical deviation is zero and the sum of
their squares is minimum, is called regression line of „y‟ on „x‟. It is used to
estimate „y‟ values for given „x‟ values. The line drawn in such a way that the
sum of horizontal deviation is zero and sum of their squares is minimum, is
called regression line of „x‟ on „y‟. It is used to estimate „x‟ values for given „y‟
values. The smaller the angle between these lines, the higher is the
correlation between the variables. The regression lines always intersect at
(X, Y).
The regression lines have equation,
i) The regression equation of „y‟ on „x‟ is given by:
b yx
3
T. R. Jain, S. C. Aggarwal, Dr. R. K. Rana, Basic Statistics for Economists, 2006-
2007 Edition, V. K. Publications.
b yx .b xy 1
The product of regression coefficients is always less than 1,
that is,
b yx .b xy 1
If „byx‟ is negative, then „bxy‟ is also negative and „r‟ is negative.
They can also be expressed as:
y
b yx r . and b xy r . x
x y
It is an absolute measure
The differences between correlation and regression coefficient are listed in
table 12.10.
Solved Problem 12: Find regression equation from the data represented in
table 12.11a. Then calculate correlation coefficient.
Table 12.11a: Data of ages of wife and husband
Age of Husband 18 19 20 21 22 23 24 25 26 27
Age of Wife 17 17 18 18 19 19 19 20 21 22
Solution: The table 12.11b represents the data required for calculation of
correlation and regression coefficients.
Table 12.11b: Data required for calculation of correlation and regression
coefficients
Age of husband 2 2
dx = x-22 dx Age of wife (y) dy = y-19 dy dx dy
(x)
18 -4 16 17 -2 4 8
19 -3 9 17 -2 4 6
20 -2 4 18 -1 1 2
21 -1 1 18 -1 1 1
22 0 0 19 0 0 0
23 1 1 19 0 0 0
24 2 4 19 0 0 0
25 3 9 20 1 1 3
26 4 16 21 2 4 8
27 5 25 22 3 9 15
Total 225 5 85 190 0 24 43
225 190
X= = 22.5 Y= = 19
10 10
Regression equation of Y on X is given by:
Y Y b y x ( X X )
19 0.521 22.5
0.521 7.2775
Regression Equation of X and Y is:
10 43 (5) (0) 43
byx = 1.392
10 24 (5) 2 24
22.5 1.792 19
1.792 11.548
r = 0.521x1.792 = 0.966
Hence, the correlation coefficient „r‟ is 0.966.
Solved Problem 13: In a correlation study, we have the data represented in
table 12.12. Find the two regression equations.
Table 12.12: Data about series X and series Y
Series X Series Y
Mean Standard Deviation 65 67
Standard Deviation 2.5 3.5
Correlation coefficient 0.8
Solution:
y
Y Y r (X X )
x
3.5
Y 67 (0.8) ( X 65)
2.5
67 1.12 65
1.12 5.8
( X X c ) 2 also
Sx y
N
Sx y 6 1 r 2 also
X 2 a X b XY
Sx y
N
( Y Yc ) 2
Sx y
N
Solved Problem 14: The table 12.13 displays the results that were worked
out from scores in Statistics and Mathematics in a certain examination.
Sikkim Manipal University Page No. 310
Statistics for Management Unit 12
X 12 4 20 8 16
Y 18 22 10 16 14
Solution: The table 12.14b displays the values required for obtaining the
regression equations.
X = (12 + 4 + 20 + 8 + 16)/ 5 =12 = mean of X
Y = (18 + 22 + 10 + 16 + 14) / 5 = 16 = mean of Y
XX YY
X Y (X X) 2 (Y Y) 2 (X X) (Y Y)
X - 12 Y - 16
12 8 0 2 0 4 0
4 22 -8 6 64 36 - 48
20 10 8 -6 64 36 - 48
8 16 -4 0 16 0 0
16 14 4 -2 16 4 -8
160 80 - 104
( X X )( Y Y ) 104
b yx 0.65
2
( X X ) 160
and
( X X) ( Y Y ) 104
b xy 1.3
2
( X X) 80
X1 2 b1.23 X2 3 b13.2 X3
σ 1.23
b12.3 =
σ 3.12
r r r S r r r S
( X 1 X 1 ) 12 13 2 23 1 ( X 2 X 2 ) 12 13 2 23 1 ( X 3 X 3 )
1 r23 S 2 1 r23 S 3
Regression equation of X3 and X2 and X1 is:
r r13 r12 S3 r r r S3
( X 3 X 3 ) 23 (X2 X2 ) 13 23 12
( X1 X1 )
1 r23 2 S2 1 r23 2 S1
Key Statistic
Standard error of estimate of X1 on X2 and X3 is given below:
( X1 Xlast ) 2
S 1.23
N3
Where
S1.23 = Standard error of estimate X1 on X2 and X3
Xlast = Estimate value of X1 as calculated from the regression
equations
12.13 Summary
In this unit we studied the concept of correlation and regression and the
different types of correlation and regression.
We saw how regression helps us to study unknown variables with the help
of known variables. It also establishes reliability measure for estimated
values.
Regression analysis helps to quantify the dependence of one variable on
the other. Some of the regression types are simple and multiple regression,
linear and non linear regression.
Regression analysis is useful in business and economic scenarios in
decision making process.
4. For the data in table 12.17, obtain the two lines of regression and its
estimation of the blood pressure when age is 50 yrs.
Table 12.17: Data for the terminal question 4
Age 56 42 72 39 63 47 52 49 40 42 68 60
(X) in
yrs
BP 127 112 140 118 129 116 130 125 115 120 135 133
(Y)
5. The table 12.18 displays the results that were worked out from scores in
statistics and mathematics in a certain examination.
Table 12.18: Results of scores in statistics and mathematics examination
Karl Pearson‟s correlation coefficient between X and Y = 0.42. Find both the
regression lines. Use these lines to estimate the value of Y when X = 50 and
the value of X when Y = 30.
12.16 References
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
S. P. Gupta, Statistical Methods, (2006), Sultan Chand & Sons
T. R. Jain, S. C. Aggarwal, Dr. R. K. Rana, Basic Statistics for
Economists, 2006-2007 Edition, V. K. Publications
13.1 Introduction
In the unit 12, „Simple Correlation and Regression‟, you have studied about
the techniques such as correlation and regression, which are used for
investigating the relationship between two or more variables. In this unit 13,
„Business Forecasting‟, we will discuss about business forecasting, the
methods available in forecasting, and the use of forecasting models in
business improvement processes.
The growing competition, rapidity of change in circumstances and the trend
towards automation demand that decisions in business are not based purely
on guesses and hunches but rather on a careful analysis of data concerning
the future course of events. The future is unknown to us. Yet every day we
are forced to make decisions involving future and therefore there is
uncertainty. Great risk is associated with business affairs. All businessmen
are forced to make forecast regarding business activities.
Success in business depends upon successful forecasts of business events.
In business or trade the importance of forecasting is so great, that when
someone enters into the business world, he really enters the profession of
forecasting. In recent times, considerable research has been conducted in
this field. Attempts are being made to make forecasting as scientific as
possible.
Business forecasting as such is not a new development. Every
businessman must forecast; even if his whole product is sold before
production. Forecasting has always been necessary. What is new in the
attempt to put forecasting on a scientific basis is to forecast by reference to
past history and statistics rather than by pure intuition and guess-work.
One of the most important tasks before businessmen and economists these
days are to make estimates for the future. For example, a business man is
interested in finding out his likely sales next year or as long term planning in
next five or ten years so that he could adjust his production accordingly and
avoid the possibility of either inadequate production to meet the demand or
unsold stocks.
Similarly, an economist is interested in estimating the likely population in the
coming years so that proper planning can be carried out with regard to jobs
for the people, food supply and so on. First step in making estimates for the
future consists of gathering information from the past. In this connection we
usually deal with statistical data which are collected, observed or recorded
at successive intervals of time. Such data is generally referred to as time
series. Thus, when we observe numerical data at different points of time the
set of observations is known as time series.
13.1.1 Learning objectives
By the end of this unit, you should be able to:
Describe the meaning of business forecasting
Distinguish between prediction, projection and forecast
Describe the forecasting methods available
Apply the forecasting theories in taking effective business decisions
Key Statistic
A prediction is an estimate based solely on past data of the series
under investigation. It is purely mechanical extrapolation.
A projection is a prediction where the extrapolated values are subject
to certain numerical assumptions.
A forecast is an estimate which relates the series in which we are
interested to external factors.
13.3.3 Extrapolation
Extrapolation is the simplest method of business forecasting. By
extrapolation, a businessman finds out the possible trend of demand of his
goods and also about the future price trends. The accuracy of extrapolation
depends on two factors:
i) Knowledge about the fluctuations of the figures
ii) Knowledge about the course of events relating to the problem under
consideration
Thus, there are two assumptions on which extrapolations are based:
i) There is no sudden jumps in figures from one period to another
ii) There is regularity in fluctuations and the rise and fall is uniform
In extrapolation, we assume that the variable will follow the established
pattern of growth. For the purpose of business forecasting, it is to determine
accurately the appropriate trend curve and the values of its parameters.
Gompertz curve
It is given by:
c ab c
The table 13.3 lists the merits and demerits of extrapolation method.
The table 13.4 lists the merits and demerits of modern econometric
methods.
Table 13.4: Merits and demerits of modern econometric methods
Merits Demerits
Accurate and reliable results are This method is difficult and
obtained under this method. complicated.
It is a scientific method where This method can be used only when
computer technology is used. adequate series of data is available.
This method explains in detail and in It is very difficult to construct growth
quantitative terms the way in which model for every business activity.
various aspects of the economy are
interrelated.
Example 1
When government makes use of deficit financing, it leads to inflationary
pressures; the purchasing power of people goes up. Therefore, the
wholesale prices, the retail prices starts rising. With the rise in retail
prices, the cost of living goes up and with it there is a demand for
increased wages. Thus, one factor, that is, more money in circulation,
has affected various fields of economic activity not simultaneously but
successively.
The table 13.5 lists the merits and demerits of sequence or time-lag theory.
Table 13.5: Merits and demerits of sequence or time-lag theory
Merits Demerits
This method is largely used for This method studies only the action
business forecasting because of the not the reaction.
accuracy.
Though this theory is based on This method cannot be regarded as
statistical techniques, yet it is easy to accurate because by using statistical
understand. techniques the results can be up to
the truth but not an accurate one.
Time-interval between two events can
be ascertained.
Government can use this technique for
the purpose of economic stability of the
economy by exercising control over
possible losses.
13.4.2 Action and reaction theory
This theory is based on the following two assumptions.
Every action has a reaction
Magnitude of the original action influences the reaction
Thus, if the price of rice has gone up above a certain level in a certain
period, there is a likelihood that after some time it will go down below the
normal level. Thus, according to this theory a certain level of business
activity is normal or abnormal; conditions cannot remain so for ever. Thus,
we find four phases of a business cycle. They are:
i. Prosperity
ii. Decline
iii. Depression
iv. Improvement
The table 13.6 lists the merits and demerits of action and reaction theory.
Table 13.6: Merits and demerits of action and reaction theory
Merits Demerits
This is better than other theories. The determination of normal level is
very difficult.
By this theory more reliable results can It is not necessary that reaction is
be obtained because this theory gives equal to the action.
attention to action and reaction of an
event.
13.6 Summary
In this unit, you have studied about the theory behind business forecasting
and the objectives of forecasting. The steps involved in forecasting the
trends and different forecasting methods available are also studied. Finally
we have ended the unit by explaining the advantages and limitations of
business forecasting.
13.9 References
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
14.1 Introduction
In the unit 13, ‘Business Forecasting’ you have studied about the ways of
forecasting business events successfully. You also studied about the
different methods available for forecasting. In this unit 14, ‘Time Series
Analysis’, you will study about the time series analysis and different
components of time series. You will also study about the forecasting
methods using time series.
A time series is a set of numerical values of a given variable listed at
successive intervals of time. That is, the data regarding the variable is listed
in chronological order. Usually the interval of time is taken as uniform.
Yearly production of wheat in the country, hourly temperature of a city,
bimonthly electricity bills are all examples of time series. Almost all the data
like industrial production, agricultural production, exports, imports, dairy
products can be arranged in chronological order.
14.1.1 Learning Objectives
By the end of this unit, you should be able to:
Analyse the time series
Describe different components of time series
Describe the forecasting methods
Apply time series analysis in business scenarios
We would like to analyse the above data and give some trends about the
sales. For example, the company would like to know as to why the sales
dropped in 1998 and 1999, and then why the sales increased. That is, the
company would like to analyse the various forces that affect the sales.
There can be changes in the values of the variable recorder over different
points of time due to various forces. Analysing the effect of all such forces
on the values of the variable is generally known as the analysis of time
series. Broadly, there can be four types of changes in the values of the
variable as discussed below:
i) Changes which generally occur due to general tendency of the data
to increase or decrease
ii) Changes which occur due to change in climate, weather conditions,
festivals
iii) Changes which occur due to booms and depressions
iv) Changes which occur due to some unpredictable forces like floods,
famines, earthquakes
Solved Problem 1: Find trend with the help of free hand curve method for
the data given in table 14.2:
Table 14.2: Production data from 1991 to 2001
Year Production Data (in Lakh ton)
1991 15
1992 18
1993 16
1994 22
1995 19
1996 24
1997 20
1998 28
1999 22
2000 30
2001 26
Solution: The figure 14.1 represents free hand curve of the production data
versus the time period. In the graph, we have taken production data values
on Y-axis and values of time on X-axis.
Fig. 14.2: Procedure for determining the trend when moving average is odd
By plotting these trend values (if desired) you can obtain the trend curve
with the help of which you can determine the trend whether it is increasing
or decreasing. If needed, you can also compute short-term fluctuations by
subtracting the trend values from the actual values.
Solved Problem 2: Calculate the 3 yearly and 5 yearly averages of the data
in table 14.4.
Table 14.4: Production data from 1988 to 1997
Year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
Production 15 18 16 22 19 24 20 28 22 30
(in Lakh ton)
Solution: The table 14.5 displays the calculated values of 3 yearly and 5
yearly averages.
Table 14.5: Calculated values of 3 yearly and 5 yearly averages
Production 3 –yearly 3 –yearly Short term
Year (Thousand Y moving moving totals fluctuations
Tonnes) totals Ye (Y - Yc)
1988 21 - - -
1989 22 66 22.00 0
1990 23 70 23.33 - 0.33
1991 25 72 24.00 1.00
1992 24 71 23.67 0.33
1993 22 71 23.67 - 1.67
1994 25 73 24.33 0.67
1995 27 79 26.33 0.67
1996 26 - - -
Fig. 14.3: Procedure for determining the trend when moving average is even
The table 14.6 lists the merits and demerits of the moving averages method.
Merits Demerits
This is a simple method. No functional relationship between the
values and the time. Thus, this method is
not helpful in forecasting and predicting the
values on the basis of time.
This method is objective in the No trend values for some years in the
sense that anybody working on a beginning and some in the end. For
problem with this method will get example, for 5 – yearly moving average,
the same results. there will be no trend values for the first two
years and the last three years.
This method is used for In case of non–linear trend, the values
determining seasonal, cyclic and obtained by this method are biased in one or
irregular variations besides the the other direction.
trend values.
This method is flexible enough to The period selection of moving average is a
add more figures to the data difficult task. Hence, great care has to be
because the entire calculations taken in period selection, particularly when
are not changed. there is no business cycle during that time.
If the period of moving averages
coincides with the period of cyclic
fluctuations in the data, such
fluctuations are automatically
eliminated.
14.5.4 Method of least squares
Under this method, the trend curve is determined by fitting a mathematical
equation. This method is more accurate and precise and can be used even
for forecasting. We can fit either a straight line or a parabolic curve from the
given data by this method.
Key Statistic
Let ‘y’ be the actual values of ‘y’ and ‘yc’ be the computed values of ‘y’
for a given value of ‘x’.
Let ‘y = a + bx’ be a straight line to be fitted for trend. To find the values
of ‘a’ and ‘b’, such that the sum of squares of differences of the actual
and computed values of ‘y’ is least, that is,
y y c
2
is least
where, the condition
y y c 0 is satisfied,
is known as method of least squares. The line obtained by the method
is known as the ‘line of best fit.’
Sikkim Manipal University Page No. 347
Statistics for Management Unit 14
For a given time series data, to find a linear trend, the values of ‘a’ and ‘b’
are obtained by the normal equations.
a b
a b 2
where, N is the number of pairs for which data are given. Here ‘a’ is
intercept of the line on the y – axis and ‘b’ is the slope of the line. ‘b’ is also
known as growth rate (if b > 0) or decline rate (if b< 0), ‘b’ gives the change
in the value of ‘y’, for per unit change in the value of ‘x’.
Direct method
The procedure to be followed is described below.
i) Convert the years into natural numbers (1, 2, 3……) and denote by ‘x’
and find x.
ii) Find the squares of ‘x’ values and obtain xz.
iii) Multiply the x – values with corresponding y – values and obtain xy.
iv) Add the values of y to obtain y.
v) Put these values in the two normal equations and solve for ‘a’ and ‘b’.
vi) Substitute these values of ‘a’ and ‘b’ in ‘y = a + bx’ and then find trend
values for various values of ‘x’.
Short cut method
Measure the variables ‘x’ from any point of time in origin as the first year, but
the calculations are simplified when the mid-point in time is taken as origin
so that:
x=0
When, x = 0 then normal equations reduce to:
y a
y
therefore, a
N
xy b x 2
xy
therefore, b
x 2
Non-linear trend
When the time series data do not confirm with the linear trend then we
obtain non-linear trend. We do so by obtaining a parabolic curve or non-
linear curve in the method of least squares. For this we use the equation of
the form.
a b c 2 d 3 .......... k n
which is known as a polynomial of degree ‘n’ in ‘X’, k ≠ 0.
Let the parabolic curve be
a b c 2
with usual notations. The values of a, b, and c can be determined by solving
the normal equations:
ab c 2
a b 2 c 3
2 a 2 b 3 c 4
If we can change the origin at a suitable point, such that ‘x = 0’, then the
normal equations reduce to:
ac 2
b 2
2 a 2 c 4
Sikkim Manipal University Page No. 349
Statistics for Management Unit 14
Key Statistic
The multiplicative model assumes that the observed value is obtained by
multiplying the trend (T) by the rates of three other components, that is,
Y=TxSxCxI
where,
Y = original data
T = trend value
S = seasonal component
C = cyclical component
I = irregular component
The multiplicative model assumes that the components, although due to
different causes, are not necessarily independent and they can affect one
another. It also assumes that the behaviour of components is of
multiplicative character. It may be noted that except the value of trend, all
the other values on the right hand side are rates or index numbers.
3. Price changes
Adjustment for price changes becomes necessary wherever we have real
value changes. Current values are to be deflated by the ratio of current
prices to base year prices.
4. Comparability
In order to have valid conclusion the data which are being analysed should
be comparable. When we are dealing with the analysis of time series it
involves the data relating to past which must be homogeneous and
comparable. When we are dealing with the analysis of time series it involves
the data relating to past which must be homogenous and comparable.
Therefore, effects should be there to make the data as homogeneous and
comparable as possible.
Merits Demerits
This method is the Most economic time series have trends and
simplest one. therefore, the seasonal index computed by this
method is really an index of trends and seasons.
ii) The mean of the link relatives for each season is computed over all the
years. Median can also be taken instead of mean of the Link
Relatives.
iii) These average link relatives are converted into chain relatives. The
chain relative of first is taken as 100.
The Chain Re lative of current year
Average Link Re lative of current year Chain Re lative of previous year
100
iv) The second chain relative of first is computed on the basis of the chain
relative for the last:
Chain Re lative of first quarter
Av erage Link Relativ e of the f irst quarter Chain Relativ e of the last
100
This chain relative may or may not be 100. It is not equal to 100 due
to secular trend. If it is 100, go to ‘step vi’, if it is not 100, go to ‘step
v’ and then go to ‘step vi’.
v) Compute the difference ‘d’ between the new chain relatives first
obtained in ‘step iv’ and chain relative assumed as 100. ‘d’ is divided
by the number of seasons and the resulting figure is multiplied by
1, 2, 3 and the product is deducted respectively from the chain
relatives of 2nd, 3rd, and 4th quarters. These are called corrected
relatives.
vi) The seasonal indices are obtained when the corrected chain relatives
are expresses as percentage of their relative averages
14.8.4 Ratio to trend method
The steps to determine seasonal indices by this method are as described
below.
i) Determine the trend values by the method of least squares.
ii) To find ratio to trend, divide the original data by the corresponding
trend values and multiply these ratios by 100, that is,
Original Data
Ratio to Trend 100
Trend Value
iii) Calculate the Arithmetic Mean of the Trend Ratios obtained in
‘step ii’.
iv) Finally all the trend ratios will be converted into seasonal indices. For
this, add all averages obtained in ‘step iii’ and find their general
average. Seasonal indices are calculated by using the following
formula:
Quarterly Averages
Seasonal Indices 100
General Averages
14.10 Summary
In this unit, you have studied about business forecasting. The different steps
involved in forecasting are discussed in a simple manner.
The four different components of time series are discussed. The concept of
time series analysis is discussed next with examples. Action and reaction
theory is explained with its merits and demerits in a simple manner.
In this unit, you have also studied about the method of least squares with
merits and demerits discussed in detail.
The five types of forecasting methods using time series are discussed in
detail.
5. What are the components of time series? Bring out the significance of
moving average in analysing a time series and point out its limitations.
6. What is meant by secular trend? Discuss any two methods of isolating
trend values in a time series.
7. What is seasonal variation of a time series? Describe the various
methods you know to evaluate it and examine their relative merits.
8. Find a straight line trend to the following data and find trend value.
Table 14.9: Yearly production data
Year Production in 1000 kg
1990 80
1991 90
1992 92
1993 83
1994 94
1995 99
1996 92
14.13 References
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited.
15.1 Introduction
In the unit 14, ‘Time Series Analysis’, you have studied about the definition
and components of time series. You have also studied about different
forecasting methods using time series analysis. In this unit 15, ‘Index
Key Statistic
An index number is a statistical measure which is designed to express
changes or differences in a variable or a group of related variables. It is
usually expressed in percentage form.
d. Value relative
If ‘p1’ and ‘q1’ are the price and quantity respectively for a commodity in a
given year and ‘p0’ and ‘q0’ are the specified price and quantity respectively
of the same commodity, in a specified year, then the value of the specified
year, ‘V1’ and the value of the given year, ‘V0’ are calculated as:
V1 = p1 q1
V0 = p0 q0
The value relative of the specified year with respect to the given year is
calculated as the ratio of ‘V1’ to ‘V0’, and then the ratio is multiplied with 100.
That is,
V1 p q
Value relative 100 1 1 100
V0 p 0 q0
Example 1
If the prices of 2005 are compared with the prices of 2004, then 2005 is
the current year and 2004 is the base year. The index number of 2005
based on 2004, is denoted by ‘Q01’ or ‘P01’, where subscript ‘0’ stands for
the year 2004, and subscript ‘1’ stands for the year 2005.
3. Relative measure
Index numbers measure changes which are not capable of direct
measurement.
4. Specified averages
Index number represents a special case of average, in general, a weighted
average. It is a special type of average, because whereas in a simple
average, the data are homogenous having the same unit of measurement,
they average variables having different units of measurement.
5. Basis of Comparison
Index numbers by their very nature are comparative. They compare
changes over time or between places or similar categories.
15.2.5 Main steps in the construction of index numbers
To follow the steps involved in the construction of index numbers many
problems are encountered which are to be discussed carefully:
1. Purpose of index number
The steps which are taken in the construction of index numbers generally
depend on the purpose of the index number. Hence, the purpose of an
index numbers must be defined clearly and precisely. For example, the
purpose of the general index number of wholesale price index number is to
know the general price level. On the other hand, the purpose of the
consumer price index number is to give an idea of the effect of the change
in retail prices on the cost of living of classes of people.
2. Selection of base period
The base period of an index number is the period of time against which the
comparisons are made. There are three types of base periods.
i) Fixed base (a single period)
ii) Fixed base (an average of selected periods)
iii) Chain base
While selecting the base, a decision has to be made to decide whether we
have fixed base or chain base.
Fixed base (a single period): In a fixed base (a single period), the base
period must be a normal period. By normal period, we mean that the period
must be free from all sorts of abnormalities or random causes such as
financial crisis, floods, famines, earth quakes, strikes of labourers, wars. The
Sikkim Manipal University Page No. 365
Statistics for Management Unit 15
base period should be a period for which reliable figures are available. The
base period should not be too distant in the past.
Fixed base (an average of selected periods): When it is difficult to choose
just one single period as the normal, then a better choice will be an average
of several periods.
Chain base: If the comparisons are required from year to year, a system of
chain base is used. In this method, there are 10 fixed bases for comparing
the values of subsequent years, but the value of each year is compared with
the value of the preceding year.
3. Selection of commodities
The following problems can occur while selecting the commodities.
First problem is the selection of commodities because it is not feasible to
include all commodities. The purpose of the index number is to help in
deciding the number of commodities.
Another problem is to decide on which commodities are to be included?
A careful selection of the commodities must be made in such a way that:
It represents the real tastes, habits and the customs of the people.
It should be of a standard quality and there must be no significant
variation in the quality.
It must be easily recognisable and describable.
It should not be a non-tangible commodity such as personal service.
4. Selection of the representative prices
In the collection of price quotations we have to consider the following points:
The method of quoting prices of the commodities
The type of quotations - whether wholesale prices or retail prices
The place from where the quotations are to be obtained
5. System of weighting
The term ‘weight’ refers to the relative importance of the different
commodities included in the construction of index numbers. There are two
methods of assigning weights. They are:
The table 15.1 lists the merits and demerits of simple aggregative method.
Table 15.1: Merits and demerits of simple aggregative method
Merits Demerits
This is the simplest method of This method gives inappropriate results when
constructing index numbers. the prices of different commodities are quoted in
different units.
It is simple and easy to Since weights are not used, this method does
understand. not give any consideration to the relative
importance of commodities.
It requires simple Index number calculated by this method is
calculations. unduly affected by high or low values.
Solved Problem 3: Find the simple aggregative price index from the data
displayed in table 15.2.
Table 15.2: Price of commodities for the years 2000 and 2004
Price in Rs. per unit
Commodity Unity
2000 2004
A One kilogram 10 15
B One kilogram 40 30
C One dozen 10 12
D One litre 5 13
Total 65 70
Solution: The price index number of 2004 is based in 2000. Using the
formula:
p
P01 1 100
p 0
Where, P1 = total of prices in 2004 = 70
P0 = total of prices in 2000 = 65
Therefore,
70
P01 100107.7
65
This implies that the prices had increased by 7.7% in year 2004 as
compared to the year 2000.
Solved Problem 4: The prices of three different commodities for 2002 and
2003 are displayed in table 15.4a. The price given is per each ton of the
commodity. Taking the year 2002 as base, calculate the price index by
using the simple average of relatives method by using both arithmetic mean
and geometric mean.
Table 15.4a: Prices of commodities for 2002 and 2003
Commodity Corn Wheat Cocoa
Price in 2002 800 500 900
Price in 2003 880 480 940
Solution: The table 15.4b represents the calculated values for determining
price index.
Table 15.4b: Calculated values for determining price index
Price
Price Price Relative
Commodity Pn log R
in 2002, Po in 2003, Pn
R 100
Po
Corn 800 880 880 2.04
100 110
800
Wheat 500 480 480 1.98
100 96
500
Cocoa 900 940 940 2.02
100 104.44
900
Total
Po 2200 Pn 2300 R 310.44 6.04
The table 15.5 displays the merits and demerits of simple average of
relatives method.
Table 15.5: Merits and demerits of simple average of relatives method
Merits Demerits
It is not affected by units in which As it is an unweighted average the
prices are quoted importance of all items is assumed to be
the same.
It is not affected by absolute values The index number constructed by this
of prices as prices are converted method does not satisfy all the criterion
into price relatives. laid down for an ideal index.
It gives equal importance to all The index number is unduly influenced by
items and extreme items do not high or low prices when arithmetic mean is
unduly affect the index number. used.
The index number calculated by More labour is involved if geometric mean
this method satisfies the unit test. is used.
Key Statistic
For the construction of the price index number quantity weights are used.
If ‘w’ is the weight attached to a commodity, then the price index is given
by:
P1 w
Pr ice Index P01 100
P0 w
Q 01
Q 1P1 100
Q 0 P0
Paasche’s method
Paasche’s method is based on current year’s quantities. Current year’s
quantities are used as weights. Paache’s Price Index is given by:
P1Q 1
PP 01 100
P0 Q 1
Where, P1 = Current year price; P0 = Base year price
Q1 = Current year quantity which are taken as weights.
This index number has downward bias. This formula is not used frequently
in practice where the number of commodities is large.
Quantity index number using Paasche’s formula is given by:
Q 1P1
PQ 01
Q 0 P1
LP PP 01 P1Q o P1Q 1
DP 100 100
01
2 P0 Q 0 P0 Q 1
1/ 2
P Q P Q
1 0 1 1 100
P0 Q 0 P0 Q 1
Where, ‘LP’ is Laspeyre’s price index and ‘PP01’ Paasche’s price index.
The table 15.6 displays the merits and demerits of weighted index number
Table 15.6: Merits and demerits of weighted index number
Merits Demerits
It is free from bias, upward as well as This formula is difficult to interpret.
downward.
This formula takes into account both It is not a practical index to compute
current years as well as base year because it is excessively laborious.
prices and quantities.
It satisfies both ‘time several test’ as It requires the prices and quantities for
well as the ‘factor reversal test’. This base year and current year.
is why it is called an ideal index
number.
Qi
100 Q n Pn
Q 0
Quantity index =
Q n Pn
where,
‘Qi’ and ‘Q0’ are the quantities for the current and base period
respectively
‘Pn’ and ‘Qn’ are the quantities and prices that determine values that we
use for weights.
P Q
V 1 1
100
P Q
0 0
V1
V 100
V0
Key Statistic
Cost of living price index measures average change over time in the
prices paid by the consumer of specific baskets of goods and services.
The cost of living price index numbers are designed to measure the
average change in the price paid by the ultimate consumers for specified
quantities of goods and services over a period of time
2. Same goods
The goods consumed in the base and current years remain unchanged.
3. No change in quantity of goods
It is also assumed that the quantity of goods consumed will remain same in
the base year and current year.
4. Price quotations are same
It is also assumed that the prices at different places are same and they do
not change frequently.
5. True on the average
Cost of living index numbers are true on the average.
6. Representative goods
The commodities included in the cost of living index number represent the
consumption of the class of people.
15.5.3 Steps in construction of cost of living index numbers
There are 5 steps involved in construction of cost of living index numbers.
3. Shift the base of the index numbers to 1990 for the data in table 15.8.
Table 15.8: Index numbers corresponding to year
Year 1982 1986 1990 1994 1998
Index number
100 140 200 260 320
(base 1982)
15.9 Summary
In this unit, you have studied about the concept of index numbers, and
classification of index numbers into different types. The different index
numbers that are formally available and the utility and importance of index
numbers are explained in a simple way. You have also studied the
limitations and uses of index numbers.
5. The table 15.10 displays the price of commodities along with the weights
of respective commodities. Calculate index number for 2000 based on
year 1995.
Table 15.10: Price of commodities along with the weights
Commodity 1995 2000 Weights
A 13 8 6
B 15 22 5
C 249 185 4
D 228 259 1
E 497 448 2
15.12 References
Richard I. Levin, David S. Rubin, (2008) Statistics for Management,
Seventh Edition, PHI Learning Private Limited
S. C. Gupta, Fundamentals of Statistics, 2008, Himalaya Publishing
House
U K Srivastava, G V Shenoy, S C Sharma, Quantitative Techniques for
Management Decisions, Second edition, New Age International
–––––––––––––––––––––