0% found this document useful (0 votes)
204 views43 pages

Business Statistics Chapter 5

This document provides an overview of correlation analysis and regression analysis techniques in business statistics. It discusses scatter diagrams, positive and negative correlation, Karl Pearson's coefficient of correlation, and linear regression analysis. The key topics covered are measuring the relationship between two variables, types of correlation including linear vs nonlinear and positive vs negative, and methods for correlation analysis including scatter diagrams and Pearson's correlation coefficient.

Uploaded by

K venkataiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
204 views43 pages

Business Statistics Chapter 5

This document provides an overview of correlation analysis and regression analysis techniques in business statistics. It discusses scatter diagrams, positive and negative correlation, Karl Pearson's coefficient of correlation, and linear regression analysis. The key topics covered are measuring the relationship between two variables, types of correlation including linear vs nonlinear and positive vs negative, and methods for correlation analysis including scatter diagrams and Pearson's correlation coefficient.

Uploaded by

K venkataiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

BUSINESS STATISTICS

BUSINESS
STATISTICS

BBA LLB
By
The_Lawgical_World

THE_LAWGICAL_WORLD 1
BUSINESS STATISTICS

SYLLABUS
Unit – V: Correlation Analysis: Scatter diagram, Positive and negative
correlation, limits for coefficient of correlation, Kari Pearson’s
coefficient of correlation, Spearman’s Rank correlation. Regression
Analysis: Concept, least square fir of a linear regression, two lines of
regression, properties of regression, properties of regression
coefficients (Simple problems only) Time Series Analysis:
Components, Models of Time Series – Additive, Multiplicative and
Mixed models; Trend analysis – Free hand curve, Semi averages,
moving averages, Least Square methods (Simple problems only).

THE_LAWGICAL_WORLD 2
BUSINESS STATISTICS

Correlation Analysis:
Introduction:
Statistical methods of measures of central tendency, dispersion,
skewness and kurtosis are helpful for the purpose of comparison and
analysis of distributions involving only one variable i.e., univariate
distributions. However, describing the relationship between two or more
variables, is another important part of statistics.
The statistical methods of Correlation and Regression are helpful in
knowing the relationship between two or more variables which may be
related in same way, like interest rate of bonds and prime interest rate;
advertising expenditure and sales; income and consumption; crop-yield
and fertilizer used; height and weights and so on.
Correlation
Correlation is a measure of association between two or more variables.
When two or more variables very in sympathy so that movement in one
tends to be accompanied by corresponding movements in the other
variable(s), they are said to be correlated.
“The correlation between variables is a measure of the nature and degree
of association between the variables”.
As a measure of the degree of relatedness of two variables, correlation is
widely used in exploratory research when the objective is to locate
variables that might be related in some way to the variable of interest.
The degree of relationship between the variables under consideration is
measure through the correlation analysis. The measure of correlation
called the correlation coefficient. The degree of relationship is expressed
by coefficient which range from correlation ( -1 ≤ r ≥ +1). The direction
of change is indicated by a sign. The correlation analysis enables us to
have an idea about the degree & direction of the relationship between
the two variables under study.

THE_LAWGICAL_WORLD 3
BUSINESS STATISTICS

Definitions:
“Correlation is an analysis of the covariation between two or more
variables.” —A.M. Tuttle
“Correlation analysis contributes to the understanding of economic
behaviour, aids in locating the critically important variables on which
others depend, may reveal to the economist the connections by which
disturbances spread and suggest to him the paths through which
stabilising forces may become effective.”—W.A. Neiswanger
Types of Correlation
The correlation is a statistical tool which studies the relationship
between two variables and correlation analysis involves various methods
and techniques used for studying and measuring the extent of the
relationship between the two variables.
(a) POSITIVE AND NEGATIVE CORRELATION: If the values of
the two variables deviate in the same direction i.e., if the increase in the
values of one variable results, on an average, in a corresponding increase
in the values of the other variable or if a decrease in the values of one
variable results, on an average, in a corresponding decrease in the values
of the other variable, correlation is said to be positive or direct.
Some examples of series of positive correlation are : (i) Heights and
weights. (ii) The family income and expenditure on luxury items. (iii)
Amount of rainfall and yield of crop (up to a point). (iv) Price and
supply of a commodity and so on.
On the other hand, correlation is said to be negative or inverse if the
variables deviate in the opposite direction i.e., if the increase (decrease)
in the values of one variable results, on the average, in a corresponding
decrease (increase) in the values of the other variable.

THE_LAWGICAL_WORLD 4
BUSINESS STATISTICS

Some examples of negative correlation are the series relating to : (i)


Price and demand of a commodity. (ii) Volume and pressure of a perfect
gas. (iii) Sale of woollen garments and the day temperature, and so on.
(b) LINEAR AND NON-LINEAR CORRELATION The correlation
between two variables is said to be linear if corresponding to a unit
change in one variable, there is a constant change in the other variable
over the entire range of the values. For example, let us consider the
following data:
x 1 2 3 4 5
y 5 7 9 11 13
Thus for a unit change in the value of x, there is a constant change viz., 2
in the corresponding values of y. Mathematically, above data can be
expressed by the relation y = 2x + 3 In general, two variables x and y are
said to be linearly related, if there exists a relationship of the form
y = a + bx
between them. But we know that y = a+bx is the equation of a straight
line with slope ‘b’ and which makes an intercept ‘a’ on the y-axis.
Methods of Correlation
Several measures of correlation are available, the selection of which
depends mostly on the level of data being analyzed. Ideally, researchers
would like to solve for, the population coefficient of correlation.
However, because researchers virtually always deal with sample data,
this section introduces a widely used sample coefficient of correlation, r.
This measure is applicable only if both variables being analyzed have at
least an interval level of data.
The commonly used methods for studying the correlation between two
variables are:
(i) Scatter diagram method.

THE_LAWGICAL_WORLD 5
BUSINESS STATISTICS

(ii) Karl Pearson’s coefficient of correlation (Covariance method).


(iii) Two-way frequency table (Bivariate correlation method).
(iv) Rank method.
(v) Concurrent deviations method
I. Scatter diagram
Scatter diagram is one of the simplest ways of diagrammatic
representation of a bivariate distribution and provides us one of simplest
tools of ascertaining the correlation between two variables.
This method is also known as Dotogram or Dot diagram. Scatter
diagram is one of the simplest methods of diagrammatic representation
of a bivariate distribution. Under this method, both the variables are
plotted on the graph paper by putting dots. The diagram so obtained is
called "Scatter Diagram".
In simple, Scatter Diagram is a graph of observed plotted points where
each point represents the values of X & Y as a coordinate. It portrays the
relationship between these two variables graphically.
Suppose we are given n pairs of values (x1, y1), (x2, y2), …, (xn, yn) of
two variables X and Y. For example, if the variables X and Y denote the
height and weight respectively, then the pairs (x1, y1), (x2, y2), …, (xn,
yn) may represent the heights and weights (in pairs) of n individuals.
These n points may be plotted as dots (.) on the x-axis and y-axis in the
xy-plane. (It is customary to take the dependent variable along the y-axis
and independent variable along the x-axis.) The diagram of dots so
obtained is known as scatter diagram. From scatter diagram we can form
a fairly good, though rough, idea about the relationship between the two
variables.
The following points may be borne in mind in interpreting the scatter
diagram regarding the correlation between the two variables:

THE_LAWGICAL_WORLD 6
BUSINESS STATISTICS

(i) If the points are very dense i.e., very close to each other, a fairly good
amount of correlation may be expected between the two variables. On
the other hand, if the points are widely scattered, a poor correlation may
be expected between them.
(ii) If the points on the scatter diagram reveal any trend (either upward
or downward), the variables are said to be correlated and if no trend is
revealed, the variables are uncorrelated.
(iii) If there is an upward trend rising from lower left hand corner and
going upward to the upper right hand corner, the correlation is positive
since this reveals that the values of the two variables move in the same
direction. If, on the other hand, the points depict a downward trend from
the upper left hand corner to the lower right hand corner, the correlation
is negative since in this case the values of the two variables move in the
opposite directions.
(iv) In particular, if all the points lie on a straight line starting from the
left bottom and going up towards the right top, the correlation is perfect
and positive, and if all the points lie on a straight line starting from left
top and coming down to right bottom, the correlation is perfect and
negative.
The following diagrams of the scattered data depict different forms of
correlation.

THE_LAWGICAL_WORLD 7
BUSINESS STATISTICS

II. Karl Pearson’s coefficient of correlation (Covariance Method)


A mathematical method for measuring the intensity or the magnitude of
linear relationship between two variable series was suggested by Karl
Pearson (1867-1936), a great British Bio-metrician and Statistician and

THE_LAWGICAL_WORLD 8
BUSINESS STATISTICS

is by far the most widely used method in practice. Karl Pearson’s


measure, known as Pearsonian correlation coefficient between two
variables (series) X and Y, usually denoted by r (X, Y) or rxy or simply r,
is a numerical measure of linear relationship between them and is
defined as the ratio of the covariance between X and Y, written as Cov
(x, y), to the product of the standard deviations of X and Y.
Symbolically,

THE_LAWGICAL_WORLD 9
BUSINESS STATISTICS

Properties of Correlation Coefficient


Property I. Limits for Correlation Coefficient
Pearsonian correlation coefficient can not exceed 1 numerically. In other
words, it lies between –1 and +1. Symbolically,
–1≤r≤1
Property II. Correlation coefficient is independent of the change of
origin and scale.
Mathematically, if X and Y are the given variables and they are
transformed to the new variables U and V by the change of origin and
scale viz.,
u = x – A / h and v = y – B/k ; h > 0, k > 0 … (8·7)
where A, B, h and k are constants, h > 0, k > 0; then the correlation
coefficient between x and y is same as the correlation coefficient
between u and v i.e.
THE_LAWGICAL_WORLD 10
BUSINESS STATISTICS

r(x, y) = r (u, v) ⇒ rxy = ruv


Property III. Two independent variables are uncorrelated but the
converse is not true.
Property IV: r(aX + b, cY + d) = a × c / | a × c | × r (X, Y)
where | a × c | is the modulus value of a × c.
Property V. If the variables x and y are connected by the linear equation
ax + by + c = 0, then the correlation coefficient between x and y is (+1)
if the signs of a and b are different and (–1) if the signs of a and b are
alike.
Symbolically, if ax + by + c = 0, then
r = r (x, y) = {+ 1‚ if a and b are of opposite signs;
{– 1‚ if a and b are of same sign.
Assumptions Underlying Karl Pearson’s Correlation Coefficient:
Pearsonian correlation coefficient r is based on the following
assumptions:
(i) The variables X and Y under study are linearly related.
(ii) Each of the variables (series) is being affected by a large number of
independent contributories causes of such a nature as to produce normal
distribution.
(iii) The forces so operating on each of the variable series are not
independent of each other but are related in a causal fashion.
III. CORRELATION IN BIVARIATE FREQUENCY TABLE
If in a bivariate distribution the data are fairly large, they may be
summarised in the form of a two-way table. Here for each variable, the
values are grouped into various classes (not necessarily the same for
both the variables), keeping in view the same considerations as in the
case of univariate distribution.
THE_LAWGICAL_WORLD 11
BUSINESS STATISTICS

For example, if there are m classes for the X-variable series and n
classes for the Y-variable series then there will be m × n cells in the two-
way table. By going through the different pairs of the values (x, y) and
using tally marks we can find the frequency for each cell and thus obtain
the so-called bivariate frequency table as shown below.

Here f (x, y) is the frequency of the pair (x, y).


The formula for computing the correlation coefficient between X and Y
for the bivariate frequency table is

where N is the total frequency. If there is no confusion we may use the


formula

where the frequency f used for the product xy is nothing but f (x, y) and
the frequency f used in the sums ∑fx and ∑fy are respectively the
frequencies of x and y, viz., fx & fy as explained in the above table. If we
THE_LAWGICAL_WORLD 12
BUSINESS STATISTICS

change the origin and scale in X and Y by transforming them to the new
variables U and V by
u = x – A/h and v = y – B/k; h > 0, k > 0
where h and k are the widths of the x-classes and y-classes respectively
and A and B are constants, then by Property II of r, we have:

IV. Spearman’s Rank correlation.


RANK CORRELATION METHOD
Charles Edward Spearman, a British psychologist, developed a formula
in 1904 which consists in obtaining the correlation coefficient between
the ranks of n individuals in the two attributes under study.
When it is not possible to measure the perfect quantities due to absence
of numerical facts, ranking figures are used. These ranks are determined
according to the size of the data. Using ranks rather than actual
observations gives the coefficient of rank correlation. This measure is
especially useful to measure honesty, beauty, skill, wisdom etc.
Spearman’s rank correlation coefficient, usually denoted by ρ (Rho) is
given by the formula

where, d is the difference between the pair of ranks of the same


individual in the two characteristics and n is the number of pairs.
The Spearman’s rank correlation formula is derived from the Pearson
product moment formula and utilizes the ranks of the n pairs instead of
the raw data. The value of d is the difference in the ranks of each pair.
The process begins by the assignment of ranks within each group. The
difference in ranks between each group (d) is calculated by subtracting
THE_LAWGICAL_WORLD 13
BUSINESS STATISTICS

the rank of a member of one group from the rank of its associated
member of the other group. The differences (d) are then squared and
summed. The number of pairs in the groups is represented by n.
Computation of Rank Correlation Coefficient.
The method of computing the Spearman’s rank correlation
coefficient ρ under the following situations :
(i) When actual ranks are given.
(ii) When ranks are not given
CASE (I) — WHEN ACTUAL RANKS ARE GIVEN
In this situation the following steps are involved:
(i) Compute d, the difference of ranks.
(ii) Compute d2.
(iii) Obtain the sum ∑d2.
(iv) Use formula 1- 6Σd2 / n(n2-1) to get the value of ρ
CASE (II)—WHEN RANKS ARE NOT GIVEN:
Spearman’s rank correlation formula can also be used even if we are
dealing with variables which are measured quantitatively, i.e., when the
actual data but not the ranks relating to two variables are given. In such a
case we shall have to convert the data into ranks. The highest (smallest)
observation is given the rank 1. The next highest (next lowest)
observation is given rank 2 and so on. It is immaterial in which way
(descending or ascending) the ranks are assigned. However, the same
approach should be followed for all the variables under consideration.
REPEATED RANKS
In case of attributes if there is a tie i.e., if any two or more individuals
are placed together in any classification w.r.t. an attribute or if in case of
variable data there is more than one item with the same value in either or
THE_LAWGICAL_WORLD 14
BUSINESS STATISTICS

both the series, then Spearman’s formula for calculating the rank
correlation coefficient breaks down, since in this case the variables X
[the ranks of individuals in characteristic A (1st series)] and Y [the ranks
of individuals in characteristic B (2nd series)] do not take the values
from 1 to n and consequently x – ≠ y – , while in proving we had
assumed that x– = y– .
In this case, common ranks are assigned to the repeated items. These
common ranks are the arithmetic mean of the ranks which these items
would have got if they were different from each other and the next item
will get the rank next to the rank used in computing the common rank.
V. METHOD OF CONCURRENT DEVIATIONS
This is very casual method of determining the correlation between two
series when we are not very serious about its precision. This is based on
the signs of the deviations (i.e., direction of the change) of the values of
the variable from its preceding value and does not take into account the
exact magnitude of the values of the variables. Thus, we put a plus (+)
sign, minus (–) sign or equality (=) sign for the deviation if the value of
the variable is greater than, less than or equal to the preceding value
respectively. The deviations in the values of two variables are said to be
concurrent if they have the same sign, i.e., either both deviations are
positive or both are negative or both are equal. The formula used for
computing correlation coefficient r by this method is given by

where c is the number of pairs of concurrent deviations and n is the


number of pairs of deviations. In the above formula, plus/minus sign to
be taken inside and outside the square root is of fundamental
importance.
Regression Analysis:

THE_LAWGICAL_WORLD 15
BUSINESS STATISTICS

Regression literally means “return” or “go back”. In the 19th century,


Francis Galton at first used regression in his paper “Regression towards
Mediocrity in Hereditary Stature” for the study of hereditary
characteristics. Use of regression in modern times is not limited to
hereditary characteristics only but it is widely used for the study of
expected dependence of one variable on the other. Therefore, the method
by which best probable values of unknown data of a variable are
calculated for the known values of the other variable is called regression.
Regression helps in forecasting, decision making and in studying two or
more variables in economic field. It also shows the direction, quality and
degree of correlation.
Regression analysis is the process of constructing a mathematical model
or function that can be used to predict or determine one variable by
another variable or other variables. The most elementary regression
model is called simple regression or bivariate regression involving two
variables in which one variable is predicted by another variable.
In simple regression, the variable to be predicted is called the dependent
variable and is designated as y. The predictor is called the independent
variable, or explanatory variable, and is designated as x. In simple
regression analysis, only a straight-line relationship between two
variables is examined. Nonlinear relationships and regression models
with more than one independent variable can be explored by using
multiple regression models.
INDEPENDENT AND DEPENDENT VARIABLES:
Simple regression involves only two variables; one variable is predicted
by another variable. The variable to be predicted is called the dependent
variable. The predictor is called the independent variable, or explanatory
variable. For example, when we are trying to predict the demand for
television sets on the basis of population growth, we are using the
demand for television sets as the dependent variable and the population

THE_LAWGICAL_WORLD 16
BUSINESS STATISTICS

growth as the independent or predictor variable. The decision, as to


which variable is which sometimes, causes problems.
LINEAR AND NON-LINEAR REGRESSION
If the given bivariate data are plotted on a graph, the points so obtained
on the scatter diagram will more or less concentrate round a curve,
called the ‘curve of regression’. Often such a curve is not distinct and is
quite confusing and sometimes complicated too. The mathematical
equation of the regression curve, usually called the regression equation,
enables us to study the average change in the value of the dependent
variable for any given value of the independent variable.
If the regression curve is a straight line, we say that there is linear
regression between the variables under study. The equation of such a
curve is the equation of a straight line, i.e., a first-degree equation in the
variables x and y.
In case of linear regression, the values of the dependent variable increase
by a constant absolute amount for a unit change in the value of the
independent variable. However, if the curve of regression is not a
straight line, the regression is termed as curved or non-linear regression.
The regression equation will be a functional relation between x and y
involving terms in x and y of degree higher than one, i.e., involving
terms of the type x2, y2, xy, etc. However, in this chapter we shall
confine our discussion to linear regression between two variables only.
Lines of Regression:
Line of regression is the line which gives the best estimate of one
variable for any given value of the other variable. In case of two
variables x and y, we shall have two lines of regression; one of y on x
and the other of x on y.
Regression line is that line which gives the best estimate of dependent
variable for any given value of independent variable. If we take the case

THE_LAWGICAL_WORLD 17
BUSINESS STATISTICS

of two variables X and Y, we shall have two regression lines as the


regression of X on Y and the regression of Y on X.
Regression Line X and Y: In this formation, Y is independent and X is
dependent variable, and best expected value of X is calculated
corresponding to the given value of Y.
Regression Line Y on X: Here Y is dependent and X is independent
variable, best expected value of Y is estimated equivalent to the given
value of X.
An important reason of having two regression lines is that they are
drawn on least square assumption which stipulates that the sum of
squares of the deviations from different points to that line is minimum.
The deviations from the points from the line of best fit can be measured
in two ways – vertical, i.e. parallel to Y – axis, and horizontal i.e.
parallel to X axis.
For minimizing the total of the squares separately, it is essential to have
two regression lines.
Single line of Regression: When there is perfect positive or perfect
negative correlation between the two variables (r = ±1) the regression
lines will coincide or overlap and will form a single regression line in
that case
Derivation of Line of Regression of y on x.
Let (x1, y1), (x2, y2), …, (xn, yn), be n pairs of observations on the two
variables x and y under study.
Let y = a + bx be the line of regression (best fit) of y on x. For any given
point Pi (xi, yi) in the scatter diagram, the error of estimate or residual as
given by the line of best fit is PiHi. Now, the x-coordinate of Hi is same
as that of Pi, viz., xi and since Hi (xi) lies on the line, the y-coordinate of
Hi, i.e., Hi M is given by (a + bxi). Hence, the error of estimate for Pi is
given by Pi Hi = Pi M – Hi M = yi – (a + bxi).

THE_LAWGICAL_WORLD 18
BUSINESS STATISTICS

This is the error (parallel to the y-axis) for the ith point. We will have
such errors for all the points on scatter diagram. For the points which lie
above the line, the error would be positive and for the points which lie
below the line, the error would be negative.

THE_LAWGICAL_WORLD 19
BUSINESS STATISTICS

Least square fir of a linear regression:


The "least squares" method is a form of mathematical regression
analysis used to determine the line of best fit for a set of data, providing
a visual demonstration of the relationship between the data points. Each
point of data represents the relationship between a known independent
variable and an unknown dependent variable.
The least squares method provides the overall rationale for the
placement of the line of best fit among the data points being studied. The
most common application of this method, which is sometimes referred to
as "linear" or "ordinary", aims to create a straight line that minimizes the
sum of the squares of the errors that are generated by the results of the
associated equations, such as the squared residuals resulting from

THE_LAWGICAL_WORLD 20
BUSINESS STATISTICS

differences in the observed value, and the value anticipated, based on


that model.
This method of regression analysis begins with a set of data points to be
plotted on an x- and y-axis graph. An analyst using the least squares
method will generate a line of best fit that explains the potential
relationship between independent and dependent variables.
In regression analysis, dependent variables are illustrated on the vertical
y-axis, while independent variables are illustrated on the horizontal x-
axis. These designations will form the equation for the line of best fit,
which is determined from the least squares method.
In contrast to a linear problem, a non-linear least squares problem has no
closed solution and is generally solved by iteration. The discovery of the
least squares method is attributed to Carl Friedrich Gauss, who
discovered the method in 1795.
COEFFICIENTS OF REGRESSION
Let us consider the line of regression of y on x, viz., y = a + bx.
The coefficient ‘b’ which is the slope of the line of regression of y on x
is called the coefficient of regression of y on x. It represents the
increment in the value of the dependent variable y for a unit change in
the value of the independent variable x. In other words, it represents the
rate of change of y w.r.t. x. For notational convenience, the slope b, i.e.,
coefficient of regression of y on x is written as byx.
Similarly, in the regression equation of x on y, viz., x = A + B y, the
coefficient B represents the change in the value of dependent variable x
for a unit change in the value of independent variable y and is called the
coefficient of regression of x on y. For notational convenience, it is
written as bxy.
Notations
byx = Coefficient of regression of y on x.
THE_LAWGICAL_WORLD 21
BUSINESS STATISTICS

bxy = Coefficient of regression of x on y.


Properties of regression:
Properties of the Regression Lines
• Regression coefficients values remain the same. Since shifting of
origin takes place because of the change of scale. The property
says
• If the variables x and y are changed to u and v respectively u= (x-
a)/p v=(y-c) /q, Here p and q are the constants.Byz =q/p*bvu
Bxy=p/q*buv.
• If there are two lines of regression. Both of these lines intersect at a
specific point [x’, y’]. Variables x and y are taken into
consideration. According to the property, the intersection of both
the lines of regression i.e. y on x and y is [x’, y’]. This is the
solution for both of the equations of variables x and y.
• The correlation coefficient between the two variables x and y is the
geometric mean of both the coefficients. Also, the sign over the
values of correlation coefficients will be the common sign of both
the coefficients. So, if according to the property regression
coefficients are byx= (b) and bxy= (b’) then the correlation
coefficient is
R = ± √ (byx + bxy)
so, in some cases, both the coefficients give a negative value and r
is also negative. If both the values of coefficients are positive the r
will be positive.
• The regression constant (a0) is equal to the y-intercept of the
regression line. Where a0 and a1 are the regression parameters.
Properties of regression coefficients
1. It is denoted by b.
2. It is expressed in terms of original unit of data.
THE_LAWGICAL_WORLD 22
BUSINESS STATISTICS

3. Between two variables (say x and y), two values of regression


coefficient can be obtained. One will be obtained when we consider x as
independent and y as dependent and the other when we consider y as
independent and x as dependent. The regression coefficient of y on x is
represented as byx and that of x on y as bxy.
4. Both regression coefficients must have the same sign. If byx is
positive, bxy will also be positive and vice versa.
5. If one regression coefficient is greater than unity, then the other
regression coefficient must be lesser than unity.
6. The geometric mean between two regression coefficients is equal to
the coefficient of correlation, r =
7. Arithmetic mean of both regression coefficients is equal to or greater
than coefficient of correlation.
(byx + bxy)/2 = equal or greater than r
Regression analysis has wide applications in the field of genetics and
breeding as given below:
1. It helps in finding out a cause and effect relationship between two or
more plant characters.
2. It is useful in determining the important yield contributing characters.
3. It helps in the selection of elite genotypes by indirect selection for
yield through independent characters.
4. It also helps in predicting the performance of selected plants in the
next generation.

THE_LAWGICAL_WORLD 23
BUSINESS STATISTICS

CORRELATION ANALYSIS Vs. REGRESSION ANALYSIS


1. Correlation literally means the relationship between two or more
variables which vary in sympathy so that the movements in one tend to
be accompanied by the corresponding movements in the other(s). On the
other hand, regression means stepping back or returning to the average
value and is a mathematical measure expressing the average relationship
between the two variables.
2. Correlation coefficient ‘rxy’ between two variables x and y is a
measure of the direction and degree of the linear relationship between
two variables which is mutual. It is symmetric, i.e., ryx = rxy and it is
immaterial which of x and y is dependent variable and which is
independent variable dy and then using this relationship to predict or
estimate the value of the dependent variable for any given value of the
independent variable. It also reflects upon the nature of the variable, i.e.,
which is dependent variable and which is independent variable.
Regression coefficients are not symmetric in x and y, i.e., byx ≠ bxy.
3. Correlation need not imply cause and effect relationship between the
variables under study.
However, regression analysis clearly indicates the cause-and-effect
relationship between the variables. The variable corresponding to cause
is taken as independent variable and the variable corresponding to effect
is taken as dependent variable.

THE_LAWGICAL_WORLD 24
BUSINESS STATISTICS

4. Correlation coefficient rxy is a relative measure of the linear


relationship between x and y and is independent of the units of
measurement. It is a pure number lying between ± 1.
On the other hand, the regression coefficients, byx and bxy are absolute
measures representing the change in the value of the variable y(x), for a
unit change in the value of the variable x(y). Once the functional form of
regression curve is known, by substituting the value of the dependent
variable we can obtain the value of the independent variable and this
value will be in the units of measurement of the variable.
5. There may be non-sense correlation between two variables which is
due to pure chance and has no practical relevance, e.g., the correlation
between the size of shoe and the intelligence of a group of individuals.
There is no such thing like non-sense regression.
6. Correlation analysis is confined only to the study of linear relationship
between the variables and, therefore, has limited applications.
Regression analysis has much wider applications as it studies linear as
well as non-linear relationship between the variables
Time Series Analysis:
A time series is an arrangement of statistical data in a chronological
order, i.e., in accordance with its time of occurrence. It reflects the
dynamic pace of movements of a phenomenon over a period of time.
Most of the series relating to Economics, Business and Commerce, e.g.,
the series relating to prices, production and consumption of various
commodities; agricultural and industrial production, national income and
foreign exchange reserves; investment, sales and profits of business
houses; bank deposits and bank clearings, prices and dividends of shares
in a stock exchange market, etc., are all time series spread over a long
period of time.
Accordingly, time series have an important and significant place in
Business and Economics, and basically most of the statistical techniques
THE_LAWGICAL_WORLD 25
BUSINESS STATISTICS

for the analysis of time series data have been developed by economists.
However, these techniques can also be applied for the study of
behaviour of any phenomenon collected chronologically over a period of
time in any discipline relating to natural and social sciences, though not
directly related to economics or business.
Components of Time Series:
If the values of a phenomenon are observed at different periods of time,
the values so obtained will show appreciable variations or changes.
These fluctuations are due to the fact that the value of the phenomenon
is affected not by a single factor but due to the cumulative effect of a
multiplicity of factors pulling it up and down. However, if the various
forces were in a state of equilibrium, then the time series will remain
constant.
The various forces affecting the values of a phenomenon in a time series
may be broadly classified into the following four categories, commonly
known as the components of a time series, some or all of which are
present (in a given time series) in varying degrees.
(a) Secular Trend or Long-term Movement (T).
(b) Periodic Movements or Short-term Fluctuations:
(i) Seasonal Variations (S),
(ii) Cyclical Variations (C).
(c) Random or Irregular Variations (R or I).
The value (y) of a phenomenon observed at any time (t) is the net effect
of the interaction of above components.
(a) Secular Trend: The general tendency of the time series data to
increase or decrease or stagnate during a long period of time is called the
secular trend or simple trend. This phenomenon is usually observed in
most of the series relating to Economics and Business, e.g., an upward

THE_LAWGICAL_WORLD 26
BUSINESS STATISTICS

tendency is usually observed in time series relating to population,


production and sales of products, prices, incomes, money in circulation,
etc., while a downward tendency is noticed in the time series relating to
deaths, epidemics, etc., due to advancement in medical technology,
improved medical facilities, better sanitation, diet, etc.
According to Simpson and Kafka: “Trend, also called secular or long-
term trend, is the basic tendency of a series to grow or decline over a
period of time. The concept of trend does not include short-range
oscillations, but rather the steady movement over a long time.”
Uses of Trend.
(i) The study of the data over a long period of time enables us to have a
general idea about the pattern of the behaviour of the phenomenon under
consideration. This helps in business forecasting and planning future
operations.
(ii) By isolating trend values from the given time series, the short-term
and irregular movements can be understood.
(iii) Trend analysis enables us to compare two or more time series over
different periods of time and draw important conclusions about them.
Short-Term Variations. In addition to the long-term movements there
are inherent in most of the time series, a number of forces which repeat
themselves periodically or almost periodically over a period of time and
thus prevent the smooth flow of the values of the series in a particular
direction. Such forces give rise to the so-called short-term variations
which may be classified into the following two categories:
(i) Sesonal Variations (S), and (ii) Cyclical Variations (C).
(i) Seasonal Variations (S). These variations in a time series are due to
the rhythmic forces which operate in a regular and periodic manner over
a span of less than a year, i.e., during a period of 12 months and have the
same or almost same pattern year after year.

THE_LAWGICAL_WORLD 27
BUSINESS STATISTICS

Thus, seasonal variations in a time series will be there, if the data are
recorded quarterly (every three months), monthly, weekly, daily, hourly,
and so on. Although in each of the above cases, the amplitudes of the
seasonal variations are different, all of them have the same period, viz.,
1 year. Thus, in a time series data where only annual figures are given,
there are no seasonal variations. Most of economic time series are
influenced by seasonal swings, e.g., prices, production and consumption
of commodities; sales and profits in a departmental store; bank clearings
and bank deposits, etc.
The seasonal variations may be attributed to the following two causes:
(i) Those resulting from natural forces and (ii) Those resulting from
man-made conventions.
(ii) Cyclical Variations (C). The oscillatory movements in a time series
with period of oscillation greater than one year are termed as cyclical
variations. These variations in a time series are due to ups and downs
recurring after a period greater than one year. The cyclical fluctuations,
though more or less regular, are not necessarily uniformly periodic, i.e.,
they may or may not follow exactly similar patterns after equal intervals
of time. One complete period which normally lasts from 7 to 9 years is
termed as a ‘cycle’. These oscillatory movements in any business
activity are the outcome of the so-called ‘Business Cycles’ which are the
four-phased cycles comprising prosperity (boom), recession, depression
and recovery from time to time. These booms and depressions in any
business activity follow each other with steady regularity and the
complete cycle from the peak of one boom to the peak of next boom
usually lasts from 7 to 9 years. Most of the economic and business
series, e.g., series relating to production, prices, wages, investments, etc.,
are affected by cyclical upswings and downswings.
The study of cyclical variations is of great importance to business
executives in the formulation of policies aimed at stabilising the level of
business activity. A knowledge of the cyclic component enables a

THE_LAWGICAL_WORLD 28
BUSINESS STATISTICS

businessman to have an idea about the periodicity of the booms and


depressions and accordingly he can take timely steps for maintaining
stable market for his product.
(c) Random or Irregular Variations. Mixed up with cyclical and
seasonal variations, there is inherent in every time series another factor
called random or irregular variations. These fluctuations are purely
random and are the result of such unforeseen and unpredictable forces
which operate in absolutely erratic and irregular manner. Such variations
do not exhibit any definite pattern and there is no regular period or time
of their occurrence, hence they are named irregular variations. These
powerful variations are usually caused by numerous non-recurring
factors like floods, famines, wars, earthquakes, strikes and lockouts,
epidemics, revolution, etc., which behave in a very erratic and
unpredictable manner. Normally, they are short-term variations but
sometimes their effect is so intense that they may give rise to new
cyclical or other movements. Irregular variations are also known as
episodic fluctuations and include all types of variations in a time series
data which are not accounted for by trend, seasonal and cyclical
variations.
Because of their absolutely random character, it is not possible to isolate
such variations and study them exclusively nor we can forecast or
estimate them precisely. The best that can be done about such variations
is to obtain their rough estimates (from past experience) and accordingly
make provisions for such abnormalities during normal times in business.
ANALYSIS OF TIME SERIES
The time series analysis consists of:
(i) Identifying or determining the various forces or influences whose
interaction produces the variations in the time series.
(ii) Isolating, studying, analysing and measuring them independently,
i.e., by holding other things constant.
THE_LAWGICAL_WORLD 29
BUSINESS STATISTICS

The time series analysis is of great importance not only to businessman


or an economist but also to people working in various disciplines in
natural, social and physical sciences. Some of its uses are enumerated
below:
(i) It enables us to study the past behaviour of the phenomenon under
consideration, i.e., to determine the type and nature of the variations in
the data.
(ii) The segregation and study of the various components is of
paramount importance to a businessman in the planning of future
operations and in the formulation of executive and policy decisions.
(iii) It helps to compare the actual current performance or
accomplishments with the expected ones (on the basis of the past
performances) and analyse the causes of such variations, if any.
(iv) It enables us to predict or estimate or forecast the behaviour of the
phenomenon in future which is very essential for business planning.
(v) It helps us to compare the changes in the values of different
phenomena at different times or places, etc.
Models of Time Series:
The following are the two models commonly used for the decomposition
of a time series into its components.
(i) Additive Model or Decomposition by Additive Hypothesis.
(ii) Multiplicative Model or Decomposition by Multiplicative
Hypothesis.
(i) Additive Model or Decomposition by Additive Hypothesis.
According to the additive model, the time series can be expressed as:
Y = T + S + C + I … (i)
or more precisely, Yt = Tt + St + Ct + It

THE_LAWGICAL_WORLD 30
BUSINESS STATISTICS

where Y (Yt ) is the time series value at time t, and Tt , St , Ct and It


represent the trend, seasonal, cyclical and random variations at time t. In
this model S = St, C = Ct and I = It are absolute quantities which can take
positive and negative values so that:

The additive model assumes that all the four components of the time
series operate independently of each other so that none of these
components has any effect on the remaining three.
This implies that the trend, however, fast or slow, it may be, has no
effect on the seasonal and cyclical components; nor do seasonal swings
have any impact on cyclical variations and conversely. However, this
assumption is not true in most of the economic and business time series
where the four components of the time series are not independent of
each other.
(ii) Multiplicative Model or Decomposition by Multiplicative
Hypothesis: Keeping the above points, in view, most of the economic
and business time series are characterised by the following classical
multiplicative model:
Y = T × S × C × I --- (i)
or more precisely, Yt = Tt × St × Ct × It
This model assumes that the four components of the time series are due
to different causes but they are not necessarily independent and they can
affect each other.
In this model S, C and I are not viewed as absolute amounts but rather as
relative variations. Except for the trend component T, the other
components S, C and I are expressed as rates or indices fluctuating
above or below 1 such that the geometric means of all the S = St values

THE_LAWGICAL_WORLD 31
BUSINESS STATISTICS

in a year, C = Ct values in a cycle or I = It values in a long-term period


are unity.
Taking logarithm of both sides in (i), we get
log Y = log T + log S + log C + log I
which is nothing but the additive model fitted to the logarithms of the
given time series values.
Mixed models:
In addition to the additive and multiplicative models discussed above,
the components in a time series may be combined in a large number of
other ways. The different models, defined under different assumptions,
will yield different results. Some of the mixed models resulting from
different combinations of additive and multiplicative models are given
below:
Y = TCS + I … (i)
Y = TC + SI … (ii)
Y = T + SCI … (iii)
Y = T + S + CI
Trend analysis:
The following are the four methods which are generally used for the
study and measurement of the trend component in a time series.
(i) Graphic (or Free-hand Curve Fitting) Method.
(ii) Method of Semi-Averages.
(iii) Method of Curve Fitting by the Principle of Least Squares.
(iv) Method of Moving Averages
Free hand curve

THE_LAWGICAL_WORLD 32
BUSINESS STATISTICS

This is the simplest and the most flexible method of estimating the
secular trend and consists in first obtaining a histogram by plotting the
time series values on a graph paper and then drawing a free-hand smooth
curve through these points so that it accurately reflects the long-term
tendency of the data. The smoothing of the curve eliminates the other
components, viz., seasonal, cyclical and random variations.
In order to obtain proper trend line or curve, the following points may be
borne in mind:
(i) It should be smooth.
(ii) The number of points above the trend curve/line should be more or
less equal to the number of points below it.
(iii) The sum of the vertical deviations of the given points above the
trend line should be approximately equal to the sum of vertical
deviations of the points below the trend line so that the total positive
deviations are more or less balanced against total negative deviations.
(iv) The sum of the squares of the vertical deviations of the given points
from the trend line/curve is minimum possible.
(v) If the cycles are present in the data then the trend line should be so
drawn that:
(a) It has equal number of cycles above and below it.
(b) It bisects the cycles so that the areas of the cycles above and
below the trend line are approximately same.
(vi) The minor short-term fluctuations or abrupt and sudden variations
may be ignored.
Merits:
(i) It is very simple and time-saving method and does not require any
mathematical calculations.

THE_LAWGICAL_WORLD 33
BUSINESS STATISTICS

(ii) It is a very flexible method in the sense that it can be used to


describe all types of trend – linear as well as non-linear.
Demerits:
(i) The strongest objection to this method is that it is highly subjective in
nature. The trend curve so obtained will very much depend on the
personal bias and judgement of the investigator handling the data and
consequently different persons will obtain different trend curves for the
same set of data.
(ii) It does not help to measure trend.
(iii) Because of the subjective nature of the free-hand trend curve, it will
be dangerous to use it for forecasting or making predictions.
Semi averages
As compared with graphic method, this method has more objective
approach. In this method, the whole time series data is classified into
two equal parts w.r.t. time.
For example, if we are given the time series values for 10 years from
1985 to 1994 then the two equal parts will be the data corresponding to
periods 1985 to 1989 and 1990 to 1994.
However, in case of odd number of years, the two equal parts are
obtained on omitting the value for the middle period.
Thus, for example, for the data for 9 years from 1990 to 1998, the two
parts will be the data for years 1990 to 1993 and 1995 to 1998, the value
for the middle year, viz., 1994 being omitted.
Having divided the given series into two equal parts, we next compute
the arithmetic mean of time-series values for each half separately. These
means are called semi-averages. Then these semi-averages are plotted as
points against the middle point of the respective time periods covered by

THE_LAWGICAL_WORLD 34
BUSINESS STATISTICS

each part. The line joining these points gives the straight-line trend
fitting the given data.
Merits:
(i) An obvious advantage of this method is its objectivity in the sense
that it does not depend on personal judgement and everyone who uses
this method gets the same trend line and hence the same trend values.
(ii) It is easy to understand and apply as compared with the moving
average or the least square methods of measuring trend.
(iii) The line can be extended both ways to obtain future or past
estimates.
Limitations:
(i) This method assumes the presence of linear trend (in the time series
values) which may not exist.
(ii) The use of arithmetic mean (for obtaining semi-averages) may also
be questioned because of its limitations. Accordingly, the trend values
obtained by this method and the predicted values for future are not
precise and reliable.
Example: Apply the method of semi-averages for determining trend of
the following data and estimate the value for 2000:
Years: 1993 1994 1995 1996 1997 1998
Sales: 20 24 22 30 28 32
If the actual figure of sales for 2000 is 35,000 units, how do you account
for the difference between the figures you obtain and the actual figures
given to you?
Solution: Here n = 6 (even), and hence the two parts will be 1993 to
1995 and 1996 to 1998.
CALCULATIONS FOR TREND BY SEMI-AVERAGES

THE_LAWGICAL_WORLD 35
BUSINESS STATISTICS

Here the semi-average 22 is to be plotted against the mid-year of first


part, i.e., 1994 and the semi-average 30 is to be plotted against the
midyear of second part, viz., 1997. The trend line is shown in the Fig.

Least Square methods (Simple problems only).


Method of Curve Fitting by the Principle of Least Squares. The principle
of least squares provides us an analytical or mathematical device to
obtain an objective fit to the trend of the given time series. Most of the
data relating to economic and business time series conform to definite
laws of growth or decay and accordingly in such a situation analytical
trend fitting will be more reliable for forecasting and predictions. This
technique can be used to fit linear as well as non-linear trends.
Fitting of Linear Trend: Let the straight-line trend between the given
time-series values (y) and time (t) be given by the equation:

THE_LAWGICAL_WORLD 36
BUSINESS STATISTICS

y = a + bt.
Then for any given time ‘t’, the estimated value y e of y as given by this
equation is:
ye = a + bt
The principle of least squares consists in estimating the values of a and b
in the above equation, so that the sum of the squares of errors of
estimate.

which, on simplification, gives the normal equations or least square


equations for estimating a and b as
∑y = na + b∑ t; ∑ t y = a∑t + b∑ t2,
where n is the number of time series pairs (t, y).
Merits and Limitations of Trend Fitting by Principle of Least
Squares
Merits:
The method of least squares is the most popular and widely used method
of fitting mathematical functions to a given set of observations. It has the
following advantages:
(i) Because of its analytical or mathematical character, this method
completely eliminates the element of subjective judgement or personal
bias on the part of the investigator.
(ii) Unlike the method of moving averages, this method enables us to
compute the trend values for all the given time periods in the series.

THE_LAWGICAL_WORLD 37
BUSINESS STATISTICS

(iii) The trend equation can be used to estimate or predict the values of
the variable for any period t in future or even in the intermediate periods
of the given series and the forecasted values are also quite reliable.
(iv) The curve fitting by the principle of least squares is the only
technique which enables us to obtain the rate of growth per annum, for
yearly data, if linear trend is fitted.
Demerits:
(i) The most serious limitation of the method is the determination of the
type of the trend curve to be fitted, viz., whether we should fit a linear or
a parabolic trend or some other more complicated trend curve.
Assumptions about the type of trend to be fitted might introduce some
bias.
(ii) The addition of even a single new observation necessitates all the
calculations to be done afresh which is not so in the case of moving
average method.
(iii) This method requires more calculations and is quite tedious and
time consuming as compared with other methods. It is rather difficult for
a non-mathematical person (layman) to understand and use.
(iv) Future predictions or forecasts based on this method are based only
on the long-term variations, i.e., trend and completely ignore the
cyclical, seasonal and irregular fluctuations.
(v) It cannot be used to fit growth curves (Modified exponential curve,
Gompertz curve and Logistic curve) to which most of the economic and
business time series conform. The discussion, however, is beyond the
scope of the book
Example: Fit a linear trend to the following data by the least squares
method. Verify that ∑ (y – ye) = 0, where ye is the corresponding trend
value of y.
Year: 1990 1992 1994 1996 1998
THE_LAWGICAL_WORLD 38
BUSINESS STATISTICS

Production: 18 21 23 27 16
(in ’000 units)
Also estimate the production for the year 1999.
Solution:
Here n = 5 i.e., odd. Hence, we shift the origin to the middle of the time
period viz., the year 1994.
Let x = t – 1994 …(i)
Let the trend line of y (production) on x be:
y = a + bx (Origin 1994) …(ii)
COMPUTATION OF STRAIGHT-LINE TREND

Putting x = – 4, –2, 0, 2 and 4 in (iii), we obtain the trend values (ye) for
the years 1990, 1992, ...., 1998 respectively.
The difference (y – ye) is calculated in the last column of the table. We
have:
THE_LAWGICAL_WORLD 39
BUSINESS STATISTICS

∑ (y – ye) = –2·6 + 0·2 + 2·0 + 5·8 – 5·4 = 8 – 8 = 0, as required.


Estimated Production for 1999. Taking t = 1999 in (i), we get x = 1999 –
1994 = 5.
Substituting x = 5 in (iii), the estimated production for 1999 is given by:
(ye)1999 = 21 + 0·1 × 5 = 21 + 0·5 = 21·5 thousand units.
Moving averages
. Method of moving averages is a very simple and flexible method of
measuring trend. It consists in obtaining a series of moving averages,
(arithmetic means), of successive overlapping groups or sections of the
time series. The averaging process smoothens out fluctuations and the
ups and downs in the given data.
The moving average is characterised by a constant known as the period
or extent of the moving average. Thus, the moving average of period ‘m’
is a series of successive averages (A.M.’s) of m overlapping values at a
time, starting with 1st, 2nd, 3rd value and so on.
Thus, for the time series values y1, y2, y3, y4, y5, … for different time
periods, the moving average (M.A.) values of period ‘m’ are given by:

Case (i) When Period is Odd.

THE_LAWGICAL_WORLD 40
BUSINESS STATISTICS

If the period ‘m’ of the moving average is odd. then the successive
values of the moving averages are placed against the middle values of
the corresponding time intervals.
For example, if m =5, the first moving average value is placed against
the middle period. i.e., 3rd, the second M.A. value is placed against the
time period 4 and so on.
Case (ii). When Period is Even.
If the period ‘m’ of the M.A. is even, then there are two middle periods
and the M.A. values are placed in between the two middle periods of the
time intervals it covers.
Obviously, in this case, the M.A. values will not coincide with a period
of the given time series and an attempt is made to synchronise them with
the original data by taking a two-period average of the moving averages
and placing them in between the corresponding time periods.
This technique is called centering and the corresponding moving average
values are called centred moving averages. In particular, if the period m
= 4, the first moving average value is placed against the middle of 2nd
and 3rd time intervals; the second moving average value is placed in
between 3rd and 4th time periods and so on.

Merits and Demerits of Moving Average Method


Merits
1. This method does not require any mathematical complexities and is
quite simple to understand and use as compared with the principle of
least squares method.
2. Unlike the ‘free hand curve’ method, this method does not involve
any element of subjectivity since the choice of the period of moving

THE_LAWGICAL_WORLD 41
BUSINESS STATISTICS

average is determined by the oscillatory movements in the data and not


by the personal judgement of the investigator.
3. Unlike the method of trend fitting by principle of least squares, the
moving average method is quite flexible in the sense that a few more
observations may be added to the given data without affecting the trend
values already obtained. The addition of some new observations will
simply result in some more trend values at the end.
4. The oscillatory movements can be completely eliminated by choosing
the period of the M.A. equal to or multiple of the period of cyclic
movement in the given series.
5. In addition to the measurement of trend, the method of moving
averages is also used for measurement of seasonal, cyclical and irregular
fluctuations.
Limitations:
1. An obvious limitation of the moving average method is that we cannot
obtain the trend values for all the given observations. We have to forego
the trend values for some observations at both the extremes (i.e., in the
beginning and at the end) depending on the period of the moving
average. For example, for a moving average of period 5, 7 and 9, we
lose the trend values for the first and last 2, 3 and 4 values respectively.
2. Since the trend values obtained by moving average method cannot be
expressed by any functional relationship, this method cannot be used for
forecasting or predicting future values which is the main objective of
trend analysis.
3. The selection of the period of moving average is very important and is
not easy to determine particularly when the time series does not exhibit
cycles which are regular in period and amplitude. In such a case the
moving average will not completely eliminate the oscillatory movements

THE_LAWGICAL_WORLD 42
BUSINESS STATISTICS

and consequently the moving average values will not represent a true
picture of the general trend.
4. In case of non-linear trend, which is generally the case in most of
economic and business time series, the trend values given by the moving
average method are biased and they lie either above or below the true
sweep of the data.

THE_LAWGICAL_WORLD 43

You might also like