0% found this document useful (0 votes)
15 views

Lecture Part A

Uploaded by

molepe6687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture Part A

Uploaded by

molepe6687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

History

A history must start somewhere, but history has no beginning. There has always been a
temptation by many writers to begin a history of Statistics with references to the endeavours
in the ancient world to record information about states.

The first use of the word appears in a work by an Italian historian GirolomoGhilini, who in
1589 referred to an account of the civil, politica, statistica e militarescienza. It is believed
that the words: statist, statistics and statistical originated from the Latin word status which
means the political states of nation. However, different views are there. One of these views is
that the word statistics originated from either Italian statista or German statistik. Note that
both the words statista and statistik mean political states of a nation like the Latin status.

The word statist appeared for the first time in Shakespear’sHamlet in 1602. After that the
word was found in the books, Cymbeline (1610 / 1611), and Paradise Regained (1671).
German professor Gottfried Achenwall used the word statistik for the first time in his book
The Abriss der EverpaischenReiche in 1749.

The word statistics appeared for the first time in Baron J. F. Von Bielfeld’s book The
Elements of Universal Erudition (1770). In that book, a chapter is titled as Statistics and the
term statistics is defined as: The science that teaches what is the political arrangement of all
the modern states of the known word. In 1787, a German professor of Natural Philosophy E.
A. W: Zimmermann in his book A political survey of the Present State of Europe defined
statistics as: A brief description of important characteristics of a state.

The evolution of statistics has a long history. Perhaps the earliest use of statistics was when
the ancient rulers or heads of different regions used to count the number of effective warriors
that they had or the number they would need to defeat their enemy, or when they estimated
how much tax could be collected from their subjects. In later times, statistics was used to
report death rates in the great London Plague and in the study of natural resources. In our
Indian subcontinent, a good records on land, revenue, agriculture etc. were maintained during
the Akbar’s reign (1556-1605).

We have a series of well-known landmarks in the history of political description. These


works are often referred to as statistical and their authors used the word to describe the
methods of their studies. The origin of the history of modern statistics could not be traced
back before A.D. 1660. The basic ideas of modern statistics seem to arise independently at
different times in Western Europe. This evolution of modern statistics can be distinguished
into four phase (Islam, 1993).

1. Pre-history period (1660 - 1750)


2. Development of formal statistical methods (1750 - 1820)
3. Socialization of statistics(1820 - 1900) and
4. Modern era of statistics (1900 + )

1
Definition of Statistics: It is almost impossible to formulate statistics in a compact definition.
Different statisticians have defined statistics from different perspectives. These definitions
are to be used in two different meanings.

A. In singular perspective: In singular perspective the meaning of statistics is that


statistics suggests those principles, formula and functions through which the
calculated subjects are expressed.
B. In plural perspective: In plural perspective statistics means the expression of the
calculated affairs of our day to day life.

Conor’s (1937) definition: Statistics are the measurements, enumeration or estimates of


natural or social phenomena systematically arranged so as to exhibit their interrelationships.

Fisher’s (1947) definition: The science of statistics is essentially a branch of applied


mathematics and may be regarded as mathematics, applied to observational data.

Croxton and Cowden’s (1948) definition: Statistics is the subject of collection, presentation
and analysis of numerical data.

Yule and Kendal’s (1950) definition: Statistics means quantitative data, which are affected
to a marked extent by multiplicity of causes.

Definition: Statistics is concerned with scientific methods for collecting, organizing,


summarizing, analyzing and presenting sample data from a specified population of interest.

Data: The raw material of statistics consists of numbers or observations usually obtained by
some process of counting or measurement is known as data. There are two types of data.

1. Primary data When the information needed for a study cannot be found in a company’s
2. Secondary data records or in published sources, it may be necessary to collect new data
through direct investigation.
Primary data: Primary data are measurements observed and recorded as part of an original
study. When data required for a particular study can be found neither in the internal records
of the enterprises, nor in published sources, may it become necessary to collect original data
i.e. to conduct first hand investigation.

Secondary Data: Secondary data means the observations that are already available i.e. they
refer to the data which have already been collected and analyzed by someone else. When the
researcher utilizes secondary data, then he has to look into various sources from where he can
obtain them. In this case he is certainly not confronted with the problems that are usually
associated with the collection of original data. Secondary data may either be published data
or unpublished.

Collection of Primary Data: They are several methods of collecting primary data,
particularly in surveys and descriptive researches .Important ones are:

1. Observation method Secondary data is information that has already been collected and
2. Interview method analyzed by someone else. When a researcher uses secondary data,
3. Through questionnaires they look at different sources where this data is available. This
approach avoids the challenges often involved in collecting new data.
Secondary data can be either published or unpublished.
2
4. Through Schedules
5. Other Methods

1. Observation Method: The observation method is the most commonly used method
especially in studies relating to behavioral sciences. Under the observation method, the
information is sought by way of investigator’s own direct observation without asking from
the respondent.

Merits:

 The main advantage of this method is that subjective bias is eliminated


 The information obtained under this method relates to what is currently happening
 It is not complicated by either the past behavior or future intentions or attitudes

Demerits:

 If the data is collected by the other than investigator wrong information can be taken
 To observe the remote area more time and cost needed

2. Interview Method: This method can be used through personal interviews and if possible
through telephone interviews.

a) Personal interviews: Personal interview method requires a person known as the


interviewer asking questions generally in a face to face contact to the other persons.
At times the interviewee may also ask certain questions and the interviewer responds
to these, but usually the interviewer initiates the interview and collects the
information.

Merits:

 More information and that too in greater depth can be obtained


 Interviewer by his own skill can overcome the resistance
 Personal information can as well be obtained easily under this method

Demerits:

 It is very expensive method, especially when large and widely spread geographical
sample is taken.
 This method is relatively more time consuming, especially when the sample is large.

b) Telephone interviews: This method of collecting information consists in connecting


respondents on telephone itself. It is not a very widely used method, but plays
important part in individuals’ surveys, particularly in developed regions.

Merits:

 It is more flexible in comparison to mailing method


 It is faster than other methods
 No field staff required

3
Demerits:

 Surveys are restricted to respondents who have telephone facilities


 Questions have to be short and to the point, probes are to difficult to handle

3. Collection of data through questionnaires: It is being adopted by private individuals,


research workers, private and public organizations and even by governments. In this method
a questionnaire is sent (usually by post or email) to the persons concerned with a request to
the answer the questions and return the questionnaire. The questionnaire is mailed to
respondents who are expected to read and understand the questions and write down the reply
in the space for the purpose in the questionnaire itself. The respondents have to answer the
questions on their own. This method is widely used by individuals, researchers, organizations, and even
governments. A questionnaire is sent (usually by mail or email) to people with a
Merits: request to answer the questions and send it back. Respondents read, understand,
and answer the questions themselves directly on the form.
 There is low cost even when the universe is large and is widely spread geographically
 It is free from the bias of the interviewer, answers are in respondents own words.

Demerits:

 It can be used only when respondents are educated and cooperating


 It is difficult to know whether willing respondents are truly representative.
 This method is likely to be the slowest of all.

4. Collection of data through schedules: This method of data collection is very much like the
collection of data through questionnaire, with little difference which lies in fact that schedules
(Performa containing a set of questions) are being filled in by enumerators who are especially
appointed for the purpose. The enumerators along with schedules, so to respondents, put to
them the questions from the Performa in the order the questions are listed and record the
replies in the space for the same in the Performa.

Merits:

 The data are collected accurately and to the point


 The non-response rate is minimum

Demerits:

 When sample area is so large then this method is very time consuming
 This type of collection is so costly due to maintain the schedule with enumerator
through wide geographic area

5. Some other methods of data collection:

i. Warranty cards: Warranty cards are usually postal sized cards which are used by
dealers of consumer durables to collect information regarding their products. The
information sought is printed in the form of questions on the warranty cards which is
placed inside the package along with the product with a request to the consumer to fill
in the card and post it back to the dealer.

4
ii. Use of mechanical devices: The use of mechanical devices has been widely made to
collect information by way of indirect means. Eye camera, Pupil metric camera,
Motion picture camera and Audiometer are the principal devices so far developed and
commonly used by modern big business houses, mostly in the developed world for the
purpose of collecting the required information.

iii. Projective Techniques: Projective techniques (or what are sometimes called as
indirect interviewing techniques) for the collection of data have been developed by
psychologists to use projections of respondents for inferring about underlying
motives.

Collection of Secondary Data: Secondary data may either be published data or unpublished
data.

1. Collection of published data:

 Various publications of the central, state are local governments


 Various publications of foreign governments
 Various publications of UN organizations
 Technical and trade journals
 Books, magazines and newspapers
 Report and publications of various association connected with business and
industry, banks, stock exchange etc.
 Reports prepared by research scholars, universities, economists etc
 Public records and statistics, historical documents and other sources of
published information.

2. Collection of unpublished data:

 They are may be found in diaries


 Letters
 Unpublished biographic
 Autobiographic

Applications of statistical tools in various processing stages of textile production: There are
various applications of statistical tools in textile sectors. Those are discussed below.

1. Yarn Production: There are several stages involved in the cotton yarn production. When
fibers are mixed and processed through blow room, within and between lap variations are
studied by computing mean, SD and CV lap rejection, and production control are studied by
p and x charts. Average measure is used to find the hank of silver in carding, draw frame,
combing and average hank of roving in roving frame and average count at ring frame.
Generally the spinning mill use ‘average count’ as the count specification if it is producing 4–
5 counts. On the other hand the weaving section uses ‘resultant count’ which is nothing but
the harmonic mean of the counts produced. Control charts are extensively used in each and
every process of yarn production (for example, the process control with respect to thin places,
neps, etc.). Application of probability distributions like Poisson, Wei-bull and binomial for
various problems in spinning is found very much advantageous to understand the end

5
breakage concept. In ring spinning section several ring bobbins are collected and tested for
CSP and difference between the bobbins and within the bobbins is studied using ‘range’
method. In cone winding section the process control can be checked either by using control
chart for averages or chart for number defectives.

2. Fabric Production: Design of experiments such as Latin square design or randomized


block design can be used to identify the effect of different size ingredients on wrap breakages
on different looms in fabric formation. Most of the suiting fabric constructions involve the
use of double yarn which is nothing but the harmonic mean of different counts. Poisson’s and
normal distribution can be applied for loom shed for warp breakages. Using statistical
techniques the interference loss can also be studied in loom shed. Various weaving
parameters such as loom speed, reed and pick can be correlated with corresponding fabric
properties and are interpreted in terms of loom parameters. Control charts are used to study
the control of process/product quality in fabric production also. For example, selection of
defective cones in a prim winding from a lot (fixed population) or in a production shift n p
and p charts are used. The width of the cloth and its control can be understood by x and
defectives per unit length and their control is understood by c charts. The testing process
includes determination of average tensile strength (and single thread strength also) and the
corresponding CV%.

3. Chemical processing and Garment Production: The scope of statistics is unlimited. For
example the effect of n number washes (identical conditions) on m fabrics on a particular
fabric property can be easily found by either tests of significance or analysis of variance.
Similarly the effect of different detergents on fabric types can be investigated by two-way
analysis of variance. Similarly different types of fabrics and the effect of sewing conditions
can be studied by ANOVA. In garment production the control of measurements and its
distribution can be well understood by control and polar charts.

4. Fiber Production: Measures of central tendency like process average gives an idea about
average staple length of fiber produced in a continuous or batch wise process. Coefficient of
variation (CV) of the process signifies about the process control. On the other hand, analysis
of time series is helpful in estimating the future production based on the past records.
Measures of dispersion such as standard deviation and CV are useful in comparing the
performance of two or more fiber-producing units or processes. Significance tests can also be
applied to investigate whether significant difference exists between the batches for means or
standard deviations. Analysis of variance can be applied for studying the effect of parameters
of fiber production and methods of polymer dissolving.

5. Textile Testing of Fiber Yarn and Fabrics: Results analysis in textile testing without the
applications of statistical tools will be meaningless. In other words every experiment in
textile testing include the use of statistical tools like average calculation, computation of SD,
CV and application of tests of significance (t-test, z-test and f-test) or analysis of variance
(one way, two way or design of experiments). Populations can be very well studied by
normal or binomial or Poisson’s distributions. Random sampling errors are used in studying
about the population mean and SD at 95% and 99% level of confidence. Application
geometric mean for finding out the overall flexural rigidity or Go has an important role in
fabric selection for garment manufacture.

A special mention is made in determination of fiber length by bear sorter where all the
measures of central tendency and dispersion (mean length, modal length quartile deviation,

6
etc., in the form of frequency distribution) are computed to understand about the cotton
sample under consideration for testing its potential in yarn manufacture. On the other hand
ball sledge sorter uses weight distribution from which mean and SD are computed. In the case
of cotton fibres, the development of cell wall thickening commonly referred as “Maturity”
concept can be very well determined using normal distribution and confidence intervals.
Several properties are tested for different packages produced from the same material or from
the same frame by applying significance tests. Effect of instruments and variables for
different types of samples can be very well studied by using ANOVA. All the fabric
properties tested on a single instrument or different instrument can be understood by using
design of experiments. In one of the research applications, which include the testing of low
stress mechanical properties for nearly 1000 fabrics are studied by ‘Principle Bi component
analysis or Bi plot’. Measures of dispersion like coefficient of variation and percentage mean
deviation are very much used in evenness measurement.

Variable: A measurable characteristics which can vary within a certain range. Example,
Family size.

Types of variable: There are mainly two types of variable

1. Qualitative – Religion, Color of fabric etc.


2. Quantitative – No. of students in ten departments of BUTEX, Time etc.

Quantitative variable may subdivide into two categories.

a) Discrete – Size of collar


b) Continuous – Weight of students in a class

Measurement scale: There are four types of Measurement scale.

i. Nominal – Class room no. of a department


ii. Ordinal – Floor no. of a multistoried building
iii. Interval – Temperature
iv. Scale – Height of a tree

7
Frequency Distribution
Definition: Frequency distribution is a listing of a data set which divides the data in different
classes and gives a count of number of observations in each class.

According to Croxton and Cowden, “Frequency distribution is a statistical table which shows
the set of all distinct values of the variables arranged in order of magnitude, either
individually or in groups with their corresponding frequencies side by side.

Example: The profits (in lakhs of Tk’s) of 30 Bangladeshis companies for the year 2016 –
2017 are given below:
28 16 23 37 35 49 63 65 55 45

58 57 69 30 32 35 42 37 42 48

53 49 65 39 48 67 25 29 58 40

Classify the above data taking a suitable class interval.

Solution: Here, Highest value, HV = 69


Lowest value, LV = 16
Range, R = HV - LV = 69 -16 = 53
Total number of observations, N = 30

Let us determine the suitable class interval with the help of the following formula:

Range
i=
1  3.322 log N

53
=
1  3.322 log 30

= 8.97

9

Since values like 3, 7, 9 etc. should be avoided we will take 10 as the class interval and the
first class be 15-25.

Frequency Distribution of the profits:

Profit (Tk.lakhs) Tally No. of companies


15-25 II 2
25-35 IIII 5
35-45 IIII III 8
45-55 IIII I 6
55-65 IIII 5
65-75 IIII 4
Total 30

8
Presentation of frequency distribution: There are two types of presentation of frequency
distribution. They are:
1. Graphical presentation
2. Diagrammatic presentation

Histogram, frequency polygon, frequency curve and ogive curve are the graphical
presentation and pie-chart and bar diagram is the diagrammatic presentation.

Histogram: Histogram or column diagram is a suitable graph for presenting the frequency
distribution of a continuous series.

In histogram, the magnitude of the class intervals and their corresponding frequency are
plotted along the horizontal and vertical axis respectively.

Histogram are drown by two ways: i) Histogram having equal class interval
And ii) Histogram having unequal class interval

Example: The marks of 100 students of statistics course are given. Represent the data by
histogram (having equal class interval).

Marks No of students
30 - 40 8
40 - 50 12
50 - 60 30
60 - 70 25
70 - 80 15
80 - 90 10
Total 100

Example: The marks of 100 students of statistics course are given. Represent the data by
histogram (having unequal class interval).

Marks No of students
30 - 40 8
40 - 60 42
60 - 70 25
70 - 80 15
80 - 90 10
Total 100

Frequency polygon: The principle of drawing a frequency polygon is that frequency of all
the class-intervals is concentrated evenly in the interval and the mid-point of the class
intervals represents the classes.

A frequency polygon can be drawn by joining the mid-point of the bars in a histogram. Here
we join all the mid-point of the upper horizon of the bars by straight line.

9
Ogive curve: An Ogive (pronounced O-Jive) is a graph that represents the cumulative
frequencies for the classes in a frequency distribution and it is a continuous frequency curve.
To plot an ogive, we need class boundaries and the cumulative frequencies. For grouped data,
ogive is formed by plotting the cumulative frequency against the upper boundary of the class.
For ungrouped data, cumulative frequency is plotted on the y-axis against the data which is
on the x-axis.

Example: The marks of 100 students of statistics course are given. Represent the data by
frequency polygon and Ogive.

Marks No of students
30 - 40 8
40 - 50 12
50 - 60 30
60 - 70 25
70 - 80 15
80 - 90 10
Total 100

Pie-chart: The pie-chart also known as pie chart is a useful devise for presenting categorical
data. Data other than categorical can also be employed for constructing a pie-chart after
suitable and meaningful classification or grouping of the data.

The pie-chart consists of a circle sub-divided into sectors, whole areas are proportional to the
various parts into which the whole quantity is divided.

Example: We are given the expenditure per month of a family on different items. Construct a
suitable diagram to represent the expenditures.
Items Expenditure (Tk. In thousand)
Food 30
Clothing 25
Residence 20
Education 20
Medicine 15
Others 10
Total 120

Bar diagram: A bar diagram or bar graph is a diagram or graph that presents categorical data
with rectangular bars with heights or lengths proportional to the values that they represent.
The bars can be plotted vertically or horizontally. A vertical bar diagram is sometimes called
a line graph.

Example: The number of ODI played the Asian five country in 2017 are given. Draw a Bar
diagram.
Country No. of ODI
Afghanistan 10
Bangladesh 15
Hong Kong 05
India 30
Pakistan 20
Sri Lanka 25

10
Measures of Central Tendency
Central Tendency: In a representative sample, the value of a series of data have a tendency to
cluster around a certain point usually at the center of the series is called central tendency and
its numerical measures are called the measures of central tendency.
Different Measures of Central Tendency: The following are the important measures of
central tendency which are generally used in business and industry.
1. Arithmetic mean
2. Geometric mean
3. Harmonic Mean
4. Median
5. Mode
1. Arithmetic mean: The arithmetic mean, often simply referred to as mean, is the total of the
values of a set of observations divided by their total number of observations. It is denoted by
AM or X .

For Ungrouped Data: If x1,x2,………………xn represent the values of n items or


observations, the arithmetic mean denoted by X is defined as:
n

x  x 2  .........  x n x i
X= 1  i 1

n n
Example: The monthly income (in Tk’s) of 10 persons working in a firm is as follows:
14870 14930 15020 14460 14750 14920 15720 15160 14680 14890
Find average monthly income.
For Grouped Data: If x1,x2,………………xn represent the values of N items or observations
and their respective frequencies are f1 , f2 , …………… fn then the arithmetic mean denoted
by X is defined as:
n

f x  f 2 x 2  .........  f n x n fx
i 1
i i
X= 1 1 
f 1  f 2  .........  f n N

Example: The following are the figures of profits earned by 1400 companies during 2016-
2017.

Profits (Tk. lakhs) 20-40 40-60 60-80 80-100 100-120 120-140 140-160
No. of companies 500 300 280 120 100 80 20

Calculate the average profits for all companies. (By both Direct and Short-Cut Method)

11
Merits of AM:
 It is rigidly defined and easy to calculate.
 Easy to understand and easy for algebraic treatment.
 It should be based on all the observations.
 Two or more sets of data can be compared very easily using their respective means.

Demerits of AM:
 It is very badly affected by extreme values.
 If there is open-end class we cannot find out the AM.
 It cannot be used for qualitative data.

Uses of AM:
 It is widely used almost in all the areas in economics, business, social science etc.
 It is extremely used for business forecasting and tine series analysis.
 It is suitable for further algebraic treatment.
 It is easily used for comparing two or more sets of data.
2. Geometric mean: The geometric mean of n non-zero positive observations is the n th root
of their product. It is usually denoted by GM or G.
For ungrouped data: Let x1, x2,……….……xn be non-zero positive observations in a series
of data. Thus, the geometric mean can be written as:

G = ( x1, x2,………………xn)1/n

 log x
i 1
i
G = Antilog ( ) [by simplification]
n
Example: Calculate the geometric mean from the following data:

6.5 169.0 11.0 112.5 14.2 75.0 35.5 215.0

For grouped data: If x1, x2,………..…xn are taken as the mid-points of various classes and f1,
f2, …..…… fn are the class frequencies in a frequency distribution, then the geometric mean
G can be expressed as follows:

 f f
G = X 1 1 X 2 2 ............. X n 
fn 1/ N

f i 1
i log X i
G = Antilog ( ) [by simplification]
N
Example: Calculate geometric mean for the following distribution.

Weight (in kg) 100-104 105-109 110-114 115-119 120-124


Frequency 24 30 45 65 72

12
Merits of GM:
 It is not much affected by extreme values.
 Suitable for arithmetic and algebraic treatment.
 Geometric mean is less affected by sampling fluctuation.

Demerits of GM:
 If any Xi = 0, geometric mean cannot be define.
 If product of Xi be negative then it cannot be defined.
 It is difficult to calculate.

Uses of GM:
 The geometric mean is used to calculate averages of ratios and percentages.
 It is also used for computing average rate of increase or decrease.
 It is useful for the construction of index number.
3. Harmonic Mean: The harmonic mean of a set of n non-zero observations in a series is the
reciprocal of the arithmetic mean of the reciprocals of the observations. It is usually denoted
by HM.
For ungrouped data: Let, x1, x2,………..…xn be non-zero observations in series of data.
n n
Then the harmonic mean, HM = = n
1 1 1 1

x1 x 2
 .........  
x n i 1 x i

Example: Calculate the Harmonic mean of the following series of monthly expenditure of a
batch of students:

TK. 125 130 75 10 45 5 0.5 0.4 500 150


For grouped data: If x1, x2,………xn are taken as the mid-points of various classes and f1, f2,
…..……fn are the class frequencies in a frequency distribution, then the harmonic mean HM
can be expressed as follows:
N N
HM = = n
f1 f f fi
x1
 2  .........  n
x2 xn

i 1 x i

Example: The following data are the marks of 30 students:

Marks 0-10 10-20 20-30 30-40 40-50


Frequency 5 7 12 4 2
Calculate harmonic mean.

Merits of HM:
 It uses all figures in the data like AM and GM.
 It provides more weight for smaller values. i.e. it has downward bias.
 It measures rate of changes and can be adopted to problems involving time and certain
type of rates and ratios.
 It is not affected by single extreme values of items.

13
Demerits of HM:
 Sometimes it is difficult to calculate; especially when the numbers of items is large.
 It is not too easy to understand.
 If any observation is zero then the harmonic mean cannot be defined.
 It assigns too much weight to the smaller items and has limited scope.

Uses of HM:
 The harmonic mean is used when the observations are expressed in terms of rates,
speeds, prices etc.

4. Median: Median is the middle most observation when the observations or set of values of
a particular study are arranged in (ascending or descending) order of magnitude. That is, half
of the observations in a set of data are lower than it and half of the observations are greater
than it.
For ungroup data: Let, x1, x2,………,xn be a series of n observations and they are arranged in
order of magnitude. Then median,

(n  1)
When n odd, Me = th observation
2

n n
( )th..observatio n  (  1)th..observatio n
When n even, Me = 2 2
2
Example: From the following data of wages of 7 workers, compute the median wage.

Wages (in Rs.) 1600 1650 1580 1690 1660 1606 1640

For group data: For grouped data we may get median by the following formula,
N
 p.c. f
Me = L + 2 c
fm
Where,
L= Lower limit of median class
p.c.f=Preceding cumulative frequency to the median class
f=Frequency of the median class
c =Class interval of the median class
Example: 1000 workers are working in an industrial establishment. Their age is classified as
follows:
Age (yrs) 10-20 20-30 30-40 40-50 50-60
No. of workers 120 225 260 240 155
Calculate the median age.

Merits of Me:
 Median is easy to calculate and understand
 It is not affected by extreme values
14
 It can be calculated when there exists open interval
 It is also located graphically

Demerits of Me:
 Median is not based on all the items of the series
 It cannot be calculated by simple mathematical method
 It is not suitable for further algebraic and arithmetic treatment
 It is affected much by sampling fluctuation

Uses of Me:
 Median is very useful when observations are qualitative
 It verily helps us when there exists open interval
 It is usually used in the data set with extreme values
 It is useful for comparing two or more sets of qualitative data

5. Mode: Mode is that observation which occurs most frequently in a data set. It is denoted
by Mo
For ungroup data: For determining mode count the numbers of items the various values
repeat themselves and the value which occurs the maximum number of times is the modal
value.
Example: Find the mode of the series: 2 5 2 5 7 8 5
For group data: In case of grouped data the following formula is used for calculating mode

1
Mo = L  C
1   2
Where,
L = Lower limit of the modal class
 1  The difference between the frequency of the modal class and pre-modal class.
 2  The difference between the frequency of the modal class and post-modal class.
C = Size of the modal class.

Example: The following data relate to the sales of 100 companies:

Sales(Tk. lakhs) 20-40 40-60 60-80 80-100 100-120 120-140


No. of companies 14 25 30 15 10 6

Calculate the modal sales.

Merits of Mo:

 Mode is easily understandable and commonly used


 It is not affected by extreme values
 It can be calculated when there exists open interval
 It is also located graphically

15
Demerits of Mo:
 Mode is indefinite
 The combined mean cannot be calculated
 It is not suitable for further algebraic and arithmetic treatment
 Mode is not clearly defined in case of bi-modal or multimodal distribution.

Uses of Mo:
 Many times we have the data without any particular numerical values. For example,
the observations on religion, economic status, sex etc. In these situations mode is most
useful
 It is widely used in stock exchange, business, focusing weather etc.

Comparison among Arithmetic mean, Median and Mode:

1. Arithmetic mean, median and mode are easy to understand and easy to
calculate
2. Arithmetic mean is based upon all observation but median and mode are not
3. Arithmetic mean is amenable to algebraic treatment but the median and mode
are not easy for algebraic treatment
4. Arithmetic mean cannot be calculated from the distribution with open class
but median and mode can be calculated from the distribution with open classes
5. Arithmetic mean is affected very much by extreme values that median and
mode are not at all affected by extreme values.

Question: What should be kept in mind in the selection of an average?

Answer: The following two considerations should be kept in mind in the selection of an
average:

1. The type of data available: Are they broadly skewed (avoid the mean), guppy around
the middle (avoid the median) or unequal in the class (avoid the mode)?

2. The concept of the typical value required by the problem: Is a composite average of
all absolute or relative values needed (arithmetic mean or geometric mean)? Is a
middle value needed (median) or the most common value (mode)?
Question: “Which average should we use”? or “ Which is the best average to be used”?

Answer: We cannot regard any of the measures of central tendency as the best in all
circumstances. As we are to face different situations in different purposes, so it is almost
impossible to deal with any of the measures of central tendency. The causes are discussed
below.

Arithmetic mean: The arithmetic mean is the most popular and widely used measure of
central tendency. But in the following cases arithmetic mean should not be used:

1. In distributions with open-end intervals


2. When an average rate of growth or change over a period of time is required
3. When the observations from geometric progression, i.e. 1, 2, 4, 8, 16 etc.

16
4. When averaging rates (i.e. speed, fluctuations in the price of articles etc)
5. When there are very large and very small values of observations arithmetic
mean would be seriously misleading on account of undue influence of extreme
values.

Geometric mean: The geometric mean is typically used is averaging index numbers, rates of
change, ratios and other sets of data expressed in percentage form.

Harmonic mean: Harmonic mean is useful in problems in which values of a variable are
compared with a constant quantity of another variable i.e. rates, time, distance covered within
certain time and quantities purchase or sold per unit etc.

Median: The median is generally the best average in open-end grouped distributions. In case
of price distribution or income distribution very high or very low values would cause the
mean to be higher or lower than most “common values”. The median may be more
representative to use in describing the mass of the data.

Mode: Generally speaking, the significance of mode lies in the fact that it can be used to
describe qualitative data. The mode can be used in problems involving the expression of
preference where quantitative measurements are not possible. If we want to compare
consumer preferences for different kinds of products or different kind of advertising, we can
compare the modal preferences expressed by different groups of people but we cannot
calculate the median or mean.

General Limitations of an Average:

1. Since an average is a single value responding a group of values, it must be properly


interpreted; otherwise, there is very possibility of jumping to wrong conclusion. This
can be best illustrated with the help of a story. A person had to cross a river from one
bank to another .He must not aware of the depth of the river, so he enquired from
another man who told him that the average depth of water is160 cms. The man was
175 cms and he thought that he can very easily cross the river because all the time he
would be above the water level. So he started. In the beginning the level of water was
very shallow but as he reached the middle, the water was 500 cms deep and he lost his
life. The man was drowned because he had a misconception that average depth means
uniform depth throughout. But it is not so. An average respondents a group of values
and lies somewhere in between the two extreme values.

2. At times an average may give absurd results. For example, if we are calculating
average size of a family we may get a value 4.8. But that is impossible as persons
cannot be in fractions. However we should remember that it is an average value
representing the entire group of families.
3. An average may give us a value that doesn’t exists in the data. For example, the
arithmetic mean of 100, 300, 250, 50, 100 is 800/5=160, a value that does not exists in
the data.

4. Measures of central value fail to give us any idea about the formation of the series.
Two or more series may the same central value but may differ widely in composition.
For example, observe the following two series:

17
Series A: 150 170 190 210 280

Series B: 300 500 20 78 102

In both series, average X =200.

Question: Describe the empirical relationship between mean, median and mode.

Answer: Now we shall describe the relationship among mean, median and mode for
symmetrical distribution and asymmetrical distribution.

For symmetrical distribution: The values of mean, median and mode coincide.

i.e. Mean = Mode = Median

For asymmetrical distribution:

a. Positively skewed b. Negatively skewed

Mean>Median>Mode Mode>Median>Mean

Karl Pearson has expressed the relationship as follows:

Mode = 3Median - 2Mean

Question: Mean but median does not depend on all the observations. Explain.

Answer: As distinct from the arithmetic mean which is calculated from the value of every
observation in the series, the median is what is called a positional average. The term
‘position’ refers to the place of a value in a series. The place of the median in a series is such
that an equal number of observations lie on either side of it. For example, if the income of
five persons is Tk.1000, 1200, 1500, 1600, 1800 then the median income would be Tk. 1500.
Changing any or both of the first two values with any other numbers with value 1500 or less
and/or changing any of the last two values to any other values with values of 1500 and more,
would not affect the value of the median which would remain 1500.

In contrast, in case of arithmetic mean, the change in value of single observation would cause
the value of the mean to be changed.

Weighted Arithmetic Mean (WAM): One of the limitations of the arithmetic mean is that it
gives equal importance to all the observations. But there are cases where the relative
importance of all
the different observations is not the same. When that is so, we compute weighted arithmetic
mean.

The terms ‘weight’ stands for the relative importance of the different observations. The
formula for computing weighted arithmetic mean is

18
Xw=
WX
W
Where,
X w = Represents the weighted arithmetic mean
X= Variable
W= Weights attached to the variable X.

Weighted mean is especially useful in the problems relating to the construction of index
numbers and standardized birth and death rates.

Example: Suppose that, the nearby Wendy’s Restaurant sold medium, large and Biggie-sized
soft drinks for $0.90, $1.25 and $1.50 respectively of the last 10 drinks sold, 3 were medium,
4 were large and 3 were Biggie-sized. Find the mean price of the last 10 drinks sold.

Solution: We know, X w =
WX
W
3($0.90)  4($1.25)  3($1.50) $12.20
Xw= = =$1.22
3 43 10

Combined mean: If we have the arithmetic mean and number of observations two or more
than two related groups, we can compute combined average of these groups by applying the
following formula.
N X  N2 X 2
Xc = 1 1
N1  N 2
Where,
X c =Combined mean of the two groups
X 1 =Arithmetic mean of the first group
X 2 =Arithmetic mean of the second group
N 1 =Number of observations in the first group
N 2 = Number of observations in the second group

If we have to find out the combined mean of three series, the above formula can be extended
as follows:
N X  N2 X 2  N3 X 3
Xc = 1 1
N1  N 2  N 3
Example: There are two branches of a company employing 100 and 80 persons respectively.
If arithmetic mean of the monthly salaries paid by two branches are tk.1570 and tk.1750
respectively, find the arithmetic mean of the salaries of the employees of the company as
a whole.

Solution: We should compute the combined mean.


Given, N1=100, X 1 =1570, N2=80 and X 2 =1750
The formula is
N X  N 2 X 2 (100  1570  80  1750) 297000
Xc = 1 1    1650
N1  N 2 100  80 180

19
Theorem: For n-nonzero positive observations, such that, AM  GM  HM

Example: Find the arithmetic mean, geometric mean and harmonic mean from the following
data.
Class 120-140 140-160 160-180 180-200 200-220 220-240
Frequency 23 35 60 40 25 17

and hence show that, AM > GM > HM.

Example: The age of workers are given below:

Age 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64
No of 40 60 200 180 150 110 175 60 25
workers

Calculate mean, median and mode from the following frequency distribution. Also comment
on the skewness.

Example: The expenditure of 1000 families is given as under:

Expenditure 40-60 60-80 80-100 100-120 120-140


(Tk. in thousand)
No. of families 50 f1 500 f2 50

The median and mean for the distribution are both Tk. 88. Calculate the missing frequencies
f1 and f2.

20
Measures of Dispersion
A measure of central tendency represents only one of the important characteristics of a set of
data or series. From the measures of central tendency, we can get knowledge only about the
central values. They do not provide us the highest or lowest value i.e. the range or how the
data are scattered. For this kind of information about any series we have to use the measures
of dispersion.

A measure of dispersion appears two main purposes, such as:

1. It is the most important quantities used to characterize of a frequency distribution.


2. It affords a basis of comparison between two or more frequency distribution.

In the study of dispersion we observe that various distributions may have exactly the same
average but substantial difference in variability. For example, let us consider the two students
obtained the following marks in certain examination.

Subject Physics Chemistry Mathematics Statistics


Student A 48 50 50 52
Student B 10 20 80 90

The two distributions are certainly not identical though their average are the same. These
differences lie in the dispersion of their marks. From the above data it is very much apparent
that student A consists of average or near average intelligent student and student B is made of
very bright and very dull student.
Measure of dispersion: The measurement of the scatter of the values of a data set among
themselves is called a measure of dispersion or variation.
Different Measures of Dispersion: There are several methods of measuring dispersion.
These measures can be divided into two groups:
1. Absolute measure
2. Relative measure

Absolute measure: Absolute measures of variation are expressed in the same statistical unit
in which the original data are given such as rupees, kilograms, tones etc. These values may be
used to compare the variation in two or more than two distributions provided the variables are
expressed in the same units and have almost the same average value.

Following are the absolute measures of variation or dispersion.


1. Range
2. Quartile Deviation
3. Mean Deviation
4. Standard Deviation
Relative Measures: A measure of relative variation is the ratio of a measure of absolute
variation to an average. Relative measures may also be used to compare the relative accuracy
of data.

21
Following are the relative measures of variation:
a) Co-efficient of Range
b) Co-efficient of Quartile Deviation
c) Co-efficient of Mean Deviation
d) Co-efficient of Variation

Range: The range of a set of observations is the difference between two extreme values, i.e.
the difference between the maximum and minimum values. Therefore, it indicates the limits
within which all the observations fall. Thus, Range = Highest Value – Lowest Value

For ungrouped data: Let us consider a set of observations x1,x2,………………xn and that H
is maximum and L is minimum. Then range, R = H - L

Example: Find out the range of the set of observations -9, -3, 0, 13, 4

For grouped data: In these case, the range is the difference between the upper boundary of
the highest class and the lower boundary of the lowest class. That is, R = XH - XL

Example: Determine the range from the following frequency distributions.


Salary (Tk.) 1500-1600 1600-1700 1700-1800 1800-1900 1900-2000
No. of workers 150 250 300 200 100

Uses of Range:
 It is time saving and widely used in industrial quality control, weather forecast.
 Variations in stock exchange can be studied by range.

Quartile Deviation: The quartile deviation (Q.D.) is another type of range obtained from the
quartiles. It is obtained by dividing the difference between upper quartile (Q3) and lower
quartile (Q1) by 2. Therefore,
𝑈𝑝𝑝𝑒𝑟 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 –𝐿𝑜𝑤𝑒𝑟 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 Q3  Q1
Q.D. = =
2 2
Q3  Q1
The term (Q3 – Q1) is known as the interquartile range and the quartile deviation is
2
also known as semi-interquartile range.

Quartiles: Quartiles are that values in a set of observations arranged in order of magnitude
(ascending or descending) which divide the total observations into four equal parts.

For ungroup data: Let, x1, x2,………,xn be a series of n observations and they are arranged in
order of magnitude. Then quartiles,

(n  1)i
When n odd, Qi = th observation ; i = 1,2,3.
4

22
n n
( i )th..observatio n  ( i  1)th..observatio n
When n even, Qi= 4 4 ; i = 1,2,3.
2

Example: The Automobile Association checks the prices of gasoline before many holiday
weekends. Listed below are the self-service prices for a sample of 8 retail outlets during the
May 2014 Memorial Day Weekend.

40 22 60 30 45 66 70 55
Determine the quartile deviation.

For group data: For grouped data we may get quartiles by the following formula,

N
i  p.c. f
Qi = L + 4  c i = 1,2,3
fq
Where,
L = Lower limit of quartile class
p.c.f = Preceding cumulative frequency to the quartile class
fq = Frequency of the quartile class
C= Class interval of the quartile class

Example: Calculate the quartile deviation from the following data:

Class Less than 10-15 15-20 20-25 25-30 30 or more


10
Frequency 2 6 7 10 3 1

Uses of Q.D.:
 The quartile deviation can be used as a rough measure of dispersion which is superior
to range

Mean Deviation: Mean deviation is the mean of absolute deviations of the items from an
average like mean, median or mode. Normally, we consider the arithmetic mean as the
average.
For ungrouped data: If x1,x2,………………xn be a set of n observations or values, then the
mean deviation defined as:
n

x
i 1
i x
i. Mean Deviation about mean, M.D.( x ) = ; x = Arithmetic Mean
n

23
n

x i 1
i  Me
ii. Mean Deviation about median, M.D.( Me ) = ; Me = Median
n
n

xi 1
i  Mo
iii. Mean Deviation about mode, M.D.( Mo ) = ; Mo = Mode
n

Example: Find out the mean deviation about mean / median / mode from the following data:
6 9 4 2 11 4 7 14 4

For grouped data: If x1,x2,………………xn occurs with frequencies f1 , f2 , …………… fn


respectively then the mean deviation can be written as:
n

f
i 1
i xi  x
i. Mean Deviation about mean, M.D.( x ) = ; x = Arithmetic Mean
N
n

f i 1
i xi  Me
ii. Mean Deviation about median, M.D.( Me ) = ; Me = Median
N
n

fi 1
i xi  Mo
iii. Mean Deviation about mode, M.D.( Mo ) = ; Mo = Mode
N

Example: The weights of 100 students of level 2 term 2 are given below:

Weight(Kg) 45-50 50-55 55-60 60-65 65-70 70-75


No of students 5 10 25 30 20 10
Compute mean deviation from mean, median and mode.

Uses of M.D.:
 It is used in certain economic and anthropological studies.
 It is often sufficient when an informal measure of dispersion required.
Standard Deviation: The standard deviation is the positive square root of the mean of the
squared deviations from their mean of a set of values. It is generally denoted by ơ or SD.
For ungrouped data: Let x1,x2,………………xn be a set of values then their standard
deviation can be calculatedas:
n

 (x
i 1
i  x) 2
By direct method, ơ=
n

24
2
n
 n 
x   xi 
2
i
ơ= i 1
  i 1  [by simplification]
n  n 
 
 

2
n
 n 
d   di 
2
i
By short-cut method, ơ= i 1
  i 1  ;Here, di = xi – A ;A = Assumed Mean
n  n 
 
 

Example: A sample of households that subscribe to the Grameen Phone company revealed
the following numbers of calls received last week. Determine the standard deviation of
number of calls received: 16 12 14 15 18
For grouped data: If x1,x2,………………xn occur with frequencies f1 , f2 , …………… fn
respectively, then the standard deviation can be calculated as:
n

 f (x i i  x) 2 n
By direct method, ơ= i 1

N
;Here, N = f
i 1
i

2
n
 n 
   f i xi 
2
f i xi
ơ= i 1
  i 1  [by simplification]
N  N 
 
 

2
n
 n 
   fi di 
2
fi di
By short-cut method, ơ= i 1
  i 1   C ;Here, d  xi  A ;
N  N  i
C
 
 

A = Assumed Mean, C = Class


interval

Example: A sample of 60 company with profits are listed below:

Profits 30-34 35-39 40-44 45-49 50-54 55-59


(Lakh Tk)
No of 7 15 20 10 5 3
companies

Calculate the standard deviation.

25
Uses of S.D.:

 It is useful for calculating the skewness, co-efficient of variation and so forth.


 It measures the consistency of data.
 It is used to compute the standard normal variate.
 It is used for testing the reliability of measures calculated from samples.

Co-efficient of Variation: Co-efficient of variation is 100 times of a ratio of the standard


deviation (ơ) to the arithmetic mean ( x ).It is denoted by C.V. and written as:


C.V. =  100 ; x  0
x
Example: Compute the co-efficient of variation from the following data:

Monthly 3000-6000 6000-9000 9000-12000 12000-15000 15000-18000 18000-21000


Income
No of 25 30 40 22 18 5
families

Example: The run scores of two cricketers for 10 innings are given below:

A 114 45 0 31 75 102 198 8 0 7


B 15 25 18 30 11 4 23 21 31 22
Who of the two is a more consistent batsman?

Uses of C.V.:
 It is suitable when two or more distributions are in different units.
 When the means of distribution are widely different although units are same.

26
Moments, Skewness and Kurtosis
Moments: Moments are popularly used to describe the characteristics of a distribution. The
Greek letterµ (read as mu) is generally used to denote the moments. There are two types of
moments.
1. Central moments
2. Raw moments

For ungrouped data: The r th moment of a variable X about the arithmetic mean X is given
by
1
𝜇𝑟 = ∑(𝑋 − 𝑋̅)𝑟
𝑛
The r th moment of a variable X about any arbitrary point A is given by
1
𝜇𝑟′ = ∑(𝑋 − 𝐴)𝑟
𝑛

For grouped data: The r th moment of a variable X about the arithmetic mean X is given by
1
𝜇𝑟 = ∑ 𝑓(𝑋 − 𝑋̅)𝑟
𝑁
The r th moment of a variable X about any arbitrary point A is given by
1
𝜇𝑟′ = ∑ 𝑓(𝑋 − 𝐴)𝑟
𝑁
For different values of r, we shall get different moments. Thus if r = 1, we will get the first
moment, if we put r = 2, we will get second moment and so on.
Moments about mean: Moments about mean is also called central moments.

For ungrouped data:


1 1
𝜇1 = 𝑛 ∑(𝑋 − 𝑋̅) 𝜇2 = 𝑛 ∑(𝑋 − 𝑋̅)2
1 1
𝜇3 = 𝑛 ∑(𝑋 − 𝑋̅)3 𝜇4 = 𝑛 ∑(𝑋 − 𝑋̅)4

For grouped data:


1 1
𝜇1 = 𝑁 ∑ 𝑓(𝑋 − 𝑋̅) 𝜇2 = 𝑁 ∑ 𝑓(𝑋 − 𝑋̅)2
1 1
𝜇3 = 𝑁 ∑ 𝑓(𝑋 − 𝑋̅)3 𝜇4 = 𝑁 ∑ 𝑓(𝑋 − 𝑋̅)4

Note: The first moment about the origin tells us about the mean, the second moment about
variance, the third moment about Skewness and the fourth moment about the kurtosis.

27
Finding Central moments from moments about Arbitrary point: With the help of following
relationships, moments about an arbitrary point can be converted to moments about mean:

1  0

2  2   1 
2

3  3  312  2  1 


3

4  4  413  6  1  2  3  1 


2 4

Skewness: The term “Skewness” refers to lack of symmetry or departure from symmetry, e.
g., when a distribution is not symmetrical (or is asymmetrical) it is called a skewed
distribution. The measures of Skewness indicate the difference between the manners in which
the observations are distributed in a particular distribution compared with a symmetrical (or
normal) distribution.
In a symmetrical distribution the values of mean, median and mode are alike. Hence the two
tails are equal length. In a skewed distribution the values differ.

Figure: Symmetrical Distribution

There are two types of Skewness.


i. Positive
ii. Negative
Positive Skewness: If the value of mean is greater than the mode, Skewness is said to be
positive. Hence the right tail is longer than the left tail. In this case, mean > median > mode.

28
Figure: Positively Skewed Distribution
Negative Skewness: If the value of mode is greater than mean, Skewness is said to be
negative. Hence the left tail is longer than the right tail. In this case, mean < median < mode.

Figure: Negatively Skewed Distribution

Coefficients of Skewness: For comparing two series, we do not calculate the absolute
measures but we calculate the relative measures called the co-efficient of Skewness which are
pure numbers i., e., independent of units of measurement. The following are two important
methods of measuring relative Skewness:
1. Karl Pearson’s coefficient of Skewness: Karl Pearson’s coefficient of Skewness or
Pearsonian coefficient of Skewness is given by the formula:

Mean-Mode
S.K P =
Standard deviation

2. Coefficient of Skewness based on moments: Coefficient of Skewness based on


moments is given by

3 2
1  3
2

29
The value of β1 will be zero for a perfectly symmetrical distribution, instead of β1 Karl

Pearson suggested 1 to be used as a measure of skewness, where

ᵞ1 =√𝛽1 = √𝜇𝜇
2 𝜇3
3
3 =
2 √𝜇23

Example: The following data relate to the marks of 40 students:

Marks 0-5 5-10 10-15 15-20 20-25 25-30


Number of 3 5 10 12 6 4
Students
Calculate the coefficient of Skewness.
Kurtosis: In Statistics, kurtosis refers to the degree of flatness or peakedness in the region
about the mode of a frequency curve. The degree of kurtosis of a distribution is measured
relative to the peakedness of a normal curve.
Karl Pearson, in 1905 introduced three broad patterns of peakedness which are illustrated in
the following diagram:
1. If a curve is more peaked than the normal curve, it is called “leptokurtic”.
2. An intermediate peaked curve which is neither flat topped nor peaked is known as
normal or “mesokurtic”.
3. If a curve is more or flat-topped than the normal curve, it is called “platykurtic”.

Measures of Kurtosis: Kurtosis is measured by  2 or  2 .

4
2 
Where 2 2 and  2   2  3 .

For a normal or mesokurtic curve  2  3 and  2  0 .

For a leptokurtic curve  2  3 and  2  0 .

For a platykurtic curve  2  3 and  2  0 .


Example: From the following data calculate kurtosis.

Age (years) 15-25 25-35 35-45 45-55 55-65 65-75


No of employees 8 22 30 25 10 5

30
Correlation and Regression
Correlation: Correlation is a statistical technique which measure and analyses the degree or
extent to which two or more variables fluctuate with reference to one another.
Correlation thus denotes the interdependence amongst varieties. The degrees are expressed by
a coefficient which ranges between -1 and +1. The direction of change is indicated by + or –
signs.

Correlation thus expresses the relationship through a relative measure of change and it has
nothing to do with the units in which the variables are expressed.

Types of Correlation: Correlation is described or classified in several different ways. Two of


the most important are:
1. Simple correlation
2. Multiple correlation

Simple correlation: When only two variables are studied for calculating correlation then it is
known as simple correlation.
Example: The correlation between height and weight of a series.
Multiple correlation: When three or more variables are studied for calculating correlation it
is known as multiple correlation.
Example: If we consider the correlation between the academic results of a student and his/her
family size, family income then this problem can be termed as multiple correlation.

Different types of simple correlation: Simple correlation can be of five types on basis of the
nature of the interrelationship between two variables. Such as:
i. Perfect positive (direct) correlation
ii. Partial positive (direct) correlation
iii. Perfect negative (inverse) correlation
iv. Partial negative (inverse) correlation
v. Zero correlation
Now, description about these are given below
Perfect positive correlation: If the changes of the two variables are same direction i.e. if the
increase (or decrease) in one variable in the corresponding increase (or decrease) in the other
and the rate of the change of both variables are equal, then the existent correlation between
the two variables is called perfect positive correlation. In this case, the value of the
correlation coefficient between two variables is 1. i.e. r = 1.
Example: The relation between radius and circumference is a perfect positive correlation.
Partial positive correlation: If the changes of two variables are same direction i.e. if the
increase (or decrease) in one variable results a corresponding increase (or decrease) in the

31
other but the rate of change of the both variables are not equal, then the existent relationship
between the two variables is called partial positive correlation. In this case, 0 < r < 1.
Example: The relation between income and expenditure is a partial positive correlation.
Perfect negative correlation: If the changes of the two variables are same direction i.e. if the
increase (or decrease) in one variable in the corresponding decrease (or increase) in the other
and the rate of the change of both variables are equal, then the existent correlation between
the two variables is called perfect negative correlation. In this case, the value of the
correlation coefficient between two variables is 1. i.e. r = - 1.
Example: The relation between the pressure and the area of gas is perfect negative
correlation.
Partial negative correlation: If the changes of two variables are same direction i.e. if the
increase (or decrease) in one variable results a corresponding decrease (or increase) in the
other but the rate of change of the both variables are not equal, then the existent relationship
between the two variables is called partial negative correlation. In this case, -1 < r < 0.
Example: The relation between price and demand of a commodity is partial negative
correlation.
Zero correlation: If the changes are independent, i.e. if increase (or decrease) in one variable
resultsa corresponding no change in the other, then the existent relation between the two
variable called a zero correlation. In this case, r = 0.
Example: The color of a sharee and the intelligence of a girls who wears it is no correlation.

Methods of Determining Correlation: We shall consider the following most commonly used
methods.

1. Karl Pearson’s coefficient of correlation


2. Spearman’s Rank-correlation coefficient.

Karl Pearson’s coefficient of correlation: Of the several mathematical methods of


measuring correlation, the Karl Pearson’s method, popularly known as Pearsonian
coefficient of correlation, is most widely used in practice. The coefficient of correlation is
denoted by the symbol r. If the two variables under study are X and Y, the following formula
suggested by Karl Pearson can be used for measuring the degree of relationship.

rxy =
 ( x  x )( y  y )
{ ( x  x ) }{ ( y  y )
2 2
}

32
 x y
 xy  n
rxy=
  x   2
 y  2

 x   y 
2 2

 n 
n 
[by simplification]

The value of the coefficient of correlation as obtained by the above formula shall always lie
between  1 .

Example: The following data show the ages of husbands and wives of 10 married couples.

Husband 36 72 37 36 51 50 47 50 37 41
Wife 35 67 33 35 50 46 47 42 36 41

Calculate the coefficient of correlation.

Example2: The following data consist of observations for the weights of 10 different
automobiles (in 1000 pounds) and the corresponding fuel consumptions (gallons per 100
miles).

Weight (x) 3.4 3.8 4.1 2.2 2.6 2.9 2.0 2.7 1.9 3.4
Fuel Consumption (y) 5.5 5.9 6.5 3.3 3.6 4.6 2.9 3.6 3.1 4.9

We would like to find out how y is correlated to x.

Spearman’s Rank Correlation coefficient: The association between two series of rank is
called rank correlation. The method of ascertaining the coefficient of correlation by ranks was
devised by Charles Edwards Spearman in 1904.This method is especially useful in case when
the actual magnitudes or item values are not given and simply their ranks in the series are
known. Spearman’s rank correlation coefficient, usually denoted by  (Rho) is given by the
formula:

6 d i 6 d i
2 2

 =1- =1-
n3  n n(n 2  1)

Where d stands for the difference between the pair of ranks and n the number of paired
observations.

The value of Spearman’s rank correlation coefficient ranges between -1 and +1.When  is +1,
there is a perfect concordance between rankings and the ranks are in the same direction.
When  is-1, there is also a perfect concordance between rankings but the ranks are in the
opposite direction.

Example: Two managers are asked to rank a group of employees in order of potential for
eventually becoming top managers .The rankings are as follows:

33
Employee A B C D E F G H I J
Ranked by manager I 10 2 1 4 3 6 5 8 7 9
Ranked by Manager II 9 4 2 3 1 5 6 8 7 10
Compute the coefficient of rank correlation and comment on the value.

Example: Calculate the rank correlation coefficient for the following data of marks of 2 tests
given to candidates for a clerical job:

Preliminary test: 92 89 87 86 83 77 71 63 53 50
Final test : 86 83 91 77 68 85 52 82 37 57

Uses of correlation:

i. Economic theory and business studies relationships between variables like price and
quantity demanded, advertising, expenditure scales promotion measure etc. The
correlation analysis helps in deriving precisely the degree and direction of such
relationships.

ii. The concepts of regression are also based upon the measure of correlation.

Regression: Regression is a mathematical measure of expressing the average of relationship


between two or more variables in terms of the original units of the data.

In a regression analysis there are two types of variables. The variable whose is influenced or
is to be predicted is called dependent variable, regressed predicted or explained variable and
the variable which influences the values or is used for prediction is called independent
variable or regressor or predictor or explanatory variable.

Example: The relationships between two variables can be considered between say rainfall
and agricultural production, price of an output and the overall cost of product, consumer
expenditure and disposable income.

Types of regression: There are two types of regression. Such as:


1. Simple Regression
2. Multiple Regression

Simple Regression: When the dependency of dependent variable is estimated by only one
independent variable then it is said simple regression.

Regression equation (for simple regression): Regression equations are algebraic expression
of the regression lines. Since there are two regression lines, the regression equation of X on Y
is said to describe the variation in the values of X for given changes in Y and the regression
equation of Y on X is used to describe the variation in the values of Y for given changes in X.

The regression equation of Y on X is expressed as follows:

Y = a + bX

34
The regression equation of X on Y is expressed as follows:

X = c + dY

Multiple Regression: When there are more than one independent variables then it is said
multiple regression.

Regression equation (for multiple regression): Equation for multiple regression are given
below:
Yi = a + b1X1 + b2X2 + ……...…… + bnXn

Regression line: If the variables in a bivariate distribution are related we will find that points
in the scatter diagram will cluster around some curve called the “Curve of regression”. If the
curve is straight line of, it is called the line of regression and there is said to be linear
regression between the variables, otherwise regression is said to be curvilinear. The line of
regression is the line which gives the best estimate to the value of one variable for any
specific value of the other variable.

Estimation of parameters of simple linear regression model:

1. For regression model: Y = a + bX


 x y
 ( x  x )( y  y )  xy  n
b= b= [by simplification]
 (x  x) 2
 x 2
x 2

n
∑𝑦 ∑𝑥
a = 𝑌̅ − 𝑏𝑋̅ a= −𝑏
𝑛 𝑛

2. For regression model: X = c + dY


 x y
 ( x  x )( y  y )  xy  n
d= d= [ by simplification]
 ( y  y) 2
 y 2
y 2

n
∑𝑥 ∑𝑦
c = 𝑋̅ − 𝑑𝑌̅ c= −𝑑
𝑛 𝑛

Example: The data of sales and promotion expenditure on a product for 10 years are given
below

Sales (Tk. Lakhs) 8 10 9 12 10 11 12 13 14 15


Promotion Exp (Tk. Thousand) 2 2 3 4 5 5 5 6 7 8
Estimate the regression equations. Also find

a) Sales when promotion expenditure is Tk. 10 thousand

b) Promotion expenditure when sales target is Tk. 20 lakhs.


35
Coefficient of Determination: One very convenient and useful way of interpreting the value
of coefficient of correlation between two variables is to use the square of coefficient of
correlation, which is called coefficient of determination. The coefficient of determination
thus equals r2.

***If the value of r=0.9, r2 will be 0.81 and this would mean that 81% of the variation in the
dependent variable has been explained by the independent variable.

Compare the correlation analysis with regression analysis:

Correlation Regression
1. Correlation co-efficient rxy is a relative 1. The regression co-efficient byx (bxy)
measure of the linear relationship between are absolute measures representing the change in
X and Y and is independent of the units of the value of the variable Y(X) for a unit change
the measurement. If is a pure number in the variable X(Y)
lying between  1.
2. Correlation analysis has limited 2. Regression analysis studies linear as well as
applications as it is confined only to the non-linear relationship between the variables
study of linear relationship between the and therefore has much wider applications.
variables.
3. Correlation coefficient is symmetric i.e. 3. Regression co-efficient are not symmetric in
rxy = ryx X and Y i.e.bxy ≠ byx
4. The range of r is -1 ≤ r ≤ 1. 4. The range of byx (bxy) is -∞ ≤ byx (bxy) ≤ ∞.
Comments on coefficient of correlation

Values of r Comments
r = -1 Perfect negative correlation
-1 < r ≤ -0.8 Higher degree of negative correlation
-0.8 < r < -0.2 Moderate degree of negative correlation
-0.2 ≤ r < 0 Lower degree of negative correlation
r=0 Zero correlation
0 < r ≤ 0.2 Lower degree of positive correlation
0.2 < r < 0.8 Moderate degree of positive correlation
0.8 ≤ r < 1 Higher degree of positive correlation
r=1 Perfect positive correlation

36

You might also like