Lecture Part A
Lecture Part A
A history must start somewhere, but history has no beginning. There has always been a
temptation by many writers to begin a history of Statistics with references to the endeavours
in the ancient world to record information about states.
The first use of the word appears in a work by an Italian historian GirolomoGhilini, who in
1589 referred to an account of the civil, politica, statistica e militarescienza. It is believed
that the words: statist, statistics and statistical originated from the Latin word status which
means the political states of nation. However, different views are there. One of these views is
that the word statistics originated from either Italian statista or German statistik. Note that
both the words statista and statistik mean political states of a nation like the Latin status.
The word statist appeared for the first time in Shakespear’sHamlet in 1602. After that the
word was found in the books, Cymbeline (1610 / 1611), and Paradise Regained (1671).
German professor Gottfried Achenwall used the word statistik for the first time in his book
The Abriss der EverpaischenReiche in 1749.
The word statistics appeared for the first time in Baron J. F. Von Bielfeld’s book The
Elements of Universal Erudition (1770). In that book, a chapter is titled as Statistics and the
term statistics is defined as: The science that teaches what is the political arrangement of all
the modern states of the known word. In 1787, a German professor of Natural Philosophy E.
A. W: Zimmermann in his book A political survey of the Present State of Europe defined
statistics as: A brief description of important characteristics of a state.
The evolution of statistics has a long history. Perhaps the earliest use of statistics was when
the ancient rulers or heads of different regions used to count the number of effective warriors
that they had or the number they would need to defeat their enemy, or when they estimated
how much tax could be collected from their subjects. In later times, statistics was used to
report death rates in the great London Plague and in the study of natural resources. In our
Indian subcontinent, a good records on land, revenue, agriculture etc. were maintained during
the Akbar’s reign (1556-1605).
1
Definition of Statistics: It is almost impossible to formulate statistics in a compact definition.
Different statisticians have defined statistics from different perspectives. These definitions
are to be used in two different meanings.
Croxton and Cowden’s (1948) definition: Statistics is the subject of collection, presentation
and analysis of numerical data.
Yule and Kendal’s (1950) definition: Statistics means quantitative data, which are affected
to a marked extent by multiplicity of causes.
Data: The raw material of statistics consists of numbers or observations usually obtained by
some process of counting or measurement is known as data. There are two types of data.
1. Primary data When the information needed for a study cannot be found in a company’s
2. Secondary data records or in published sources, it may be necessary to collect new data
through direct investigation.
Primary data: Primary data are measurements observed and recorded as part of an original
study. When data required for a particular study can be found neither in the internal records
of the enterprises, nor in published sources, may it become necessary to collect original data
i.e. to conduct first hand investigation.
Secondary Data: Secondary data means the observations that are already available i.e. they
refer to the data which have already been collected and analyzed by someone else. When the
researcher utilizes secondary data, then he has to look into various sources from where he can
obtain them. In this case he is certainly not confronted with the problems that are usually
associated with the collection of original data. Secondary data may either be published data
or unpublished.
Collection of Primary Data: They are several methods of collecting primary data,
particularly in surveys and descriptive researches .Important ones are:
1. Observation method Secondary data is information that has already been collected and
2. Interview method analyzed by someone else. When a researcher uses secondary data,
3. Through questionnaires they look at different sources where this data is available. This
approach avoids the challenges often involved in collecting new data.
Secondary data can be either published or unpublished.
2
4. Through Schedules
5. Other Methods
1. Observation Method: The observation method is the most commonly used method
especially in studies relating to behavioral sciences. Under the observation method, the
information is sought by way of investigator’s own direct observation without asking from
the respondent.
Merits:
Demerits:
If the data is collected by the other than investigator wrong information can be taken
To observe the remote area more time and cost needed
2. Interview Method: This method can be used through personal interviews and if possible
through telephone interviews.
Merits:
Demerits:
It is very expensive method, especially when large and widely spread geographical
sample is taken.
This method is relatively more time consuming, especially when the sample is large.
Merits:
3
Demerits:
Demerits:
4. Collection of data through schedules: This method of data collection is very much like the
collection of data through questionnaire, with little difference which lies in fact that schedules
(Performa containing a set of questions) are being filled in by enumerators who are especially
appointed for the purpose. The enumerators along with schedules, so to respondents, put to
them the questions from the Performa in the order the questions are listed and record the
replies in the space for the same in the Performa.
Merits:
Demerits:
When sample area is so large then this method is very time consuming
This type of collection is so costly due to maintain the schedule with enumerator
through wide geographic area
i. Warranty cards: Warranty cards are usually postal sized cards which are used by
dealers of consumer durables to collect information regarding their products. The
information sought is printed in the form of questions on the warranty cards which is
placed inside the package along with the product with a request to the consumer to fill
in the card and post it back to the dealer.
4
ii. Use of mechanical devices: The use of mechanical devices has been widely made to
collect information by way of indirect means. Eye camera, Pupil metric camera,
Motion picture camera and Audiometer are the principal devices so far developed and
commonly used by modern big business houses, mostly in the developed world for the
purpose of collecting the required information.
iii. Projective Techniques: Projective techniques (or what are sometimes called as
indirect interviewing techniques) for the collection of data have been developed by
psychologists to use projections of respondents for inferring about underlying
motives.
Collection of Secondary Data: Secondary data may either be published data or unpublished
data.
Applications of statistical tools in various processing stages of textile production: There are
various applications of statistical tools in textile sectors. Those are discussed below.
1. Yarn Production: There are several stages involved in the cotton yarn production. When
fibers are mixed and processed through blow room, within and between lap variations are
studied by computing mean, SD and CV lap rejection, and production control are studied by
p and x charts. Average measure is used to find the hank of silver in carding, draw frame,
combing and average hank of roving in roving frame and average count at ring frame.
Generally the spinning mill use ‘average count’ as the count specification if it is producing 4–
5 counts. On the other hand the weaving section uses ‘resultant count’ which is nothing but
the harmonic mean of the counts produced. Control charts are extensively used in each and
every process of yarn production (for example, the process control with respect to thin places,
neps, etc.). Application of probability distributions like Poisson, Wei-bull and binomial for
various problems in spinning is found very much advantageous to understand the end
5
breakage concept. In ring spinning section several ring bobbins are collected and tested for
CSP and difference between the bobbins and within the bobbins is studied using ‘range’
method. In cone winding section the process control can be checked either by using control
chart for averages or chart for number defectives.
3. Chemical processing and Garment Production: The scope of statistics is unlimited. For
example the effect of n number washes (identical conditions) on m fabrics on a particular
fabric property can be easily found by either tests of significance or analysis of variance.
Similarly the effect of different detergents on fabric types can be investigated by two-way
analysis of variance. Similarly different types of fabrics and the effect of sewing conditions
can be studied by ANOVA. In garment production the control of measurements and its
distribution can be well understood by control and polar charts.
4. Fiber Production: Measures of central tendency like process average gives an idea about
average staple length of fiber produced in a continuous or batch wise process. Coefficient of
variation (CV) of the process signifies about the process control. On the other hand, analysis
of time series is helpful in estimating the future production based on the past records.
Measures of dispersion such as standard deviation and CV are useful in comparing the
performance of two or more fiber-producing units or processes. Significance tests can also be
applied to investigate whether significant difference exists between the batches for means or
standard deviations. Analysis of variance can be applied for studying the effect of parameters
of fiber production and methods of polymer dissolving.
5. Textile Testing of Fiber Yarn and Fabrics: Results analysis in textile testing without the
applications of statistical tools will be meaningless. In other words every experiment in
textile testing include the use of statistical tools like average calculation, computation of SD,
CV and application of tests of significance (t-test, z-test and f-test) or analysis of variance
(one way, two way or design of experiments). Populations can be very well studied by
normal or binomial or Poisson’s distributions. Random sampling errors are used in studying
about the population mean and SD at 95% and 99% level of confidence. Application
geometric mean for finding out the overall flexural rigidity or Go has an important role in
fabric selection for garment manufacture.
A special mention is made in determination of fiber length by bear sorter where all the
measures of central tendency and dispersion (mean length, modal length quartile deviation,
6
etc., in the form of frequency distribution) are computed to understand about the cotton
sample under consideration for testing its potential in yarn manufacture. On the other hand
ball sledge sorter uses weight distribution from which mean and SD are computed. In the case
of cotton fibres, the development of cell wall thickening commonly referred as “Maturity”
concept can be very well determined using normal distribution and confidence intervals.
Several properties are tested for different packages produced from the same material or from
the same frame by applying significance tests. Effect of instruments and variables for
different types of samples can be very well studied by using ANOVA. All the fabric
properties tested on a single instrument or different instrument can be understood by using
design of experiments. In one of the research applications, which include the testing of low
stress mechanical properties for nearly 1000 fabrics are studied by ‘Principle Bi component
analysis or Bi plot’. Measures of dispersion like coefficient of variation and percentage mean
deviation are very much used in evenness measurement.
Variable: A measurable characteristics which can vary within a certain range. Example,
Family size.
7
Frequency Distribution
Definition: Frequency distribution is a listing of a data set which divides the data in different
classes and gives a count of number of observations in each class.
According to Croxton and Cowden, “Frequency distribution is a statistical table which shows
the set of all distinct values of the variables arranged in order of magnitude, either
individually or in groups with their corresponding frequencies side by side.
Example: The profits (in lakhs of Tk’s) of 30 Bangladeshis companies for the year 2016 –
2017 are given below:
28 16 23 37 35 49 63 65 55 45
58 57 69 30 32 35 42 37 42 48
53 49 65 39 48 67 25 29 58 40
Let us determine the suitable class interval with the help of the following formula:
Range
i=
1 3.322 log N
53
=
1 3.322 log 30
= 8.97
9
Since values like 3, 7, 9 etc. should be avoided we will take 10 as the class interval and the
first class be 15-25.
8
Presentation of frequency distribution: There are two types of presentation of frequency
distribution. They are:
1. Graphical presentation
2. Diagrammatic presentation
Histogram, frequency polygon, frequency curve and ogive curve are the graphical
presentation and pie-chart and bar diagram is the diagrammatic presentation.
Histogram: Histogram or column diagram is a suitable graph for presenting the frequency
distribution of a continuous series.
In histogram, the magnitude of the class intervals and their corresponding frequency are
plotted along the horizontal and vertical axis respectively.
Histogram are drown by two ways: i) Histogram having equal class interval
And ii) Histogram having unequal class interval
Example: The marks of 100 students of statistics course are given. Represent the data by
histogram (having equal class interval).
Marks No of students
30 - 40 8
40 - 50 12
50 - 60 30
60 - 70 25
70 - 80 15
80 - 90 10
Total 100
Example: The marks of 100 students of statistics course are given. Represent the data by
histogram (having unequal class interval).
Marks No of students
30 - 40 8
40 - 60 42
60 - 70 25
70 - 80 15
80 - 90 10
Total 100
Frequency polygon: The principle of drawing a frequency polygon is that frequency of all
the class-intervals is concentrated evenly in the interval and the mid-point of the class
intervals represents the classes.
A frequency polygon can be drawn by joining the mid-point of the bars in a histogram. Here
we join all the mid-point of the upper horizon of the bars by straight line.
9
Ogive curve: An Ogive (pronounced O-Jive) is a graph that represents the cumulative
frequencies for the classes in a frequency distribution and it is a continuous frequency curve.
To plot an ogive, we need class boundaries and the cumulative frequencies. For grouped data,
ogive is formed by plotting the cumulative frequency against the upper boundary of the class.
For ungrouped data, cumulative frequency is plotted on the y-axis against the data which is
on the x-axis.
Example: The marks of 100 students of statistics course are given. Represent the data by
frequency polygon and Ogive.
Marks No of students
30 - 40 8
40 - 50 12
50 - 60 30
60 - 70 25
70 - 80 15
80 - 90 10
Total 100
Pie-chart: The pie-chart also known as pie chart is a useful devise for presenting categorical
data. Data other than categorical can also be employed for constructing a pie-chart after
suitable and meaningful classification or grouping of the data.
The pie-chart consists of a circle sub-divided into sectors, whole areas are proportional to the
various parts into which the whole quantity is divided.
Example: We are given the expenditure per month of a family on different items. Construct a
suitable diagram to represent the expenditures.
Items Expenditure (Tk. In thousand)
Food 30
Clothing 25
Residence 20
Education 20
Medicine 15
Others 10
Total 120
Bar diagram: A bar diagram or bar graph is a diagram or graph that presents categorical data
with rectangular bars with heights or lengths proportional to the values that they represent.
The bars can be plotted vertically or horizontally. A vertical bar diagram is sometimes called
a line graph.
Example: The number of ODI played the Asian five country in 2017 are given. Draw a Bar
diagram.
Country No. of ODI
Afghanistan 10
Bangladesh 15
Hong Kong 05
India 30
Pakistan 20
Sri Lanka 25
10
Measures of Central Tendency
Central Tendency: In a representative sample, the value of a series of data have a tendency to
cluster around a certain point usually at the center of the series is called central tendency and
its numerical measures are called the measures of central tendency.
Different Measures of Central Tendency: The following are the important measures of
central tendency which are generally used in business and industry.
1. Arithmetic mean
2. Geometric mean
3. Harmonic Mean
4. Median
5. Mode
1. Arithmetic mean: The arithmetic mean, often simply referred to as mean, is the total of the
values of a set of observations divided by their total number of observations. It is denoted by
AM or X .
x x 2 ......... x n x i
X= 1 i 1
n n
Example: The monthly income (in Tk’s) of 10 persons working in a firm is as follows:
14870 14930 15020 14460 14750 14920 15720 15160 14680 14890
Find average monthly income.
For Grouped Data: If x1,x2,………………xn represent the values of N items or observations
and their respective frequencies are f1 , f2 , …………… fn then the arithmetic mean denoted
by X is defined as:
n
f x f 2 x 2 ......... f n x n fx
i 1
i i
X= 1 1
f 1 f 2 ......... f n N
Example: The following are the figures of profits earned by 1400 companies during 2016-
2017.
Profits (Tk. lakhs) 20-40 40-60 60-80 80-100 100-120 120-140 140-160
No. of companies 500 300 280 120 100 80 20
Calculate the average profits for all companies. (By both Direct and Short-Cut Method)
11
Merits of AM:
It is rigidly defined and easy to calculate.
Easy to understand and easy for algebraic treatment.
It should be based on all the observations.
Two or more sets of data can be compared very easily using their respective means.
Demerits of AM:
It is very badly affected by extreme values.
If there is open-end class we cannot find out the AM.
It cannot be used for qualitative data.
Uses of AM:
It is widely used almost in all the areas in economics, business, social science etc.
It is extremely used for business forecasting and tine series analysis.
It is suitable for further algebraic treatment.
It is easily used for comparing two or more sets of data.
2. Geometric mean: The geometric mean of n non-zero positive observations is the n th root
of their product. It is usually denoted by GM or G.
For ungrouped data: Let x1, x2,……….……xn be non-zero positive observations in a series
of data. Thus, the geometric mean can be written as:
G = ( x1, x2,………………xn)1/n
log x
i 1
i
G = Antilog ( ) [by simplification]
n
Example: Calculate the geometric mean from the following data:
For grouped data: If x1, x2,………..…xn are taken as the mid-points of various classes and f1,
f2, …..…… fn are the class frequencies in a frequency distribution, then the geometric mean
G can be expressed as follows:
f f
G = X 1 1 X 2 2 ............. X n
fn 1/ N
f i 1
i log X i
G = Antilog ( ) [by simplification]
N
Example: Calculate geometric mean for the following distribution.
12
Merits of GM:
It is not much affected by extreme values.
Suitable for arithmetic and algebraic treatment.
Geometric mean is less affected by sampling fluctuation.
Demerits of GM:
If any Xi = 0, geometric mean cannot be define.
If product of Xi be negative then it cannot be defined.
It is difficult to calculate.
Uses of GM:
The geometric mean is used to calculate averages of ratios and percentages.
It is also used for computing average rate of increase or decrease.
It is useful for the construction of index number.
3. Harmonic Mean: The harmonic mean of a set of n non-zero observations in a series is the
reciprocal of the arithmetic mean of the reciprocals of the observations. It is usually denoted
by HM.
For ungrouped data: Let, x1, x2,………..…xn be non-zero observations in series of data.
n n
Then the harmonic mean, HM = = n
1 1 1 1
x1 x 2
.........
x n i 1 x i
Example: Calculate the Harmonic mean of the following series of monthly expenditure of a
batch of students:
Merits of HM:
It uses all figures in the data like AM and GM.
It provides more weight for smaller values. i.e. it has downward bias.
It measures rate of changes and can be adopted to problems involving time and certain
type of rates and ratios.
It is not affected by single extreme values of items.
13
Demerits of HM:
Sometimes it is difficult to calculate; especially when the numbers of items is large.
It is not too easy to understand.
If any observation is zero then the harmonic mean cannot be defined.
It assigns too much weight to the smaller items and has limited scope.
Uses of HM:
The harmonic mean is used when the observations are expressed in terms of rates,
speeds, prices etc.
4. Median: Median is the middle most observation when the observations or set of values of
a particular study are arranged in (ascending or descending) order of magnitude. That is, half
of the observations in a set of data are lower than it and half of the observations are greater
than it.
For ungroup data: Let, x1, x2,………,xn be a series of n observations and they are arranged in
order of magnitude. Then median,
(n 1)
When n odd, Me = th observation
2
n n
( )th..observatio n ( 1)th..observatio n
When n even, Me = 2 2
2
Example: From the following data of wages of 7 workers, compute the median wage.
Wages (in Rs.) 1600 1650 1580 1690 1660 1606 1640
For group data: For grouped data we may get median by the following formula,
N
p.c. f
Me = L + 2 c
fm
Where,
L= Lower limit of median class
p.c.f=Preceding cumulative frequency to the median class
f=Frequency of the median class
c =Class interval of the median class
Example: 1000 workers are working in an industrial establishment. Their age is classified as
follows:
Age (yrs) 10-20 20-30 30-40 40-50 50-60
No. of workers 120 225 260 240 155
Calculate the median age.
Merits of Me:
Median is easy to calculate and understand
It is not affected by extreme values
14
It can be calculated when there exists open interval
It is also located graphically
Demerits of Me:
Median is not based on all the items of the series
It cannot be calculated by simple mathematical method
It is not suitable for further algebraic and arithmetic treatment
It is affected much by sampling fluctuation
Uses of Me:
Median is very useful when observations are qualitative
It verily helps us when there exists open interval
It is usually used in the data set with extreme values
It is useful for comparing two or more sets of qualitative data
5. Mode: Mode is that observation which occurs most frequently in a data set. It is denoted
by Mo
For ungroup data: For determining mode count the numbers of items the various values
repeat themselves and the value which occurs the maximum number of times is the modal
value.
Example: Find the mode of the series: 2 5 2 5 7 8 5
For group data: In case of grouped data the following formula is used for calculating mode
1
Mo = L C
1 2
Where,
L = Lower limit of the modal class
1 The difference between the frequency of the modal class and pre-modal class.
2 The difference between the frequency of the modal class and post-modal class.
C = Size of the modal class.
Merits of Mo:
15
Demerits of Mo:
Mode is indefinite
The combined mean cannot be calculated
It is not suitable for further algebraic and arithmetic treatment
Mode is not clearly defined in case of bi-modal or multimodal distribution.
Uses of Mo:
Many times we have the data without any particular numerical values. For example,
the observations on religion, economic status, sex etc. In these situations mode is most
useful
It is widely used in stock exchange, business, focusing weather etc.
1. Arithmetic mean, median and mode are easy to understand and easy to
calculate
2. Arithmetic mean is based upon all observation but median and mode are not
3. Arithmetic mean is amenable to algebraic treatment but the median and mode
are not easy for algebraic treatment
4. Arithmetic mean cannot be calculated from the distribution with open class
but median and mode can be calculated from the distribution with open classes
5. Arithmetic mean is affected very much by extreme values that median and
mode are not at all affected by extreme values.
Answer: The following two considerations should be kept in mind in the selection of an
average:
1. The type of data available: Are they broadly skewed (avoid the mean), guppy around
the middle (avoid the median) or unequal in the class (avoid the mode)?
2. The concept of the typical value required by the problem: Is a composite average of
all absolute or relative values needed (arithmetic mean or geometric mean)? Is a
middle value needed (median) or the most common value (mode)?
Question: “Which average should we use”? or “ Which is the best average to be used”?
Answer: We cannot regard any of the measures of central tendency as the best in all
circumstances. As we are to face different situations in different purposes, so it is almost
impossible to deal with any of the measures of central tendency. The causes are discussed
below.
Arithmetic mean: The arithmetic mean is the most popular and widely used measure of
central tendency. But in the following cases arithmetic mean should not be used:
16
4. When averaging rates (i.e. speed, fluctuations in the price of articles etc)
5. When there are very large and very small values of observations arithmetic
mean would be seriously misleading on account of undue influence of extreme
values.
Geometric mean: The geometric mean is typically used is averaging index numbers, rates of
change, ratios and other sets of data expressed in percentage form.
Harmonic mean: Harmonic mean is useful in problems in which values of a variable are
compared with a constant quantity of another variable i.e. rates, time, distance covered within
certain time and quantities purchase or sold per unit etc.
Median: The median is generally the best average in open-end grouped distributions. In case
of price distribution or income distribution very high or very low values would cause the
mean to be higher or lower than most “common values”. The median may be more
representative to use in describing the mass of the data.
Mode: Generally speaking, the significance of mode lies in the fact that it can be used to
describe qualitative data. The mode can be used in problems involving the expression of
preference where quantitative measurements are not possible. If we want to compare
consumer preferences for different kinds of products or different kind of advertising, we can
compare the modal preferences expressed by different groups of people but we cannot
calculate the median or mean.
2. At times an average may give absurd results. For example, if we are calculating
average size of a family we may get a value 4.8. But that is impossible as persons
cannot be in fractions. However we should remember that it is an average value
representing the entire group of families.
3. An average may give us a value that doesn’t exists in the data. For example, the
arithmetic mean of 100, 300, 250, 50, 100 is 800/5=160, a value that does not exists in
the data.
4. Measures of central value fail to give us any idea about the formation of the series.
Two or more series may the same central value but may differ widely in composition.
For example, observe the following two series:
17
Series A: 150 170 190 210 280
Question: Describe the empirical relationship between mean, median and mode.
Answer: Now we shall describe the relationship among mean, median and mode for
symmetrical distribution and asymmetrical distribution.
For symmetrical distribution: The values of mean, median and mode coincide.
Mean>Median>Mode Mode>Median>Mean
Question: Mean but median does not depend on all the observations. Explain.
Answer: As distinct from the arithmetic mean which is calculated from the value of every
observation in the series, the median is what is called a positional average. The term
‘position’ refers to the place of a value in a series. The place of the median in a series is such
that an equal number of observations lie on either side of it. For example, if the income of
five persons is Tk.1000, 1200, 1500, 1600, 1800 then the median income would be Tk. 1500.
Changing any or both of the first two values with any other numbers with value 1500 or less
and/or changing any of the last two values to any other values with values of 1500 and more,
would not affect the value of the median which would remain 1500.
In contrast, in case of arithmetic mean, the change in value of single observation would cause
the value of the mean to be changed.
Weighted Arithmetic Mean (WAM): One of the limitations of the arithmetic mean is that it
gives equal importance to all the observations. But there are cases where the relative
importance of all
the different observations is not the same. When that is so, we compute weighted arithmetic
mean.
The terms ‘weight’ stands for the relative importance of the different observations. The
formula for computing weighted arithmetic mean is
18
Xw=
WX
W
Where,
X w = Represents the weighted arithmetic mean
X= Variable
W= Weights attached to the variable X.
Weighted mean is especially useful in the problems relating to the construction of index
numbers and standardized birth and death rates.
Example: Suppose that, the nearby Wendy’s Restaurant sold medium, large and Biggie-sized
soft drinks for $0.90, $1.25 and $1.50 respectively of the last 10 drinks sold, 3 were medium,
4 were large and 3 were Biggie-sized. Find the mean price of the last 10 drinks sold.
Solution: We know, X w =
WX
W
3($0.90) 4($1.25) 3($1.50) $12.20
Xw= = =$1.22
3 43 10
Combined mean: If we have the arithmetic mean and number of observations two or more
than two related groups, we can compute combined average of these groups by applying the
following formula.
N X N2 X 2
Xc = 1 1
N1 N 2
Where,
X c =Combined mean of the two groups
X 1 =Arithmetic mean of the first group
X 2 =Arithmetic mean of the second group
N 1 =Number of observations in the first group
N 2 = Number of observations in the second group
If we have to find out the combined mean of three series, the above formula can be extended
as follows:
N X N2 X 2 N3 X 3
Xc = 1 1
N1 N 2 N 3
Example: There are two branches of a company employing 100 and 80 persons respectively.
If arithmetic mean of the monthly salaries paid by two branches are tk.1570 and tk.1750
respectively, find the arithmetic mean of the salaries of the employees of the company as
a whole.
19
Theorem: For n-nonzero positive observations, such that, AM GM HM
Example: Find the arithmetic mean, geometric mean and harmonic mean from the following
data.
Class 120-140 140-160 160-180 180-200 200-220 220-240
Frequency 23 35 60 40 25 17
Age 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64
No of 40 60 200 180 150 110 175 60 25
workers
Calculate mean, median and mode from the following frequency distribution. Also comment
on the skewness.
The median and mean for the distribution are both Tk. 88. Calculate the missing frequencies
f1 and f2.
20
Measures of Dispersion
A measure of central tendency represents only one of the important characteristics of a set of
data or series. From the measures of central tendency, we can get knowledge only about the
central values. They do not provide us the highest or lowest value i.e. the range or how the
data are scattered. For this kind of information about any series we have to use the measures
of dispersion.
In the study of dispersion we observe that various distributions may have exactly the same
average but substantial difference in variability. For example, let us consider the two students
obtained the following marks in certain examination.
The two distributions are certainly not identical though their average are the same. These
differences lie in the dispersion of their marks. From the above data it is very much apparent
that student A consists of average or near average intelligent student and student B is made of
very bright and very dull student.
Measure of dispersion: The measurement of the scatter of the values of a data set among
themselves is called a measure of dispersion or variation.
Different Measures of Dispersion: There are several methods of measuring dispersion.
These measures can be divided into two groups:
1. Absolute measure
2. Relative measure
Absolute measure: Absolute measures of variation are expressed in the same statistical unit
in which the original data are given such as rupees, kilograms, tones etc. These values may be
used to compare the variation in two or more than two distributions provided the variables are
expressed in the same units and have almost the same average value.
21
Following are the relative measures of variation:
a) Co-efficient of Range
b) Co-efficient of Quartile Deviation
c) Co-efficient of Mean Deviation
d) Co-efficient of Variation
Range: The range of a set of observations is the difference between two extreme values, i.e.
the difference between the maximum and minimum values. Therefore, it indicates the limits
within which all the observations fall. Thus, Range = Highest Value – Lowest Value
For ungrouped data: Let us consider a set of observations x1,x2,………………xn and that H
is maximum and L is minimum. Then range, R = H - L
Example: Find out the range of the set of observations -9, -3, 0, 13, 4
For grouped data: In these case, the range is the difference between the upper boundary of
the highest class and the lower boundary of the lowest class. That is, R = XH - XL
Uses of Range:
It is time saving and widely used in industrial quality control, weather forecast.
Variations in stock exchange can be studied by range.
Quartile Deviation: The quartile deviation (Q.D.) is another type of range obtained from the
quartiles. It is obtained by dividing the difference between upper quartile (Q3) and lower
quartile (Q1) by 2. Therefore,
𝑈𝑝𝑝𝑒𝑟 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 –𝐿𝑜𝑤𝑒𝑟 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 Q3 Q1
Q.D. = =
2 2
Q3 Q1
The term (Q3 – Q1) is known as the interquartile range and the quartile deviation is
2
also known as semi-interquartile range.
Quartiles: Quartiles are that values in a set of observations arranged in order of magnitude
(ascending or descending) which divide the total observations into four equal parts.
For ungroup data: Let, x1, x2,………,xn be a series of n observations and they are arranged in
order of magnitude. Then quartiles,
(n 1)i
When n odd, Qi = th observation ; i = 1,2,3.
4
22
n n
( i )th..observatio n ( i 1)th..observatio n
When n even, Qi= 4 4 ; i = 1,2,3.
2
Example: The Automobile Association checks the prices of gasoline before many holiday
weekends. Listed below are the self-service prices for a sample of 8 retail outlets during the
May 2014 Memorial Day Weekend.
40 22 60 30 45 66 70 55
Determine the quartile deviation.
For group data: For grouped data we may get quartiles by the following formula,
N
i p.c. f
Qi = L + 4 c i = 1,2,3
fq
Where,
L = Lower limit of quartile class
p.c.f = Preceding cumulative frequency to the quartile class
fq = Frequency of the quartile class
C= Class interval of the quartile class
Uses of Q.D.:
The quartile deviation can be used as a rough measure of dispersion which is superior
to range
Mean Deviation: Mean deviation is the mean of absolute deviations of the items from an
average like mean, median or mode. Normally, we consider the arithmetic mean as the
average.
For ungrouped data: If x1,x2,………………xn be a set of n observations or values, then the
mean deviation defined as:
n
x
i 1
i x
i. Mean Deviation about mean, M.D.( x ) = ; x = Arithmetic Mean
n
23
n
x i 1
i Me
ii. Mean Deviation about median, M.D.( Me ) = ; Me = Median
n
n
xi 1
i Mo
iii. Mean Deviation about mode, M.D.( Mo ) = ; Mo = Mode
n
Example: Find out the mean deviation about mean / median / mode from the following data:
6 9 4 2 11 4 7 14 4
f
i 1
i xi x
i. Mean Deviation about mean, M.D.( x ) = ; x = Arithmetic Mean
N
n
f i 1
i xi Me
ii. Mean Deviation about median, M.D.( Me ) = ; Me = Median
N
n
fi 1
i xi Mo
iii. Mean Deviation about mode, M.D.( Mo ) = ; Mo = Mode
N
Example: The weights of 100 students of level 2 term 2 are given below:
Uses of M.D.:
It is used in certain economic and anthropological studies.
It is often sufficient when an informal measure of dispersion required.
Standard Deviation: The standard deviation is the positive square root of the mean of the
squared deviations from their mean of a set of values. It is generally denoted by ơ or SD.
For ungrouped data: Let x1,x2,………………xn be a set of values then their standard
deviation can be calculatedas:
n
(x
i 1
i x) 2
By direct method, ơ=
n
24
2
n
n
x xi
2
i
ơ= i 1
i 1 [by simplification]
n n
2
n
n
d di
2
i
By short-cut method, ơ= i 1
i 1 ;Here, di = xi – A ;A = Assumed Mean
n n
Example: A sample of households that subscribe to the Grameen Phone company revealed
the following numbers of calls received last week. Determine the standard deviation of
number of calls received: 16 12 14 15 18
For grouped data: If x1,x2,………………xn occur with frequencies f1 , f2 , …………… fn
respectively, then the standard deviation can be calculated as:
n
f (x i i x) 2 n
By direct method, ơ= i 1
N
;Here, N = f
i 1
i
2
n
n
f i xi
2
f i xi
ơ= i 1
i 1 [by simplification]
N N
2
n
n
fi di
2
fi di
By short-cut method, ơ= i 1
i 1 C ;Here, d xi A ;
N N i
C
25
Uses of S.D.:
C.V. = 100 ; x 0
x
Example: Compute the co-efficient of variation from the following data:
Example: The run scores of two cricketers for 10 innings are given below:
Uses of C.V.:
It is suitable when two or more distributions are in different units.
When the means of distribution are widely different although units are same.
26
Moments, Skewness and Kurtosis
Moments: Moments are popularly used to describe the characteristics of a distribution. The
Greek letterµ (read as mu) is generally used to denote the moments. There are two types of
moments.
1. Central moments
2. Raw moments
For ungrouped data: The r th moment of a variable X about the arithmetic mean X is given
by
1
𝜇𝑟 = ∑(𝑋 − 𝑋̅)𝑟
𝑛
The r th moment of a variable X about any arbitrary point A is given by
1
𝜇𝑟′ = ∑(𝑋 − 𝐴)𝑟
𝑛
For grouped data: The r th moment of a variable X about the arithmetic mean X is given by
1
𝜇𝑟 = ∑ 𝑓(𝑋 − 𝑋̅)𝑟
𝑁
The r th moment of a variable X about any arbitrary point A is given by
1
𝜇𝑟′ = ∑ 𝑓(𝑋 − 𝐴)𝑟
𝑁
For different values of r, we shall get different moments. Thus if r = 1, we will get the first
moment, if we put r = 2, we will get second moment and so on.
Moments about mean: Moments about mean is also called central moments.
Note: The first moment about the origin tells us about the mean, the second moment about
variance, the third moment about Skewness and the fourth moment about the kurtosis.
27
Finding Central moments from moments about Arbitrary point: With the help of following
relationships, moments about an arbitrary point can be converted to moments about mean:
1 0
2 2 1
2
Skewness: The term “Skewness” refers to lack of symmetry or departure from symmetry, e.
g., when a distribution is not symmetrical (or is asymmetrical) it is called a skewed
distribution. The measures of Skewness indicate the difference between the manners in which
the observations are distributed in a particular distribution compared with a symmetrical (or
normal) distribution.
In a symmetrical distribution the values of mean, median and mode are alike. Hence the two
tails are equal length. In a skewed distribution the values differ.
28
Figure: Positively Skewed Distribution
Negative Skewness: If the value of mode is greater than mean, Skewness is said to be
negative. Hence the left tail is longer than the right tail. In this case, mean < median < mode.
Coefficients of Skewness: For comparing two series, we do not calculate the absolute
measures but we calculate the relative measures called the co-efficient of Skewness which are
pure numbers i., e., independent of units of measurement. The following are two important
methods of measuring relative Skewness:
1. Karl Pearson’s coefficient of Skewness: Karl Pearson’s coefficient of Skewness or
Pearsonian coefficient of Skewness is given by the formula:
Mean-Mode
S.K P =
Standard deviation
3 2
1 3
2
29
The value of β1 will be zero for a perfectly symmetrical distribution, instead of β1 Karl
ᵞ
Pearson suggested 1 to be used as a measure of skewness, where
ᵞ1 =√𝛽1 = √𝜇𝜇
2 𝜇3
3
3 =
2 √𝜇23
4
2
Where 2 2 and 2 2 3 .
30
Correlation and Regression
Correlation: Correlation is a statistical technique which measure and analyses the degree or
extent to which two or more variables fluctuate with reference to one another.
Correlation thus denotes the interdependence amongst varieties. The degrees are expressed by
a coefficient which ranges between -1 and +1. The direction of change is indicated by + or –
signs.
Correlation thus expresses the relationship through a relative measure of change and it has
nothing to do with the units in which the variables are expressed.
Simple correlation: When only two variables are studied for calculating correlation then it is
known as simple correlation.
Example: The correlation between height and weight of a series.
Multiple correlation: When three or more variables are studied for calculating correlation it
is known as multiple correlation.
Example: If we consider the correlation between the academic results of a student and his/her
family size, family income then this problem can be termed as multiple correlation.
Different types of simple correlation: Simple correlation can be of five types on basis of the
nature of the interrelationship between two variables. Such as:
i. Perfect positive (direct) correlation
ii. Partial positive (direct) correlation
iii. Perfect negative (inverse) correlation
iv. Partial negative (inverse) correlation
v. Zero correlation
Now, description about these are given below
Perfect positive correlation: If the changes of the two variables are same direction i.e. if the
increase (or decrease) in one variable in the corresponding increase (or decrease) in the other
and the rate of the change of both variables are equal, then the existent correlation between
the two variables is called perfect positive correlation. In this case, the value of the
correlation coefficient between two variables is 1. i.e. r = 1.
Example: The relation between radius and circumference is a perfect positive correlation.
Partial positive correlation: If the changes of two variables are same direction i.e. if the
increase (or decrease) in one variable results a corresponding increase (or decrease) in the
31
other but the rate of change of the both variables are not equal, then the existent relationship
between the two variables is called partial positive correlation. In this case, 0 < r < 1.
Example: The relation between income and expenditure is a partial positive correlation.
Perfect negative correlation: If the changes of the two variables are same direction i.e. if the
increase (or decrease) in one variable in the corresponding decrease (or increase) in the other
and the rate of the change of both variables are equal, then the existent correlation between
the two variables is called perfect negative correlation. In this case, the value of the
correlation coefficient between two variables is 1. i.e. r = - 1.
Example: The relation between the pressure and the area of gas is perfect negative
correlation.
Partial negative correlation: If the changes of two variables are same direction i.e. if the
increase (or decrease) in one variable results a corresponding decrease (or increase) in the
other but the rate of change of the both variables are not equal, then the existent relationship
between the two variables is called partial negative correlation. In this case, -1 < r < 0.
Example: The relation between price and demand of a commodity is partial negative
correlation.
Zero correlation: If the changes are independent, i.e. if increase (or decrease) in one variable
resultsa corresponding no change in the other, then the existent relation between the two
variable called a zero correlation. In this case, r = 0.
Example: The color of a sharee and the intelligence of a girls who wears it is no correlation.
Methods of Determining Correlation: We shall consider the following most commonly used
methods.
rxy =
( x x )( y y )
{ ( x x ) }{ ( y y )
2 2
}
32
x y
xy n
rxy=
x 2
y 2
x y
2 2
n
n
[by simplification]
The value of the coefficient of correlation as obtained by the above formula shall always lie
between 1 .
Example: The following data show the ages of husbands and wives of 10 married couples.
Husband 36 72 37 36 51 50 47 50 37 41
Wife 35 67 33 35 50 46 47 42 36 41
Example2: The following data consist of observations for the weights of 10 different
automobiles (in 1000 pounds) and the corresponding fuel consumptions (gallons per 100
miles).
Weight (x) 3.4 3.8 4.1 2.2 2.6 2.9 2.0 2.7 1.9 3.4
Fuel Consumption (y) 5.5 5.9 6.5 3.3 3.6 4.6 2.9 3.6 3.1 4.9
Spearman’s Rank Correlation coefficient: The association between two series of rank is
called rank correlation. The method of ascertaining the coefficient of correlation by ranks was
devised by Charles Edwards Spearman in 1904.This method is especially useful in case when
the actual magnitudes or item values are not given and simply their ranks in the series are
known. Spearman’s rank correlation coefficient, usually denoted by (Rho) is given by the
formula:
6 d i 6 d i
2 2
=1- =1-
n3 n n(n 2 1)
Where d stands for the difference between the pair of ranks and n the number of paired
observations.
The value of Spearman’s rank correlation coefficient ranges between -1 and +1.When is +1,
there is a perfect concordance between rankings and the ranks are in the same direction.
When is-1, there is also a perfect concordance between rankings but the ranks are in the
opposite direction.
Example: Two managers are asked to rank a group of employees in order of potential for
eventually becoming top managers .The rankings are as follows:
33
Employee A B C D E F G H I J
Ranked by manager I 10 2 1 4 3 6 5 8 7 9
Ranked by Manager II 9 4 2 3 1 5 6 8 7 10
Compute the coefficient of rank correlation and comment on the value.
Example: Calculate the rank correlation coefficient for the following data of marks of 2 tests
given to candidates for a clerical job:
Preliminary test: 92 89 87 86 83 77 71 63 53 50
Final test : 86 83 91 77 68 85 52 82 37 57
Uses of correlation:
i. Economic theory and business studies relationships between variables like price and
quantity demanded, advertising, expenditure scales promotion measure etc. The
correlation analysis helps in deriving precisely the degree and direction of such
relationships.
ii. The concepts of regression are also based upon the measure of correlation.
In a regression analysis there are two types of variables. The variable whose is influenced or
is to be predicted is called dependent variable, regressed predicted or explained variable and
the variable which influences the values or is used for prediction is called independent
variable or regressor or predictor or explanatory variable.
Example: The relationships between two variables can be considered between say rainfall
and agricultural production, price of an output and the overall cost of product, consumer
expenditure and disposable income.
Simple Regression: When the dependency of dependent variable is estimated by only one
independent variable then it is said simple regression.
Regression equation (for simple regression): Regression equations are algebraic expression
of the regression lines. Since there are two regression lines, the regression equation of X on Y
is said to describe the variation in the values of X for given changes in Y and the regression
equation of Y on X is used to describe the variation in the values of Y for given changes in X.
Y = a + bX
34
The regression equation of X on Y is expressed as follows:
X = c + dY
Multiple Regression: When there are more than one independent variables then it is said
multiple regression.
Regression equation (for multiple regression): Equation for multiple regression are given
below:
Yi = a + b1X1 + b2X2 + ……...…… + bnXn
Regression line: If the variables in a bivariate distribution are related we will find that points
in the scatter diagram will cluster around some curve called the “Curve of regression”. If the
curve is straight line of, it is called the line of regression and there is said to be linear
regression between the variables, otherwise regression is said to be curvilinear. The line of
regression is the line which gives the best estimate to the value of one variable for any
specific value of the other variable.
Example: The data of sales and promotion expenditure on a product for 10 years are given
below
***If the value of r=0.9, r2 will be 0.81 and this would mean that 81% of the variation in the
dependent variable has been explained by the independent variable.
Correlation Regression
1. Correlation co-efficient rxy is a relative 1. The regression co-efficient byx (bxy)
measure of the linear relationship between are absolute measures representing the change in
X and Y and is independent of the units of the value of the variable Y(X) for a unit change
the measurement. If is a pure number in the variable X(Y)
lying between 1.
2. Correlation analysis has limited 2. Regression analysis studies linear as well as
applications as it is confined only to the non-linear relationship between the variables
study of linear relationship between the and therefore has much wider applications.
variables.
3. Correlation coefficient is symmetric i.e. 3. Regression co-efficient are not symmetric in
rxy = ryx X and Y i.e.bxy ≠ byx
4. The range of r is -1 ≤ r ≤ 1. 4. The range of byx (bxy) is -∞ ≤ byx (bxy) ≤ ∞.
Comments on coefficient of correlation
Values of r Comments
r = -1 Perfect negative correlation
-1 < r ≤ -0.8 Higher degree of negative correlation
-0.8 < r < -0.2 Moderate degree of negative correlation
-0.2 ≤ r < 0 Lower degree of negative correlation
r=0 Zero correlation
0 < r ≤ 0.2 Lower degree of positive correlation
0.2 < r < 0.8 Moderate degree of positive correlation
0.8 ≤ r < 1 Higher degree of positive correlation
r=1 Perfect positive correlation
36