Fy Bba Unit 1
Fy Bba Unit 1
Introduction:
The views commonly held about statistics are numerous, but often incomplete. It has different meanings to different
people depending largely on its use. For example, (i) for a cricket fan, statistics refers to numerical information or data
relating to the runs scored by a cricketer; (ii) for an environmentalist, statistics refers to information on the quantity of
pollution released into the atmosphere by all types of vehicles in different cities; (iii) for the census department,
statistics consists of information about the birth rate per thousand and the sex ratio in different states; (iv) for a share
broker, statistics is the information on changes in share prices over a period of time; and so on.
Definition of statistics is given by Croxton and Cowden. They have defined statistics in a singular sense. This definition
also refers statistics as Statistical Method. According to Croxton and Cowden Statistics may be defined as a science of
collection, presentation, analysis and interpretation of numerical data.
This definition has pointed out four stages of statistical investigation, to which one more stage ‘organization of data’
rightly deserves to be added. Accordingly, statistics may be defined as the science of collecting, organizing, presenting,
analyzing, and interpreting numerical data for making better decisions.
Types of data:
The collected data are of two types (i) Qualitative data (ii) Quantitative data
Qualitative Data:
When the data are classified according to some qualitative phenomena which are not capable of quantitative
measurement like honesty, beauty, employment, intelligence, occupation, sex, literacy, etc., are termed as qualitative
data. The qualitative phenomena under study are known as Attributes.
For example,
(i) Population has two classes like presence and absence, male and female, honest or dishonest employed or
unemployed , beautiful or not beautiful
1
(ii) Population is classified into more than two classes. Attribute “Intelligence” the various classes may be, say, genius,
very intelligent, average intelligent, below average and dull as given below:
(iii) Classify the population by sex into two classes, males and females. Then each of these is again classified according
to smoking, smokers and non-smokers, again each of these four classes are classified with respect to a third attribute,
religion, into two classes , Hindu and non-Hindu.
Quantitative data:
If the data are classified on the basis of phenomenon which is capable of quantitative measurement like age, height,
weight, prices, production, income, expenditure, sales, profits, etc., are called Quantitative data. The quantitative
phenomenon under study is known as Variable.
Variables are of two kinds: (i) Continuous variable. (ii) Discrete variable (Discontinuous variable).
(i) Those variables which can take all the possible values (integral as well as fractional) in a given specified range are
termed as continuous variables.
For example, the age of students in a school (Nursery to Higher Secondary) is a continuous variable because age
can take all possible values (as it can be measured to the nearest fraction of time : years, months, days, minutes,
seconds, etc.), in a certain range, say, from 3 years to 20 years.
More precisely a variable is said to be continuous if it is capable of passing from any given value to the next value
by infinitely small gradations.
(ii) On the other hand those variables which cannot take all the possible values within a given specified range are
termed as discrete (discontinuous) variables.
For example, family size (members in a family), the population of a city, the number of accidents on the road, the
number of typing mistakes per page and so on.
2
Scales of measurements:
The data are categorized using different scales of measurements. Each level of measurement scale has specific
properties that determine the various use of statistical analysis. There are four different scales of measurement. The
data can be defined as being one of the four scales. The four types of scales are:
The four types of scales are:
1. Nominal Scale
2. Ordinal Scale
3. Interval Scale
4. Ratio Scale
1. Nominal Scale: 1st Level of Measurement
• Definition:
Nominal Scale, also called the categorical variable scale, is defined as a scale used for labeling variables into
distinct classifications and doesn’t involve a quantitative value or order. This scale is the simplest of the four
variable measurement scales. Calculations done on these variables will be futile as there is no numerical value of
the options.
This is the fundamental of quantitative research, and nominal scale is the most fundamental research scale.
The sequence in which subgroups are listed makes no difference as there is no relationship among subgroups. A
subgroup of nominal scale with only two categories (e.g. male/female) is called “dichotomous.”
• Nominal Scale Data and Analysis:
There are two primary ways in which nominal scale data can be collected:
(i) By asking an open-ended question, the answers of which can be coded to a respective number of label
decided by the researcher.
(ii) The other alternative to collect nominal data is to include a multiple choice question in which the answers
will be labeled.
In both the cases, the analysis of gathered data will happened using percentages or mode, i.e., the most
common answer received for the question. It is possible for a single question to have more than one mode as it
is possible for two common favorites can exist in a target population.
• Nominal Scale Examples:
Nominal scale is often used in research surveys and questionnaires where only variable labels hold significance.
(1) For instance, a customer survey asking, “Which brand of smart phones do you prefer?”
Options: “Apple”- 1 , “Samsung”-2, “OnePlus”-3.
In this survey question, only the names of the brands are significant for the researcher conducting
consumer research. There is no need for any specific order for these brands. However, while capturing
nominal data, researchers conduct analysis based on the associated labels.
In the above example, when a survey respondent selects Apple as their preferred brand, the data entered
and associated will be “1”. This helped in quantifying and answering the final question – How many
respondents selected Apple, how many selected Samsung, and how many went for OnePlus – and which
one is the highest.
(2) What is your Gender?
Options:”Male”-1, “Female”-2
(3) What is your Political preference?
Options: 1- Independent, 2- Democrat, 3- Republican
(4) Where do you live?
Options: 1- Suburbs, 2- City, 3- Town
3
In this survey question, only the names of the brands are significant for the researcher conducting consumer
research. There is no need for any specific order for these brands. However, while capturing nominal data,
researchers conduct analysis based on the associated labels
5
Summary of Levels of Measurement
Offers: Nominal Ordinal Interval Ratio
Frequency distribution:
The organization of the data pertaining to a quantitative phenomenon involves the following four stages:
(1) The set or series of individual observations - unorganized (raw) or organized (arrayed) data.
(2) Discrete or ungrouped frequency distribution.
(3) Grouped frequency distribution.
(4) Continuous frequency distribution
(1) Array. A better presentation of the above raw data would be to arrange them in an ascending or descending order
of magnitude which is called the ‘arraying’ of the data. However, this presentation (arraying), though better than
the raw data does not reduce the volume of the data.
6
(2) Discrete or ungrouped frequency distribution: A much better way of the representation of the data is to express it
in the form of discrete or ungrouped frequency distribution where count the number of times each value of the
variable occurs in the data. This is facilitated through the technique of tally bars. If the variables takes the values in
a wide (large) range then the data still remain unwieldy and need further processing for statistical analysis.
Example
Following data shows the total number of overtime hours worked for 30 consecutive weeks by machinists in a
machine shop. The displayed are in raw form:
91 89 88 89 90 92 93 88 87 85 88 93 91 93 91
93 92 88 92 90 93 84 93 84 91 93 85 91 89 92
Represent the above information by appropriate frequency distribution.
Solution:
Variable (X): Number of overtime hours per week
Frequency (𝑓): Number of weeks, N= no. of weeks = 30
Maximum observation: 93, Minimum observation: 84
(3) Grouped frequency distribution: If the identity of the units about whom a particular information is collected is not
relevant, nor is the order in which the observation occur, then the first real step is classifying the data into
different classes (or class intervals) by dividing the entire range of the values of the variable into a suitable number
of groups called classes and then recording the number of observations in each group or class. The various groups
into which the values of the variable are classified are known as classes or class intervals; the length of the class
interval is called the width of the classes. The two values specifying the class are called the class limits; the larger
value is called the upper class limit and the smaller
value is called the lower class limit. Here classes are of inclusive form so that both upper and lower limit is
included in respective classes. This type of classes generally used for discrete variable.
Example
A computer company received a rush order for as many home computers as could be shipped during a six-week
period. Company records provide the following daily shipments:
22 65 65 67 55 50 65 77 73 30 62 54 48 65 79 60 63 45 51 68 79
83 33 41 49 28 55 61 65 75 55 75 39 87 45 50 66 65 59 25 35 53
Represent the above information by appropriate frequency distribution.
Solution:
Variable: Number of computers shipped per day (discrete variable, make inclusive classes)
Frequency: Number of days
Maximum observation: 87; Minimum observation: 22
7
N = total no. of days during six weeks = 42
Number of classes = k = 6 as 25 = 32 < 42 and 26 = 64 > 42 so take k= 6
87−22
Class interval = ≅ 11
6
Grouped frequency distribution
of computers shipping/day during six week period in the computer company
Classes Tally marks No. of days
22-32 4
33-43 4
44-54 9
55-65 14
65-76 6
77-87 5
Total 42
(4) Continuous frequency distribution: While dealing with a continuous variable it is not desirable to present the
data into a grouped frequency distribution like 0-9, 10-19, 20-29 etc., because this classification does not take
into consideration the observation between 9 to 10, 19 to 20, so on. In such situation one should form
continuous class intervals like 0-10, 10-20, 20-30 etc., the presentation of the data into continuous classes with
corresponding frequencies is known as continuous frequency distribution.
Example
The following data represent the annual family expenses (in thousands of rupees) on food items in a city.
13.8 14.1 14.7 15.2 16.8 15.6 14.9 16.7 19.2 14.9 14.9 14.9 15.2 15.9
15.2 14.8 14.8 19.1 14.6 18.0 14.9 14.2 14.1 15.3 15.5 18.0 17.2 17.2
14.1 14.5 18.0 14.4 14.2 14.6 14.2 14.8
Represent the above information by appropriate frequency distribution.
Solution:
Variable(X): Annual family expense in thousands of rupees
(Continuous Variable, so make exclusive classes)
Frequency (𝑓): Number of families
N = total no. of families = 36, Minimum Observation: 13.8; Maximum Observation: 19.2.
Number of classes = k = 6 as 25 = 32 < 42 and 26 = 64 > 42 so take k= 6
19.2−13.8
𝐶𝑙𝑎𝑠𝑠 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = ≅1
6
8
• Relative frequency distribution:
Frequency of each class can also be expressed as a fraction or percentage terms. These are known as relative
frequencies. In other words, a relative frequency is the class frequency expressed as a ratio of the total frequency.
Class frequency
Relative Frequency =
Total frequency
For Example:
Consumption of electricity Number of Relative frequency
(in Kilowatt) factories distribution
20-30 18 0.18
30-40 18 0.18
40-50 25 0.25
50-60 22 0.22
60-70 17 0.17
Total 100 1
(a) Less than type: Here cumulative frequencies (C.F) are obtained by adding successive class frequency from top
to bottom. Here frequencies are obtained as compare to upper limit of class.
For example:
(b) More than type: Here cumulative frequencies are obtained by adding class frequencies from bottom to top.
Here frequencies are obtained as compare to lower limit of class.
For example:
Consumption of electricity Number of More than (C.F)
(in Kilowatt) factories(f)
20-30 18 More than 20 = 100
30-40 18 More than 30 = 82
40-50 25 More than 40 = 64
50-60 22 More than 50 = 39
60-70 17 More than 60 = 17
More than 70 = 0
9
Graphical presentation of data:
In Frequency distribution graphs data are presented by
(1) Histograms,
(2) Frequency Polygons,
(3) Frequency Curves,
(4) Ogives.
(1) Histograms
- It consists of a number of rectangles, those are vertically adjacent.
- For drawing Histogram Class intervals are taken on X-axis and frequency density on Y-axis so that are of
rectangles represent frequency of that class. In case of equal class interval for simplicity frequencies are taken
on Y-axis.
- Histograms can’t be constructed for frequency distributions with open end classes unless we assume that the
magnitude of the first open class is same as that of the succeeding (second) class and the magnitude of the
last open class is same as that of the preceding (i.e., last but one) class.
- The purpose of drawing histogram is to locate Mode (measure of central tendency) graphically and to
comment about the nature of frequency distribution whether it is positively skewed, negatively skewed or
symmetric.
The technique of constructing histogram is as follows:
1. For ungrouped frequency distribution: Here erect a vertical line towards the value of variable having height
equal to frequencies.
For example:
The following data shows the number of accidents sustained by 314 drivers of a public utility company over a
period of five years.
No. of accident 0 1 2 3 4 5 6 7 8 9 10 11
No. of drivers 82 44 68 41 25 20 13 7 5 4 3 2
10
Frequency distribution is nearer
to symmetry. Mode can be
between 40 and 50.
Frequency distribution is
negatively skewed. Mode can be
between 20 and 25.
12
(4) Cumulative Frequency Distribution (Ogives)
- Sometimes it is preferable to present data in a cumulative frequency (Cf) distribution or simply a
distribution which shows the cumulative number of observations below the upper boundary (limit) of
each class in the given frequency distribution. For example, at a time we are interested in knowing how
many workers of a factory earn less than Rs. 700 per month or how many workers earn more than Rs.
1,000 per month, percentage of students who have failed etc. To answer these questions it is necessary
to add the frequencies. When frequencies are added they are called cumulative frequencies. Then a table
of cumulative frequencies is drawn, which when plotted on a graph paper is called the cumulative
frequency curve or more popularly known as 'Ogive'.
- A cumulative frequency distribution is of two types: (i) more than type and (ii) less than type.
- Less than cumulative frequency: In the less than method we start with upper limits of class and go on
adding the frequencies. When these frequencies are plotted we get a rising curve.
- More than cumulative frequency: Here, we start with lower limit and go on subtracting the frequencies
of each class. When these frequencies are plotted a decreasing curve will be obtained.
- Using Ogive median can be located graphically.
For example:
The following table gives the distribution of monthly income of 600 families in a certain city.
13
Measures of Central Tendency
One of the important objectives of statistical analysis is to determine various numerical measures which describe the
inherent characteristics of a frequency distribution. The first of such measures is average. The averages are the
measures which condense a huge unwieldy set of numerical data into single numerical values which are representative
of the entire distribution. The numerical value of an observation (also called central value) around which most
numerical values of other observations in the data set show a tendency to cluster or group, called the central tendency.
• Requisites of an Ideal measure of Central Tendency
The following requirements to be satisfied by an ideal measure of central tendency:
1. It should be rigidly defined: The definition of an average should be rigid so that there must be uniformity in its
interpretation by different users or investigators.
2. It should be easy to understand and calculate: The value of an average should be calculated by using simple
methods without reducing its accuracy and other advantages.
3. It should be based on all observations: Since it represents the entire data set, it must be computed using all the
observations.
4. It should be suitable for further mathematical treatment: This means that, if average of certain group is known
then it is possible to calculate their combine average without knowing actual observations for all groups. For
example, it should be possible to determine the average production in a particular year by the use of average
production in each month of the year.
5. It should be affected as little as possible by fluctuation of sampling: This means that it should have sampling
stability. That is the value of an average calculated from various independent random samples of the same size
from a given population should not vary much from another.
6. It should not be affected much by extreme observations: The value of an average should not be affected by
very small or very large observations in the given data.
14
1. Mathematical Averages
a) Arithmetic Mean
i. Simple Mean:
It is the quantity obtained by sum of all observations divided by the total number of observations. If X is the
involved variable, then arithmetic mean of X is abbreviated as A.M. of X and denoted by x .
(a) Raw data: a data without any statistical treatment): If x1, x2, xn are n observations of random variable X.
x=
x
Then arithmetic mean or mean is denoted by x and is given by
n
(b) Discrete ungrouped frequency distribution:
If x1, x2, …, xn are n distinct observations of discrete variable X with frequency f1, f2, …, fn respectively, then
1. It is rigidly defined.
2. It is easy to calculate.
3. It is simple to compute.
4. It is based upon all the observations.
5. It is capable of father algebraic treatment. (i.e. possible to find combined mean).
6. It is least affected of sampling fluctuations.
Demerits:
1. It is very much affected by extreme values. (i.e., too high and too low values).
2. It can’t be calculated when end classes are open-ended.
3. It can’t be located graphically.
4. The mean cannot be calculated for qualitative characteristics such as intelligence, honesty, beauty, or
loyalty.
Properties of Arithmetic Mean (or Mean):
1. The sum of the deviations of all observations from their arithmetic mean is always zero.
i.e., (x − x) = 0 ; for raw data
f (x − x) = 0 ; for frequency data.
2. The sum of squared deviations of all the observations is minimum when it was taken about their arithmetic
mean.
( x − x ) ( x − A)
2 2
i.e., for raw data
f (x − x ) f (x − A)
2 2
for frequency data.
Here A is any value except x .
15
3. If we replace each individual observation in the data by the constant then mean is the constant itself. That is if
𝑥𝑖 = 𝑐 for all 𝑖 then 𝑥 = 𝑐
4. ∑ 𝑥 = 𝑛𝑥 for raw data
∑ 𝑓𝑥 = 𝑁𝑥 for freuency data
5. Arithmetic mean is depends on change of origin and scale both.
That is, if a fixed number is subtracted from/added to each observation, then their mean is
diminished/increased by this same number and if each observation is divided/multiply by a fixed number, then
their mean is divided/multiply by this same number.
i.e., If Y = a + b X then Y = a + b X.
6. If x and x be arithmetic mean of two groups of observations N1 and N2 then the combined mean of these
1 2
two groups can be computed by
N 1 x1 + N 2 x2
x12 = N 1 + N 2
This can also be generalized in the same way for more than two groups of different observations having
different arithmetic meas.
N x1 + N + ... + Nk xk
1 2 x2
x = N + N + ... + N
c 1 2 k
(n + 1 )
7. The arithmetic mean of first 𝑛 natural number is
2
16
ii. Weighted Arithmetic Mean:
In the computation of simple arithmetic average assumption is that all the items in the distribution are of equal
importance. However, in practice, it is possible to come across situation where relative importance of all the items
of the distribution is not same. In such cases, due weightage is to be given to various item weighted mean is
computed. For example, if it is desired to have an idea of the change in the cost of living of a certain group of
people, then the simple arithmetic average of the prices of the commodities consumed by the people will not do,
as all commodities are not equally importance; e.g. items like wheat, rice pluses, fuels, housing lighting etc. are
more important than cigarettes, confectionary, cosmetics, etc. Hence, different items should be assigned weights
according to their relative importance for the computation of mean, which will be weighted mean.
Let w1 , w2 ,...wn be the weight assign to variable values x1 , x2 ,..., xn respectively, then, the weighted arithmetic
mean, usually denoted by xw is given by
xw =
w1 x1 + w2 x2 + ... + wn xn
=
wx
w1 + w2 + ... + w
n w
In case of frequency distribution, if f1 , f 2 ,... fn are the frequency of the variable values x1 , x2 ,..., xn respectively
then the weighted arithmetic mean is given by
xw =
w1 ( f1 x1 ) + w2 ( f 2 x2 ) + ... + wn ( f n xn )
=
wfx
w1 + w2 + ... + wn w
The weighted arithmetic mean should be used
1. When the importance of all the numerical values in the given data set is not equal.
2. When the frequencies of various classes are widely varying.
3. When there is a change either in the proportion of numerical values or in the proportion of their frequencies.
b) Geometric Mean:
The geometric mean is the nth root of product of n observations.
n x1.x2 ,. .. xn ; for raw data
f1 f2 fn
G.M = N x1 .x 2 ,....x n ; for discrete group data
f
N m1 f1 .m 2 2 ,....m n fn ; for data are in classes form, and mi represent mid value of classes
17
(ii) If any one of the observation is negative then G.M is imaginary.
Application:
(i) The concept of G.M. is used in the construction of Index number.
(ii) Since G.M. ≤ A.M., therefore G.M. is useful in those cases where smaller observations are to be given
importance. Such cases usually occur in social and economic areas of study.
(iii) The G.M. of a data set is useful in estimating the average rate of growth in the initial value of an
observation per unit period. For example, it is useful in finding the percentage increase in sales, profit,
production, population, and so on.
c) Harmonic Mean:
Harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the given observation.
1 = n
1 1 + 1 1 1 ; for raw data
n x x + ... + x x
1 2 n
H.M =
f1 = N ; for frequency data
1 f1 + 2 + ... + f n f
N x x x x
1 2 n
Note:
i) If any observation is Zero then H.M is not defined.
ii) Harmonic mean is especially useful in averaging rates and ratios where time factor is variable and the act
being performed e.g., for finding average speed of vehicles, typist etc.
Application: The harmonic mean is particularly useful for computation of average rates and ratios. Such rates and
ratios are generally used to express relations between two different types of measuring units that can be
expressed reciprocally.
- Relationship among A.M., G.M. and H.M.
For any set of observations, it’s A.M., G.M. and H.M. are related to each other in the relationship
𝐴𝑀 ≥ 𝐺𝑀 ≥ 𝐻𝑀
The sign of ‘=’ holds if and only if all the observations are identical.
Note: (i) If the observations in a data set take the values a, ar2, ar3, arn-1, each with single frequency,
then (G.M.) 2 = A.M. x H.M.
(ii) If a variable assume only two values then (G.M.) 2 = A.M. x H.M.
2. Averages of Position:
(a) Mode: Mode is the value which occurs most frequently in a set of observation and around which the other items
of the set cluster densely.
For example, if a sandwich shop sells 10 different types of sandwiches, the mode would represent the
most popular sandwich.
(a) Raw data: M0 = that value of variable which occur more frequently in data set.
(b) Discrete ungrouped frequency distribution: M0 = that value of variable which corresponds to highest
frequency.
(c) Continuous frequency distribution : First find modal class that is the class having highest frequency.
M = L+ f1 − f0
c
0
2 f1 − f 2 − f0
Where L: lower limit or boundary of modal class
f 0 : Frequency of class above modal class
f1 : Frequency of modal class
f 2 : Frequency of class below modal class
c : Class width of modal class
18
Mode is especially useful in finding the most popular size in studies relating to marketing, trade, business and
industry. It is the appropriate average to be used to find the ideal size e.g., in business forecasting, in the
manufacture of shoes or readymade garments, in sales, in production, etc.
If two or more values observe for the same numbers of time, then there are two or more Modes exist and
distribution is said to be bi-modal or multi-modal. If the data having only one mode the distribution is said to
be uni-modal and data having two modes, the distribution is said to be bi-model.
Merits:
1. It is easy to calculate, easy to understand.
2. It is not affected by extreme values.
3. It can be determined in open-end classes.
4. It can be represented graphically by Histogram.
5. It is most suitable average to find the ideal size. For e.g. its value is used for comparing consumer
preferences for various types of products, say soaps, cigarettes, toothpastes or other products. In the
manufacture of readymade garments, shoes etc.
Demerits:
1. It is not based on all the observations.
2. It is not capable of further algebraic treatment.
3. As compared to mean and median, it is affected to a greater extent by sampling fluctuations
(b) Median: The median is that value of variable which divides the group in two equal parts, one part comprising
all the values greater and the other, all values less than median. Since its value depends on the position
occupied by a value in the frequency distribution it is also known as positional measure of central tendency.
(a) Raw data: First arrange the observation in ascending (increasing) order.
n + 1th
observatio n ; if n is odd
2
M e = n th
n th
observatio n + + 1 observatio n
2 2 ;if n is even
2
(b) Discrete ungroup frequency distribution: First find the cumulative frequency (C.F) less than type.
M e = that value of variable which corresponds to C.F. just greater than (N/2)
(c) Continuous frequency distribution:
- First find C.F less than type.
- Find Median class i.e. class having C.F just greater than (N/2).
N −Cf
2
Me = L + c
f
Where, L : Lower limit or boundary of median class
C f : Cumulative frequency of class above median class
f : Frequency of median class
c : Class width of median class
Note:
20
Other positional measures
1. Quartiles
- The values which divide the given data into four equal parts are known as Quartiles.
- There will be three such points Q1, Q2 and Q3, Such that Q1 ≤ Q2 ≤ Q3.
- Quartiles divide a rank-ordered data set into four equal parts. There are three quartiles called, first
quartile, second quartile and third quartile. The second quartile (Q 2) is equal to the median. The first
quartile is also called lower quartile and is denoted by Q 1. The third quartile is also called upper quartile
and is denoted by Q3.
- The lower quartile Q1 is a point which has 25% observations less than it and 75% observations are above it.
- The upper quartile Q3 is a point with 75% observations below it and 25% observations above it.
(a) Raw Data (Quartile for Individual Observations):
If x1, x2, …, xn are n observations ofthrandom variable X. Then,
n + 1
Q1= value of observation
4 th
n + 1
Q2= value of 2 observation
4 th
n + 1
Q3= value of 3 observation
4
(b) Discrete ungrouped frequency distribution ( When the data follows the discrete set of values grouped by
size):
If x1, x2, …, xn are n distinct observations of random variable X with frequency f1, f2, …, fn respectively, then
N + 1 N + 1
th
(c) Continuous frequency distribution (When data arranged in tabular form containing different groups):
If L1 – U1, L2 – U2, …, Ln – Un are n exhaustive and exclusive class of random variable X with
frequency f1, f2, …, fn respectively, then
➢ First find C.F less than type and find ith quartile cl ass i.e. class having C.F just greater than [i(N/4)].
iN −Cf
Then ith Quartile is given by, Qi = L + 4 c where, i = 1, 2, 3.
f
Where, L: Lower limit or boundary of i th quartile class
Cf : Cumulative frequency of class above i th quartile class
f : Frequency of i th quartile class
c : Class width of i th quartile class
21
2. Deciles
The deciles are the partition values which divides the set of observations into ten equal parts. We have nine
deciles, denoted by respectively D1, D2, …, D9.
The first decile is D1 is a point which has 10% of the observations below it.
(a) Raw Data: If x1, x2, …, xn are N observations of random variable X. Then, ith decile is given by
+
th
n 1
Di= value of i observation , where i = 1,2,…,9.
10 th
n + 1
Therefore, D1= value of observation (first decile).
10 th
n + 1
D2= value of 2 observation (sec ond decile)
10
…………
n + 1
th
D2 = L + c
f
…………
D9 = L + c
. f
22
3. Percentiles: Divide the series into hundred equal parts.
There are ninety nine percentile, P1,P2,…P99. Such that
P1 ≤ P2 ≤…≤ P99. Pi has (i × 100)% item less than it.
(a) Raw data: First arrange the observation
th
in ascending (increasing) order.
n + 1
Pi = i observatio ns i = 1,2,...,99
100
(b) Discrete ungroup frequency distribution: First find the cumulative frequency (C.F) less than type.
Pi = that value of variable which corresponds to C.F. just greater than [i(N/100)], i= 1,2,…,99.
(c) Continuous frequency distribution: First find C.F less than type.
Find ith percentile class i.e. class having C.F just greater than [iN/100)].
( )
i N 100 − C.Fa
c
Pi = L + i = 1,2,…,99
f
Where, L : Lower limit or boundary of i th percentile class
C f : Cumulative frequency of class above i th percentile class
f : Frequency of i th percentile class
c : Class width of i th percentile class
- Relation between median, Quartiles, Deciles and percentiles.
Median = Q2 = D5 = P50
Q1 = P25, Q3 = P75
D1 = P10, D2 = P20, … D9 = P90
➢ Summary of various location measures or the measure of central tendency
Property Arithme Median Mode Geometric Harmonic
tic mean Mean
mean
1. Rigidly defined Yes Yes Not very Yes Yes
2. Based on all values of series Yes No No Yes Yes
3. Easy to calculate and understand Yes Quite Quite Difficult Difficult
4. Amenable to algebraic treatment Yes No No Yes Yes
5. Effect of sample variations Stable Moderate Moderate Moderate Moderate
6. Effect of extreme values Large None None Very low Very low
7. Most useful in General Least net discomfort Typifying Averaging Averaging
purpose problem series rates ratios
23
Measures Of Dispersion
One of the important characteristic of distribution is Central Tendency, gives one single value that represents the
entire data. Another important characteristic of distribution is to describe the dispersion of data. The dispersion
also means scatteredness, spread or variation of the observations. The averages alone cannot adequately describe
a set of observations, unless all the observations are the same. It is necessary to describe the variability or
dispersion of the observation of the observations. In two or more distributions the central value may be the same
but still there can be wide disparities in the formulation of distribution. The extent to which the individual
observations differ on an average from mean or any other measure of central value is called measure of dispersion
or measure of variation. As these measures give an average of the differences of the observations included in a
group from an average of these items, they are also known as “averages of second order”. Note that Measures of
central values are, therefore, called the “averages of first order”.
Definition of Dispersion: According to Spiegel – “The degree to which numerical data tend to spread about an
average value is called the variation or dispersion of the data.”
Different Measures of Dispersion
For the study of dispersion, we need some measures which show whether the dispersion is small or large. There
are two types of measure of dispersion which are:
a) Absolute Measure of Dispersion
b) Relative Measure of Dispersion
a) Absolute Measures of Dispersion
These measures give us an idea about the amount of dispersion in a set of observations. They give the answers in
the same units as the units of the original observations. When the observations are in kilograms, the absolute
measure is also in kilograms. If we have two sets of observations, we cannot always use the absolute measures to
compare their dispersion. We shall explain later as to when the absolute measures can be used for comparison of
dispersion in two or more than two sets of data.
The absolute measures which are commonly used are:
1. The Range (R)
2. The Quartile Deviation (Q.D)
3. The Mean Deviation (M.D)
4. The Standard deviation (S.D) and Variance
b) Relative Measure of Dispersion
These measures are calculated for the comparison of dispersion in two or more than two sets of observations.
These measures are free of the units in which the original data is measured. If the original data is in dollar or
kilometers, we do not use these units with relative measure of dispersion. These measures are a sort of ratio and
are called coefficients. Each absolute measure of dispersion can be converted into its relative measure.
Thus the relative measures of dispersion are:
1. Coefficient of Range or Coefficient of Dispersion.
2. Coefficient of Quartile Deviation or Quartile Coefficient of Dispersion.
3. Coefficient of Mean Deviation or Mean Deviation of Dispersion.
4. Coefficient of Variation (C.V.)
Absolute measures vs. relative measures of variation
Measures of dispersion may be either absolute or relative. Absolute measures of dispersion are expressed in the
same statistical unit in which the original data are given such as rupees, kilograms, kilometers etc. These values
may be used to compare the variations in two distributions provided the variables are expressed in the same units
and of the same average size. In case the two sets of data are expressed in different units, however, such as
quintals of sugar versus tones of sugarcane, or if the average size is very different such as manager’s salary versus
24
workers’ salary, the absolute measures of dispersion are not comparable. In such cases measures of relative
dispersion should be used.
A relative measure of dispersion is the ratio of an absolute measure of dispersion to an appropriate average. It is
called a coefficient of dispersion, because “coefficient” means a pure number that is independent of the unit of
measurement. It should be remembered that while computing the relative dispersion the average used as base
should be the same one from which the absolute deviations were measured.
1. Range
In any statistical series, the difference between the largest and the smallest values is called as the range.
Thus
Range (R) = L – S
where; 𝐿 = 𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑜𝑏𝑎𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑆 = 𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
Merits
1. It is easy to calculate and simple to understand.
2. It is less affected by the extreme values of variable.
3. It is based on all observations in the distribution.
Demerits
1. It ignores the negative deviation and treats them as positive which is not justified mathematically.
2. It is not a satisfactory measure when the deviations are taken from the mode.
3. It is not suitable when the class intervals are open end type.
4. The mean deviation cannot be used in statistical inference.
26
2. It is not capable of further mathematical treatment.
3. It is greatly affected by the fluctuations in the sampling.
4. Standard Deviation
27
28
Note: Standard deviation is best measure of variation because the value of s. d. is based on all observation in a set
of data. It is only the measure of variation capable of algebraic treatment.
a. It is less affected by sampling fluctuations as compared to other measures of variation.
b. Standard deviation has definite relationship with the area under the symmetric curve of a frequency
distribution.
Coefficient of Variation (C. V.)
29
To compare the variations (dispersion) of two different series, relative measures of standard deviation must be
calculated. This is known as co-efficient of variation or the co-efficient of s. d. Its formula is
C. V. = 100
x
Thus it is defined as the ratio s. d. to its mean.
Remark: It is given as a percentage and is used to compare the consistency or variability of two more series. The
higher the C. V., the higher the variability and lower the C. V., the higher is the consistency of the data.
Measures of skewness:
Introduction
The voluminous raw data cannot be easily understood; hence, we calculate the measures of central tendencies and
obtain a representative figure. From the measures of variability, we can know that whether most of the items of the
data are close to or away from these central tendencies. But these statistical means and measures of variation are not
enough to draw sufficient description about the data. Another aspect of the data is to know its symmetry. The
symmetry of data is well studied by the knowledge of the "Skewness."
Literal meaning of skewness is ‘lack of symmetry’. Study of skewness is help to have an idea about the shape of the
curve which can be draw with the help of the given frequency distribution. The frequency curve of the distribution is
not a symmetric bell-shaped curve but it is stretched more to one side than other then it is called skewed distribution.
A frequency distribution for which the curve has longer tail towards the right is said to be positively skewed and if the
longer tail lies towards the left, it is said to be negatively skewed.
- Symmetric Distribution: For symmetric distribution curve falls at same rate from the highest peak. Thus
frequency curve has same tail from mean. For such curve
Mean = Median = Mode
- Positively skewed distribution: For a positively skewed distribution curve rises rapidly, reaches the maximum and
falls slowly. In other words, if the frequency curve has longer tail to right the distribution is known as positively
skewed distribution and for a positively skewed distribution
Mean > Median > Mod.
- Negatively skewed distribution: A negatively skewed distribution curve rises slowly, reaches its maximum and
falls rapidly. In other words, if the frequency curve has longer tail to left the distribution is known as negatively
skewed distribution and for negatively skewed distribution
Mean < Median < Mode.
Measures of Skewness
A measure which gives the extent of asymmetry is known as the measure of ‘skewness’. Measures of
Skewness are categories in two ways.
(i) Absolute measures (ii) Relative measures
(i) Absolute Measures:
30
Sk = (Mean – Mode)
Sk = (Mean –Median)
Sk = Q3+ Q1 – 2 Median
Absolute measures are not much practical because they involve the units of measurement, hence cannot be
used for comparative study of the distribution measured in different units of measurements, even if the same
units of measurements, one may come across different distributions which have more or less identical absolute
measures but which vary widely in the measures of central tendency and dispersion.
(ii) Relative Measures
For comparing two or more distributions for skewness compute relative measures of skewness called
coefficient of skewness which are pure numbers independent of the units of measurement.
1. Karl Pearson’s Coefficient of skewness:
Mean − Mode x − M o
Sk = =
s.d
If mode is not uniquely defined then
3(Mean − Median) 3(x − M e )
Sk = =
If s.d
2. Bowley’s coefficient of skewness: In case of open end distribution Bowely’s coefficient of skewness is used.
Q3 + Q1 − 2Me
Sk =
Q3 − Q1
If,
Sk 0 then distribution is positively skewed
Sk 0 then distribution is negatively skewed
Sk = 0 then distribution is symmetric.
Limits for Bowley’s Coefficient of Skewness is −1 Sk +1
Uses of Skewness
1. It helps in finding out the nature and degree of concentration whether it is in higher or the lower values.
2. The imperative (it include statistical data) relationship between Mean, Median and Mode is based on the
assumption of a moderately skewed distribution. The measure of skewness will show to what amount
such imperative relationship would holds good.
3. It helps in knowing the distribution is normal or not. Many statistical measures are based on the
assumption of normal distribution.
Moments
Introduction:
31
Moment is a familiar mechanical term for the measure of a force with reference to its tendency to produce
rotation. In statistics moments are used to describe the various characteristics of a frequency distribution like
center tendency, variation, skewness and kurtosis.
Different types of moments:
(1) Central Moment:
Moments are calculated using the arithmetic mean. It is the arithmetic mean of the various powers of the
deviations of observations from the arithmetic mean in any distribution is called the moments of the distribution.
These moments about mean are called the "central moment" and are denoted by
Symbolically, rth moment about A.M. ( x ) is term as rth central moment denoted by r and define as
1
(x − x )
r
for raw data
r = N r
1
N f (x − x ) for frequency data
Remark : μo = 1
First four central moments are
1
N ( x − x ) for raw data
1 =
1
N
f (x − x ) for frequency data
1 = 0
1
N ( x − x )
2
for raw data
2 =1 2
1
f (x − A)
r
for frequency data
N
32
First four raw moments are
1
N ( x − A)
1
for raw data
' =
1
1 1
1
N ( x − A)
2
for raw data
'2 =
1 2
1
( x − A)
3
N for raw data
'3 =
1 (x − A)
3
f for frequency data
N
1
( x − A)
4
N for raw data
'4 =
1 (x − A)
4
f for frequency data
N
33
1
n x
r
for raw data
r =
1 r
N
f (x) for frequency data
o = 1
First four moments about origin are
1
n (x )
1
for raw data
1 =
1 1
N
f (x) for frequency data
1 = x
1
n (x )
2
for raw data
2
=
1 2
N
f (x) for frequency data
1
n (x )
3
for raw data
3 =
1 3
N
f ( x) for frequency data
1
n (x )
4
for raw data
4 =
1 4
N
f (x) for frequency data
Property of Moments
1. Moments are independent of change of origin and dependent on change of scale.
x−x
Let d = where h is class width, then first four moments about mean are:
h
=
fd
2
= h
N
fd
3
= h
N
f
4
= h d
N
2. Central moments in terms of Raw moments
1 = 0
34
1
=
2 ( x − x)
2
N
1
= (x − A - (x − A))
2
N
1 1
= ( x − A) − 2 ( x − A)(x − A) + (x − A)2
2
N N
= ' −2' + '
2 2
2 2 1 1
2 = '2 −' 2
1
=
1
(x − x ) 3
N
3
1
=(x − A − (x − A))3
N
= ( x − A) − 3 ( x − A) (x − A) + 3 ( x − A)(x − A)2 − (x − A)3
1 3 1 2 1
N N N
= ' −3 ' ' +3 ' ' 2 −' 3
3
3 2 1 1 1 1
3
= ' −3 ' ' +2 '
3 3 2 1 1
4 = ( x − x )
1 4
N
= (x − A − (x − A))
1 4
N
= ( x − A) − 4 ( x − A) (x − A)2 + 6 ( x − A) (x − A)2 − 4 ( x − A)(x − A)3 + (x − A)4
1 4 1 3 1 2 1
N N N N
= ' −4' ' +6' ' 2 −4' ' 3 + ' 4
4 4 3 1 2 1 1 1 1
= ' −4' ' +6' ' 2 −3' 4
4 4 3 1 2 1 1
' 1 = x − A
(x − A)
1
'2 =
2
(x − x + x − A)
1 2
=
n
( x − x ) ( x − x )(x − A) + (x − A)2
1 2 1
= +2
n n
'2 = 2 + '1 2
35
1
'3 = (x − A)3
n
' = + 3 ' +3 ' 2 + ' 3
3 3 2 1 1 1 1
' = + 3 ' + ' 3
3 3 2 1 1
1
'4 = (x − A)
4
n
' = + 4 ' +6 ' 2 +4 ' 3 + ' 4 Thus,
4 4 3 1 2 1 1 1 1
' = + 4 ' +6 ' 2 + ' 4
4 4 3 1 2 1 1
x = ’+
’ = + ’
’ = + ’ + ’
’ = + ’ + ’ + ’
Karl Pearson defined the following four coefficients, based upon four moments about the mean.
= 1 = 3
information about the shape of the curve obtained from the frequency
1 3
36
Kurtosis
It has its origin in the Greek word "Bulginess." In statistics it is the degree of flatness or ‘peakedness’ in the region of
mode of a frequency curve. It is measured relative to the ‘peakedness’ of the normal curve. It tells us the extent to
which a distribution is more peaked or flat-topped than the normal curve.
If the curve is more peaked than a normal curve it is called ’Lepto Kurtic.’ In this case items are more clustered about
the mode.
If the curve is more flat-toped than the more normal curve, it is Platy-Kurtic. The normal curve itself is known as "Meso
Kurtic."
30. A candidate obtains the following percentages in an examination. English 46%, Mathematics 67%, Sanskrit
72%, Economics 58%, Political science 53%. It is agreed to give double weights to marks in English and
Mathematics as compared to other subjects. What is the average mark?
31. Calculate the mean, median, mode, quartile, third deciles and 82 percentile of the following data that relates to
the service time (in minutes) per customer for 7 customers at a railway reservation counter: 3.5, 4.5,
3, 3.8, 5.0, 5.5, 4
39
32. Calculate the median and mode of the following data that relates to the number of patients examined per
hour in the outpatient ward (OPD) in a hospital: 10, 12, 15, 20, 12, 24, 17, 18
33. The mean monthly salary paid to all employees in a company is Rs.16000. The mean monthly salaries paid to
technical and non-technical employees are Rs.18000 and Rs. 12,000 respectively. Determine the percentage of
technical and non-technical employees in the company.
34. Given the following frequency distribution with some missing frequencies:
Class 10-20 20-30 30-40 40-50 50-60 60-70 70-80
frequency 185 ---- 34 180 136 ---- 50
If the total frequency is 685 and median is 42.6, find out the missing frequency.
35. Find the missing information in the following table:
A B C Combine
Number 10 8 ---- 24
Mean 20 ---- 6 15
36. The following table gives the weekly wages in rupees of workers in certain commercial organization. The
frequency of the class interval 49-52 is missing.
Weekly wages(Rs.) 40-43 43-46 46-49 49-52 52-55
No. of workers 31 58 60 ---- 27
It is known that the mean of the above frequency distribution is Rs. 47.2. Find the missing frequency.
37. Calculate mean, mode and median from the following data of the heights (in inches) of a group of students: 61,
62, 63, 61, 63, 64, 60, 65, 63, 64, 65, 65, 66, 64
Now suppose that a group of students whose heights are 60, 66, 59, 68, 67 and 70 inches, is added to the
original group. Find mean mode and median of combine group.
38. An incomplete frequency distribution is given below.
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Total
Frequency 4 16 - - - 6 4 230
Find the three missing frequency of the table, given that median = 33.5 and mode = 34. Also calculate the
mean using empirical relation between mean, median and mode.
39. The average daily wage of all workers in a factory is Rs. 444. If the average daily wages paid to male and female
are Rs. 480 and Rs.360 respectively, find the percentage of male and female workers employed by the factory.
(Ans: 70% and 30%)
40. Calculate the missing frequency for the following data given that the mode is Rs. 68.44
Earnings (Rs.) 66-67 67-68 68-69 69-70 70-71 71-72
No.of persons. 15 24 --- 20 14 11
(Ans : 40)
41. Find the missing figures:
a) Mean = ? (3 Median – Mode )
b) Mean – Mode = ? (Mean – Median)
c) Median = Mode + ?(Mean – Median)
d) Mode = Mean - ? (Mean – Median)
42. Doctor’s X and Y measured the systolic pressure of two groups of man all of the same age and results were:
Doctors No. of Men Mean Systolic Blood Pressure S.D
X 113 159 mm 22.4 mm
Y 121 149 mm 20.0 mm
Find the mean and S.D of the two groups taken together.
40
43. From the following table compute the missing values:
Sub group Number A.M Variance
I - 25 9
II 250 - 16
III 300 15 -
Combine 750 16 51.73
[Ans: N1 = 200, X2 = 10 , S.D32 = 25]
44. The following table gives the distribution of wages in the two branches of a factory:
Monthly Number of workers
wages (Rs) Branch A Branch B
100-150 167 63
150-200 207 93
200-250 253 157
250-300 205 105
300-350 168 82
a) Find mean and standard deviation for the two branches for the wages separately.
b) Which branch pays higher average wages?
c) Which branch has greater variability in wages in relation to the average wages?
d) What is the average monthly wage of the factory as a whole?
e) What is the variation of wages of all the workers in the two branches A and B taken together?
45. The following table gives the distribution of income of households based on hypothetical data:
Income (Rs.) Percentage of Income (Rs.) Percentage of
households households
Under 100 7.2 500-599 14.9
100-199 11.7 600-699 10.4
200-299 12.1 700-999 9.0
300-399 14.8 1000 and above 4.0
400-499 15.9
Compute a suitable measure of dispersion. Also find its relative measure.
46. Calculate appropriate karl Pearson’s coefficient of skewness from the following data.
Classes Frequency Classes Frequency
40-60 25 10-15 6
30-40 15 5-10 4
20-30 12 3-5 3
15-20 8 0-3 2
(Hint: since classes are of unequal width. So, median base coefficient can be computed.)
(ANS: Mean =31.13, median =31.67, S.D = 16.06 and Sk = -0.1)
47. The following facts are gathered before and after an industrial dispute.
Before dispute After dispute
No. of workers employed 515 500
Mean wage Rs. 49.5 Rs. 52.7
Median Wage Rs. 52.80 Rs. 50.00
Variance of wage (Rs.)2 121.00 (Rs.)2144.0
Compare the position before and after the dispute in respect of
(i) Total wages (ii) modal wages (iii) standard deviation (iv) skewness
Before dispute After dispute
Total wages Rs. 25492.50 Rs. 26849.75
Modal wages Rs. 59.4 Rs. 44.50
C.V 22.22 22.74
Skewness - 0.90 0.69
41
48. By using the quartiles, find a measure of skewness for the following distribution.
Annual Sales(Rs. ‘000) No. of firms Annual sales(Rs. ‘000) No. of firms
Less than 20 30 Less than 70 644
“ “ 30 225 “ “ 80 650
“ “ 40 465 “ “ 90 665
“ “ 50 580 “ “ 100 680
“ “ 60 634
(ANS : Q1 = 27.18, Q3 = 43.90, Median = 34.79, skewness 0.0903)
49. Calculate the first four moments about mean for the following distribution.
Also calculate beta coefficients, and comment upon the nature of skewness and Kurtosis.
Profit (Rs. In lakh) 10-20 20-30 30-40 40-50 50-60
Number of companies 18 20 30 22 10
(ANS: 152.04, 21.312, 47327.51, Sk = 0.0114, kurtosis= 2.047)
50. Karl Pearson’s measure of skewness of a distribution is 0.5. The median and mode of the distribution are
respectively, 42 and 32. Find (i) mean (ii) S.D. (iii) Coefficient of variation. (ANS: 47, 30, 63.83)
51. The first three moment of a distribution about the value 3 of a variable are 2, 10 and 30 respectively. Comment
upon the nature of distribution.
52. The following measures were computed for a frequency distribution:
Mean = 50, coefficient of Variation = 35% and Karl Pearson's Coefficient of Skewness = - 0.25.
Compute Standard Deviation, Mode and Median of the distribution. (ANS: 17.5, 54.375, 51.45833)
53. If the first quartile is 142 and the semi-inter quartile range is 18, find the median assuming the distribution to
be symmetrical. (Ans: 160)
54. In a frequency distribution the coefficient of skewness based on quartile is 0.6. If the sum of the upper and
lower quartile is 100 and median is 38. Find the value of upper and lower quartile.(ANS: 70,30)
55. In a distribution ‘the difference of the two quartiles is 15 and their sum is 35 and median is 20. Find the
coefficient of skewness. (ANS: -0.33)
56. Find coefficient of skewness from the following data and show which section is more skewed.
Income(Rs.) 55-58 58-61 61-64 64-67 67-70
Section A 12 17 23 18 11
Section B 20 22 25 13 4
(ANS: Sk(A) = -0.0061, Sk(B) = -0.06, Section B is more skewed)
57. The first four moment of distribution about origin are 1, 4, 10, 46 Obtain first four central moment and
comment upon the nature of the distribution. (ANS: mean = 1, s.d. = 1.732,central moments= 0,3,0,26, sk=0,
kurtosis=3, distribution is symmetric and mesokurtic, hence normal)
58. If β 1= +1 and β2=4 and variance =9. Find the value of µ 3 and µ4 and comment upon nature of distribution.(ANS:
27, 324)
59. For a mesokurtic distribution the first moment about 7 is 23 and the second moment about origin is 1000. Find
the coefficient of variation and the fourth moment about mean. (Ans: 33.33, 30000)
60. For distribution, the mean is 10, variance is 16, β 1 is +1 and β2 is 4. Obtain first four moments about the origin.
Comment upon nature of distribution. (10, 116, 1544, 23184)
61. The following data are given to an economist for the purpose of economic analysis. The data refer to the length
of certain type of batteries,
𝑁 = 100, ∑ 𝑓𝑑 = 50, ∑ 𝑓𝑑2 = 1970, ∑ 𝑓𝑑3 = 2948, 𝑎𝑛𝑑 ∑ 𝑓𝑑4 = 86752 in which d = (X-48)
Do you think distribution is platykurtic? Also comment on skewness. (ANS: β2=2.214)
62. Give any three measures of skewness of a frequency distribution. Explain briefly with suitable diagrams the
term skewness.
63. Distinguish between Skewness and Kurtosis.
64. Explain briefly how the measures of skewness and kurtosis can be used in describing a frequency distribution.
65. Define moments. ‘’A frequency distribution can be described almost completely by the first four moments and
two measures based on moments.” Examine the statement
42
43