SFB Module I 2019
SFB Module I 2019
Course Instructor
Anil Chandra
1
Module I: Descriptive Statistics
1. Data, Types of Data, Variables & Constants
1.2 Qualitative Data: represents the properties, classification, names or labels OR data
which cannot be measured or quantified
Examples:
- the hair color of a person (red, blond, brown, black)
- Division scored by the students (remember division scored is a classification and not a
measurable quantity)
Continuous Variables: The value which is continuous over a certain interval OR in other
words can take all the values between two given values is called continuous data
Eg. Height of students of a class (Say class XIIth)
The height of students can range from 4 feet (or may be lower) to 6 feet (or may be higher).
So the data set – height of a student can take any value between these two values.
2
2. Population/Universe and Sample, Parameter & Statistics:
Population/Universe:
- The term population/universe refers to a collection of people or objects that share common
observable characteristics.
For example, a population could be all of the people who live in your city, all of the students
enrolled in a particular university, or all of the people who are afflicted by a certain disease
(e.g., all women diagnosed with breast cancer during the last five years).
- Generally, researchers are interested in particular characteristics of a population, not
the characteristics that define the population but rather such attributes as height,
weight, gender, age, heart rate, and systolic or diastolic blood pressure.
Sample:
Subset of a population, or a part taken out from the population is called a sample.
Given below is a venn diagram that represents A Population and its Sample
Population (P)
Sample (S)
3
3. Introduction to Statistics
3.1 What is Statistics?
Statistics: it is a branch of science that deals with the:
a) Collection, organization and interpretation of data (Descriptive Statistics)
b) Drawing an inference about population(s) from given sample(s): (Inferential Statistics)
Adopted from the book “Statistics for the Utterly Confused by Lloyd Jaisingh”
Limitations of Statistics
1. Statistics does not deal with individual measurements: Since statistics deals with
aggregate of facts, the study of individual measurements lies outside the scope of
statistics. Data are statistical when they relate to measurement of masses. For eg.
Wages earned by an individual worker has no role to play in statistics, however, the
data of wages pertaining to all the workers is used for statistical interpretation
2. Statistical results are true only on an average: The conclusion obtained statistically
are not universally true, they are true only under certain conditions
3. Statistics deals with quantitative characteristics: Statistics are numerical
statements of facts. Such characteristics which cannot be expressed in numbers are
incapable of statistical analysis. Thus qualitative characteristics like honesty,
efficiency, intelligence, blindness and deafness cannot be studied directly. We must
assign certain qualitative scales of measurements so that these characteristics could be
studied statistically.
4. Statistics can be misused: One of the major drawbacks of statistics is that it can be
misused. This misuse could be because of many reasons. For example if statistical
conclusions are based on incokplete information one may arrive at a false conclusion.
4
Eg., to measure attitude, opinion towards a subject we have to construct scale
First we would understand types of data and appropriate measurements of scales
Interval Scale
Provides information about order
• -20º C to + 20º C
Possesses equal intervals
• 10º C, 20º C, 30º C
Difference between values are well defined
• 20º C – 10º C = 10º C
• 40º C – 30º C = 10º C
Has no absolute zero
• 0º C does not indicate no temperature
Other Examples
• TIME OF DAY on a 12-hour clock
Ratio Scale
Provides information about order
• 20 cm to 120 cm
Possesses equal intervals
• 10 cm, 20 cm …
Difference between values are well defined
• 100 cm – 80 cm= 10 cm
• 20 cm – 10 cm = 10 cm
Has absolute zero
• 0 cm means no height
Other examples
5
Years of experience, no. of children
Ordinal Scale:
Provides information about order (natural order)
Difference between values are logically not defined but we can categorize them as
greater than, less than
Examples:
Satisfaction, Happiness, Discomfort,
Rank orders
Nominal Scale
Used only for labeling variables (no order/ no concept of difference between variables
or less than greater than)
6
4. Descriptive Statistics
4.1 Descriptive Statistics
4.1.1 Collection of data: Collection of data can be classified into two forms:
1. Primary data: Data collected by the experimenter himself/herself
2. Secondary Data: Data collected from other sources
There are many different ways of collecting data, some of the methods are:
Interviews
Questionnaires and Surveys
Observations
Focus Groups
Case Studies
Documents and Records.
4.1.2 Organization of data:
Data can be organized into the following ways:
a) Ordered Array: representation of data in the increasing order of its magnitude is
called an Ordered Array (used for quantitative data)
Eg 1. Given below is number of marks in Statistics scored by 10 students in a class
90, 87, 85, 90, 92, 94, 56, 73, 75, 75
The Ordered Array of above data will be:
56, 73, 75, 75, 85, 87, 90, 90, 92, 94
Frequency Distribution
In statistics, a frequency distribution is a list, table or graph that displays the
frequency of various outcomes in a sample. Each entry in the table contains the
frequency or count of the occurrences of values within a particular group or interval,
and in this way, the table summarizes the distribution of values in the sample.
7
30-40 13
40-50 04
50-60 09
60-70 03
Total 50
In Example b the marks scored are given in the form called Class Intervals (CI)
00 – 10 is a class interval where lowest class interval is 00 and highest is 9; It means
that this class interval contains number of students who have scored the marks
between 0 to 9
8
Creating a Grouped Frequency Distribution
Find the largest and smallest values
Compute the Range = Maximum - Minimum
Select the number of classes desired. This is usually between 5 and 20.
Find the class width by dividing the range by the number of classes and rounding up. There
are two things to be careful of here. You must round up, not off. Normally 3.2 would round
to be 3, but in rounding up, it becomes 4. If the range divided by the number of classes gives
an integer value (no remainder), then you can either add one to the number of classes or add
one to the class width. Sometimes you're locked into a certain number of classes because of
the instructions. The Bluman text fails to mention the case when there is no remainder.
Pick a suitable starting point less than or equal to the minimum value. You will be able to
cover: "the class width times the number of classes" values. You need to cover one more
value than the range. Follow this rule and you'll be okay: The starting point plus the number
of classes times the class width must be greater than the maximum value. Your starting point
is the lower limit of the first class. Continue to add the class width to this lower limit to get
the rest of the lower limits.
To find the upper limit of the first class, subtract one from the lower limit of the second class.
Then continue to add the class width to this upper limit to find the rest of the upper limits.
Find the boundaries by subtracting 0.5 units from the lower limits and adding 0.5 units from
the upper limits. The boundaries are also half-way between the upper limit of one class and
the lower limit of the next class. Depending on what you're trying to accomplish, it may not
be necessary to find the boundaries.
Tally the data.
Find the frequencies.
Find the cumulative frequencies. Depending on what you're trying to accomplish, it may not
be necessary to find the cumulative frequencies.
If necessary, find the relative frequencies and/or relative cumulative frequencies.
A line graph, also known as a line chart, is a type of chart used to visualize the value of
something over time. For example, a finance department may plot the change in the amount
of cash the company has on hand over time.
The line graph consists of a horizontal x-axis and a vertical y-axis. Most line graphs only
deal with positive number values, so these axes typically intersect near the bottom of the y-
axis and the left end of the x-axis. The point at which the axes intersect is always (0, 0). Each
axis is labeled with a data type. For example, the x-axis could be days, weeks, quarters, or
years, while the y-axis shows revenue in dollars.
Data points are plotted and connected by a line in a "dot-to-dot" fashion.
The x-axis is also called the independent axis because its values do not depend on anything.
For example, time is always placed on the x-axis since it continues to move forward
regardless of anything else. The y-axis is also called the dependent axis because its values
depend on those of the x-axis: at this time, the company had this much money. The result is
that the line of the graph always progresses in a horizontal fashion and each x value only has
one y value (the company cannot have two amounts of money at the same time).
9
More than one line may be plotted in the same axis as a form of comparison. For example,
you could create a line graph comparing the amount of money held by each branch office
with a separate line for each office. In this case each line would have a different color,
identified in a legend.
The line graph is a powerful visual tool for marketing, finance, and other areas. It is also
useful in laboratory research, weather monitoring, or any other function involving a
correlation between two numerical values. If two or more lines are on the chart, it can be
used as a comparison between them.
i) Histogram:
- A histogram is a graphical display of tabulated frequencies of continuous data (Quantitative
data on both X and Y Axis), which are shown as bars.
- It shows what proportion of cases fall into each of several categories. The categories are
usually specified as non-overlapping intervals of some variable.
- The categories (bars) must be adjacent.
- The intervals (or bands, or bins) are generally of the same size.
- The width of the bars, marked along with the X axis, are also important along-with height
10
14
12
No. of students
10
8
6
4
2
0 20 30 40 50 70
0 10 60
Marks scored
14
12
10
No. of students
0 35
5 15 25 45 55 65
Marks scored
11
60
50
Cumulative frequency
40
30
20
10
0
Marks scored
45 55 65
5 15 25 35
d. Diagrammatic representation
b) Bar diagrams:
- A bar diagram is a chart with rectangular bars with lengths proportional to the values that
they represent.
- Bar charts are used for comparing two or more values that were taken over time or on
different conditions, usually on small data sets.
- The bars can be horizontally lines or it can also be used to mass a point of view.
- It is generally used to depict the discrete data set.
Bar diagrams are of following types:
1. Simple bar diagram: A simple bar diagram is used to depict only one variable. As one
bar represents only one figure, there are as many bars as the number of figures.
Eg. Oxygen consumption in cc/kg/h in different months in half year in a species of fish was
obtained as below:
Months Jan Feb Mar Apr May June
O2 67 74 84 85 100 105
The simple bar diagram of above data is depicted below
120
June
May
100
Mar Apr
80 Feb
Jan
60
40
20
0
Months
Multiple bar diagrams: When a comparison between two or more relative variables has to
be made, them multiple bar diagrams are preferred. The technique of plotting multiple bar
diagram is given below:
12
90
80
70
60
50
40 East
30
20 West
10 North
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
East 20.4 27.4 90 20.4
West 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9
13
Difference between Histogram and Bar diagram
BASIS FOR
HISTOGRAM BAR GRAPH
COMPARISON
Spaces Bars touch each other, hence there are Bars do not touch each other,
no spaces between bars hence there are spaces between
bars.
14
Pie Charts (Source: https://www.smartdraw.com/pie-chart/):
A pie chart is a circular chart divided into wedge-like sectors, illustrating proportion. Each
wedge represents a proportionate part of the whole, and the total value of the pie is always
100 percent.
Pie charts can make the size of portions easy to understand at a glance. They're widely used
in business presentations and education to show the proportions among a large variety of
categories including expenses, segments of a population, or answers to a survey
Type of Organization
Type of No. of Angle
Organization Students
Placed 3
10
JRF/SRF 10 360ᴼX10 JRF/SRF
----------- Pharma Industry
50
Pharma 12 360ᴼX12 Food Industry
25 12
Industry ----------- Government Job
50
Food Industry 25 360ᴼX25
-----------
50
Government 3 360ᴼX3
Job -----------
50
TOTAL 50
Assignments:
1) Define Histogram and Bar Diagram. Give the difference between them.
2) Define Primary & Secondary Data. Differentiate between discrete variable and continuous
variables by giving appropriate examples
3) Define Sample & Population. Write a short note on descriptive and inferential statistics.
Write short notes on data scales/scaling techniques
15
5. Measures of Central Tendency – Mean, Mode & Median
Mean:
a) Arithmetic Mean (X)
i) Arithmetic Mean for Ungrouped Data: If x1, x2, x3, …, xn is a set of n values then
Arithmetic Mean = (x1 + x2 + x3 + … + xn) = Σ X
n n
Find the Arithmetic Mean (AM) of: 20, 21, 20, 24, 21, 24, 26, 22, 23, 21 (Note here n = 10)
AM = 20 + 21 + 20 + 24 + 21 + 24 + 26 + 22 + 23 + 21 = 22.2 (Answer)
10
ii) Arithmetic Mean for Discrete Grouped Data. Also known as the “weighted
Arithmetic mean”
X x1 x2 x3 ………….. xn
Frequency (f) f1 f2 f3 …………. fn
Eg., Rate of respiration of 50 fishes in a species and their frequencies are given below.
Calculate the mean of this experimental data
Rate of Resp 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80
Frequency 3 11 7 4 10 5 7 3
Here rate of Respiration is the variable given in the form of class interval. So we will use the
mid point of each class interval to calculate the mean
Mid point of Class Interval 1-10 is 5.5 (= m1)
Mid point of Class Interval 11-20 is 15.5 (= m2)
Mid point of Class Interval 21-30 is 25.5 (= m3) and so on
16
Merits and Limitations of Arithmetic Mean
Merits:
1. Simplest to understand
2. Easy to Compute
3. Each item is used in calculation
4. Defined by rigid mathematical formula
5. Can be subjected to further algebraic treatment
6. It is relatively reliable, which means that it does not vary too much when repeated
samples are taken from one and same population
Limitations:
1. Since the value of mean is dependent on each and every item of the series, extreme
items, i.e. values which are very large or very small compared to most of the values in
the group unduly affect the value of average
2. In a distribution with open end classes the value of mean cannot be calculated without
making assumptions about the lower/upper limits of the class intervals
3. The average is not always a good measure of central tendency. It is a good measure
only when the frequency distribution follows a typical bell shaped curve. The average
is not a good measure of central tendency for U-shaped distribution (example –
failure rate of electronic components) or markedly skewed distribution like income
distribution or price distribution
Median
A median of a distribution is defined as the value of that variable which divides the total
frequency into two equal parts when the series is arranged in ascending or descending order
of magnitude
Eg. To find the Median of the given data set of 10 observations viz.
{20, 21, 20, 24, 21, 24, 26, 22, 23, 21}
First arrange it in ascending order of magnitude OR Ordered Array
{20, 20, 21, 21, 21, 22, 23, 24, 24, 26}
Here n = 10 (which is even value) so Median = Average of n/2 nd and its next value =
Average of 5th and 6th value = Average of 21 & 22 = (21+22)/2 = 43/2 = 21.5
Eg. To find the Median of the given data set of 9 observations viz.
{20, 21, 20, 24, 21, 24, 26, 22, 23}
First arrange it into an Ordered Array viz.
{20, 20, 21, 21, 22, 23, 24, 24, 26}
Here n = 9 (odd) so Median = (n+1)/2 nd value = 5th value = 21
Median for Discrete Series: In a discrete series items are first arranged into an Ordered
Array and their respective frequencies are written against respective items.
17
Eg., Determine median of the following data
X 5 6 7 8 9 10
f 2 4 8 10 15 25
Looking at the cumulative frequency table we observe that the 32nd value is 9 and 33rd value
is also 9 so Median = 9
In this case the median can not be found directly by the method given above.
Here n = ∑f
Here first we will prepare the cumulative frequency (cf) table and the revised table will be:
CI 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80
Frequency 3 15 2 8 11 4 1 6
cf 3 18 20 28 39 43 44 50
Here n = 50
Median class = where (n+1)/2nd value lies = 51/2nd value = 25.5th value lies
Looking at the cf table 25.5th value lies in the class interval 31-40 which will be our median
class
Therefore,
L1 = 31; n = 50; c = 20; fm = 8; h = 10
Median = 31 + (50/2 – 20) X 10 = 37.25
8
18
Merits and Limitations of Median
Merits of median
(i) If found directly it represents an actual item
(ii) It eliminates the effects of extreme items, since they are not taken in its calculations and
hence are good measure of location of central tendencies in case of markedly skewed
distributions like income or price distribution
(iii) The values of only the middle items are required to be known
(iv) It can be found even for the data which is qualitative
(v) Median is most suitable for expressing qualitative data
Limitations of Median
(i) It may not be representative when the distribution is irregular
(ii) It cannot be located when the items are grouped. It can only be estimated in this case.
(iii) The data must be kept in ascending or descending order which involves considerable
work if number of items is large
(iv) It is a positional value only and is not based on every value of the distribution
(v) Median is affected by sampling fluctuations
Quartiles
The procedure for computing quartiles is the same as the median
Q) From the following data compute the value of upper and lower quartiles:
Marks Below 10 10-20 20-40 40-60 60-80 Above 80
No. of 8 10 22 25 10 5
students
19
− . .
= +
Mode
Mode of a frequency distribution is defined as “that value of the variable for which the
frequency is maximum”
Where
L1 = Lower limit of modal class
f0 = frequency of the class preceding modal class
f1 = frequency of the modal class
f2 = frequency of the class succeeding the modal class
h = Width of the class interval of modal class
Here since the class interval 35-39 have the maximum frequency it will be our modal class
As per the formula
L1 = 35; f0 = 2; f1 = 6; f2 = 3; h = 5
20
Merits and Limitations of Mode
Merits of Mode
(i) It avoids the effects of extreme
(ii) Often it can be ascertained by mere inspection
(iii) Only the values occurring with high frequencies are required to be known. All values
need not be known
Limitations of Mode
(i) It is not well defined and therefore is rarely used for higher life science research
(ii) Arithmetic explanation of mode is not possible
(iii) Sometimes it is indefinite
(iv) It becomes difficult to find a good measure of location in multi-modal distributions
(v) It is not based on all the observations of a series
21
6. Measure of variation
22
Mean Deviation:
The mean deviation is also known as average deviation. It is the average difference between
the items in a distribution from the median or mean of that series. Theoretically there is an
advantage in taking the deviation from median because sum of the deviations of items from
median is minimum when signs are ignored. However, in practice the arithmetic mean is
more frequently used in calculating the value of average deviation and this is the reason why
it is more commonly called “mean deviation”.
X 10 11 12 13 14 15 16 17 18
f 1 2 3 4 5 4 3 2 1 = 25
fX 10 22 36 52 70 60 48 34 18 ∑ = 350
Mean = 350/25 = 14
X– -4 -3 -2 -1 0 1 2 3 4
= X - 14
| − | 4 3 2 1 0 1 2 3 4
| − | 4 6 6 4 0 4 6 6 4 ∑ | − | =40
∑ | − | 40
ℎ = = = 1.6
25
23
Standard Deviation
Standard deviation may be defined as “the square root of the arithmetic mean of the squares
of deviations from the arithmetic mean”
σ= ∑ (X – x )2
√ n
Where x = Mean = ∑ X
n
Formula (1) can further be simplified into a more easy & popular form viz
σ= ∑ X2 – ∑ X 2
√ n n ….. (2)
For all the questions of Standard Deviation it is convenient to use the Formula (2)
Eg., Haemoglobin percent g/100 ml of liver fed Wallago attu was recorded as 23, 22, 20,
24, 16, 17, 18, 19 and 21. Calculate the Standard Deviation
X 23 22 20 24 16 17 18 19 21 ∑ X = 180
X2 529 484 400 576 256 289 324 361 441 ∑ X2 = 3660
σ= 3660 – (20 )2
√ 9
σ= ∑ fX2 – ∑ fX 2
√ ∑f ∑f ….. (3)
24
Standard Deviation for continuous grouped data:
2
σ= ∑ fm2 – ∑ fm
√ ∑f ∑f ….. (4)
Please note that if we try to solve this problem using normal method then the values of X are
large. To find Standard Deviation we have to find X2 and then ∑ X2 which will be larger
values. By Step Deviation Method our values of X will be simplified
By Step Deviation Method, we shall subtract al the values of X by a suitably chosen constant.
Now question arises how to choose this constant? Easy method is to find the average of
maximum and minimum value of X. In the above question minimum and maximum values of
X are 101 and 110 respectively. Average of which will be 211/2 = 105.5. To make our
calculations simple we shall chose 105 instead of 105.5
Always remember in step deviation method only the variables (values of X or m) are
subtracted by a suitable constant. Frequency will remain unchanged.
Applying Step Deviation Method i.e. subtracting each value of X1 by 105 we will get new
variables which we name as X1 our modified table shall be
X1 = X-105 -4 -3 -2 -1 0 1 2 3 4 5
f 2 4 6 8 10 8 6 4 2 2 ∑ f = 52
X12 16 9 4 1 0 1 4 9 16 25
fX1 -8 -12 -12 -8 0 8 12 12 8 10 ∑ fX1 = 10
fX12 32 36 24 8 0 8 24 36 32 50 ∑ fX12 = 250
In formula (3) of Standard Deviation we shall use X1 instead of X, then formula will be
σ= ∑ fX12 – ∑ fX1 2
√ ∑f ∑f
25
Next we shall find Standard Deviation for continuous grouped data given below
Question) Ovary weight of 50 fishes and their frequency is given in class interval (CI),
tabulated below. Find Standard Deviation
CI 2-3 3-4 4-5 5-6 6-7
Frequency 6 13 11 8 12
Here first we shall find the mid point (m) of each class interval
CI 2-3 3-4 4-5 5-6 6-7
Frequency 6 13 11 8 12
M 2.5 3.5 4.5 5.5 6.5
Now we shall apply step deviation method. Here we shall subtract the values of m by a
suitable constant. Here minimum and max values of m are 2.5 & 6.5 resp. Their Average =
4.5. Also looking at the values it seem appropriate to subtract each value of m by 4.5
Our modified table will be
CI 2-3 3-4 4-5 5-6 6-7
Frequency (f) 6 13 11 8 12 ∑ f = 50
M 2.5 3.5 4.5 5.5 6.5
m1 = m – 4.5 -2 -1 0 1 2
m12 4 1 0 1 4
fm1 -12 -13 0 8 24 ∑ f.m1 = 7
fm1 2 24 13 0 8 48 ∑ f.m12 = 93
In formula (4) of Standard Deviation we shall use m1 instead of m, then formula will be
σ= ∑ fm12 – ∑ fm1 2
√ ∑f ∑f ….. (4)
Upon calculation
Variance
Square of standard deviation is called Variance i.e. Variance = σ2
26
Quartile Deviation (QD)
( − )
=
=( − )
Q3 – Q1 = Interquartile Range
Questions to calculate QD
Discrete series
Eg) Compute the coefficient of quartile deviation from the following data
Marks 10 20 30 40 50 80
No. of students 4 7 15 8 7 2
Cumulative frequency (cf) 4 11 26 34 41 43
Q1 = size of (N+1)/4 th item = 11th item = 20, Q3 = size of 3(N+1)/4 th item 33rd item = 40
QD = (Q3 – Q1)/2 = (40-20)/2 = 10
Coefficient of QD = (Q3 – Q1) = 40 – 20 = 0.333
(Q3 + Q1) 40 + 20
27
Merits and Limitations of Quartile Deviation
Merits
1. In certain respects it is superior range as a measure of dispersion
2. It has a special utility in measuring variation in case of open-end distributions or one
in which data may be ranked
3. It is also useful in erratic skewed distributions, where the other measures of variations
may be warped by extreme values
Limitations
1. Quartile deviation ignores 50% items i.e. the first 25% and the last 25%. As the value
of quartile deviation does not depend upon every item of the series it cannot be
regarded as a good method of measuring dispersion
2. It is not capable of mathematical manipulation
3. Its value is very much affected by sampling fluctuations
4. It does not show scatter of data around average, rather it shows a distance on scale
Example) There are two branches of an establishment employing 100 and 80 persons
respectively. If the arithmetic mean of monthly salaries paid by the two branches are Rs. 275
and Rs. 225 respectively, find the arithmetic mean of the salaries of the employees of the
establishment as a whole.
Sol:
28
Example) For a group of 50 boys the mean score and the standard deviation of the scores in a
test are 59.5 and 8.38 respectively while for a group of 40 girls the mean score and the
standard deviation of the scores in the same test are 54 and 8.23 respectively. Find the mean
and standard deviation of the combined group of 90 students.
Branch No. 1 Branch No. 2
No. 50 (n1) 40 (n2)
Arithmetic mean 59.5 (x1) 54.0 (x2)
S.D. 8.38 (σ1) 8.23 (σ2)
σ = √76.58 = 8.75
-------------------------------------------------------------------------------------------------------------
Coefficient of Variation
The measures of dispersion - Range, Mean Deviation, Standard deviation etc. are expressed
in same unit as the original observations, and are called Absolute measures of variability. So
they can not be used for comparing the variability of two or more distributions given in
different units. In order to meet such situations, we use Coefficient of Variation to compare
the variability in two different distributions.
Eg., The scores of two batsmen, A and B, in ten innings during a certain season are as under:
A 32 28 47 63 71 39 10 60 96 14
B 19 31 48 53 67 90 10 62 40 80
Find which of the batsman is more consistent in scoring
29
Standard deviation = 25.5 (please calculate)
Therefore, CV = 100 X 25.5/46 = 55
For cricketer B
Mean = 50
Standard deviation = 24.4
Therefore, CV = 100 X 24.4/50 = 49
Assignments:
1) If the interest paid on each of three different sums of money yielding 5%, 6% and
8% simple interest per annum respectively is the same, what is the average yield
percent on the total sum invested? (Hint: Find which mean to use AM, HM or GM)
2) Calculate mean and the standard deviation of the following frequency distribution
Variable 5 10 15 20 25 30 40 45 50 60
Frequency 2 4 6 6 10 10 10 6 4 2
3) Find the Standard deviation of first ten natural numbers. (Hint: First ten natural
numbers are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, you have to find their standard deviation)
4) From the given data set, state which series is more variable:
Variable 10 – 20 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70
Series A 10 18 32 40 22 18
Series B 18 22 40 32 18 10
5) In a series of adults the mean blood pressure was 135 mmHg with standard deviation 10
mmHg. In the same series mean height was 170 cm with standard deviation 6 cm. Which
character shows greater variation?
7) The median and mode of the following frequency distribution are known to be 27 and 26
respectively. Find the values of a & b.
Values 0-10 10-20 20-30 30-40 40-50
Frequency 3 a 20 12 b
8) The mean monthly salary paid to all employees in a certain company was Rs. 500. The
mean monthly salaries paid to male and female employees were 520 and 420 rupees
respectively. Obtain the percentage of male to female employees in the company.
30
9) Find median, mode and standard deviation of the data given below:
CI 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80
F 3 11 7 4 10 5 7 3
10) Find the standard deviation by step deviation method for the following data on the age of
patients suffering from pulmonary disease.
Age (in years) 0-10 10-20 20-30 30-40 40-50 50-60 60-70
No. of patients 6 14 10 8 1 3 8
11) Calculate Quartile Deviation: Find the interquartile range and Coefficient of quartile
deviation from the following data
31
5. Skewness Moments & Kurtosis:
Skewness
Skewness refers to asymmetry or lack of symmetry in the shape of frequency distribution i.e.
When a distribution is not symmetrical it is called a skewed distribution.
Tests of Skewness
Skewness is present if:
1. The values of mean, median and mode do not conincide
2. When the data are plotted on a graph they do not give normal bell shaped curve
Measures of Skewness
1. The Karl Pearson’s Coefficient of skewness
−
=
3 ( − .)
=
32
MOMENTS
Moments about mean (central moments)
First moment about mean
( − ) ( − )
µ = =0 µ = =0
Second moment about mean
( − ) ( − )
µ = = µ = =
33
Conversion of Moments about Arbitrary Origin into Moments about Mean
µ1 = 0
µ2 = µ2’ – (µ1’)2
µ3 = µ3’ - 3µ1’ µ2’ + 2 (µ1’)3
µ4 = µ4’ - 4µ1’ µ3’ + 6 (µ2’) (µ1’)2 - 3 (µ1’)4
Q) Find the first four moments about the mean in the following distribution:
Solution:
Calculation of Moments
Height 60-62 63-65 66-68 69-71 72-74 TOTAL
mi 61 64 67 70 73
Frequency 5 18 42 27 8 100
d = (m-63)/3 -2 -1 0 1 2
d2 4 1 0 1 4
d3 -8 -1 0 1 8
d4 16 1 0 1 16
fd -10 -18 0 27 16 15
fd2 20 18 0 27 32 97
fd 3 -40 -18 0 27 64 33
fd4 80 18 0 27 128 253
34
Kurtosis
Kurtosis is refers to the degree of flatness or peakedness of the curve of frequency
distribution. In other words measures of kurtosis tell us the extent to which a distribution is
peaked or flat-topped than a normal curve (normal curve is discussed later)
Simple Example
Q) Calculate Skewness & Kurtosis of the given data
X 2 3 4 5 6
f 1 3 7 3 7
0
μ = =0
15
14
μ = = 0.9333
15
0
μ = =0
15
38
μ = = 2.533
15
35
Hence we get
1 = =( . )
= 0 (Distribution is symmetric and not skewed)
.
= = .
= 2.91 (Distribution is platykurtic since the value is less than 3)
= ( 3 + 1 − 2 . )/( 3 − 1 )
( )
SkB = =0
Some examples
Q 1) The standard deviation of symmetric distribution is 3. What must be the value of
the fourth moment about the mean in order that the distribution be mesokurtic
For mesokurtic distribution =3
= 3, Hence μ = =9
β2= µ4/µ22 , Hence, 3 = OR μ = 243
Thus the fourth moment about mean must be 243 in order that the distribution be mesokurtic
Q2) If first four moments about the value 5 are equal to -4, 33, -117 and 560 determine
the corresponding moments about the mean
Conversion of Moments about Arbitrary Origin into Moments about Mean
µ2 = µ2’ – (µ1’)2
µ3 = µ3’ - 3µ1’ µ2’ + 2 (µ1’)3
µ4 = µ4’ - 4µ1’ µ3’ + 6 (µ2’) (µ1’)2 - 3 (µ1’)4
Given that µ1’ = -4, µ2’ = 33, µ3’ = -117, µ4’ = 560, substituting these values in above given
equations we get
µ2 =6, µ3 = 19, µ4 =32
36
Note: Also refer to class notes for additional questions
References:
1. Introductory Biostatistics for the Health Sciences Modern Applications Including
Bootstrap - Michael R. Chernick, Robert H. Friis. Wiley-Interscience
2. Statistics for Anthropology Second Edition - Loren Madrigal. Cambridge University
Press
3. Statistical Methods – SP Gupta and Archana Gupta. Sultan Chand & Sons
4. Statistics for the utterly confused - Lloyd Jaisingh. McGraw Hill Education
5. Grubbs, F.E. 1979. Procedures for detecting outlying observations. In Army Statistics
6. Manual DARCOM-P706-103, Chapter 3. U.S. Army Research and Development
Center,
7. Aberdeen Proving Ground, MD 21005.
8. American Public Health Association, Standard Methods for the Examination of Water
and
37