Notes of B.Stats
Notes of B.Stats
The Word statistics have been derived from Latin word “Status” or the Italian word “Statista”,
meaning of these words is “Political State” or a Government. The word statistics was originally
applied to only such facts and figures that were required by the state for official purposes i.e. data
related to population and property.
The application of statistics was very limited but rulers and kings needed information about lands,
agriculture, commerce, population of their states to assess their military potential, their wealth,
taxation and other aspects of government.
Statistics: Definition
As a singular, “By statistics we mean a science that deals with collection, classification
presentation, analysis and drawing valid inference from numerical facts or data”.
Statistics: Limitation
Discrete Data: A discrete variable takes only distinct and integer values (analogous to 'counting').
Example: number of defective items, number of students absent in statistics class.
Continuous data: A continuous variable takes any value on a range of real numbers (analogous
to measurement). Example: height of first year students, time spent on studying at home.
Ex. (2) For each of the following indicate if a discrete or a continuous random variable provides the
best definition.
(a) number of defective items in a sample of 20 items from a large shipment
(b) yearly income for a family
(c) change in price of a share of IBM common stock in a month
(d) number of errors detected in a corporation's accounts
(e) number of claims on a medical insurance policy in a particular year
(f) amount of oil imported into the India in a month
(g) questions answered correctly in 50-objective question examination
(i) number of nonproductive hours in an 8-hoyrs workday.
Scale of Measurement: There are four generally used scales of measurement, listed here from weakest
to strongest.
• Nominal scale: In the nominal scale of measurement, numbers are used simply as labels
of groups or classes. Example: gender of respondent, defectiveness of items.
• Ordinal scale: In the ordinal scale of measurement, data elements may be ordered
according to their relative size or quality. Example: Ranking of brands.
• Interval Scale: In the Interval scale of measurement the value of zero is assigned
arbitrarily and therefore we cannot take ratios of two measurements. But we can take ratios
of intervals. Example: time of day, temperature.
• Ratio Scale: If two measurements are in ratio scale, than we can take ratios of those
measurement. The zero in this scale is an absolute zero. Example: Money, Weight.
Ex (1) A survey by an electric company contains questions on the following, describe the scales of
measurement for the variables implicit in 11 items.
Primary data
Data that is collected by a researcher from first-hand sources, using methods like surveys,
interviews, or experiments. It is collected with the research project in mind, directly from primary
sources.
Secondary data
Data collected by someone else for some other purpose (but being utilized by the investigator for
his own purpose). The term is used in contrast is the term secondary data. Secondary data is data
gathered from studies, surveys, or experiments that have been run by other people or for other
research.
Advantages
Primary data
• The investigator collects data specific to the problem under study.
• There is no doubt about the quality of the data collected (for the investigator).
• If required, it may be possible to obtain additional data during the study period.
Secondary data
Primary data
• The investigator has to contend with all the hassles of data collection
• Ensuring the data collected is of a high standard
• Cost of obtaining the data is often the major expense in studies
Secondary data
• The investigator cannot decide what is collected (if specific data about something is
required, for instance).
• One can only hope that the data is of good quality
• Obtaining additional data (or even clarification) about something is not possible (most often)
Sources of data
Primary sources
Provide raw information and first-hand evidence. Examples include interview transcripts, statistical
data, and works of art. A primary source gives you direct access to the subject of your research.
Following are the methods of obtaining the primary data.
• Observation
• Experiment
• Interviews
• Survey
Secondary sources
Provide second-hand information and commentary from other researchers. Examples include
journal articles, reviews, and academic books. A secondary source describes, interprets, or
synthesizes primary sources.
• Research journal
• Government publication
• Trade and business magazines
• Any other publication.
Primary sources are more credible as evidence, but good research uses both primary and
secondary sources.
Census and Sampling
• Population: The group of object(subject) of research intend to generalize the results of study.
• Sample: A subset(part) of population which researcher investigate.
• Census: The method of statistical enumeration where all members of the population are
studied. A population refers to the set of all observations under concern.
• Sampling: The technique of selecting individual members or a subset of the population to
make inferences from them or estimate characteristics of the whole population.
Advantage of Census:
• Accuracy (In terms of sampling error)
Advantage of sampling:
• Time
• Money
• Accuracy (In terms of quality of data)
• Scope
Average: Introduction
Main Value: One of the objectives of the analysis of data is to get one single value which can
describe the characteristics of the entire mass of the data and which can be consider as
representative of the entire data. A value satisfying, this criterion is the central value or an
“average”.
Central Tendency: The average is the representative or typical value of the data. It usually lies
somewhere near the center of the group and that is why the average are termed as measures of
central tendency or central value.
Comparison: Large volume of data cannot be easily understood or remembered so a single
value, summarizing the prominent features of the data as the average can be used. If two or more
sets of data are to be compared then it is not possible to compare each and every item. So, we
require one figure, representing entire data as an average, in a condensed form. Thus averages
can facilitate comparisons.
Definition
Arithmetic Mean: The most widely used measure of location or central tendency is the Arithmetic
Mean. It is defined as sum of the observations divided by the number of observations.
Median: When all the observation of a variable is arranged in either ascending (descending)
order, the middle observation is known as median.
Mode: It is the most frequently occurring observation in a data i.e. most common or most
fashionable, if it exists
Average for ungrouped data
Example (1) A random sample of 22 business economists were asked to predict the percentage
growth in the consumer price index number over the next year. The forecasts were:
3.6 3.1 3.9 3.7 3.5 3.7 3.4
3.0 3.6 3.4 3.1 2.9 3.0 4.0 2.8
3.8 4.2 2.5 3.1 3.9 2.9 2.6
Find the sample mean.
Example (2): The following data represent the number of days it took 7 individuals to quit smoking
after completing a course designed for this purpose. What is sample median?
1 100 5 2 8 3 7
Example (3): A sample of 12 senior executives found the following results for percentage of total
compensation derived from bonus payments. Find the sample median.
15.8 7.3 28.4 18.2 15.0 24.7
13.1 10.2 29.3 34.7 16.9 25.3
Example (4) The following are the sizes of the last 8 dresses sold at a women's boutique. What is
the sample mode?
8 10 6 4 10 12 14 10
Average for grouped data (discrete case)
Example (5) Mr. XYZ is Quality control manager of ABC electrical limited. To check the quality of
the switch, he selects 30 switches randomly from the lot and observes the following no of defect in
30 switches. Find mean, median and Mode.
Class (No of defects) Frequency
0 2
1 8
2 10
3 6
4 4
Mean: Here the variable X assumes separate, distinct values 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 …..𝒙𝒌 with the
corresponding frequencies 𝒇𝟏 , 𝒇𝟐 , 𝒇𝟑 …..𝒇𝒌
Then Arithmetic Mean is
𝑺𝒖𝒎 𝒐𝒇 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔
𝑿=
𝒏𝒐. 𝒐𝒇 𝒗𝒂𝒍𝒖𝒆𝒔
𝒇𝟏 𝒙𝟏 + 𝒇𝟐 𝒙𝟐 + 𝒇𝟑 𝒙𝟑 + ⋯ + 𝒇𝒌 𝒙𝒌
=
𝒇𝟏 + 𝒇𝟐 + 𝒇𝟑 + ⋯ + 𝒇𝒌
∑ 𝒇𝒙
=
𝒏
where, 𝒏 = ∑ 𝒇 = 𝒇𝟏 + +𝒇𝟑 + ⋯ + 𝒇𝒌
Wait and think: Mean
• It is based on all observation; hence better representative of data.
• It can be only calculated for interval and above scale of data.
• It is affected by extreme values.
Median: First calculate the cumulative frequency of less than type and then median is given as the
𝒏+𝟏
value of the variable for which cumulative frequency is at or exceeds starting from the top;
𝟐
where n represent the total number of observations.
Wait and think: Median
• It is not affected by extreme values.
• It is not based on all observation.
• It can be only calculated for ordinal and above scale of data.
Mode: Here mode can be obtained as the value of the variable with the maximum frequency.
Wait and think: Mode
• It is not affected by extreme values.
• It is not based on all observation.
• It can be calculated for nominal and above scale of data.
Example (6): The “Computer Today” reported on home technology and its usage by person aged
12 and older. The following data are the hours of personal computer usage during one week for a
sample of 50 persons. Calculate the mean, median and mode.
Class interval
Frequency
(computer usage in hours)
0-3 5
3-6 28
6-9 8
9-12 6
12-15 3
Mean: Here the variable X assumes 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 …..𝒙𝒌 representative (mid value or class marks)
value of the class intervals with the corresponding frequencies 𝒇𝟏 , 𝒇𝟐 , 𝒇𝟑 …..𝒇𝒌 .
Where
𝒍𝟏 - lower limit of Median class
𝒍𝟐 - upper limit of Median class
𝒇- frequency of Median class
𝒄. 𝒇. – cumulative frequency of pre-Median class
Mode: First identify the model class (the class interval which contains the mode value) as the
class interval for which the frequency is maximum. Then mode is given by,
(𝒍𝟐 − 𝒍𝟏 )(𝒇𝟏 − 𝒇𝟎 )
𝑴𝒐𝒅𝒆 = 𝒍𝟏 +
(𝟐𝒇𝟏 − 𝒇𝟎 − 𝒇𝟐 )
Where
𝒍𝟏 - lower limit of Model class
𝒍𝟐 - upper limit of Model class
𝒇𝟎 - frequency of pre-model class
𝒇𝟏 - frequency of model class
𝒇𝟐 - frequency of post model class
Example (7): During 3 hours at Heathrow airport 55 aircraft arrived late. The number of minutes
they were is shown in the frequency table below. Calculate the mean, median and mode.
Minutes
No. of aircrafts
Late
0-10 27
10-20 10
20-30 7
30-40 5
40-50 4
50-60 2
Combined Arithmetic Mean
Example (8) The average wage for 50 male workers is Rs.17630/- & the average wage for 40
female workers is Rs.14540/- in a factory. Find the combined average for all the workers in the
factory.
Combined Mean: If the A.M of two groups are 𝑋̅1and 𝑋̅2 with 𝑛1 and 𝑛2 number of observations in
the groups, then the combined of the two groups taken together is given by
Example (10) Miss. Pooja has Rs.2,00,000 at her deposal, which she has invested in three
different investment opportunities with following information. Calculate average rate of return on
total investment.
Investment Money Rate of return of
Option Invested Investment (%)
Equities 1,00,000 17
Corporate 60,000 9
Bonds
Government 40,000 6
Bond
While calculating the A.M., we have assumed that all values are equally important, which may not
be true in many practical situations.
In case when some values are more important than the other, we calculate weighted A.M.
Here the variable X assumes distinct values 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 …..𝒙𝒌 with the corresponding together with
their relative importance 𝒘𝟏 , 𝒘𝟐 , 𝒘𝟑 …..𝒘𝒌
Then weighted Arithmetic Mean is
𝑾𝒆𝒊𝒈𝒉𝒕𝒆𝒅 𝑺𝒖𝒎 𝒐𝒇 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔
̅𝒘 =
𝑿
𝑻𝒐𝒕𝒂𝒍 𝒗𝒂𝒍𝒖𝒆𝒔 𝒐𝒇 𝒘𝒆𝒊𝒈𝒉𝒕
𝒘𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒘𝟑 𝒙𝟑 + ⋯ + 𝒘𝒌 𝒙𝒌
=
𝒘𝟏 + 𝒘𝟐 + 𝒘𝟑 + ⋯ + 𝒘𝒌
∑ 𝒘𝒙
=
∑𝒘
Example (11) Calculate A. Mean of marks obtain by Mrs. Pragati in the examination.
Subjects Credits Marks obtained
Business Communication 2 70
Business Statistics 3 98
Financial accounting 3 86
Business Economics 3 69
Foundation Course 2 78
Example (1): Following data is represent the operating system of smart phone used by class of
students. Prepare the frequency distribution of the data. (A = Android, W= Window Phone, I =
IPhone, AM = Amazon’s fire phone)
AM, A, I, I, I, W, I, AM, W, A,
I, I, W, A, I, A, AM, W, W, I,
I, I, W, A, A, A, W, I, AM, AM,
A, A, I, A, I, A, A, W, I, I
Solution: Frequency distribution of OS of smartphone
Example (2): Mr. XYZ is Quality control manger of ABC electrical limited. To check the quality of
the switch, he selects 30 switches randomly from the lot and observes the following no of defect in
30 switches.
2 1 3 2 1 3 3 2 4 1
2 1 0 1 0 2 3 2 1 3
1 2 1 4 4 2 4 3 2 2
Solution: Frequency distribution of No. of Defect
Class(No of defects) Frequency
0
1
2
3
4
Example (3): In a study of job satisfaction, a series of test was administered to 50 subjects. The
following data was obtained; higher score represent greater satisfactions. Summarise the data
using frequency distribution.
87 59 80 61 50 60 70 89 84 76
76 41 81 88 47 65 74 84 76 78
67 50 70 46 81 92 53 83 78 67
58 90 73 85 87 77 43 70 64 74
92 75 69 97 75 71 61 46 69 64
Example (4): The “Computer Today” reported on home technology and its usage by person aged
12 and older. The following data are the hours of personal computer usage during one week for a
sample of 50 persons:
4.1 1.5 10.4 5.9 3.4 5.7 1.6 6.1 3.0 3.7
3.1 4.8 2.0 14.8 5.4 4.2 3.9 4.1 11.1 3.5
4.1 4.1 8.8 5.6 4.3 3.3 7.1 10.3 6.2 7.6
10.8 2.8 9.5 12.9 12.1 0.7 4.0 9.2 4.4 5.7
7.2 6.1 5.7 5.9 4.7 3.9 3.7 3.1 6.1 3.1
Summarize the data by constructing a frequency distribution with class width of 2 hours.
Solution: Frequency distribution of Computer usage in hours
Class interval (computer usage in hours) Frequency
0-3
3-6
6-9
9-12
12-15
Total =
Following is the important point to be remembered while making a diagram and graph:
A good diagram and graph:
Provides a clear summary of data
Is a fair and honest representation
Highlights underlying patterns
Allows the extraction of a lot of information quickly.
A bad diagram:
Confuses the viewer
Misleads (either accidentally or intentionally).
Diagrams: Frequency diagram
Example (5) The following frequency diagram represent number of confirmed cases of COVID-19
In India and world.
P ia
A e
n
d
l
.
n
in
an a
zi
.A
sh
pa
nc
d
n
ta
B rali
ra
h
In
ila
.S
de
Ja
is
ra
C
B
t
he
us
ak
F
a
gl
T
Example (7): The following diagram represents Record of Disinvestment (Rs. In Crores) for the
year 1991-02.
Record of Disinvestment
14000
12000
10000 Target Set by
8000 Government
6000
4000 Actual
2000 Receipts
0
2
2
-9
-9
-9
-9
-0
-0
91
93
95
97
99
01
19
19
19
19
19
20
Diagram: Subdivided bar diagram
Example (8): The following diagram represents distribution of senior, adults and child at hotel
accommodation at Irish, British, Mainland European and Rest of World.
Pie-chart: With a small number of categories, we could use a pie-chart. The angle can be
calculated using the formula.
Component value x 360
Angle in Degree=
Total value of all components
Example (9): The following Pie diagram represents distribution of favorite types of movie.
Example (10): The “Computer Today” reported on home technology and its usage by person aged
12 and older. The following data are the hours of personal computer usage during one week for a
sample of 50 persons:
4.1 1.5 10.4 5.9 3.4 5.7 1.6 6.1 3.0 3.7
3.1 4.8 2.0 14.8 5.4 4.2 3.9 4.1 11.1 3.5
4.1 4.1 8.8 5.6 4.3 3.3 7.1 10.3 6.2 7.6
10.8 2.8 9.5 12.9 12.1 0.7 4.0 9.2 4.4 5.7
7.2 6.1 5.7 5.9 4.7 3.9 3.7 3.1 6.1 3.1
Prepare the histogram representing the data. Calculate the mode from histogram and verify your
answer by calculating it using the formula. (Answer=4.6)
Class interval Frequency
0-3 5
3-6 28
6-9 8
9-12 6
12-15 3
Total = 50
Presentation of Data: Ogives
Cumulative (Less than type) frequency graph: It plots the frequency of all observation less than
a given observation. Plot the points by taking upper limit of class interval on x- axis and
corresponding cumulative on y axis. Join these points by smooth free hand curve.
Example (11): For the data given in question example (10). Draw the less than type cumulative
curve. Hence find the value of median from the graph and verify your answer by calculating it
using the formula. (Given Answer = 5.22)
Example (12): During 3 hours at Heathrow airport 55 aircraft arrived late. The number of minutes
they were is shown in the frequency table below.
Minutes Late No. of aircrafts
0-10 27
10-20 10
20-30 7
30-40 5
40-50 4
50-60 2
Prepare the histogram representing the data. Calculate the mode from histogram and verify your
answer by calculating it using the formula. (Mode =6.14)
Draw the less than type cumulative curve. Hence find the value of median from the graph and
verify your answer by calculating it using the formula. (Median = 11)
Quartiles: Quartiles are not the measure of central tendency but are partitioning value, that is they
are specific points in data set that separate large ordered data sets into four quarters.
First data must be arranged in ascending order and then quartiles are given by,
First quartile (lower quartiles), Q1: The first quartile, Q1, divides the ordered data set such that
25% of observations are at or below this value.
𝒏 + 𝟏 𝒕𝒉
𝑸𝟏 = ( ) 𝒗𝒂𝒍𝒖𝒆
𝟒
Second quartile, Q2: The second quartile, Q2, divides the ordered data set such that 50% of
observations are at or below this value.
𝒏 + 𝟏 𝒕𝒉 𝒏 + 𝟏 𝒕𝒉
𝑸𝟐 = 𝑴𝒆𝒅𝒊𝒂𝒏 = {𝟐 ( )} 𝒗𝒂𝒍𝒖𝒆 = ( ) 𝒗𝒂𝒍𝒖𝒆
𝟒 𝟐
Third quartile (Upper quartiles), Q3: The third quartile, Q3, divides the ordered data set such that
75% of observations are at or below this value.
𝒏 + 𝟏 𝒕𝒉
𝑸𝟑 = {𝟑 ( )} 𝒗𝒂𝒍𝒖𝒆
𝟒
Where 𝒏 is the number of observation in the data.
Example (1): The growing use of personal computers is suggested to be one reasons people can
operate at-home business. Following is a sample of age data for individuals working at home.
22 58 24 50 29 52 57 31 30 41
44 40 46 29 31 37 32 44 49 29
Compute the first, second and third quartiles.
Example (2) The IQ scores for a sample of 30 students who are entering their first year of high
school are shown below:
95 95 97 98 101
102 103 104 105 106
106 107 108 108 110
111 115 115 117 119
119 121 121 126 126
128 133 134 136 142
Find the three quartiles. Without calculating, give the value of median.
Example (3) Mr. XYZ is Quality control manager of ABC electrical limited. To check the quality of
the switch, he selects 30 switches randomly from the lot and observes the following no of defect in
30 switches. Find three quartiles.
Class (No of defects) Frequency
0 2
1 8
2 10
3 6
4 4
Example (4): During 3 hours at Heathrow airport 55 aircraft arrived late. The number of minutes
they were is shown in the frequency table below. Calculate the three quartiles.
Minutes
No. of aircrafts
Late
0-10 27
10-20 10
20-30 7
30-40 5
40-50 4
50-60 2
Dispersion: Introduction
In addition to averages, some additional information about the observation is required to know the
extent to which the values vary from one another and from central value.
A measure of spread or scatter of the data is called a measure of variation or dispersion.
The measure of dispersion can give us idea about reliability of the averages. When the variability
is less, the average is more reliable, so that it is a better estimate of the population average and if,
the dispersion is more, the average is not a good representing of the data.
The measures of dispersion can be used to compare two or more distributions. The one with less
dispersion is more consistent or homogenous and the one with more dispersion is less consistent.
Quartile
Coefficient of Q.D.
deviation
Mean
Coefficient of M.D
deviation
Standard
Coefficient of Variance
deviation
Range: It is defined as the difference between the maximum and minimum observation in the
data.
(𝑸𝟐 − 𝑸𝟏 ) + (𝑸𝟑 − 𝑸𝟐 )
𝑸. 𝑫. =
𝟐
(𝑸𝟑 − 𝑸𝟏 )
=
𝟐
And the corresponding relative measure is given by
𝑸𝟑 − 𝑸𝟏
𝑪𝒐𝒆𝒇𝒇. 𝒐𝒇 𝑸. 𝑫. =
𝑸𝟑 + 𝑸𝟏
Mean Deviation: It is defined as average of absolute deviation of value from mean.
𝟏
𝑴. 𝑫. = ̅|
∑|𝑿 − 𝑿
𝒏
And the corresponding relative measure is given by
𝑴. 𝑫.
𝑪𝒐𝒆𝒇𝒇. 𝒐𝒇 𝑴. 𝑫. =
̅
𝑿
Standard Deviation: It is defined as square root of average of squared deviation of value from
mean.
𝟏 𝟏
̅ )𝟐 = √ ∑ 𝒙𝟐 − (𝒙
𝑺. 𝑫. = 𝑺 = √ ∑(𝑿 − 𝑿 ̅)𝟐
𝒏 𝒏
And the corresponding relative measure is known as coefficient of variance and given by
𝑺. 𝑫.
𝑪𝒐𝒆𝒇𝒇. 𝒐𝒇 𝑺. 𝑫. = 𝟏𝟎𝟎
̅
𝑿
Example (1) Eight participants in a bike race had the following finishing times in minutes.
28 22 26 33 21 23 37 24
Compute the range, Q.D, M.D and S.D. and their coefficient.
Example (2) The Los Angeles Times regularly reports the air quality index for various area of the
southern California. A sample of air quality index values for Pomona provided the following data:
28 42 58 48 45 55 60 49 50
Compute the range, Q.D, M.D and S.D. and their coefficient.
𝟏 𝟏
̅ )𝟐 = √ ∑ 𝒇𝒙𝟐 − (𝒙
𝑺. 𝑫. = 𝑺 = √ ∑ 𝒇(𝑿 − 𝑿 ̅) 𝟐
𝒏 𝒏
Example (3) The score of 20 students in color sensitivity test is given by the following frequency
distribution. Calculate the range, Q.D, M.D and S.D. and their coefficient. (1-least sensitivity and 7-
most)
score 1 2 3 4 5 6 7
frequency 3 1 3 4 6 2 1
Example (4) Following is the frequency diminution of age of Instagram user in a random survey.
Calculate the range, Q.D, M.D and S.D. and their coefficient.
Example (4) Following is the frequency diminution of age of Instagram user in a random survey.
Calculate the range, Q.D, M.D and S.D. and their coefficient.
Age of Instagram user No. of user
12-18 9
18-24 34
24-30 35
30-36 16
36-42 8
42-48 4
48-54 2
Example (5) In a study of job satisfaction, a series of test was administered to 50 subjects. The
following data was obtained; higher score represent greater satisfactions. Calculate the range,
Q.D, M.D and S.D. and their coefficient.
Symmetry: The shape of a distribution is said to be symmetric if the observations are balanced,
or evenly distributed, about mean.
A positively skewed (or skewed to the right) distribution has a tail that extends to the right in the
direction of positive values.
Generally for a left skewed distribution, Mean < median < mode
Example (1) Following data shows the mean and median of 3 year return for two types of funds.
State for skewness of the data.
% return of Growth % return of Value
fund fund
Mean 22.44 20.42
Median 22.32 19.46
Kurtosis: It measure the extent to which values that are very different from the mean (Normal)
effect the shape of the distribution of the data set. Kurtosis affects the peakedness of the curve of
the distribution i.e., how sharply the curve rises approaching the center of the distribution.
In affecting the shape of the central peak, the relative concentration of values near the mean also
effect the ends, or tails, of the distribution of data. Thus, Kurtosis is the “peakedness” and
“tailedness” of the distribution of data.
Mesokurtic: The distribution of data which is neither very peaked nor very flat-topped (Normally
distributed) is also called mesokurtic.
Leptokurtic: Here, the distribution has longer and fatter tails than normal distribution. Moreover,
the peak is higher and also sharper when compared to normal distribution. It means the
distribution produces more extreme outliers than does the normal distribution
Platykurtic: Here, the distribution of the data has shorter and thinner tails than normal distribution.
Moreover, the peak is lower and also broader when compared to normal distribution. It means the
distribution produces fewer and less extreme outliers than does the normal