Introduction To Statistics
Introduction To Statistics
to Statistics
INTRODUCTION TO STATISTICS
Syllabus
• Descriptive Statistics – Measures Of Central Tendency & Dispersions
• Correlation, Index Numbers
• Probability Theory: Concepts Of Probability, Distributions, Moments
• Central Limit Theorem
• Sampling Methods & Sampling Distribution
• Statistical Inferences, Hypothesis Testing
Introduction
Statistics play a very vital role in any domain. It helps in collecting data, be it in
any field. Along with that, it also helps in analyzing data using statistical techniques.
Speaking of the present time, it has a lot of importance and application. Furthermore, if
we talk about the examples, here they are.
• It is one of the very old branches of science dealing with numbers. It is as old as
human society. Earlier it was regarded as “Science of Statecraft”. It gave more
importance to administrative activities. But its scope was limited to
i. Estimation Theory
ii. Sampling Distribution
iii. Analysis Of Variance
e. P.C. Mahalanobis – Father Of Indian Statistics
Meaning of statistics
Statistics can come forward in two ways: singular and plural.
• In plural form, statistics is quantitative as well as qualitative. In the plural sense, data
is generally taken into account keeping in mind the statistical analysis.
• Singularly, it is more like a scientific method that helps in presenting, collecting, as
well as analyzing data. All of this brings some major characteristics into the limelight.
Definition of statistics
• A. L. Bowley - “Statistics is the science of measurement of social organism regarded
as a whole in all its manifestations”.
• According to Selligman “Statistics is the science which deals with the methods of
collecting, classifying, presenting , comparing and interpreting numerical data
collected to throw some light on any sphere of enquiry”.
• Croxton and Cowden defined “statistics as the collection , presentation, analysis and
interpretation of numerical data
Classification of statistics
On the basis of statistical methods used, we have two types.
a) Descriptive statistics:
Descriptive statistics are brief descriptive coefficients that summarize a given data
set, which can be either a representation of the entire population or a sample of a
population. Descriptive statistics are broken down into measures of central
tendency and measures of variability (spread). Measures of central tendency
include the mean, median, and mode, while measures of variability include
standard deviation, variance, minimum and maximum variables, kurtosis, and
skewness.
b) Inferential statistics
Inferential statistics takes data from a sample and makes inferences about the
larger population from which the sample was drawn. Because the goal of
inferential statistics is to draw conclusions from a sample and generalize them to
a population, we need to have confidence that our sample accurately reflects the
population.
Inferential statistics are again classified into
1. Parametric Method:
Parametric inferential tests are carried out on data that follow certain
parameters: the data will be normal; numbers can be added, subtracted,
multiplied and divided; variances are equal when comparing two or more
groups; and the sample should be large and randomly selected. There are
generally more statistical technique options for the analysis of parametric
than non-parametric data, and parametric statistics are considered to be
the more powerful. Common examples of parametric tests are: correlated
t-tests and the Pearson r correlation coefficient.
2. Non-Parametric Method:
Non-parametric tests relate to data that are flexible and do not follow a
normal distribution. They are also known as “distribution-free” and the
data are generally ranked or grouped. Non-parametric data are lacking
those same parameters and cannot be added, subtracted, multiplied, or
divided. These data include nominal measurements such as gender or
race; or ordinal levels of measurement such as IQ scales, or survey
response categories such as “good, better, best”, “agree, neutral, disagree”,
etc. Examples include ranking, the chi-square test, binomial test and
Spearman's rank correlation coefficient.
Data
There are different types of data in Statistics that are collected, analysed, interpreted
and presented. The data are the individual pieces of factual information recorded, and it
is used for the purpose of the analysis process. The two processes of data analysis are
interpretation and presentation. Statistics are the result of data analysis. Thus Data is a
collection of observation on one or more variables of interest.
Terminologies
• Element: The entities on which data are collected
• Variable: The characteristic of interest for the element
• Observation: The numerical value of a variable for an element
This type of data consists of only one variable. The analysis of univariate data is thus the
simplest form of analysis since the information deals with only one quantity that
changes.
2. Bivariate data
This type of data involves two different variables. The analysis of this type of data deals
with causes and relationships and the analysis is done to find out the relationship
among the two variables.
3. Multivariate data
When the data involves three or more variables, it is categorized under multivariate
Observation Wage/hr Education/year Experience/year
1 3.2 11 2
2 3.24 12 22
3 4 11 7
2. Cross-Sectional Data
Cross-sectional data analysis is when you analyze a data set at a fixed point in time.
Surveys and government records are some common sources of cross-sectional data. The
datasets record observations of multiple variables at a particular point in time. Data is
collected at the same or approximately the same point in time.Order doesn’t matter.
observation Wage/hr Education/year Experience/year
1 90 11 2
2 240 12 22
3 200 11 7
Pooled data occur when we have a “time series of cross sections,” but the observations
in each cross section do not necessarily refer to the same unit. Data with both cross
sectional and time series features. Here we are pooling different cross section data sets.
1 1993 85500 42
2 1993 67300 36
3 1993 134000 38
4 1995 65000 41
5 1995 182400 16
6 1995 97500 15
4. Panel data
Panel data refers to samples of the same cross-sectional units observed at multiple
points in time. Here order matters.
3 2 1986 2 64300 75
4 2 1990 1 65100 75
Qualitative data are measures of 'types' and may be represented by a name, symbol, or
a number code. Qualitative data are data about categorical variables (e.g. what type). It
can’t be expressed in numerical terms. Eg:- Sex, Religion, Attitude
2.Quantitative data
Quantitative data are measures of values or counts and are expressed as numbers.
Quantitative data are data about numeric variables (e.g. how many; how much; or how
often). Eg:- Height , Weight, etc
Variables
The characteristic under study that assumes different values for different elements
1. Discrete variables are countable in a finite amount of time. For example, you can
count the change in your pocket. You can count the money in your bank account.
You could also count the amount of money in everyone’s bank accounts. It might
take you a long time to count that last item, but the point is—it’s still countable.
2. Continuous Variables would (literally) take forever to count. In fact, you would
get to “forever” and never finish counting them. For example, take age. You can’t
count “age”. Because it would literally take forever. For example, you could be: 25
years, 10 months, 2 days, 5 hours, 4 seconds, 4 milliseconds, 8 nanoseconds, 99
picoseconds…and so on.
Scales of measurement
1. Nominal scale of measurement: The nominal scale of measurement defines the
identity property of data. This scale has certain characteristics, but doesn’t have any
form of numerical meaning. The data can be placed into categories but can’t be
multiplied, divided, added or subtracted from one another. It’s also not possible to
measure the difference between data points. Examples of nominal data include eye
colour and country of birth. Nominal data can be broken down again into three
categories:
• Nominal with order: Some nominal data can be sub-categorized in order, such as
“cold, warm, hot and very hot.”
• Nominal without order: Nominal data can also be sub-categorized as nominal
without order, such as male and female.
• Dichotomous: Dichotomous data is defined by having only two categories or
levels, such as “yes’ and ‘no’.
2. Ordinal scale of measurement: The ordinal scale defines data that is placed in a
specific order. While each value is ranked, there’s no information that specifies what
differentiates the categories from each other. These values can’t be added to or
subtracted from. An example of this kind of data would include satisfaction data points
in a survey, where ‘one = happy, two = neutral, and three = unhappy.’ Where someone
finished in a race also describes ordinal data. While first place, second place or third
place shows what order the runners finished in, it doesn’t specify how far the first-place
finisher was in front of the second-place finisher.
1. Collection Of Data
2. Organization Of Data
3. Presentation Of Data
4. Analysis Of Data
5. Interpretation Of Data
Collection of data
In Statistics, data collection is a process of gathering information from all the relevant
sources to find a solution to the research problem. It helps to evaluate the outcome of
the problem. The data collection methods allow a person to conclude an answer to the
relevant question. Depending on the type of data, the data collection method is divided
into two categories namely,
Observation Method
Observation method is used when the study relates to behavioral science. This
method is planned systematically. It is subject to many controls and checks. The
different types of observations are:
Questionnaire Method
In this method, the set of questions are mailed to the respondent. They should read,
reply and subsequently return the questionnaire. The questions are printed in the
definite order on the form. A good survey should have the following features:
Schedules
This method is similar to the questionnaire method with a slight difference. The
enumerations are specially appointed for the purpose of filling the schedules. It explains
the aims and objects of the investigation and may remove misunderstandings, if any
have come up. Enumerators should be trained to perform their job with hard work and
patience.
Questionnaire vs Schedules
• Government publications
• Public records
• Historical and statistical documents
• Business documents
• Technical and trade journals
• Diaries
• Letters
• Unpublished biographies, etc.
Organization of data
Data collected in its ordinal form is raw data. The systematic classification of the
raw data is called organization of data. To present the data in a readily comprehensible
condensed form which will highlight the important characteristics of the data, facilitate
comparisons and render it suitable for further processing (statistical analysis) and
interpretations.
Data can be presented in
• Table
• Diagram or Graph
Tabular Presentation
It is an orderly and logical arrangement of data into rows and columns. It presents the
voluminous and heterogeneous data in a condensed and homogeneous form.
Systematic arrangement of the raw data into different homogeneous classes is
necessary to sort out the relevant and significant features from the irrelevant and
insignificant ones. Thus Classification of the Data become preliminary to its tabulation
Classification of Data
It is the process of arranging the data into groups or classes according to resemblances
and similarities. There are various types of data
a) Geographical Classification
b) Chronological Classification
It is nothing but the time series data. Here the data is classified on the basis of
time
c) Qualitative Classification
Classified on the basis of descriptive characteristics like sex, literacy, region, caste, etc
i.e. which cannot be quantified
d) Quantitative Classification
On the basis of some characteristics which can be measured such as height, weight,
income, etc. It is mainly organized in two forms
Used when the identity of element & order is not a matter. This is best suited for
Discrete Variables. Here data are classified into different class intervals and recording
its frequency against it. The various groups into which the values of the variable are
classified are known as Classes or Class Intervals. The length of the class intervals is
called width or magnitude of the class. Two values specifying the class are called the
Class Limits. Upper class Limit is the one with the larger value. Lower class limit is the
smaller value.
C) Continuous Frequency Distribution
A continuous frequency distribution is a series in which the data are classified into
different class intervals without gaps and their respective frequencies are assigned as
per the class intervals and class width.
20-25 6
25-30 10
30-35 14
35-40 9
1. Inclusive Type Class: Inclusive class intervals contain values up to upper class
limit i.e. upper class limit is included. The upper class limit of preceding class
interval and lower class limit of succeeding class interval are different.
Marks No. of students
15-19 11
20-24 9
25-29 12
30-34 26
35-39 32
40-44 35
2. Exclusive Type Class: Exclusive class intervals contain values lower than the
upper-class limit i.e. upper-class limit is excluded. The upper-class limit of
preceding class interval and lower-class limit of succeeding class interval are
same.
20-25 6
25-30 10
30-35 14
35-40 9
Cumulative Frequencies
Cumulative frequency is used to determine the number of observations that lie above
(or below) a particular value in a data set. The cumulative frequency is calculated using a
frequency distribution table, which can be constructed from stem and leaf plots or
directly from the data. The cumulative frequency is calculated by adding each frequency
from a frequency distribution table to the sum of its predecessors. Two types
30-35 5 5
35-40 10 15
40-45 15 30
45-50 30 60
50-55 5 65
55-60 5 70
It is obtained by finding the cumulative totals of frequencies starting from the highest
value of the variable (class) to the lowest value (class).
30-35 5 70
35-40 10 65
40-45 15 55
45-50 30 40
50-55 5 10
55-60 5 5
Presentation Of The Data
• Textual presentation
• Data tables
• Diagrammatic presentation
Tabular Presentation
Table Number: Each table should have a specific table number for ease of access and
locating. This number can be readily mentioned anywhere which serves as a reference and
leads us directly to the data mentioned in that particular table.
Title: A table must contain a title that clearly tells the readers about the data it contains,
time period of study, place of study and the nature of classification of data.
Headnotes: A headnote further aids in the purpose of a title and displays more
information about the table. Generally, headnotes present the units of data in brackets at
the end of a table title.
Stubs: These are titles of the rows in a table. Thus, a stub display information about the
data contained in a particular row.
Caption: A caption is the title of a column in the data table. In fact, it is a counterpart if a
stub and indicates the information contained in a column.
Body or field: The body of a table is the content of a table in its entirety. Each item in a
body is known as a ‘cell’.
Footnotes: Footnotes are rarely used. In effect, they supplement the title of a table if
required.
Source: When using data obtained from a secondary source, this source has to be
mentioned below the footnote.
When presented diagrammatically, data is easy to interpret with just a glance. In such a
case we need to learn how to represent data diagrammatically via bar diagrams, pie charts
etc.
Bar Diagrams
As the name suggests, when data is presented in form of bars or rectangles, it is termed to
be a bar diagram.
• Simple Bar Diagram: These are the most basic type of bar diagrams. A simple bar
diagram represents only a single set of numerical data. Generally, simple bar
diagrams are used to represent time series data for a single entity.
• Multiple Bar Diagram: Unlike single bar diagram, a multiple bar diagram can
represent two or more sets of numerical data on the same bar diagram. Generally,
these are constructed to facilitate comparison between two entities like average
height and average weight, birth rates and death rates etc.
• Sub-divided or Differential Bar Diagrams: Sub-divided bar diagrams are useful
when we need to represent the total values and the contribution of various sections
of the total simultaneously. The different sections are shaded with different colors in
the same bar.
In addition to bar diagrams, pie diagrams are also widely used to pictorially represent data.
In this, a circle is divided into various segments which are decided on the basis of
percentages. Which means the circle is divided into sectors depending on various
percentages.
Histogram: Graph of frequency distribution in which classes are marked in horizontal axis
and frequencies are on the vertical axis
Frequency polygon: A frequency polygon is a line graph of class frequency plotted
against class midpoint. It can be obtained by joining the midpoints of the tops of the
rectangles in the histogram
Frequency curve: A frequency curve is a smooth curve for which the total area is taken to
be unity. It is a limiting form of a histogram or frequency polygon
Ogives: are graphs that are used to estimate how many numbers lie below or above a
particular variable or value in data. To construct an Ogive, firstly, the cumulative
frequency of the variables is calculated using a frequency table.
Analysis of data
2. Measure Of Dispersion
3. Measure Of Skewness
4. Measure Of Kurtosis
1. Mathematical Average
a. Arithmetic mean
b. Geometric mean
c. Harmonic mean
2. Positional averages
a. Median
b. Mode
3. Partition Values
a. Quartiles
b. Deciles
c. Percentiles
➢ Arithmetic Mean
Arithmetic mean represents a number that is obtained by dividing the sum of the
elements of a set by the number of values in the set. So you can use the layman term
Average, or be a little bit fancier and use the word “Arithmetic mean” your call, take your
pick -they both mean the same. The arithmetic mean may be either
• The sum of deviations of the items from their arithmetic mean is always zero, i.e.
∑(x – X) = 0.
• The sum of the squared deviations of the items from Arithmetic Mean (A.M) is
minimum, which is less than the sum of the squared deviations of the items from
any other values.
• If each item in the arithmetic series is substituted by the mean, then the sum of
these replacements will be equal to the sum of the specific items.
Merits
• Easy to calculate
• Based on all observation
• Suitable for further mathematical treatment
• Least effected by fluctuations in sampling (among all average)
Demerits
➢ Geometric mean
The Geometric Mean (GM) is the average value or mean which signifies the central
tendency of the set of numbers by finding the product of their values.
Merits
• Rigidly defined
• Based on all the observation
• Suitable for further mathematical treatment
• Not affected much by fluctuations of sampling
Demerits
Uses
➢ Harmonic mean
The harmonic mean is a type of numerical average. It is calculated by dividing the
number of observations by the reciprocal of each number in the series. Thus, the
harmonic mean is the reciprocal of the arithmetic mean of the reciprocals.
Merits
• Rigidly defined
Demerits
Positional Averages
• Its value depends upon the position occupied by a value in the frequency
distribution.
1. Median
2. Mode
➢ Median
• Idea of the median have first appeared in Edward Wright’s book “Certaine
Errors In Navigation” in 1599.
• Antoine Augustin Cournot in 1843 was the first to use the term median for the
value that divides a probability distribution into two equal halves.
• Gustav Theodor Fechner popularized the median into formal analysis of data.
• Francis Galton coined the term median in 1881
• Till 1869 – Middle-most Value
• 1880 – Medium
Median Properties
• Rigidly defined
• Easy to understand and calculate
• Not affected by extreme values
• Very useful measure when data is skewed
• Best average when data have extreme values
• Can be calculated for open ended distribution.
• Can be located graphically (ogives).
• Best measure for studying qualitative features or attributes of an observation.
Demerits
• Exact median can’t be determined for an even no. of observation.
• Not suitable for further mathematical treatment
• Value of the median is affected by the number of observations rather than values
of the observation.
➢ Mode
The mode is one of the measures of central tendency that can be calculated for a
given set of data values (the others being the mean and the median). The mode or the
modal value is by definition the value in a series of observations that occurs with the
highest frequency.
Properties of Mode:
• The mode is not unduly affected by extreme value, that is, values that are
extremely high or extremely low. For example, if we are given the following set of
observations:
• 1, 1, 1, 1, 1, 2, 2, 100
• 1,1,1,1,1,2,2,100
• The mean of the above set of data values is 13.625 which is clearly not
representative of the above data values. However, the mode which is equal to 1 is
clearly representative of a typical value from the above data set. This is one
advantage of the mode compared to the mean.
• The mode is not calculated on all observations in a data set.
• The value of the mode can be computed graphically whereas the value of the
mean cannot be calculated graphically.
• The value of the mode can be calculated in open end distributions without
knowing the class limits.
• The mode can be conveniently found even if the frequency distribution has class
intervals of unequal magnitude provided that the modal class and the classes
succeeding and preceding it are of the same magnitude.
• Sometimes it may not be possible to calculate the mode. This happens if the data
has a bimodal distribution in which there are two possible values for the mode.
• Another disadvantage of the mode is that as compared with the mean, it is
affected more by fluctuations of sampling.
• We have the following relationship between the mean, median and the mode:
Mode = 3*Median - 2*Mean
Merits
Demerits
• Positively Skewed
• Negatively Skewed
Partition values
• The values which divide the data into a number of equal parts are called Partition
Values.
• Mainly we have three partition values
o Quartiles
o Deciles
o Percentiles
• It can be located graphically by ogives / cumulative frequency curve
Quartiles
A quartile is a statistical term that describes a division of observations into four
defined intervals based on the values of the data and how they compare to the entire
set of observations. The quartile measures the spread of values above and below the
mean by dividing the distribution into four groups. A quartile divides data into three
points—a lower quartile, median, and upper quartile—to form four groups of the
dataset. Quartiles are used to calculate the interquartile range, which is a measure of
variability around the median. Each quartile contains 25% of the total observations.
Generally, the data is arranged from smallest to largest:
Percentiles
A percentile is a term used in statistics to express how a score compares to other
scores in the same set. While there is technically no standard definition of percentile, it's
typically communicated as the percentage of values that fall below a particular value in a
set of data scores.
Measures of dispersion
Dispersion is the state of getting dispersed or spread. Statistical dispersion means
the extent to which a numerical data is likely to vary about an average value. In other
words, dispersion helps to understand the distribution of the data. It helps to
The relative measures of dispersion are used to compare the distribution of two or
more data sets. This measure compares values without units. Common relative
dispersion methods include:
1. Co-efficient of Range
2. Co-efficient of Variation
3. Co-efficient of Standard Deviation
4. Co-efficient of Quartile Deviation
5. Co-efficient of Mean Deviation
Range
A range is the most common and easily understandable measure of dispersion. It
is the difference between two extreme observations of the data set. If X max and X min
are the two extreme observations then
R= H – L
Where,
Merits of Range
Demerits of Range
Coefficient of Range
It is defined as the relative measure of the distribution based on the range of any
given data set, which is the difference between the maximum and minimum value in the
given set. It is also known as range coefficient. In the case of grouped data, the range is
the difference between the upper boundary of the highest class and the lower boundary
of the lowest class. It is also calculated by using the difference between the mid points
of the highest class and the lowest class.
Coefficient of range =
Quartile Deviation
The quartiles divide a data set into quarters. The first quartile, (Q1) is the middle
number between the smallest number and the median of the data. The second quartile,
(Q2) is the median of the data set. The third quartile, (Q3) is the middle number between
the median and the largest number.
If one set of data has a larger coefficient of quartile deviation than another set, then that
data set’s interquartile dispersion is greater.
Mean Deviation
Mean deviation is the arithmetic mean of the absolute deviations of the observations
from a measure of central tendency. If x1, x2, … , xn are the set of observation, then the
mean deviation of x about the average A (mean, median, or mode) is
Standard Deviation
A standard deviation is the positive square root of the arithmetic mean of the
squares of the deviations of the given values from their arithmetic mean. It is denoted
by a Greek letter sigma, σ. It is also referred to as root mean square deviation. The
standard deviation is given as
Coefficient of variation
Lorenz Curve
One of the five major and common macroeconomic goals of a government is the
equitable (fair) distribution of income.
The Gini Coefficient can vary from 0 (perfect equality) to 1 (perfect inequality).
A Gini Coefficient of zero means that everyone has the same income, while a Coefficient
of 1 represents a single individual receiving all the income.
Measure of Skewness
1. Positive Skewness
If the given distribution is shifted to the left and with its tail on the right side, it is
a positively skewed distribution.
2. Negative Skewness
If the given distribution is shifted to the right and with its tail on the left side, it is
a negatively skewed distribution. It is also called a left-skewed distribution.
Asymmetrical distribution
• Sk = Mean – Mode
• Sk = Q3+Q1 – 2Md
Types of Kurtosis
The types of kurtosis are determined by the excess kurtosis of a particular distribution.
The excess kurtosis can take positive or negative values, as well as values close to zero.
1. Mesokurtic
2. Leptokurtic
3. Platykurtic
➢ Correlation:
Correlation is used to test relationships between quantitative variables or categorical
variables. In other words, it’s a measure of how things are related. The study of how
variables are correlated is called correlation analysis. A Statistical technique that is used
to analyze the strength and direction of the relationship between two variables is
called correlation analysis.
Types of Correlation
o Positive Correlation – when the values of the two variables move in the same
direction so that an increase/decrease in the value of one variable is followed by an
increase/decrease in the value of the other variable.
o Negative Correlation – when the values of the two variables move in the opposite
direction so that an increase/decrease in the value of one variable is followed by
decrease/increase in the value of the other variable.
o Non linear correlation - Correlation is said to be non linear if the ratio of change is
not constant. In other words, when all the points on the scatter diagram tend to lie
near a smooth curve, the correlation is said to be non linear (curvilinear)
o Simple correlation - When only two variables are studied it is a problem of simple
correlation
o Partial correlation - Two variables are chosen to study the correlation, while other
factors are assumed to be constant
o Multiple correlation - Correlation between more than three variables are considered
simultaneously.
1. Scatter diagram
The scatter diagram graphs pairs of numerical data, with one variable on each axis, to
look for a relationship between them. If the variables are correlated, the points will fall
along a line or curve. The better the correlation, the tighter the points will hug the line
Pearson’s correlation coefficient is the test statistics that measures the statistical
relationship, or association, between two continuous variables. It is known as the
best method of measuring the association between variables of interest because it is
based on the method of covariance. It gives information about the magnitude of the
association, or correlation, as well as the direction of the relationship.
Properties:
1. Limit: Coefficient values can range from +1 to -1, where +1 indicates a perfect
positive relationship, -1 indicates a perfect negative relationship, and a 0
indicates no relationship exists..
2. Pure number: It is independent of the unit of measurement. For example, if one
variable’s unit of measurement is in inches and the second variable is in quintals,
even then, Pearson’s correlation coefficient value does not change.
3. Symmetric: Correlation of the coefficient between two variables is symmetric.
This means between X and Y or Y and X, the coefficient value of will remain the
same.
Degree of correlation:
Types
o For instance, when we predict rent based on square feet alone that is
simple linear regression.
o When we predict rent based on square feet and age of the building that is
an example of multiple linear regression.
1.The constant
Coefficient of Regression
The Regression Coefficient is the constant ‘b’ in the regression equation that tells
about the change in the value of dependent variable corresponding to the unit change
in the independent variable.
If there are two regression equations, then there will be two regression
coefficients:
When the deviations are obtained from the assumed mean, the following formula is
used:
Regression Coefficient of Y on X: The symbol byx is used that measures the change in Y
corresponding to the unit change in X. Symbolically, it can be represented as:
In case, the deviations are taken from the actual means; the following formula is used:
The byx can be calculated by using the following formula when the deviations are taken
from the assumed means:
Theorem of Regression Coefficients
➢ Index numbers:
Meaning of Index Number
• “Index number is a single ratio (or a percentage) which measures the combined
change of several variables between two different times, places or situations”
Index number expresses the relative change in price, quantity, or value compared to
a base period. An index number is used to measure changes in prices paid for raw
materials; numbers of employees and customers, annual income and profits, etc
Terminologies
Base Year
1. The year selected for comparison or the year with which comparisons
are made
Current Year
• P01= Price index number for the current year with respect to the base year.
• P10= Price index number for the base year with respect to the current year.
• Q01= Quantity index number for the current year with respect to the base year.
• Q10= Quantity index number for the base year with respect to the current year.
• V01= Value index number for the current year with respect to the base year
1. Unweighted Indexes
2. Weighted Indexes
In this method, we find out the price relative of individual items and average out the
individual values. Price relative refers to the percentage ratio of the value of a variable in
the current year to its value in the year chosen as the base.
Simple Aggregative Method
It calculates the percentage ratio between the aggregate of the prices of all
commodities in the current year and aggregate prices of all commodities in the base
year.
Here, ∑P1= Summation of the prices of all commodities in current year and
Here different goods are assigned weight according to the quantity bought. There are
three well-known sub-methods based on the different views of economists as
mentioned below:
1. Laspeyre’s Method
Laspeyre was of the view that base year quantities must be chosen as weights. Therefore
the formula is :
P= (∑P1Q0 ÷ ∑P0Q0)×100
Here, ∑P1Q0= Summation of prices of current year multiplied by quantities of the base
year taken as weights and ∑P0Q0= Summation of, prices of base year multiplied by
quantities of the base year taken as weights.
2. Paasche’s Method
Unlike the above mentioned, Paasche believed that the quantities of the current year
must be taken as weights. Hence the formula:
P= (∑P1Q1÷∑P0Q1) ×100
Here, ∑P1Q1= Summation of, prices of current year multiplied by quantities of the
current year taken as weights and ∑P0Q1= Summation of, prices of base year multiplied
with quantities of the current year taken as weights.
3. Fisher’s Method
Fisher combined the best of both above-mentioned formulas which resulted in an ideal
method. This method uses both current and base year quantities as weights as follows:
NOTE: Index number of base year is generally assumed to be 100 if not given
Fisher’s Method is an Ideal Measure
As noted, Fisher’s method uses views of both Laspeyres and Paasche. Hence it
takes into account the prices and quantities of both years. Moreover, it is based on the
concept of the geometric mean, which is considered as the best mean method.
However, the most important evidence for the above affirmation is that it satisfies
both time reversal and factor reversal tests. Time reversal test checks that when we
reverse the current year to base year and vice-versa, the product of indexes should be
equal to unity. This confirms the working of a formula in both directions. Also, factor
reversal test implies that interchanging the piece and quantities do not give varying
results. This proves the consistency of the formula.
There are certain tests which are put to verify the consistency, or adequacy of an
index number formula from different points of view. The most popular among these are
the following tests:
• Circular test.
• Unit test.
At the outset, it should be noted that it is neither possible nor necessary for an
index-number formula to satisfy all the tests mentioned above. But, an ideal formula
should be such that it satisfies the maximum possible tests which are relevant to the
matter under study. However, the various tests cited above are explained here as under:
This test requires that a formula of Index number should be such that the value of
the index number remains the same, even if, the order of arrangement of the
items is reversed, or altered. As a matter of fact, this test is satisfied by all the
twelve methods of index number explained above.
This test has been put forth by Prof. Irving Fisher, who proposes that a formula of
index number should be such that it turns the value of the index number to its
reciprocal when the time subscripts of the formula are reversed i.e. 0 is made
1,and 1 is made 0. According to this proposition, if the index number of the
current period on the basis of the current period i.e.
P01 is 200, the index number of the base period on the basis of the current
period i.e. P10 would be 50. Thus, when the value of is 2 times the base year
price, the value of P10 is. As such, an index number formula, in order to satisfy
this test must prove the following equation:
P01 x P10 = 1
This test has also been purforth by Prof. Irving Fisher, who proposes that a
formula of index number should be such that it permits the interchange of the
price, and the quantity factors without giving inconsistent result i.e. the two
results multiplied together should give the true ratio in as much as th4e product
of price and quantity is the value of a thing.Thus, for the Factor Reversal test, a
formula of index number should satisfy the following equation:
Most of the formulae of index number discussed above fail to satisfy this acid test
of consistency except that of Prof. Irving Fisher. This is the reason for which Prof.
Fisher claims his formula to be an ideal one.
4. Circular test
This test has been put forth by Vestergaard and recommended by C.M. Walsch in
extension of the times reversal test put forth by Prof. Fisher. This test requires
that an Index number formula should be such that an index number formula
should be such that it works in a circular fashion. This means that if an index is
computed for the period 1 on the base period 0, another index is computed for
the period 2 on the base period 0 on the base period 2, the product of all these
indices should be equal to 1. Thus, a formula to satisfy the test should comply
with the following equation
An index formula which satisfies this test enjoys the advantage of reducing the
computation work every time a change in the base year is made. As it will be seen
from the table exhibited on page 632, this test is not satisfied by most of the
important index formula viz. Fisher’s, Laspeyre’s, Paasche’s, Marshall and Edge
worth’s, Drobish and Bowley’s etc.
5. Unit test
This is a common test which requires that an index number formula should be
such that it does not affect the value of the index number, even if, the units of the
price quotations are altered viz. price per kg, converted into price per quintal or
vice versa. This test is satisfied by all the index formula except the simple
aggregative method under which the value of the index number changes
radically, if the units of price quotations of any of the items included in the index
number are changed.
Base shifting
➢ Theory of probability
Probability is the measure of the likelihood that an event will occur in a
Random Experiment. Probability is quantified as a number between 0 and 1, where,
loosely speaking, 0 indicates impossibility and 1 indicates certainty. The higher the
probability of an event, the more likely it is that the event will occur.
Example: A simple example is the tossing of a fair (unbiased) coin. Since the
coin is fair, the two outcomes (“heads” and “tails”) are both equally probable; the
probability of “heads” equals the probability of “tails”; and since no other outcomes are
possible, the probability of either “heads” or “tails” is 1/2 (which could also be written as
0.5 or 50%).
Terminologies
• Experiment
• Experimenter
• Outcome
• Random experiment
• Conditions
3. Experiment is repeatable
• Sample space
• Denoted by “s”
• Eg:-
1. Tossing a coin
2. s = {H, T}
3. Throwing a die
4. s = {1,2,3,4,5,6}
• Sample point
• Event
• Any subset of a sample space
• Eg:-throwing a die
s = {1, 2, 3, 4, 5, 6}
• A = odd number
a = {1, 3, 5}
• B = even numbers
b = {2, 4, 6}
• Impossible event
• Sure event
• If there be only one element of the sample space in the set representing an
event, then this event is called a simple or elementary event.
• For example; if we throw a die, then the sample space, S = {1, 2, 3, 4, 5, 6}.
Now the event of 2 appearing on the die is simple and is given by E = {2}.
• Compound Event
• If an event has more than one sample point, it is termed as a compound
event. The compound events are a little more complex than simple events.
These events involve the probability of more than one event occurring
together. The total probability of all the outcomes of a compound event is
equal to 1
• Independent events
• In probability, we say two events are independent if knowing one event
occurred doesn't change the probability of the other event. So the result of
a coin flip and the day being Tuesday are independent events; knowing it
was a Tuesday didn't change the probability of getting "heads."
• Complementary events
• Two events are said to be complementary when one event occurs if and
only if the other does not. The probabilities of two complimentary events
add up to 1.
• For example, rolling a 5 or greater and rolling a 4 or less on a die are
complementary events, because a roll is 5 or greater if and only if it is
not 4 or less. The probability of rolling a 5 or greater is = , and the
probability of rolling a 4 or less is = . Thus, the total of their
probabilities is + = = 1.
• Union of events
• Intersection of events
The intersection of events A and B, denoted A∩B, is the collection of all outcomes
that are elements of both of the sets AA and BB. It corresponds to combining
descriptions of the two events using the word “and.”
• Difference events
• Denoted by A – B
• When a sample space S is divided into many mutually exclusive events such
that their union forms the entire sample space, these events are said to be
mutually exhaustive events.
• The probability that an exhaustive event will occur is always 1.
• The intersection of mutually exclusive exhaustive events is always empty.
Approaches to Probability
• There are 5 approaches to Probability.
1. Empirical Approaches
2. Classical Approaches
3. Axiomatic Approach
5. Subjective Approach
Empirical approach
• Selecting bingo balls. Each numbered ball has an equal chance of being
chosen.
The probability of a simple event happening is the number of times the event can
happen, divided by the number of possible events.
Axiomatic approach
Axiomatic Probability is just another way of describing the probability of an
event. As, the word itself says, in this approach, some axioms are predefined before
assigning probabilities. This is done to quantize the event and hence to ease the
calculation of occurrence or non-occurrence of the event.
0 ≤ P(X)
Axiom 2: We know that the sample space S of the experiment is the set of all the
outcomes. This means that the probability of any one outcome happening is 100
percent i.e P(S) = 1. Intuitively this means that whenever this experiment is performed,
the probability of getting some outcome is 100 percent.
P(S) = 1
Axiom 3: For the experiments where we have two outcomes A and B. If A and B are
mutually exclusive,
Proof:
Where:
• P(A ∩ B) – the joint probability of events A and B; the probability that both
events A and B occur
Two events are independent if the probability of the outcome of one event does
not influence the probability of the outcome of another event. Due to this reason, the
conditional probability of two independent events A and B is:
P(A|B) = P(A)
P(B|A) = P(B)
In probability theory, mutually exclusive events are events that cannot occur
simultaneously. In other words, if one event has already occurred, another can event
cannot occur. Thus, the conditional probability of mutually exclusive events is always
zero.
P(A|B) = 0
P(B|A) = 0
• If A and B are two events with positive probabilities ( P(A) ≠ 0, P(B) ≠ 0),
then A and B are independent if and only if
2. Permutation
• Often, we want to count all of the possible ways that a single set of objects
can be arranged. For example, consider the letters X, Y, and Z. These letters
can be arranged a number of different ways (XYZ, XZY, YXZ, etc.) Each of
these arrangements is a permutation.
• Sometimes, we want to count all of the possible ways that a single set of
objects can be selected - without regard to the order in which they are
selected.
• In general, n objects can be arranged in n(n - 1)(n - 2) ... (3)(2)(1) ways. This
product is represented by the symbol n!, which is called n factorial. (By
convention, 0! = 1.)
• Binomial Distribution
• Poisson distribution
Binomial Distribution:
The prefix ‘Bi’ means two or twice. A binomial distribution can be understood as
the probability of a trail with two and only two outcomes. It is a type of distribution that
has two different outcomes namely, ‘successes and ‘failure’. Also, it is applicable to
discrete random variables only.
• Every trial is independent. None of your trials should affect the possibility of the
next trial.
• The probability always stays the same and equal. The probability of success may
be equal for more than one trial.
• Symbolically
Where,
x= number of success
n-x=number of failure
p=probability of success
q=probability of failure
Poisson Distribution :
At first, we divide the time into n number of small intervals, such that n → ∞ and
p denote the probability of success, as we have already divided the time into infinitely
small intervals so p → 0. So the result must be that in that condition is n x p = λ (a finite
constant).
• The probability of more than one success in unit time is very low.
Normal Distribution :
The Normal Distribution defines a probability density function f(x) for the
continuous random variable X considered in the system. The random variables which
follow the normal distribution are ones whose values can assume any known value in a
given range.
The theorem states that any distribution becomes normally distributed when the
number of variables is sufficiently large. For instance, the binomial distribution tends to
change into the normal distribution with mean and variance.
• The mean and median are the same and lie in the middle of the distribution
• Its standard deviation measures the distance on the distribution from the mean
to the inflection point (the place where the curve changes from an “upside-down-
bowl” shape to a “right-side-up-bowl” shape).
• Because of its unique bell shape, probabilities for the normal distribution follow
the Empirical Rule, which says the following:
• About 68 percent of its values lie within one standard deviation of the mean. To
find this range, take the value of the standard deviation, then find the mean plus
this amount, and the mean minus this amount.
• About 95 percent of its values lie within two standard deviations of the mean.
• Almost all of its values lie within three standard deviations of the mean.
The standard normal curve is a special case of the normal distribution, and thus as
well a probability distribution curve. Therefore basic properties of the normal
distribution hold true for the standard normal curve as well
• The total area under the standard normal curve is 1 (this property is shared by all
density curves).
• The standard normal curve is is bell shaped, is centered at z=0. Almost all the
area under the standard normal curve lies between z=−3 and z=3.
➢ Sampling theory
• Population: Complete collection of data under consideration in a statistical study
Sampling Methods
• Different types
4. Cluster sampling
To conduct this type of sampling, you can use tools like random number
generators or other techniques that are based entirely on chance.
Example
2. Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually slightly easier
to conduct. Every member of the population is listed with a number, but instead of
randomly generating numbers, individuals are chosen at regular intervals.
Example
All employees of the company are listed in alphabetical order. From the first 10
numbers, you randomly select a starting point: number 6. From number 6
onwards, every 10th person on the list is selected (6, 16, 26, 36, and so on), and
you end up with a sample of 100 people.
If you use this technique, it is important to make sure that there is no hidden
pattern in the list that might skew the sample. For example, if the HR database
groups employees by team, and team members are listed in order of seniority,
there is a risk that your interval might skip over people in junior roles, resulting in
a sample that is skewed towards senior employees.
3. Stratified sampling
Stratified sampling involves dividing the population into subpopulations that may
differ in important ways. It allows you draw more precise conclusions by ensuring that
every subgroup is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called
strata) based on the relevant characteristic (e.g. gender, age range, income bracket, job
role).
Based on the overall proportions of the population, you calculate how many
people should be sampled from each subgroup. Then you use random or systematic
sampling to select a sample from each subgroup.
Example
The company has 800 female employees and 200 male employees. You want to
ensure that the sample reflects the gender balance of the company, so you sort
the population into two strata based on gender. Then you use random sampling
on each group, selecting 80 women and 20 men, which gives you a
representative sample of 100 people.
4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each
subgroup should have similar characteristics to the whole sample. Instead of sampling
individuals from each subgroup, you randomly select entire subgroups.
If it is practically possible, you might include every individual from each sampled
cluster. If the clusters themselves are large, you can also sample individuals from within
each cluster using one of the techniques above. This is called multistage sampling.
This method is good for dealing with large and dispersed populations, but there
is more risk of error in the sample, as there could be substantial differences between
clusters. It’s difficult to guarantee that the sampled clusters are really representative of
the whole population.
Example
The company has offices in 10 cities across the country (all with roughly the same
number of employees in similar roles). You don’t have the capacity to travel to
every office to collect your data, so you use random sampling to select 3 offices –
these are your clusters.
5. Multistage sampling
This type of sample is easier and cheaper to access, but it has a higher risk of
sampling bias. That means the inferences you can make about the population are
weaker than with probability samples, and your conclusions may be more limited. If you
use a non-probability sample, you should still aim to make it as representative of the
population as possible.
• Types
1. Convenience Sampling
2. Judgement Sampling
3. Quota Sampling
4. Purposive Sampling
5. Snowball Sampling
1. Convenience sampling
Example
You are researching opinions about student support services in your university,
so after each of your classes, you ask your fellow students to complete a survey
on the topic. This is a convenient way to gather data, but as you only surveyed
students taking the same classes as you at the same level, the sample is not
representative of all the students at your university.
2. Purposive sampling
Example
You want to know more about the opinions and experiences of disabled students
at your university, so you purposefully select a number of students with different
support needs in order to gather a varied range of data on their experiences with
student services.
3. Snowball sampling
Example
4. Judgmental Sampling
5. Quota sampling
For example, a cigarette company wants to find out what age group prefers what
brand of cigarettes in a particular city. He/she applies quotas on the age groups of 21-
30, 31-40, 41-50, and 51+. From this information, the researcher gauges the smoking
trend among the population of the city.
BASIS FOR
SAMPLING ERROR NON-SAMPLING ERROR
COMPARISON
BASIS FOR
SAMPLING ERROR NON-SAMPLING ERROR
COMPARISON
Sample size Possibility of error reduced with It has nothing to do with the
the increase in sample size. sample size.
Estimation
Estimation, in statistics, any of numerous procedures used to calculate the value of
some property of a population from observations of a sample drawn from the
population. A point estimate, for example, is the single number most likely to express
the value of the property. An interval estimate defines a range within which the value of
the property can be expected (with a specified degree of confidence) to fall.
1. Estimation
3. Estimate
Confidence level
The width of the confidence interval tells us more about how certain (or
uncertain) we are about the true figure in the population. This width is stated as a plus
or minus (in this case, +/- 3) and is called the confidence interval. When the interval and
confidence level are put together, you get a spread of percentage. In this case, you
would expect the results to be 35 (38-3) to 41 (35+3) percent, 95% of the time.
Confidence Coefficient (1-α)
The confidence coefficient is the confidence level stated as a proportion, rather
than as a percentage. For example, if you had a confidence level of 99%, the confidence
coefficient would be .99.
In general, the higher the coefficient, the more certain you are that your results
are accurate. For example, a .99 coefficient is more accurate than a coefficient of .89. It’s
extremely rare to see a coefficient of 1 (meaning that you are positive without a doubt
that your results are completely, 100% accurate). A coefficient of zero means that you
have no faith that your results are accurate at all.
0.90 90 %
0.95 95 %
0.99 99 %
Level of Significance
Hypothesis testing
The null hypothesis is always the accepted fact. The assumption of a statistical test is
called the null hypothesis, or hypothesis 0 (H0 for short). It is often called the default
assumption, or the assumption that nothing has changed.
• Hypothesis 1 (H1): Assumption of the test does not hold and is rejected at some
level of significance.
Before we can reject or fail to reject the null hypothesis, we must interpret the result of
the test.
➢ Power of a test (1 – β)
Statistical power or the power of a hypothesis test is the probability that the test
correctly rejects the null hypothesis. That is, the probability of a true positive result. It is
only useful when the null hypothesis is rejected. The higher the statistical power for a
given experiment, the lower the probability of making a Type II (false negative) error.
That is the higher the probability of detecting an effect when there is an effect. In fact,
the power is precisely the inverse of the probability of a Type II error
• Low Statistical Power: Large risk of committing Type II errors, e.g. a false negative.
Tests
A test statistic is used in a hypothesis test when you are deciding to support or
reject the null hypothesis. The test statistic takes your data from an experiment or survey
and compares your results to the results you would expect from the null hypothesis.
When you run a hypothesis test, you’ll use a distribution like a t-distribution or
normal distribution. These have a known area, and enable to you to calculate a
probability value (p-value) that will tell you if your results are due to chance, or if your
results are die to your theory being correct. The larger the test statistic, the smaller the
p-value and the more likely you are to reject the null hypothesis.
➢ Parametric Tests
The basic principle behind the parametric tests is that we have a fixed set of
parameters that are used to determine a probabilistic model that may be used in
Machine Learning as well.
Parametric tests are those tests for which we have prior knowledge of the
population distribution (i.e, normal), or if not then we can easily approximate it to a
normal distribution which is possible with the help of the Central Limit Theorem.
• Mean
• Standard Deviation
Types
• T-test
• Z-test
• F-test
• ANOVA
Non-parametric Tests
In Non-Parametric tests, we don’t make any assumption about the parameters for
the given population or the population we are studying.
Hence, there is no fixed set of parameters is available, and also there is no distribution
(normal distribution, etc.) of any kind is available for use.This is also the reason that
nonparametric tests are also referred to as distribution-free tests.
Types
• Chi-square
• Mann-Whitney U-test
• Kruskal-Wallis H-test
➢ T-Test
1. It is a parametric test of hypothesis testing based on Student’s T distribution.
2. It is essentially, testing the significance of the difference of the mean values when the
sample size is small (i.e, less than 30) and when the population standard deviation is not
available.
One Sample T-test: To compare a sample mean with that of the population mean.
where,
Conclusion:
• If the value of the test statistic is greater than the table value -> Rejects the null
hypothesis.
• If the value of the test statistic is less than the table value -> Do not reject the
null hypothesis.
➢ Z-Test
1. It is a parametric test of hypothesis testing.
2. It is used to determine whether the means are different when the population variance
is known and the sample size is large (i.e, greater than 30).
One Sample Z-test: To compare a sample mean with that of the population mean.
where,
➢ F-Test
1. It is a parametric test of hypothesis testing based on Snedecor F-distribution.
2. It is a test for the null hypothesis that two normal populations have the same
variance.
5. By changing the variance in the ratio, F-test has become a very flexible test. It
can then be used to:
➢ ANOVA
1. Also called as Analysis of variance, it is a parametric test of hypothesis testing.
3. It is used to test the significance of the differences in the mean values among
more than two sample groups.
4. It uses F-test to statistically test the equality of means and the relative variance
between them.
3. It helps in assessing the goodness of fit between a set of observed and those
expected theoretically.
6. If there is no difference between the expected and observed frequencies, then the
value of chi-square is equal to zero.
7. It is also known as the “Goodness of fit test” which determines whether a particular
distribution fits the observed data or not.
• No one of the groups should contain very few items, say less than 10.
• The reasonably large overall number of items. Normally, it should be at least 50,
however small the number of groups may be.
11. Chi-square as a parametric test is used as a test for population variance based on
sample variance.
12. If we take each one of a collection of sample variances, divide them by the known
population variance and multiply these quotients by (n-1), where n means the number
of items in the sample, we get the values of chi-square.
REFERENCE