0% found this document useful (0 votes)
18 views92 pages

Introduction To Statistics

The document provides an introduction to statistics, covering its significance, origin, and various classifications such as descriptive and inferential statistics. It details the types of data, variables, and scales of measurement, as well as the stages of statistical investigation and data collection methods. Overall, it emphasizes the importance of statistics in analyzing and interpreting data across different fields.

Uploaded by

Shubham Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views92 pages

Introduction To Statistics

The document provides an introduction to statistics, covering its significance, origin, and various classifications such as descriptive and inferential statistics. It details the types of data, variables, and scales of measurement, as well as the stages of statistical investigation and data collection methods. Overall, it emphasizes the importance of statistics in analyzing and interpreting data across different fields.

Uploaded by

Shubham Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Introduction

to Statistics
INTRODUCTION TO STATISTICS
Syllabus
• Descriptive Statistics – Measures Of Central Tendency & Dispersions
• Correlation, Index Numbers
• Probability Theory: Concepts Of Probability, Distributions, Moments
• Central Limit Theorem
• Sampling Methods & Sampling Distribution
• Statistical Inferences, Hypothesis Testing

Introduction
Statistics play a very vital role in any domain. It helps in collecting data, be it in
any field. Along with that, it also helps in analyzing data using statistical techniques.
Speaking of the present time, it has a lot of importance and application. Furthermore, if
we talk about the examples, here they are.

The government uses statistics to conduct planning in the economic sector. A


businessman looks forward to expanding his growth in the business world by taking into
the account the data and feedback. Similarly, politics makes use of statistics to show
their accomplishments. For showing research papers, scholars use statistic. Therefore,
the list and applications of statistics are endless.

Origin and development


Origin of the word ‘statistics’. There are several views related to the same. One such
view is that it has a Latin origin and the word that it comes from is ‘status’. On the
contrary, another view speaks of its Italian origin and that it comes from ‘statista.’
According to scholars, the origin is German and the word it comes from is ‘statistik.’
Similarly, according to more suggestion, the origin is traced back to a French word
called ‘statistique.’

• It is one of the very old branches of science dealing with numbers. It is as old as
human society. Earlier it was regarded as “Science of Statecraft”. It gave more
importance to administrative activities. But its scope was limited to

• Age and sex-wise population of the country

• Property and wealth of the country

• In the 16th century it was more concentrated on Astronomy


• 17th century was when there was the origin of vital statistics took place. Captain John
Graunt is known as the Father of Vital Statistics. Vital statistics refers to the
accumulated data gathered on live births, deaths, migration, fetal deaths, marriages
and divorces.
• Theory of probability is referred to as the backbone of Modern statistics
• Statistics reached its peak by the contributions of

a. Francis Galton (1822-1921) – Regression Analysis

b. Karl Pearson (1857-1936) – Correlation Analysis, Chi-square test

c. W.S.Gosset (1876-1936) – t-test

d. Sir Ronald A Fisher (1890-1962) – Father Of Statistics

i. Estimation Theory
ii. Sampling Distribution
iii. Analysis Of Variance
e. P.C. Mahalanobis – Father Of Indian Statistics

Meaning of statistics
Statistics can come forward in two ways: singular and plural.

• In plural form, statistics is quantitative as well as qualitative. In the plural sense, data
is generally taken into account keeping in mind the statistical analysis.
• Singularly, it is more like a scientific method that helps in presenting, collecting, as
well as analyzing data. All of this brings some major characteristics into the limelight.

Statistics is concerned with scientific method for collecting, organizing, summarizing,


presenting, analyzing and interpreting of data. The word statistics is normally referred
either as numerical facts or methods.

Definition of statistics
• A. L. Bowley - “Statistics is the science of measurement of social organism regarded
as a whole in all its manifestations”.

• According to Selligman “Statistics is the science which deals with the methods of
collecting, classifying, presenting , comparing and interpreting numerical data
collected to throw some light on any sphere of enquiry”.
• Croxton and Cowden defined “statistics as the collection , presentation, analysis and
interpretation of numerical data

Classification of statistics
On the basis of statistical methods used, we have two types.

a) Descriptive statistics:
Descriptive statistics are brief descriptive coefficients that summarize a given data
set, which can be either a representation of the entire population or a sample of a
population. Descriptive statistics are broken down into measures of central
tendency and measures of variability (spread). Measures of central tendency
include the mean, median, and mode, while measures of variability include
standard deviation, variance, minimum and maximum variables, kurtosis, and
skewness.

b) Inferential statistics
Inferential statistics takes data from a sample and makes inferences about the
larger population from which the sample was drawn. Because the goal of
inferential statistics is to draw conclusions from a sample and generalize them to
a population, we need to have confidence that our sample accurately reflects the
population.
Inferential statistics are again classified into
1. Parametric Method:
Parametric inferential tests are carried out on data that follow certain
parameters: the data will be normal; numbers can be added, subtracted,
multiplied and divided; variances are equal when comparing two or more
groups; and the sample should be large and randomly selected. There are
generally more statistical technique options for the analysis of parametric
than non-parametric data, and parametric statistics are considered to be
the more powerful. Common examples of parametric tests are: correlated
t-tests and the Pearson r correlation coefficient.

2. Non-Parametric Method:
Non-parametric tests relate to data that are flexible and do not follow a
normal distribution. They are also known as “distribution-free” and the
data are generally ranked or grouped. Non-parametric data are lacking
those same parameters and cannot be added, subtracted, multiplied, or
divided. These data include nominal measurements such as gender or
race; or ordinal levels of measurement such as IQ scales, or survey
response categories such as “good, better, best”, “agree, neutral, disagree”,
etc. Examples include ranking, the chi-square test, binomial test and
Spearman's rank correlation coefficient.

Data
There are different types of data in Statistics that are collected, analysed, interpreted
and presented. The data are the individual pieces of factual information recorded, and it
is used for the purpose of the analysis process. The two processes of data analysis are
interpretation and presentation. Statistics are the result of data analysis. Thus Data is a
collection of observation on one or more variables of interest.

Terminologies
• Element: The entities on which data are collected
• Variable: The characteristic of interest for the element
• Observation: The numerical value of a variable for an element

Data can be classified on the basis of

a) Variables in the data


b) Nature of the data
c) Numerical measurement of the data

Data on the basis of variable


1. Univariate data

This type of data consists of only one variable. The analysis of univariate data is thus the
simplest form of analysis since the information deals with only one quantity that
changes.

2. Bivariate data

This type of data involves two different variables. The analysis of this type of data deals
with causes and relationships and the analysis is done to find out the relationship
among the two variables.

3. Multivariate data

When the data involves three or more variables, it is categorized under multivariate
Observation Wage/hr Education/year Experience/year

1 3.2 11 2

2 3.24 12 22

3 4 11 7

Data classified on the basis of nature


1. Time series data

Time series data is a collection of observations obtained through repeated


measurements over time. Time series data is data that is collected at different points in
time. This is opposed to cross-sectional data which observes individuals, companies, etc.
at a single point in time. Time series data is a collection of quantities that are assembled
over even intervals in time and ordered chronologically. The time interval at which data
is collection is generally referred to as the time series frequency.

Year GDP WPI M1 MONEY SUPPLY

2017 *** *** ***

2018 *** *** ***

2019 *** *** ***

2. Cross-Sectional Data

Cross-sectional data analysis is when you analyze a data set at a fixed point in time.
Surveys and government records are some common sources of cross-sectional data. The
datasets record observations of multiple variables at a particular point in time. Data is
collected at the same or approximately the same point in time.Order doesn’t matter.
observation Wage/hr Education/year Experience/year

1 90 11 2

2 240 12 22

3 200 11 7

3. Pooled Cross Section Data

Pooled data occur when we have a “time series of cross sections,” but the observations
in each cross section do not necessarily refer to the same unit. Data with both cross
sectional and time series features. Here we are pooling different cross section data sets.

Observation Year Housing Price Property Tax


number

1 1993 85500 42

2 1993 67300 36

3 1993 134000 38

4 1995 65000 41

5 1995 182400 16

6 1995 97500 15

4. Panel data

Panel data refers to samples of the same cross-sectional units observed at multiple
points in time. Here order matters.

Observation City Year Murders Population Police


number

1 1 1986 5 350000 440


2 1 1990 8 359200 471

3 2 1986 2 64300 75

4 2 1990 1 65100 75

5 3 1986 25 543000 520

6 3 1990 32 546200 493

Data Classified On The Basis Of Numerical Measurement


1. Qualitative data

Qualitative data are measures of 'types' and may be represented by a name, symbol, or
a number code. Qualitative data are data about categorical variables (e.g. what type). It
can’t be expressed in numerical terms. Eg:- Sex, Religion, Attitude

2.Quantitative data

Quantitative data are measures of values or counts and are expressed as numbers.
Quantitative data are data about numeric variables (e.g. how many; how much; or how
often). Eg:- Height , Weight, etc

Variables
The characteristic under study that assumes different values for different elements

1. Quantitative Variables: Sometimes referred to as “numeric” variables, these are


variables that represent a measurable quantity. Examples include:

• Number of students in a class


• Number of square feet in a house
• Population size of a city
• Age of an individual
• Height of an individual

Quantitative variables are again classified into

1. Discrete variables are countable in a finite amount of time. For example, you can
count the change in your pocket. You can count the money in your bank account.
You could also count the amount of money in everyone’s bank accounts. It might
take you a long time to count that last item, but the point is—it’s still countable.
2. Continuous Variables would (literally) take forever to count. In fact, you would
get to “forever” and never finish counting them. For example, take age. You can’t
count “age”. Because it would literally take forever. For example, you could be: 25
years, 10 months, 2 days, 5 hours, 4 seconds, 4 milliseconds, 8 nanoseconds, 99
picoseconds…and so on.

2. Qualitative Variables: Sometimes referred to as “categorical” variables, these are


variables that take on names or labels and can fit into categories. Examples include:

• Eye color (e.g. “blue”, “green”, “brown”)


• Gender (e.g. “male”, “female”)
• Breed of dog (e.g. “lab”, “bulldog”, “poodle”)
• Level of education (e.g. “high school”, “Associate’s degree”, “Bachelor’s degree”)
• Marital status (e.g. “married”, “single”, “divorced”)

Scales of measurement
1. Nominal scale of measurement: The nominal scale of measurement defines the
identity property of data. This scale has certain characteristics, but doesn’t have any
form of numerical meaning. The data can be placed into categories but can’t be
multiplied, divided, added or subtracted from one another. It’s also not possible to
measure the difference between data points. Examples of nominal data include eye
colour and country of birth. Nominal data can be broken down again into three
categories:

• Nominal with order: Some nominal data can be sub-categorized in order, such as
“cold, warm, hot and very hot.”
• Nominal without order: Nominal data can also be sub-categorized as nominal
without order, such as male and female.
• Dichotomous: Dichotomous data is defined by having only two categories or
levels, such as “yes’ and ‘no’.

2. Ordinal scale of measurement: The ordinal scale defines data that is placed in a
specific order. While each value is ranked, there’s no information that specifies what
differentiates the categories from each other. These values can’t be added to or
subtracted from. An example of this kind of data would include satisfaction data points
in a survey, where ‘one = happy, two = neutral, and three = unhappy.’ Where someone
finished in a race also describes ordinal data. While first place, second place or third
place shows what order the runners finished in, it doesn’t specify how far the first-place
finisher was in front of the second-place finisher.

3. Interval scale of measurement: The interval scale contains properties of nominal


and ordered data, but the difference between data points can be quantified. This type of
data shows both the order of the variables and the exact differences between the
variables. They can be added to or subtracted from each other, but not multiplied or
divided. For example, 40 degrees is not 20 degrees multiplied by two. This scale is also
characterized by the fact that the number zero is an existing variable. In the ordinal
scale, zero means that the data does not exist. In the interval scale, zero has meaning –
for example, if you measure degrees, zero has a temperature. Data points on the interval
scale have the same difference between them. The difference on the scale between 10
and 20 degrees is the same between 20 and 30 degrees. This scale is used to quantify
the difference between variables, whereas the other two scales are used to describe
qualitative values only. Other examples of interval scales include the year a car was
made or the months of the year.

4. Ratio scale of measurement: Ratio scales of measurement include properties from


all four scales of measurement. The data is nominal and defined by an identity, can be
classified in order, contains intervals and can be broken down into exact value. Weight,
height and distance are all examples of ratio variables. Data in the ratio scale can be
added, subtracted, divided and multiplied. Ratio scales also differ from interval scales in
that the scale has a ‘true zero’. The number zero means that the data has no value point.
An example of this is height or weight, as someone cannot be zero centimeters tall or
weigh zero kilos – or be negative centimeters or negative kilos. Examples of the use of
this scale are calculating shares or sales. Of all types of data on the scales of
measurement, data scientists can do the most with ratio data points.

Stages Of Statistical Investigation


• By adding “organization of data” into Croxton And Cowden definition, we have
five stages of statistical investigation

1. Collection Of Data

2. Organization Of Data

3. Presentation Of Data

4. Analysis Of Data

5. Interpretation Of Data
Collection of data
In Statistics, data collection is a process of gathering information from all the relevant
sources to find a solution to the research problem. It helps to evaluate the outcome of
the problem. The data collection methods allow a person to conclude an answer to the
relevant question. Depending on the type of data, the data collection method is divided
into two categories namely,

• Primary Data Collection methods


• Secondary Data Collection methods

Primary Data Collection Methods


Primary data or raw data is a type of information that is obtained directly from the first-
hand source through experiments, surveys or observations. The primary data collection
method is further classified into two types. They are

• Quantitative Data Collection Methods


• Qualitative Data Collection Methods

Quantitative Data Collection Methods


It is based on mathematical calculations using various formats like close-ended
questions, correlation and regression methods, mean, median or mode measures. This
method is cheaper than qualitative data collection methods and it can be applied in a
short duration of time.

Qualitative Data Collection Methods


It does not involve any mathematical calculations. This method is closely
associated with elements that are not quantifiable. This qualitative data collection
method includes interviews, questionnaires, observations, case studies, etc. There are
several methods to collect this type of data. They are

Observation Method
Observation method is used when the study relates to behavioral science. This
method is planned systematically. It is subject to many controls and checks. The
different types of observations are:

• Structured and unstructured observation


• Controlled and uncontrolled observation
• Participant, non-participant and disguised observation
Interview Method
The method of collecting data in terms of oral or verbal responses. It is achieved in two
ways, such as

• Personal Interview – In this method, a person known as an interviewer is required


to ask questions face to face to the other person. The personal interview can be
structured or unstructured, direct investigation, focused conversation, etc.
• Telephonic Interview – In this method, an interviewer obtains information by
contacting people on the telephone to ask the questions or views orally.

Questionnaire Method
In this method, the set of questions are mailed to the respondent. They should read,
reply and subsequently return the questionnaire. The questions are printed in the
definite order on the form. A good survey should have the following features:

• Short and simple


• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
• Should have good physical appearance such as color, quality of the paper to
attract the attention of the respondent

Schedules
This method is similar to the questionnaire method with a slight difference. The
enumerations are specially appointed for the purpose of filling the schedules. It explains
the aims and objects of the investigation and may remove misunderstandings, if any
have come up. Enumerators should be trained to perform their job with hard work and
patience.

Questionnaire vs Schedules

Basis for Comparison Questionnaire Schedule

Filled by Respondent Enumerator

Skill & Efficiency Question setters Enumerators


Response Rate Low High

Coverage Large Comparatively Small

Cost Economical Expensive

Useful Literate Both literate & illiterate

Secondary Data Collection Methods


Secondary data is data collected by someone other than the actual user. It means
that the information is already available, and someone analyses it. The secondary data
includes magazines, newspapers, books, journals, etc. It may be either published data or
unpublished data.

Published data are available in various resources including

• Government publications
• Public records
• Historical and statistical documents
• Business documents
• Technical and trade journals

Unpublished data includes

• Diaries
• Letters
• Unpublished biographies, etc.

Organization of data
Data collected in its ordinal form is raw data. The systematic classification of the
raw data is called organization of data. To present the data in a readily comprehensible
condensed form which will highlight the important characteristics of the data, facilitate
comparisons and render it suitable for further processing (statistical analysis) and
interpretations.
Data can be presented in

• Table
• Diagram or Graph

Tabular Presentation
It is an orderly and logical arrangement of data into rows and columns. It presents the
voluminous and heterogeneous data in a condensed and homogeneous form.
Systematic arrangement of the raw data into different homogeneous classes is
necessary to sort out the relevant and significant features from the irrelevant and
insignificant ones. Thus Classification of the Data become preliminary to its tabulation

Classification of Data
It is the process of arranging the data into groups or classes according to resemblances
and similarities. There are various types of data

a) Geographical Classification

It is also known as Spatial classification. It classifies the data on the basis of


geographical or location difference. It is ordered based on alphabet and size

b) Chronological Classification

It is nothing but the time series data. Here the data is classified on the basis of
time

c) Qualitative Classification

Classified on the basis of descriptive characteristics like sex, literacy, region, caste, etc
i.e. which cannot be quantified

d) Quantitative Classification

On the basis of some characteristics which can be measured such as height, weight,
income, etc. It is mainly organized in two forms

• Array : Arrangement of the data in ascending or descending order is called


Array
• Frequency Distribution: A frequency distribution is a representation, either in
a graphical or tabular format that displays the number of observations within
a given interval. The interval size depends on the data being analyzed and the
goals of the analyst. They can be of three types
• Discrete/Ungrouped Frequency Distribution
• Grouped Frequency Distribution
• Continuous Frequency Distribution

a) Discrete/Ungrouped Frequency Distribution

Frequency is marked against all possible values by summing up tally marks.

b) Grouped Frequency Distribution

Used when the identity of element & order is not a matter. This is best suited for
Discrete Variables. Here data are classified into different class intervals and recording
its frequency against it. The various groups into which the values of the variable are
classified are known as Classes or Class Intervals. The length of the class intervals is
called width or magnitude of the class. Two values specifying the class are called the
Class Limits. Upper class Limit is the one with the larger value. Lower class limit is the
smaller value.
C) Continuous Frequency Distribution

A continuous frequency distribution is a series in which the data are classified into
different class intervals without gaps and their respective frequencies are assigned as
per the class intervals and class width.

Age No. of Persons

20-25 6

25-30 10

30-35 14

35-40 9

Types of Class Interval


There are mainly two types of class interval

1. Inclusive Type Class: Inclusive class intervals contain values up to upper class
limit i.e. upper class limit is included. The upper class limit of preceding class
interval and lower class limit of succeeding class interval are different.
Marks No. of students

15-19 11

20-24 9

25-29 12

30-34 26

35-39 32

40-44 35
2. Exclusive Type Class: Exclusive class intervals contain values lower than the
upper-class limit i.e. upper-class limit is excluded. The upper-class limit of
preceding class interval and lower-class limit of succeeding class interval are
same.

Age No.of persons

20-25 6

25-30 10

30-35 14

35-40 9

Conversion of inclusive into exclusive


1. First find the difference between the upper limit of a class interval and the lower
limit of the next class interval.
2. Half the difference is added to the upper limit of each class interval and
remaining half is deducted from the lower limit of each class-interval.

Cumulative Frequencies
Cumulative frequency is used to determine the number of observations that lie above
(or below) a particular value in a data set. The cumulative frequency is calculated using a
frequency distribution table, which can be constructed from stem and leaf plots or
directly from the data. The cumulative frequency is calculated by adding each frequency
from a frequency distribution table to the sum of its predecessors. Two types

1. Less than cumulative frequency distribution

2. More than cumulative frequency distribution

1. Less than cumulative frequency distribution


Obtained by adding successively the frequencies of all the previous values, including the
frequency of variable against which the totals are written, provided the values are
arranged in ascending order of magnitude.
Marks Frequency Less than cf

30-35 5 5

35-40 10 15

40-45 15 30

45-50 30 60

50-55 5 65

55-60 5 70

2. More Than Cumulative Frequency Distribution

It is obtained by finding the cumulative totals of frequencies starting from the highest
value of the variable (class) to the lowest value (class).

Marks Frequency More than c.f

30-35 5 70

35-40 10 65

40-45 15 55

45-50 30 40

50-55 5 10

55-60 5 5
Presentation Of The Data

Presentation of data refers to an exhibition or putting up data in an attractive and


useful manner such that it can be easily interpreted. The three main forms of
presentation of data are:

• Textual presentation
• Data tables
• Diagrammatic presentation

Tabular Presentation

A table facilitates representation of even large amounts of data in an attractive, easy


to read and organized manner. The data is organized in rows and columns. This is one of
the most widely used forms of presentation of data since data tables are easy to construct
and read.

Components of Data Tables

Table Number: Each table should have a specific table number for ease of access and
locating. This number can be readily mentioned anywhere which serves as a reference and
leads us directly to the data mentioned in that particular table.

Title: A table must contain a title that clearly tells the readers about the data it contains,
time period of study, place of study and the nature of classification of data.

Headnotes: A headnote further aids in the purpose of a title and displays more
information about the table. Generally, headnotes present the units of data in brackets at
the end of a table title.

Stubs: These are titles of the rows in a table. Thus, a stub display information about the
data contained in a particular row.

Caption: A caption is the title of a column in the data table. In fact, it is a counterpart if a
stub and indicates the information contained in a column.

Body or field: The body of a table is the content of a table in its entirety. Each item in a
body is known as a ‘cell’.

Footnotes: Footnotes are rarely used. In effect, they supplement the title of a table if
required.
Source: When using data obtained from a secondary source, this source has to be
mentioned below the footnote.

Diagrammatic Presentation of Data

When presented diagrammatically, data is easy to interpret with just a glance. In such a
case we need to learn how to represent data diagrammatically via bar diagrams, pie charts
etc.

Bar Diagrams

As the name suggests, when data is presented in form of bars or rectangles, it is termed to
be a bar diagram.

• Simple Bar Diagram: These are the most basic type of bar diagrams. A simple bar
diagram represents only a single set of numerical data. Generally, simple bar
diagrams are used to represent time series data for a single entity.
• Multiple Bar Diagram: Unlike single bar diagram, a multiple bar diagram can
represent two or more sets of numerical data on the same bar diagram. Generally,
these are constructed to facilitate comparison between two entities like average
height and average weight, birth rates and death rates etc.
• Sub-divided or Differential Bar Diagrams: Sub-divided bar diagrams are useful
when we need to represent the total values and the contribution of various sections
of the total simultaneously. The different sections are shaded with different colors in
the same bar.

Pie or Circular Diagrams

In addition to bar diagrams, pie diagrams are also widely used to pictorially represent data.
In this, a circle is divided into various segments which are decided on the basis of
percentages. Which means the circle is divided into sectors depending on various
percentages.

Graphical Presentation Of The Data

Histogram: Graph of frequency distribution in which classes are marked in horizontal axis
and frequencies are on the vertical axis
Frequency polygon: A frequency polygon is a line graph of class frequency plotted
against class midpoint. It can be obtained by joining the midpoints of the tops of the
rectangles in the histogram

Frequency curve: A frequency curve is a smooth curve for which the total area is taken to
be unity. It is a limiting form of a histogram or frequency polygon
Ogives: are graphs that are used to estimate how many numbers lie below or above a
particular variable or value in data. To construct an Ogive, firstly, the cumulative
frequency of the variables is calculated using a frequency table.

Analysis of data

Statistical data analysis is a procedure of performing various statistical operations.


It is a kind of quantitative research, which seeks to quantify the data, and typically,
applies some form of statistical analysis. Quantitative data basically involves descriptive
data, such as survey data and observational data.

There are four types of analyzing data. They are

1. Measure Of Central Tendencies

2. Measure Of Dispersion

3. Measure Of Skewness

4. Measure Of Kurtosis

Measure Of Central Tendencies

A measure of central tendency (also referred to as measures of centre or central


location) is a summary measure that attempts to describe a whole set of data with a
single value that represents the middle or centre of its distribution. It can be classified
into three major measures

1. Mathematical Average
a. Arithmetic mean

i. Simple Arithmetic Mean

ii. Weighted arithmetic mean

b. Geometric mean

c. Harmonic mean

2. Positional averages

a. Median

b. Mode

3. Partition Values

a. Quartiles

b. Deciles

c. Percentiles

➢ Arithmetic Mean
Arithmetic mean represents a number that is obtained by dividing the sum of the
elements of a set by the number of values in the set. So you can use the layman term
Average, or be a little bit fancier and use the word “Arithmetic mean” your call, take your
pick -they both mean the same. The arithmetic mean may be either

• Simple Arithmetic Mean


• Weighted Arithmetic Mean

Properties of Arithmetic Mean

• The sum of deviations of the items from their arithmetic mean is always zero, i.e.
∑(x – X) = 0.
• The sum of the squared deviations of the items from Arithmetic Mean (A.M) is
minimum, which is less than the sum of the squared deviations of the items from
any other values.
• If each item in the arithmetic series is substituted by the mean, then the sum of
these replacements will be equal to the sum of the specific items.
Merits

• Easy to calculate
• Based on all observation
• Suitable for further mathematical treatment
• Least effected by fluctuations in sampling (among all average)

Demerits

• Affected by extreme values


• Can’t be used for open end classes
• Can’t be located graphically
• Can’t be used when data is qualitative
• Give equal importance to all observation

➢ Weighted Arithmetic Mean

Weighted Mean is an average computed by giving different weights to some of the


individual values. If all the weights are equal, then the weighted mean is the same as the
arithmetic mean.

➢ Geometric mean

The Geometric Mean (GM) is the average value or mean which signifies the central
tendency of the set of numbers by finding the product of their values.
Merits

• Rigidly defined
• Based on all the observation
• Suitable for further mathematical treatment
• Not affected much by fluctuations of sampling
Demerits

• Not easy to calculate and understand


• Can’t be calculated if zero or negative value is present in the observation

Uses

• Used in the construction of index numbers (Fisher index)


• Average rate of growth in GDP, Population etc

➢ Harmonic mean
The harmonic mean is a type of numerical average. It is calculated by dividing the
number of observations by the reciprocal of each number in the series. Thus, the
harmonic mean is the reciprocal of the arithmetic mean of the reciprocals.
Merits

• Rigidly defined

• Based on all the observation

• Suitable for further mathematical treatment

• Not affected very much by fluctuations in sampling

Demerits

• Not easy to calculate and understand

• Can’t be calculated if zero or negative value is present in the observation

• Does not represent true characteristic of the data

• More weight is given to the smaller values

Positional Averages

• Its value depends upon the position occupied by a value in the frequency
distribution.

• Two averages are there

1. Median

2. Mode

➢ Median
• Idea of the median have first appeared in Edward Wright’s book “Certaine
Errors In Navigation” in 1599.
• Antoine Augustin Cournot in 1843 was the first to use the term median for the
value that divides a probability distribution into two equal halves.
• Gustav Theodor Fechner popularized the median into formal analysis of data.
• Francis Galton coined the term median in 1881
• Till 1869 – Middle-most Value
• 1880 – Medium

• The median is the middle number in a sorted, ascending or descending, list of


numbers and can be more descriptive of that data set than the average.
• Odd Number of Observations
If the total number of observation given is odd, then the formula to calculate
the median is:
Median = {(n+1)/2}thterm
where n is the number of observations

• Even Number of Observations


If the total number of observation is even, then the median formula is:
Median = [(n/2)th term + {(n/2)+1}th]/2

Median Properties

• Median is not dependent on all the data values in a dataset.


• The median value is fixed by its position and is not reflected by the individual
value.
• The distance between the median and the rest of the values is less than the
distance from any other point.
• Every array has a single median.
• Median cannot be manipulated algebraically. It cannot be weighed and
combined.
• In a grouping procedure, the median is stable.
• Median is not applicable to qualitative data.
• The values must be grouped and ordered for computation.
• Median can be determined for ratio, interval and ordinal scale.
• Outliers and skewed data have less impact on the median.
• If the distribution is skewed, the median is a better measure when compared to
mean.
Merits

• Rigidly defined
• Easy to understand and calculate
• Not affected by extreme values
• Very useful measure when data is skewed
• Best average when data have extreme values
• Can be calculated for open ended distribution.
• Can be located graphically (ogives).
• Best measure for studying qualitative features or attributes of an observation.
Demerits
• Exact median can’t be determined for an even no. of observation.
• Not suitable for further mathematical treatment
• Value of the median is affected by the number of observations rather than values
of the observation.
➢ Mode
The mode is one of the measures of central tendency that can be calculated for a
given set of data values (the others being the mean and the median). The mode or the
modal value is by definition the value in a series of observations that occurs with the
highest frequency.

Properties of Mode:

• The mode is not unduly affected by extreme value, that is, values that are
extremely high or extremely low. For example, if we are given the following set of
observations:
• 1, 1, 1, 1, 1, 2, 2, 100
• 1,1,1,1,1,2,2,100
• The mean of the above set of data values is 13.625 which is clearly not
representative of the above data values. However, the mode which is equal to 1 is
clearly representative of a typical value from the above data set. This is one
advantage of the mode compared to the mean.
• The mode is not calculated on all observations in a data set.
• The value of the mode can be computed graphically whereas the value of the
mean cannot be calculated graphically.
• The value of the mode can be calculated in open end distributions without
knowing the class limits.
• The mode can be conveniently found even if the frequency distribution has class
intervals of unequal magnitude provided that the modal class and the classes
succeeding and preceding it are of the same magnitude.
• Sometimes it may not be possible to calculate the mode. This happens if the data
has a bimodal distribution in which there are two possible values for the mode.
• Another disadvantage of the mode is that as compared with the mean, it is
affected more by fluctuations of sampling.
• We have the following relationship between the mean, median and the mode:
Mode = 3*Median - 2*Mean
Merits

• Easy to understand and calculate


• Not affected by the extreme values/outliers
• Can be calculate for open end class
• Can locate graphically (Histogram)
• Can be used to describe both quantitative and qualitative data.

Demerits

• Not rigidly defined


• Mode is determined by empirical relation when the distribution is either bimodal
or multimodal
• Mode = 3 Median – 2 Mean
• Not based on all the observations
• Not suitable for further mathematical treatment
• As compared with mean, mode is affected to a greater extent by the fluctuations
of sampling.

Empirical Relation between Mean, Median & Mode


• For a normal distribution

• Mean = Median = Mode

• For skewed distribution

• Positively Skewed

• Mean > Median > Mode

• Negatively Skewed

• Mean < Median < Mode

➢ A normal distribution curve


Asymmetrical distribution curves

Partition values

• The values which divide the data into a number of equal parts are called Partition
Values.
• Mainly we have three partition values
o Quartiles
o Deciles
o Percentiles
• It can be located graphically by ogives / cumulative frequency curve

Quartiles
A quartile is a statistical term that describes a division of observations into four
defined intervals based on the values of the data and how they compare to the entire
set of observations. The quartile measures the spread of values above and below the
mean by dividing the distribution into four groups. A quartile divides data into three
points—a lower quartile, median, and upper quartile—to form four groups of the
dataset. Quartiles are used to calculate the interquartile range, which is a measure of
variability around the median. Each quartile contains 25% of the total observations.
Generally, the data is arranged from smallest to largest:

• First quartile: the lowest 25% of numbers


• Second quartile: between 0% and 50% (up to the median)
• Third quartile: 0% to 75%
• Fourth quartile: the highest 25% of numbers

Percentiles
A percentile is a term used in statistics to express how a score compares to other
scores in the same set. While there is technically no standard definition of percentile, it's
typically communicated as the percentage of values that fall below a particular value in a
set of data scores.

Measures of dispersion
Dispersion is the state of getting dispersed or spread. Statistical dispersion means
the extent to which a numerical data is likely to vary about an average value. In other
words, dispersion helps to understand the distribution of the data. It helps to

• To test the reliability of an average


• To control the variability
• To compare two or more sets of data with respect to their variability.
• To obtain other statistical measures for further analysis of data.

Types of Measures of Dispersion


There are two main types of dispersion methods in statistics which are:

• Absolute Measure of Dispersion


• Relative Measure of Dispersion
Absolute Measure of Dispersion
An absolute measure of dispersion contains the same unit as the original data
set. Absolute dispersion method expresses the variations in terms of the average of
deviations of observations like standard or means deviations. It includes range, standard
deviation, quartile deviation, etc. It is helpful for comparing two data sets with same
units of measurement. But it fails to compare two data sets with different units of
measurement.

Relative Measure of Dispersion

The relative measures of dispersion are used to compare the distribution of two or
more data sets. This measure compares values without units. Common relative
dispersion methods include:

1. Co-efficient of Range
2. Co-efficient of Variation
3. Co-efficient of Standard Deviation
4. Co-efficient of Quartile Deviation
5. Co-efficient of Mean Deviation
Range
A range is the most common and easily understandable measure of dispersion. It
is the difference between two extreme observations of the data set. If X max and X min
are the two extreme observations then

R= H – L

Where,

H= Highest value of an observation

L = Lowest value of an observation

Merits of Range

• It is the simplest of the measure of dispersion


• Easy to calculate
• Easy to understand
• Independent of change of origin

Demerits of Range

• It is based on two extreme observations. Hence, get affected by fluctuations


• A range is not a reliable measure of dispersion
• Dependent on change of scale

Coefficient of Range
It is defined as the relative measure of the distribution based on the range of any
given data set, which is the difference between the maximum and minimum value in the
given set. It is also known as range coefficient. In the case of grouped data, the range is
the difference between the upper boundary of the highest class and the lower boundary
of the lowest class. It is also calculated by using the difference between the mid points
of the highest class and the lowest class.

Coefficient of range =

Quartile Deviation
The quartiles divide a data set into quarters. The first quartile, (Q1) is the middle
number between the smallest number and the median of the data. The second quartile,
(Q2) is the median of the data set. The third quartile, (Q3) is the middle number between
the median and the largest number.

Quartile deviation or semi-inter-quartile deviation is

Merits of Quartile Deviation

• All the drawbacks of Range are overcome by quartile deviation


• It uses half of the data
• Independent of change of origin
• The best measure of dispersion for open-end classification

Demerits of Quartile Deviation

• It ignores 50% of the data


• Dependent on change of scale
• Not a reliable measure of dispersion

Coefficient of Quartile Deviation


The coefficient of quartile deviation (sometimes called the quartile coefficient of
dispersion) allows you to compare dispersion for two or more sets of data. The formula
is:

If one set of data has a larger coefficient of quartile deviation than another set, then that
data set’s interquartile dispersion is greater.

Mean Deviation
Mean deviation is the arithmetic mean of the absolute deviations of the observations
from a measure of central tendency. If x1, x2, … , xn are the set of observation, then the
mean deviation of x about the average A (mean, median, or mode) is

Merits of Mean Deviation

• Based on all observations


• It provides a minimum value when the deviations are taken from the median
• Independent of change of origin

Demerits of Mean Deviation

• Not easily understandable


• Its calculation is not easy and time-consuming
• Dependent on the change of scale
• Ignorance of negative sign creates artificiality and becomes useless for further
mathematical treatment
Coefficient of the Mean Deviation
A relative measure of dispersion based on the mean deviation is called the
coefficient of the mean deviation or the coefficient of dispersion. It is defined as the
ratio of the mean deviation of the average used in the calculation of the mean deviation.
Thus:

• Coefficient of MD about mean = Mean Deviation from


Mean/Mean

• Coefficient of MD about Median = Mean Deviation from


Median/Median

• Coefficient of MD about mode = Mean Deviation from


Mode/Mode

Standard Deviation
A standard deviation is the positive square root of the arithmetic mean of the
squares of the deviations of the given values from their arithmetic mean. It is denoted
by a Greek letter sigma, σ. It is also referred to as root mean square deviation. The
standard deviation is given as

Merits of Standard Deviation

• Squaring the deviations overcomes the drawback of ignoring signs in mean


deviations

• Suitable for further mathematical treatment

• Least affected by the fluctuation of the observations

• The standard deviation is zero if all the observations are constant

• Independent of change of origin


Demerits of Standard Deviation

o Not easy to calculate

o Difficult to understand for a layman

o Dependent on the change of scale

Coefficient of variation

The coefficient of variation (CV) is a measure of relative variability. It is the ratio of


the standard deviation to the mean (average). For example, the expression “The
standard deviation is 15% of the mean” is a CV.

Coefficient of Variation = (Standard Deviation / Mean) * 100.

Lorenz Curve

The distribution of Income in an economy is represented by the Lorenz Curve and


the degree of income inequality is measured through the Gini Coefficient.

One of the five major and common macroeconomic goals of a government is the
equitable (fair) distribution of income.

The Lorenz Curve (the actual distribution of income curve), a graphical


distribution of wealth developed by Max Lorenz in 1906, shows the proportion of
income earned by any given percentage of the population. The line at the 45º angle
shows perfectly equal income distribution, while the other line shows the actual
distribution of income. The further away from the diagonal, the more unequal the size of
the distribution of income
Gini Coefficient
The Gini Coefficient, which is derived from the Lorenz Curve, can be used as an indicator
of economic development in a country.

The Gini Coefficient measures the degree of income equality in a population.

The Gini Coefficient can vary from 0 (perfect equality) to 1 (perfect inequality).

A Gini Coefficient of zero means that everyone has the same income, while a Coefficient
of 1 represents a single individual receiving all the income.

Relation between various measures of dispersion

Measure of Skewness

Skewness is a measure of asymmetry or distortion of symmetric distribution. It


measures the deviation of the given distribution of a random variable from a symmetric
distribution, such as normal distribution. A normal distribution is without any skewness,
as it is symmetrical on both sides. Hence, a curve is regarded as skewed if it is shifted
towards the right or the left.
Types of Skewness

1. Positive Skewness

If the given distribution is shifted to the left and with its tail on the right side, it is
a positively skewed distribution.

2. Negative Skewness

If the given distribution is shifted to the right and with its tail on the left side, it is
a negatively skewed distribution. It is also called a left-skewed distribution.

Asymmetrical distribution

Absolute Measure of Skewness

• It involves unit of measurement


• It has two important measures

• Sk = Mean – Mode

• Sk = Q3+Q1 – 2Md

Relative Measure of Skewness

• Also called coefficient of skewness

• Main coefficient of skewness are

1. Karl Pearson Coefficient of Skewness

2. Bowley’s Coefficient of Skewness


3. Kelly’s Coefficient Of Skewness

Moments [of a statistical distribution]


The shape of any distribution can be described by its various ‘moments’. The first
four are:

o The mean, which indicates the central tendency of a distribution.

o The second moment is the variance, which indicates the width or


deviation.

o The third moment is the skewness, which indicates any asymmetric


‘leaning’ to either left or right.
o The fourth moment is the Kurtosis, which indicates the degree of central
‘peakedness’ or, equivalently, the ‘fatness’ of the outer tails.

Moments about Mean


Moment About Origin
Karl Pearson’s Beta and Gama Coefficients Based on Moments\
➢ Kurtosis
Kurtosis is a statistical measure that defines how heavily the tails of a distribution
differ from the tails of a normal distribution. In other words, kurtosis identifies whether
the tails of a given distribution contain extreme values. Along with skewness, kurtosis is
an important descriptive statistic of data distribution. However, the two concepts must
not be confused with each other. Skewness essentially measures the symmetry of the
distribution, while kurtosis determines the heaviness of the distribution tails. Three types
of curve

1. Lepto –Kurtic Curve

2. Meso – Kurtic Curve

3. Platy – Kurtic Curve

Types of Kurtosis

The types of kurtosis are determined by the excess kurtosis of a particular distribution.
The excess kurtosis can take positive or negative values, as well as values close to zero.

1. Mesokurtic

Data that follows a mesokurtic distribution shows an excess kurtosis of zero or


close to zero. This means that if the data follows a normal distribution, it follows a
mesokurtic distribution.

2. Leptokurtic

Leptokurtic indicates a positive excess kurtosis. The leptokurtic distribution shows


heavy tails on either side, indicating large outliers. In finance, a leptokurtic
distribution shows that the investment returns may be prone to extreme values
on either side. Therefore, an investment whose returns follow a leptokurtic
distribution is considered to be risky.

3. Platykurtic

A platykurtic distribution shows a negative excess kurtosis. The kurtosis reveals a


distribution with flat tails. The flat tails indicate the small outliers in a distribution.
In the finance context, the platykurtic distribution of the investment returns is
desirable for investors because there is a small probability that the investment
would experience extreme returns.
Karl Pearson Coefficient Of Kurtosis

➢ Correlation:
Correlation is used to test relationships between quantitative variables or categorical
variables. In other words, it’s a measure of how things are related. The study of how
variables are correlated is called correlation analysis. A Statistical technique that is used
to analyze the strength and direction of the relationship between two variables is
called correlation analysis.

Types of Correlation

o Positive Correlation – when the values of the two variables move in the same
direction so that an increase/decrease in the value of one variable is followed by an
increase/decrease in the value of the other variable.

o Negative Correlation – when the values of the two variables move in the opposite
direction so that an increase/decrease in the value of one variable is followed by
decrease/increase in the value of the other variable.

o No Correlation – when there is no linear dependence or no relation between the two


variables.

o Linear correlation - Correlation is said to be linear if the ratio of change is constant.


When the amount of output in a factory is doubled by doubling the number of
workers, this is an example of linear correlation. In other words, when all the points
on the scatter diagram tend to lie near a line which looks like a straight line, the
correlation is said to be linear

o Non linear correlation - Correlation is said to be non linear if the ratio of change is
not constant. In other words, when all the points on the scatter diagram tend to lie
near a smooth curve, the correlation is said to be non linear (curvilinear)
o Simple correlation - When only two variables are studied it is a problem of simple
correlation
o Partial correlation - Two variables are chosen to study the correlation, while other
factors are assumed to be constant
o Multiple correlation - Correlation between more than three variables are considered
simultaneously.

Method of Correlation Analysis

1. Scatter diagram
The scatter diagram graphs pairs of numerical data, with one variable on each axis, to
look for a relationship between them. If the variables are correlated, the points will fall
along a line or curve. The better the correlation, the tighter the points will hug the line

2. Karl Pearson’s Correlation Coefficient

The Karl Pearson’s product-moment correlation coefficient (or simply, the


Pearson’s correlation coefficient) is a measure of the strength of a linear association
between two variables and is denoted by r or rxy(x and y being the two variables
involved).This method of correlation attempts to draw a line of best fit through the data
of two variables, and the value of the Pearson correlation coefficient, r, indicates how far
away all these data points are to this line of best fit.
3. Pearson’s Correlation Coefficient

Pearson’s correlation coefficient is the test statistics that measures the statistical
relationship, or association, between two continuous variables. It is known as the
best method of measuring the association between variables of interest because it is
based on the method of covariance. It gives information about the magnitude of the
association, or correlation, as well as the direction of the relationship.

Properties:

1. Limit: Coefficient values can range from +1 to -1, where +1 indicates a perfect
positive relationship, -1 indicates a perfect negative relationship, and a 0
indicates no relationship exists..
2. Pure number: It is independent of the unit of measurement. For example, if one
variable’s unit of measurement is in inches and the second variable is in quintals,
even then, Pearson’s correlation coefficient value does not change.
3. Symmetric: Correlation of the coefficient between two variables is symmetric.
This means between X and Y or Y and X, the coefficient value of will remain the
same.

Degree of correlation:

1. Perfect: If the value is near ± 1, then it said to be a perfect correlation: as one


variable increases, the other variable tends to also increase (if positive) or
decrease (if negative).
2. High degree: If the coefficient value lies between ± 0.50 and ± 1, then it is said to
be a strong correlation.
3. Moderate degree: If the value lies between ± 0.30 and ± 0.49, then it is said to be
a medium correlation.
4. Low degree: When the value lies below + .29, then it is said to be a small
correlation.
5. No correlation: When the value is zero.

4. Spearmans Rank Correlation Coefficient

The Spearman’s rank coefficient of correlation is a nonparametric measure of


rank correlation (statistical dependence of ranking between two variables). Named after
Charles Spearman, it is often denoted by the Greek letter ‘ρ’ (rho) and is primarily used
for data analysis. It measures the strength and direction of the association between two
ranked variables. But before we talk about the Spearman correlation coefficient, it is
important to understand Pearson’s correlation first. A Pearson correlation is a statistical
measure of the strength of a linear relationship between paired data.

The Spearman Coefficient, ⍴, can take a value between +1 to -1 where,

o A ⍴ value of +1 means a perfect association of rank

o A ⍴ value of 0 means no association of ranks

o A ⍴ value of -1 means a perfect negative association between ranks.

Coefficient of Determination (R Squared)

The coefficient of determination, R2, is used to analyze how differences in


one variable can be explained by a difference in a second variable. For example, when a
person gets pregnant has a direct relation to when they give birth.
More specifically, R-squared gives you the percentage variation in y explained by
x-variables. The range is 0 to 1 (i.e. 0% to 100% of the variation in y can be explained by
the x-variables).
➢ Regression:
Regression is a statistical method used in finance, investing, and other disciplines
that attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).

Types

o Simple linear regression has only one x and one y variable.

o Multiple linear regression has one y and two or more x variables.

o For instance, when we predict rent based on square feet alone that is
simple linear regression.

o When we predict rent based on square feet and age of the building that is
an example of multiple linear regression.

o Linear Regression Equations: A linear regression model follows a very


particular form. In statistics, a regression model is linear when all terms in
the model are one of the following:

1.The constant

2.A parameter multiplied by an independent variable (IV)

o Nonlinear regression is a form of regression analysis in which data is fit to


a model and then expressed as a mathematical function. Simple linear
regression relates two variables (X and Y) with a straight line (y = mx + b),
while nonlinear regression relates the two variables in a nonlinear (curved)
relationship.
➢ Line of regression
In statistics, you can calculate a regression line for two variables if their scatterplot
shows a linear pattern and the correlation between the variables is very strong (for
example, r = 0.98). A regression line is simply a single line that best fits the data (in
terms of having the smallest overall distance from the line to the points). Statisticians
call this technique for finding the best-fitting line a simple linear regression analysis
using the least squares method.

Coefficient of Regression

The Regression Coefficient is the constant ‘b’ in the regression equation that tells
about the change in the value of dependent variable corresponding to the unit change
in the independent variable.

If there are two regression equations, then there will be two regression
coefficients:

Regression Coefficient of X on Y: The regression coefficient of X on Y is


represented by the symbol bxy that measures the change in X for the unit change in Y.
Symbolically, it can be represented as:
The bxy can be obtained by using the following formula when the deviations are taken
from the actual means of X and Y:

When the deviations are obtained from the assumed mean, the following formula is
used:

Regression Coefficient of Y on X: The symbol byx is used that measures the change in Y
corresponding to the unit change in X. Symbolically, it can be represented as:

In case, the deviations are taken from the actual means; the following formula is used:

The byx can be calculated by using the following formula when the deviations are taken
from the assumed means:
Theorem of Regression Coefficients
➢ Index numbers:
Meaning of Index Number

• “An index number is a statistical measure, designed to measure changes in a


variable, or a group of related variables”.

• “Index number is a single ratio (or a percentage) which measures the combined
change of several variables between two different times, places or situations”

Index number expresses the relative change in price, quantity, or value compared to
a base period. An index number is used to measure changes in prices paid for raw
materials; numbers of employees and customers, annual income and profits, etc

Types of Index numbers

1. Price index numbers

• Measures general change in the prices

2. Quantity index numbers

• Measures changes in the volume of goods produced, consumed

• Helpful in studying the level of physical output in an economy

3. Value index numbers

• Measures the change in the total value of production

• Such as retail sales, profits ,inventories

Terminologies

Base Year

1. The year selected for comparison or the year with which comparisons
are made

2. Denoted by suffix zero i.e, 0

Current Year

1. The year for which the comparisons are sought or required

2. Denoted by suffix one i.e, 1


• p0=Price of the commodity in the base year

• p1=Price of the commodity in the current year

• q0=Quantity of the commodity in the base year

• q1=Quantity of the commodity in the current year

• W = weight assigned to a commodity according to its relative importance in the


group.

• P01= Price index number for the current year with respect to the base year.

• P10= Price index number for the base year with respect to the current year.

• Q01= Quantity index number for the current year with respect to the base year.

• Q10= Quantity index number for the base year with respect to the current year.

• V01= Value index number for the current year with respect to the base year

Methods of constructing index numbers

Price indexes can be classified into two

1. Unweighted Indexes

• Single Price Index

• Aggregate Price Index

• Average Price Relative Index

2. Weighted Indexes

• Weighted Aggregate Price Index

• Weighted Average Price Relative Index

Simple price index

In this method, we find out the price relative of individual items and average out the
individual values. Price relative refers to the percentage ratio of the value of a variable in
the current year to its value in the year chosen as the base.
Simple Aggregative Method

It calculates the percentage ratio between the aggregate of the prices of all
commodities in the current year and aggregate prices of all commodities in the base
year.

Here, ∑P1= Summation of the prices of all commodities in current year and

∑P0= Summation of prices of all commodities in base year.

Weighted Average or Price Relatives Method

Weighted Aggregate Method

Here different goods are assigned weight according to the quantity bought. There are
three well-known sub-methods based on the different views of economists as
mentioned below:
1. Laspeyre’s Method
Laspeyre was of the view that base year quantities must be chosen as weights. Therefore
the formula is :

P= (∑P1Q0 ÷ ∑P0Q0)×100

Here, ∑P1Q0= Summation of prices of current year multiplied by quantities of the base
year taken as weights and ∑P0Q0= Summation of, prices of base year multiplied by
quantities of the base year taken as weights.

2. Paasche’s Method
Unlike the above mentioned, Paasche believed that the quantities of the current year
must be taken as weights. Hence the formula:

P= (∑P1Q1÷∑P0Q1) ×100

Here, ∑P1Q1= Summation of, prices of current year multiplied by quantities of the
current year taken as weights and ∑P0Q1= Summation of, prices of base year multiplied
with quantities of the current year taken as weights.

3. Fisher’s Method
Fisher combined the best of both above-mentioned formulas which resulted in an ideal
method. This method uses both current and base year quantities as weights as follows:

P = √ [(∑P1Q0÷∑P0Q0) × (∑P1Q1÷∑P0Q1)] ×100

NOTE: Index number of base year is generally assumed to be 100 if not given
Fisher’s Method is an Ideal Measure
As noted, Fisher’s method uses views of both Laspeyres and Paasche. Hence it
takes into account the prices and quantities of both years. Moreover, it is based on the
concept of the geometric mean, which is considered as the best mean method.

However, the most important evidence for the above affirmation is that it satisfies
both time reversal and factor reversal tests. Time reversal test checks that when we
reverse the current year to base year and vice-versa, the product of indexes should be
equal to unity. This confirms the working of a formula in both directions. Also, factor
reversal test implies that interchanging the piece and quantities do not give varying
results. This proves the consistency of the formula.

Dorbish-Bowley Price Index

Marshall – Edgeworth Price Index

The formula enunciated by Marshall and Edgeworth for constructing an index


number is known as Marshall-Edge worth’s method. In this method, they have
suggested to take the arithmetic average of the quantities of the quantities of the base
year, and the current year as the weights of the items. Accordingly, they have
formulated the formula of index number as under:
Walsh Price Index
A Walsh Price Index defined as the weighted arithmetic average of the current to base
period price relatives which uses the quantities of an intermediate basket as weights.
The quantities of the intermediate basket are geometric averages of the quantities of
the base and current periods. It is a symmetric index and a superlative index.

Kelleys price index


Tests of Consistency of Index Number

There are certain tests which are put to verify the consistency, or adequacy of an
index number formula from different points of view. The most popular among these are
the following tests:

• Order reversal test.

• Time reversal test.

• Factor reversal test.

• Circular test.

• Unit test.

At the outset, it should be noted that it is neither possible nor necessary for an
index-number formula to satisfy all the tests mentioned above. But, an ideal formula
should be such that it satisfies the maximum possible tests which are relevant to the
matter under study. However, the various tests cited above are explained here as under:

1. Order reversal test

This test requires that a formula of Index number should be such that the value of
the index number remains the same, even if, the order of arrangement of the
items is reversed, or altered. As a matter of fact, this test is satisfied by all the
twelve methods of index number explained above.

2. Time reversal test

This test has been put forth by Prof. Irving Fisher, who proposes that a formula of
index number should be such that it turns the value of the index number to its
reciprocal when the time subscripts of the formula are reversed i.e. 0 is made
1,and 1 is made 0. According to this proposition, if the index number of the
current period on the basis of the current period i.e.
P01 is 200, the index number of the base period on the basis of the current
period i.e. P10 would be 50. Thus, when the value of is 2 times the base year
price, the value of P10 is. As such, an index number formula, in order to satisfy
this test must prove the following equation:

P01 x P10 = 1

3. Factor reversal test

This test has also been purforth by Prof. Irving Fisher, who proposes that a
formula of index number should be such that it permits the interchange of the
price, and the quantity factors without giving inconsistent result i.e. the two
results multiplied together should give the true ratio in as much as th4e product
of price and quantity is the value of a thing.Thus, for the Factor Reversal test, a
formula of index number should satisfy the following equation:

Price index × Quantity Index = Value Index

∴ P01 x Q01 = V01 , ∑P1 Q1/ ∑P0 Q0

Most of the formulae of index number discussed above fail to satisfy this acid test
of consistency except that of Prof. Irving Fisher. This is the reason for which Prof.
Fisher claims his formula to be an ideal one.

4. Circular test

This test has been put forth by Vestergaard and recommended by C.M. Walsch in
extension of the times reversal test put forth by Prof. Fisher. This test requires
that an Index number formula should be such that an index number formula
should be such that it works in a circular fashion. This means that if an index is
computed for the period 1 on the base period 0, another index is computed for
the period 2 on the base period 0 on the base period 2, the product of all these
indices should be equal to 1. Thus, a formula to satisfy the test should comply
with the following equation

P01 × P12 × P20 = 1.

An index formula which satisfies this test enjoys the advantage of reducing the
computation work every time a change in the base year is made. As it will be seen
from the table exhibited on page 632, this test is not satisfied by most of the
important index formula viz. Fisher’s, Laspeyre’s, Paasche’s, Marshall and Edge
worth’s, Drobish and Bowley’s etc.

However, the following three methods satisfy the test:


(i) Simple aggregative method

(ii) Weighted aggregative method

(iii) Kelley’s method.

5. Unit test

This is a common test which requires that an index number formula should be
such that it does not affect the value of the index number, even if, the units of the
price quotations are altered viz. price per kg, converted into price per quintal or
vice versa. This test is satisfied by all the index formula except the simple
aggregative method under which the value of the index number changes
radically, if the units of price quotations of any of the items included in the index
number are changed.

Base shifting

➢ Theory of probability
Probability is the measure of the likelihood that an event will occur in a
Random Experiment. Probability is quantified as a number between 0 and 1, where,
loosely speaking, 0 indicates impossibility and 1 indicates certainty. The higher the
probability of an event, the more likely it is that the event will occur.

Example: A simple example is the tossing of a fair (unbiased) coin. Since the
coin is fair, the two outcomes (“heads” and “tails”) are both equally probable; the
probability of “heads” equals the probability of “tails”; and since no other outcomes are
possible, the probability of either “heads” or “tails” is 1/2 (which could also be written as
0.5 or 50%).

Terminologies

• Experiment

• Process of performing an activity

• Experimenter

• Person who performs the activity

• Outcome

• Possible result of an experiment

• Random experiment

• Conditions

1. More than one possible outcome

2. Outcomes are unpredictable

3. Experiment is repeatable

• Sample space

• Collection of all possible outcomes of a random experiment

• Denoted by “s”

• Eg:-

1. Tossing a coin

2. s = {H, T}

3. Throwing a die

4. s = {1,2,3,4,5,6}

• Sample point

• Each element of the sample space

• Event
• Any subset of a sample space

• Denoted by capital letters of alphabets – a,b,c,d………

• Eg:-throwing a die

s = {1, 2, 3, 4, 5, 6}

• A = odd number

a = {1, 3, 5}

• B = even numbers

b = {2, 4, 6}

• Impossible event

• An impossible event is equal to the empty set ∅. This is a rule of


probability, which can formally be written as [1]:
• p (∅) = 0.

• The impossible event, defined in this way as a set with no elements,


doesn’t correspond to any experimental result, but it is useful in some
calculations

• Sure event

• A sure event is the one that is defined to happen for sure.


• Example: Getting a head or tail while tossing a coin once is a sure event as
the outcome will definitely be either head or tail
• Thus, the probability of a sure event is 1.

• Simple or Elementary Event:

• If there be only one element of the sample space in the set representing an
event, then this event is called a simple or elementary event.

• For example; if we throw a die, then the sample space, S = {1, 2, 3, 4, 5, 6}.
Now the event of 2 appearing on the die is simple and is given by E = {2}.

• Compound Event
• If an event has more than one sample point, it is termed as a compound
event. The compound events are a little more complex than simple events.
These events involve the probability of more than one event occurring
together. The total probability of all the outcomes of a compound event is
equal to 1
• Independent events
• In probability, we say two events are independent if knowing one event
occurred doesn't change the probability of the other event. So the result of
a coin flip and the day being Tuesday are independent events; knowing it
was a Tuesday didn't change the probability of getting "heads."

• Complementary events

• Two events are said to be complementary when one event occurs if and
only if the other does not. The probabilities of two complimentary events
add up to 1.
• For example, rolling a 5 or greater and rolling a 4 or less on a die are
complementary events, because a roll is 5 or greater if and only if it is
not 4 or less. The probability of rolling a 5 or greater is = , and the
probability of rolling a 4 or less is = . Thus, the total of their
probabilities is + = = 1.

• Union of events

The union of events A and B, denoted A∪B, is the collection of all


outcomes that are elements of one or the other of the sets AA and BB, or
of both of them. It corresponds to combining descriptions of the two
events using the word “or.”

• Intersection of events

The intersection of events A and B, denoted A∩B, is the collection of all outcomes
that are elements of both of the sets AA and BB. It corresponds to combining
descriptions of the two events using the word “and.”

• Difference events

• Denoted by A – B

• Also called Event A but not B

• set of all those elements which are in A but not in B

• Mutually exclusive events


In statistics and probability theory, two events are mutually exclusive if they
cannot occur at the same time. The simplest example of mutually exclusive events
is a coin toss. A tossed coin outcome can be either head or tails, but both
outcomes cannot occur simultaneously.

• Mutually exhaustive events

• When a sample space S is divided into many mutually exclusive events such
that their union forms the entire sample space, these events are said to be
mutually exhaustive events.
• The probability that an exhaustive event will occur is always 1.
• The intersection of mutually exclusive exhaustive events is always empty.

Approaches to Probability
• There are 5 approaches to Probability.

1. Empirical Approaches

2. Classical Approaches

3. Axiomatic Approach

4. Relative Frequency Approach

5. Subjective Approach

Empirical approach

Empirical probability is also known as an experimental probability which


refers to a probability that is based on historical data. In other words, simply we can say
that empirical probability illustrates the likelihood of an event occurring based on
historical data. In theoretical probability, we assume that the probability of occurrence
of any event is equally likely, and based on that we predict the probability of an event.
The empirical probability formula can be obtained by multiplying the number of times
an event occurs by the total number of trials
Classical probability

Classical probability is a simple form of probability that has equal odds of


something happening. For example:

• Rolling a fair die. It’s equally likely you would get a 1, 2, 3, 4, 5, or 6.

• Selecting bingo balls. Each numbered ball has an equal chance of being
chosen.

• Guessing on a test. If you guessed on a multiple-choice test with four


possible answer A B C and D, each choice has the same odds of being
picked (assuming you pick randomly and don’t follow a pattern).

The probability of a simple event happening is the number of times the event can
happen, divided by the number of possible events.

Axiomatic approach
Axiomatic Probability is just another way of describing the probability of an
event. As, the word itself says, in this approach, some axioms are predefined before
assigning probabilities. This is done to quantize the event and hence to ease the
calculation of occurrence or non-occurrence of the event.

Perform a random experiment whose sample space is S and P is the


probability of occurrence of any random event. This model assumes that P should be a
real-valued function with a range between 0 and 1. The domain of this function is
defined to be a power set of sample space. If all these conditions are satisfied then, the
function should satisfy the following axioms:
Axiom 1: For any given event X, the probability of that event must be greater
than or equal to 0. Thus,

0 ≤ P(X)

Axiom 2: We know that the sample space S of the experiment is the set of all the
outcomes. This means that the probability of any one outcome happening is 100
percent i.e P(S) = 1. Intuitively this means that whenever this experiment is performed,
the probability of getting some outcome is 100 percent.

P(S) = 1

Axiom 3: For the experiments where we have two outcomes A and B. If A and B are
mutually exclusive,

P(A ∪ B) = P(A) + P(B)

Addition theory of probability


If A and B are any two events then the probability of happening of at least one of the
events is defined as P(AUB) = P(A) + P(B)- P(A∩B).

If the events are mutually exclusive then,

P(AUB) = P(A) + P(B)

Proof:

Since events are nothing but sets,

From set theory, we have

n(AUB) = n(A) + n(B)- n(A∩B).

Dividing the above equation by n(S), (where S is the sample space)

n(AUB)/ n(S) = n(A)/ n(S) + n(B)/ n(S)- n(A∩B)/ n(S)

Then by the definition of probability,

P(AUB) = P(A) + P(B)- P(A∩B).


Conditional Probability theorem
Conditional probability is the probability of an event occurring given that
another event has already occurred. The concept is one of the quintessential concepts in
probability theory. Note that conditional probability does not state that there is always a
causal relationship between the two events, as well as it does not indicate that both
events occur simultaneously.

The concept of conditional probability is primarily related to the Bayes’


theorem, which is one of the most influential theories in statistics.

Where:

• P(A|B) – the conditional probability; the probability of event A occurring given


that event B has already occurred

• P(A ∩ B) – the joint probability of events A and B; the probability that both
events A and B occur

• P(B) – the probability of event B

• Conditional Probability for Independent Events

Two events are independent if the probability of the outcome of one event does
not influence the probability of the outcome of another event. Due to this reason, the
conditional probability of two independent events A and B is:

P(A|B) = P(A)

P(B|A) = P(B)

• Conditional Probability for Mutually Exclusive Events

In probability theory, mutually exclusive events are events that cannot occur
simultaneously. In other words, if one event has already occurred, another can event
cannot occur. Thus, the conditional probability of mutually exclusive events is always
zero.
P(A|B) = 0

P(B|A) = 0

Multiplication theorem of probability

The probability of simultaneous occurrence of two events A and B is equal to the


product of the probability of the other, given that the first one has occurred. This is
called the Multiplication Theorem of probability.

• If A and B are two events associated with a random experiment, then

P(A ∩ B) = P(A)P(B/A) if P(A) ≠ 0

or, P(A ∩ B) = P(B)P(A/B) if P(B) ≠ 0

• If A and B are two events with positive probabilities ( P(A) ≠ 0, P(B) ≠ 0),
then A and B are independent if and only if

P(A ∩ B) = P(A) . P(B)

Counting rules in probability

1. Counting Rule for Multiple-Steps Experiments

If an experiment consists of a sequence of k steps in which there are n1possible


results for the first step, n2 possible results for the second step, and so on, then
the total number of experimental outcomes is given by (n1)(n2) . . . (nk)

2. Permutation

• Often, we want to count all of the possible ways that a single set of objects
can be arranged. For example, consider the letters X, Y, and Z. These letters
can be arranged a number of different ways (XYZ, XZY, YXZ, etc.) Each of
these arrangements is a permutation.

• A permutation is an arrangement of all or part of a set of objects, with


regard to the order of the arrangement. This means that XYZ is considered a
different permutation than ZYX.

• The number of permutations of n objects taken r at a time is denoted by nPr.


3. Combination

• Sometimes, we want to count all of the possible ways that a single set of
objects can be selected - without regard to the order in which they are
selected.

• In general, n objects can be arranged in n(n - 1)(n - 2) ... (3)(2)(1) ways. This
product is represented by the symbol n!, which is called n factorial. (By
convention, 0! = 1.)

• A combination is a selection of all or part of a set of objects, without regard


to the order in which they were selected. This means that XYZ is considered
the same combination as ZYX.

• The number of combinations of n objects taken r at a time is denoted by


nCr.

Theoretical distribution in probability

• A random exponent is assumed as a model for theoretical distribution, and


the probabilities are given by a function of the random variable is called
probability function. For example, if we toss a fair coin, the probability of
getting a head is 12. If we toss it for 50 times, the probability of getting a
head is 25. We call this as the theoretical or expected frequency of the
heads. But actually, by tossing a coin, we may get 25, 30 or 35 heads which
we call as the observed frequency. Thus, the observed frequency and the
expected frequency may equal or may differ from each other due to
fluctuation in the experiment.
Types of Theoretical Distribution

• Binomial Distribution

• Poisson distribution

• Normal distribution or Expected Frequency distribution

Binomial Distribution:

The prefix ‘Bi’ means two or twice. A binomial distribution can be understood as
the probability of a trail with two and only two outcomes. It is a type of distribution that
has two different outcomes namely, ‘successes and ‘failure’. Also, it is applicable to
discrete random variables only.

Thus, the binomial distribution summarized the number of trials, survey or


experiment conducted. It is very useful when each outcome has an equal chance of
attaining a particular value. The binomial distribution has some assumptions which show
that there is only one outcome and this outcome has an equal chance of occurrence.

The three different criteria of binomial distributions are:

• The number of the trial or the experiment must be fixed.

• Every trial is independent. None of your trials should affect the possibility of the
next trial.

• The probability always stays the same and equal. The probability of success may
be equal for more than one trial.

• Symbolically

P(x) =nCx px qn-x

Where,

n= total number of trials

x= number of success

n-x=number of failure

p=probability of success

q=probability of failure
Poisson Distribution :

The Poisson Distribution is a theoretical discrete probability distribution that is


very useful in situations where the events occur in a continuous manner. Poisson
Distribution is utilized to determine the probability of exactly x0 number of successes
taking place in unit time. Let us now discuss the Poisson Model.

At first, we divide the time into n number of small intervals, such that n → ∞ and
p denote the probability of success, as we have already divided the time into infinitely
small intervals so p → 0. So the result must be that in that condition is n x p = λ (a finite
constant).

Properties of Poisson Model :

• The event or success is something that can be counted in whole numbers.

• The probability of having success in a time interval is independent of any of its


previous occurrence.

• The average frequency of successes in a unit time interval is known.

• The probability of more than one success in unit time is very low.

Normal Distribution :
The Normal Distribution defines a probability density function f(x) for the
continuous random variable X considered in the system. The random variables which
follow the normal distribution are ones whose values can assume any known value in a
given range.

We can hence extend the range to – ∞ to + ∞ . Continuous Variables are such


random variables and thus, the Normal Distribution gives you the probability of your value
being in a particular range for a given trial. The normal distribution is very important in the
statistical analysis due to the central limit theorem.

The theorem states that any distribution becomes normally distributed when the
number of variables is sufficiently large. For instance, the binomial distribution tends to
change into the normal distribution with mean and variance.

Properties of Normal Distribution :

• Its shape is symmetric.

• The mean and median are the same and lie in the middle of the distribution

• Its standard deviation measures the distance on the distribution from the mean
to the inflection point (the place where the curve changes from an “upside-down-
bowl” shape to a “right-side-up-bowl” shape).

• Because of its unique bell shape, probabilities for the normal distribution follow
the Empirical Rule, which says the following:

• About 68 percent of its values lie within one standard deviation of the mean. To
find this range, take the value of the standard deviation, then find the mean plus
this amount, and the mean minus this amount.

• About 95 percent of its values lie within two standard deviations of the mean.

• Almost all of its values lie within three standard deviations of the mean.

Standard normal distribution


The standard normal distribution is a special case of the normal distribution. For
the standard normal distribution, the value of the mean is equal to zero (μ=0), and the
value of the standard deviation is equal to 1 (σ=1).

Basic Properties of the Standard Normal Curve

The standard normal curve is a special case of the normal distribution, and thus as
well a probability distribution curve. Therefore basic properties of the normal
distribution hold true for the standard normal curve as well

• The total area under the standard normal curve is 1 (this property is shared by all
density curves).

• The standard normal curve extends indefinitely in directions, approaching, but


never touching, the horizontal axis as it does so.

• The standard normal curve is is bell shaped, is centered at z=0. Almost all the
area under the standard normal curve lies between z=−3 and z=3.

➢ Sampling theory
• Population: Complete collection of data under consideration in a statistical study

• Sample : The portion of the population selected for analysis

• Sampling : Process of selecting sample from population

• Parameter : Statistical constants of the population (constant)

• Statistic : Statistical constants of the sample (Variable)

Sampling Methods

• Classified into two

1. Probability Sampling Method

2. Non-probability Sampling Method

Probability Sampling Method


Probability sampling means that every member of the population has a chance of
being selected. It is mainly used in quantitative research. If you want to produce results
that are representative of the whole population, probability sampling techniques are the
most valid choice.

• Different types

1. Simple random sampling

2. Systematic random sampling

3. Stratified random sampling

4. Cluster sampling

5. Multi stage sampling

1. Simple random sampling


In a simple random sample, every member of the population has an equal chance
of being selected. Your sampling frame should include the whole population.

To conduct this type of sampling, you can use tools like random number
generators or other techniques that are based entirely on chance.

Example

You want to select a simple random sample of 100 employees of Company X.


You assign a number to every employee in the company database from 1 to
1000, and use a random number generator to select 100 numbers.

2. Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually slightly easier
to conduct. Every member of the population is listed with a number, but instead of
randomly generating numbers, individuals are chosen at regular intervals.

Example

All employees of the company are listed in alphabetical order. From the first 10
numbers, you randomly select a starting point: number 6. From number 6
onwards, every 10th person on the list is selected (6, 16, 26, 36, and so on), and
you end up with a sample of 100 people.
If you use this technique, it is important to make sure that there is no hidden
pattern in the list that might skew the sample. For example, if the HR database
groups employees by team, and team members are listed in order of seniority,
there is a risk that your interval might skip over people in junior roles, resulting in
a sample that is skewed towards senior employees.

3. Stratified sampling
Stratified sampling involves dividing the population into subpopulations that may
differ in important ways. It allows you draw more precise conclusions by ensuring that
every subgroup is properly represented in the sample.

To use this sampling method, you divide the population into subgroups (called
strata) based on the relevant characteristic (e.g. gender, age range, income bracket, job
role).

Based on the overall proportions of the population, you calculate how many
people should be sampled from each subgroup. Then you use random or systematic
sampling to select a sample from each subgroup.

Example

The company has 800 female employees and 200 male employees. You want to
ensure that the sample reflects the gender balance of the company, so you sort
the population into two strata based on gender. Then you use random sampling
on each group, selecting 80 women and 20 men, which gives you a
representative sample of 100 people.

4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each
subgroup should have similar characteristics to the whole sample. Instead of sampling
individuals from each subgroup, you randomly select entire subgroups.

If it is practically possible, you might include every individual from each sampled
cluster. If the clusters themselves are large, you can also sample individuals from within
each cluster using one of the techniques above. This is called multistage sampling.

This method is good for dealing with large and dispersed populations, but there
is more risk of error in the sample, as there could be substantial differences between
clusters. It’s difficult to guarantee that the sampled clusters are really representative of
the whole population.
Example

The company has offices in 10 cities across the country (all with roughly the same
number of employees in similar roles). You don’t have the capacity to travel to
every office to collect your data, so you use random sampling to select 3 offices –
these are your clusters.

5. Multistage sampling

Multistage sampling is defined as a sampling method that divides the population


into groups (or clusters) for conducting research. It is a complex form of cluster
sampling, sometimes, also known as multistage cluster sampling. During this sampling
method, significant clusters of the selected people are split into sub-groups at various
stages to make it simpler for primary data collection.

Non-probability sampling methods

In a non-probability sample, individuals are selected based on non-random


criteria, and not every individual has a chance of being included.

This type of sample is easier and cheaper to access, but it has a higher risk of
sampling bias. That means the inferences you can make about the population are
weaker than with probability samples, and your conclusions may be more limited. If you
use a non-probability sample, you should still aim to make it as representative of the
population as possible.

Non-probability sampling techniques are often used in exploratory and qualitative


research. In these types of research, the aim is not to test a hypothesis about a broad
population, but to develop an initial understanding of a small or under-researched
population.

• Types

1. Convenience Sampling

2. Judgement Sampling

3. Quota Sampling

4. Purposive Sampling

5. Snowball Sampling
1. Convenience sampling

A convenience sample simply includes the individuals who happen to be most


accessible to the researcher.This is an easy and inexpensive way to gather initial data,
but there is no way to tell if the sample is representative of the population, so it can’t
produce generalizable results.

Example

You are researching opinions about student support services in your university,
so after each of your classes, you ask your fellow students to complete a survey
on the topic. This is a convenient way to gather data, but as you only surveyed
students taking the same classes as you at the same level, the sample is not
representative of all the students at your university.

2. Purposive sampling

This type of sampling, also known as judgement sampling, involves the


researcher using their expertise to select a sample that is most useful to the purposes of
the research.It is often used in qualitative research, where the researcher wants to gain
detailed knowledge about a specific phenomenon rather than make statistical
inferences, or where the population is very small and specific. An effective purposive
sample must have clear criteria and rationale for inclusion.

Example

You want to know more about the opinions and experiences of disabled students
at your university, so you purposefully select a number of students with different
support needs in order to gather a varied range of data on their experiences with
student services.

3. Snowball sampling

If the population is hard to access, snowball sampling can be used to recruit


participants via other participants. The number of people you have access to “snowballs”
as you get in contact with more people.

Example

You are researching experiences of homelessness in your city. Since there is no


list of all homeless people in the city, probability sampling isn’t possible. You
meet one person who agrees to participate in the research, and she puts you in
contact with other homeless people that she knows in the area.

4. Judgmental Sampling

Judgmental sampling, also called authoritative sampling, is a non-probability


sampling technique in which the sample members are chosen only on the basis of the
researcher’s knowledge and judgment. As the researcher’s knowledge is instrumental in
creating a sample in this sampling technique, there are chances that the results obtained
will be highly accurate with a minimum margin of error.

The process of selecting a sample using judgmental sampling involves the


researchers carefully picking and choosing each individual to be a part of the sample.
The researcher’s knowledge is primary in this sampling process as the members of the
sample are not randomly chosen.

5. Quota sampling

Quota sampling is defined as a non-probability sampling method in which


researchers create a sample involving individuals that represent a population.
Researchers choose these individuals according to specific traits or qualities. They
decide and create quotas so that the market research samples can be useful in
collecting data. These samples can be generalized to the entire population. The final
subset will be decided only according to the interviewer’s or researcher’s knowledge of
the population.

For example, a cigarette company wants to find out what age group prefers what
brand of cigarettes in a particular city. He/she applies quotas on the age groups of 21-
30, 31-40, 41-50, and 51+. From this information, the researcher gauges the smoking
trend among the population of the city.

Sampling and non sampling errors

BASIS FOR
SAMPLING ERROR NON-SAMPLING ERROR
COMPARISON
BASIS FOR
SAMPLING ERROR NON-SAMPLING ERROR
COMPARISON

Meaning Sampling error is a type of error, An error occurs due to sources


occurs due to the sample other than sampling, while
selected does not perfectly conducting survey activities is
represents the population of known as non sampling error.
interest.

Cause Deviation between sample mean Deficiency and analysis of data


and population mean

Type Random Random or Non-random

Occurs Only when sample is selected. Both in sample and census.

Sample size Possibility of error reduced with It has nothing to do with the
the increase in sample size. sample size.

Estimation
Estimation, in statistics, any of numerous procedures used to calculate the value of
some property of a population from observations of a sample drawn from the
population. A point estimate, for example, is the single number most likely to express
the value of the property. An interval estimate defines a range within which the value of
the property can be expected (with a specified degree of confidence) to fall.

1. Estimation

• A procedure by which a numerical value assigned to a population


parameter based on the value corresponding to its sample statistic.
2. Estimator

• The sample statistic used to estimate the population parameter

3. Estimate

• The value assigned to a population parameter based on the value of


statistic

Point Estimate vs. Interval Estimate

Statisticians use sample statistics to estimate population parameters. For


example, sample means are used to estimate population means; sample proportions, to
estimate population proportions.

An estimate of a population parameter may be expressed in two ways:

Point estimate. A point estimate of a population parameter is a single value of a


statistic. For example, the sample mean x is a point estimate of the population mean μ.
Similarly, the sample proportion p is a point estimate of the population proportion P.

Interval estimate. An interval estimate is defined by two numbers, between which


a population parameter is said to lie. For example, a < x < b is an interval estimate of the
population mean μ. It indicates that the population mean is greater than a but less than
b.

Confidence level
The width of the confidence interval tells us more about how certain (or
uncertain) we are about the true figure in the population. This width is stated as a plus
or minus (in this case, +/- 3) and is called the confidence interval. When the interval and
confidence level are put together, you get a spread of percentage. In this case, you
would expect the results to be 35 (38-3) to 41 (35+3) percent, 95% of the time.
Confidence Coefficient (1-α)
The confidence coefficient is the confidence level stated as a proportion, rather
than as a percentage. For example, if you had a confidence level of 99%, the confidence
coefficient would be .99.

In general, the higher the coefficient, the more certain you are that your results
are accurate. For example, a .99 coefficient is more accurate than a coefficient of .89. It’s
extremely rare to see a coefficient of 1 (meaning that you are positive without a doubt
that your results are completely, 100% accurate). A coefficient of zero means that you
have no faith that your results are accurate at all.

Confidence coefficient (1 – α) Confidence level (1 – α * 100%)

0.90 90 %

0.95 95 %

0.99 99 %

Level of Significance

The level of significance is defined as the fixed probability of wrong elimination


of null hypothesis when in fact, it is true. The level of significance is stated to be the
probability of type I error and is preset by the researcher with the outcomes of error.
The level of significance is the measurement of the statistical significance. It defines
whether the null hypothesis is assumed to be accepted or rejected. It is expected to
identify if the result is statistically significant for the null hypothesis to be false or
rejected.
The level of significance is denoted by the Greek symbol α (alpha). Therefore, the
level of significance is defined as follows:

Significance Level = p (type I error) = α

Hypothesis testing

A hypothesis is an educated guess about something in the world around you. It


should be testable, either by experiment or observation. Hypothesis testing in statistics
is a way for you to test the results of a survey or experiment to see if you have
meaningful results. You’re basically testing whether your results are valid by figuring out
the odds that your results have happened by chance. If your results may have happened
by chance, the experiment won’t be repeatable and so has little use.

i. Figure out your null hypothesis,

ii. State your null hypothesis,

iii. Choose what kind of test you need to perform,

iv. Either support or reject the null hypothesis.

The null hypothesis is always the accepted fact. The assumption of a statistical test is
called the null hypothesis, or hypothesis 0 (H0 for short). It is often called the default
assumption, or the assumption that nothing has changed.

A violation of the test’s assumption is often called the first hypothesis,


hypothesis 1 or H1 for short. H1 is really a short hand for “some other hypothesis,” as
all we know is that the evidence suggests that the H0 can be rejected.

• Hypothesis 0 (H0): Assumption of the test holds and is failed to be rejected at


some level of significance.

• Hypothesis 1 (H1): Assumption of the test does not hold and is rejected at some
level of significance.

Before we can reject or fail to reject the null hypothesis, we must interpret the result of
the test.

• A type I error (α error) (false-positive) occurs if an investigator rejects a null


hypothesis that is actually true in the population;
• A type II error (β error) (false-negative) occurs if the investigator fails to reject a
null hypothesis that is actually false in the population.

➢ Power of a test (1 – β)
Statistical power or the power of a hypothesis test is the probability that the test
correctly rejects the null hypothesis. That is, the probability of a true positive result. It is
only useful when the null hypothesis is rejected. The higher the statistical power for a
given experiment, the lower the probability of making a Type II (false negative) error.
That is the higher the probability of detecting an effect when there is an effect. In fact,
the power is precisely the inverse of the probability of a Type II error

• Low Statistical Power: Large risk of committing Type II errors, e.g. a false negative.

• High Statistical Power: Small risk of committing Type II errors.

Tests
A test statistic is used in a hypothesis test when you are deciding to support or
reject the null hypothesis. The test statistic takes your data from an experiment or survey
and compares your results to the results you would expect from the null hypothesis.

Test Statistics and P-Values

When you run a hypothesis test, you’ll use a distribution like a t-distribution or
normal distribution. These have a known area, and enable to you to calculate a
probability value (p-value) that will tell you if your results are due to chance, or if your
results are die to your theory being correct. The larger the test statistic, the smaller the
p-value and the more likely you are to reject the null hypothesis.
➢ Parametric Tests
The basic principle behind the parametric tests is that we have a fixed set of
parameters that are used to determine a probabilistic model that may be used in
Machine Learning as well.

Parametric tests are those tests for which we have prior knowledge of the
population distribution (i.e, normal), or if not then we can easily approximate it to a
normal distribution which is possible with the help of the Central Limit Theorem.

Parameters for using the normal distribution is –

• Mean

• Standard Deviation

Types

• T-test

• Z-test

• F-test

• ANOVA

Non-parametric Tests
In Non-Parametric tests, we don’t make any assumption about the parameters for
the given population or the population we are studying.
Hence, there is no fixed set of parameters is available, and also there is no distribution
(normal distribution, etc.) of any kind is available for use.This is also the reason that
nonparametric tests are also referred to as distribution-free tests.

Types

• Chi-square

• Mann-Whitney U-test

• Kruskal-Wallis H-test

➢ T-Test
1. It is a parametric test of hypothesis testing based on Student’s T distribution.

2. It is essentially, testing the significance of the difference of the mean values when the
sample size is small (i.e, less than 30) and when the population standard deviation is not
available.

3. Assumptions of this test:

• Population distribution is normal, and

• Samples are random and independent

• The sample size is small.

• Population standard deviation is not known.

• Mann-Whitney ‘U’ test is a non-parametric counterpart of the T-test.

One Sample T-test: To compare a sample mean with that of the population mean.

where,

• x̄ is the sample mean


• s is the sample standard deviation

• n is the sample size

• μ is the population mean

Two-Sample T-test: To compare the means of two different samples.

• x̄1 is the sample mean of the first group

• x̄2 is the sample mean of the second group

• S1 is the sample-1 standard deviation

• S2 is the sample-2 standard deviation

• n is the sample size

Conclusion:

• If the value of the test statistic is greater than the table value -> Rejects the null
hypothesis.

• If the value of the test statistic is less than the table value -> Do not reject the
null hypothesis.

➢ Z-Test
1. It is a parametric test of hypothesis testing.

2. It is used to determine whether the means are different when the population variance
is known and the sample size is large (i.e, greater than 30).

3. Assumptions of this test:

• Population distribution is normal

• Samples are random and independent.


• The sample size is large.

• Population standard deviation is known.

A Z-test can be:

One Sample Z-test: To compare a sample mean with that of the population mean.

Two Sample Z-test: To compare the means of two different samples.

where,

• x̄1 is the sample mean of 1st group

• x̄2 is the sample mean of 2nd group

• σ1 is the population-1 standard deviation

• σ2 is the population-2 standard deviation

• n is the sample size

➢ F-Test
1. It is a parametric test of hypothesis testing based on Snedecor F-distribution.
2. It is a test for the null hypothesis that two normal populations have the same
variance.

3. An F-test is regarded as a comparison of equality of sample variances.

4. F-statistic is simply a ratio of two variances.

5. By changing the variance in the ratio, F-test has become a very flexible test. It
can then be used to:

• Test the overall significance for a regression model.

• To compare the fits of different models and

• To test the equality of means.

7. Assumptions of this test:

• Population distribution is normal, and

• Samples are drawn randomly and independently.

➢ ANOVA
1. Also called as Analysis of variance, it is a parametric test of hypothesis testing.

2. It is an extension of the T-Test and Z-test.

3. It is used to test the significance of the differences in the mean values among
more than two sample groups.

4. It uses F-test to statistically test the equality of means and the relative variance
between them.

5. Assumptions of this test:

• Population distribution is normal, and

• Samples are random and independent.

• Homogeneity of sample variance.

6. One-way ANOVA and Two-way ANOVA are is types.

7. F-statistic = variance between the sample means/variance within the sample


➢ Chi-Square Test
1. It is a non-parametric test of hypothesis testing.

2. As a non-parametric test, chi-square can be used:

• Test of goodness of fit.

• As a test of independence of two variables.

3. It helps in assessing the goodness of fit between a set of observed and those
expected theoretically.

4. It makes a comparison between the expected frequencies and the observed


frequencies.

5. Greater the difference, the greater is the value of chi-square.

6. If there is no difference between the expected and observed frequencies, then the
value of chi-square is equal to zero.

7. It is also known as the “Goodness of fit test” which determines whether a particular
distribution fits the observed data or not.

8. Chi-square is also used to test the independence of two variables.

9. Conditions for chi-square test:

• Randomly collect and record the Observations.

• In the sample, all the entities must be independent.

• No one of the groups should contain very few items, say less than 10.

• The reasonably large overall number of items. Normally, it should be at least 50,
however small the number of groups may be.

11. Chi-square as a parametric test is used as a test for population variance based on
sample variance.

12. If we take each one of a collection of sample variances, divide them by the known
population variance and multiply these quotients by (n-1), where n means the number
of items in the sample, we get the values of chi-square.
REFERENCE

• Statistics and Econometrics by B. L. Agarwal

• Statistical Methods by S.P. Gupta

• Econometrics by Damodar Gujarati

You might also like