0% found this document useful (0 votes)
1K views207 pages

MDU MBA 1st Semester Buisness Statitcs and Analytics Notes 1

The document discusses business statistics and analytics including definitions, types, applications and examples. It defines business statistics as applying statistics to business data to help companies understand their situation and make projections. The two main types are descriptive statistics which summarizes data, and inferential statistics which tests generalizations. Applications include production, accounting, research and development, economics, human resources and marketing. Examples demonstrate how measures of central tendency can help businesses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views207 pages

MDU MBA 1st Semester Buisness Statitcs and Analytics Notes 1

The document discusses business statistics and analytics including definitions, types, applications and examples. It defines business statistics as applying statistics to business data to help companies understand their situation and make projections. The two main types are descriptive statistics which summarizes data, and inferential statistics which tests generalizations. Applications include production, accounting, research and development, economics, human resources and marketing. Examples demonstrate how measures of central tendency can help businesses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 207

1

MAHARSHI DAYANAND UNIVERSITY

Master of Business Administration


1st Semester

BUISNESS STATISTICS
AND ANALYTICS

Study Notes in Easy Language

CREATED BY: RATHOUR STUDY POINT


2

MDU MBA 1st SEMESTER


BUISNESS STATISTICS & ANALYTICS
INDEX
SECTION-I

Introduction :
 Defination
 Role and Application
 Measures of Central Tendencies and their
application
 Measure of Dispersion
 Range
 Quartile Deviation
 Standard Deviations
 Coefficient of Variance and mean deviation
 Skewness and Kurtosis
SECTION-II

Correlation:
 Meaning and Types of Correlation
 Positive Correlation
 Negative Correlation
3

 Linear and nonlinear Correlation


 Scatter Diagram
 Karl Person’s Coefficient of Correlation
 Properties of corelation coefficient
 Probable error of correlation coefficient
 Multiple and partial correlation coefficient
Regression:
 Meaning and types
 Simple regression
 Multiple regression
 Linear and non linear regression
 Regression lines
 Properties of regression

SECTION-III
Time series:
 Introduction
 Objective and identification of trends
 Variation in time
 Secular variation
 Cyclical variation
 Seasonal and irregular variation
4

 Methods of estimation of trends


 Moving average and least square method
Index number:
 Definition
 Uses
 Types
 Simple aggregate method and weight
aggregate method
 Laspeyre’s,Passches’s, Fisher’s and CPI
 Construction of index numbers and their
uses

SECTION-IV
Sampling:
 Meaning and basic sampling concept
 Sampling and non sampling errors
Hypothesis testing
 Formulation and procedure for testing a
hypothesis
 Large and small sample test
 Z,t,f Test and ANNOVA(one way)
Non parametric test:
 Chi Square test
 Sign Test
 Kruskal Wallis Test

Concept of business Analytics :


5

 Meaning
 Types
 Application of business analytics

SECTION-I
INTRODUCTION
Defination:

What Is Business Statistics?


Business statistics refers to a method that involves the application of
statistics to get valuable insights from the data or information
available to a company. It helps businesses understand their present
situation and make future projections. Thus, this method plays a
crucial role in helping companies make crucial decisions.

Business Statistics
You are free to use this image o your website, templates, etc, Please
provide us with an attribution link
6

All experienced managers make vital business growth decisions based


on this technique. This is because this method helps organizations
spot overall industry trends. Moreover, companies use it in human
resource and production planning besides finance. Based on the
statistical technique utilized, statistics in business can be of two types
— inferential and descriptive.

Business statistics refers to using different statistical techniques and


tools in a business setting. Companies use this method to make
forecasts, test correlations, and describe data. Moreover,
organizations can take informed business decisions with the help of
this concept.
The two types of business statistics are inferential and descriptive.
There are various advantages of business statistics. A noteworthy one
is that it can improve performance management.
A key difference between business statistics and analytics is that the
latter explores events and explanations while the former compares
them and assigns weights to the various explanations.
Business Statistics Explained
Business statistics meaning refers to the method of utilizing statistics
for analyzing an organization’s data. The primary goal of this
technique is data collection, enabling managers to assess past
performance, predict future business practices, and carry out an
organization’s operations profitably. In other words, it serves as the
basis for price determination, market trends, risk navigation, changing
consumer behavior, sales prediction, etc.

With the help of principles and statistical techniques, organizations


can utilize data to make decisions based on fundamental values
instead of intuitions. That said, one must remember that managers
require different skills, such as quality control, forecasting, personnel
management, product planning, and market research, to make
decisions. Also, one should remember that businesses generally
7

accumulate data from experiments, surveys, or any other information


system within an organization.

Individuals can utilize this method to determine the viability of an


organization’s business proposition. Also, it can help companies know
whether a specific marketing campaign was able to entice more
consumers or not. This way, they can plan campaigns better in the
future.

Types
Let us understand the different types of this method in detail.

#1 – Descriptive Statistics
This method involves summarizing substantial data into different bits
of information in a meaningful and useful manner. It uses different
statistical tools, such as tables, charts, and graphs, to describe a
specific phenomenon or make generalizations.

This method looks into what happened and clarifies the reason
behind it. Managers can use historical information to check the
mistakes and achievements in the past. The use of descriptive
statistics is common in operations, finance, and marketing.

Some sub-categories of this method are measures of central


tendency, frequency measurement, and measures of variation or
dispersion.

#2 – Inferential Statistics
Not every generalization made using descriptive statistics needs to be
true. Hence, individuals utilize this method to test whether the
generalizations are valid. It involves assessing the validity and
estimating facts and figures to make business decisions.
8

In inferential statistics, individuals utilize sample data to solve their


research-related problems. A few sub-categories of inferential
statistics are as follows:

Pearson Correlation Coefficient


Bivariate Regression
Confidence Interval
ANOVA (Analysis of Variance)
Multivariate Regression
Applications
Let us look at some applications of statistics in business.

Production: Businesses can use this method to determine how much


of what to produce and when.
Accounting: The use of statistical data is common in accounting,
particularly in the auditing function, where destination and sampling
techniques are mostly utilized.
Research And Development: Various large companies have their R&D
or research and development departments. The primary objective of
such departments is to determine how to enhance existing products’
quality and what new offerings can be added to the portfolio.
Carrying out a worthwhile R&D program is almost impossible without
statistical data.
Economics: Statistical methods and data play a crucial role in helping
one comprehend economic issues and economic policies formulation.
Individuals can quantify most economic indicators and phenomena
and deal with them using statistically sound logic.
Human Resource (HR) Management And Development: HR
departments are responsible for developing rating systems, assessing
performance, evolving training and compensatory rewards, etc. Such
functions involve storing, accumulating, analyzing, retrieving
substantial data, and designing forms. One can perform all such
functions effectively and efficiently by utilizing statistics.
Marketing: In this field, statistical analysis provides information that
influences decision-making. One must note that any attempt to carve
9

out a niche in a new market must depend on a skillful and thorough


analysis of the data related to the workforce, purchasing power,
transportation costs, consumer habits, and production.
Examples
Let us look at a few business statistics examples to understand the
concept better.

Example #1
Suppose a software company, ABC, looks at their customers’ mean
spending on the mobile-based application offered by them, the mode
of the products purchased, and the median spending for each
customer. Although, at first glance, this might appear to be
overlapping, the three figures individually show a different aspect of
the organization.

From customers’ mean spending, the organization can look to raise


their offering’s value to convince the customers to purchase more
and thus improve the revenue.

For mode, the organization can determine what function is


appreciated the most by users and promote that particular feature to
more customers. Alternatively, the company can introduce new
features to amplify the existing features already liked by their
customers.

The business can learn about customers’ spending habits by observing


the median. For instance, if the median is significantly below the
mean, it indicates that the majority of the customers are spending a
small amount while a small section of the company’s customer base is
spending most of the money. Thus, business statistics can help the
organization introduce new features that customers having a low
budget can enjoy and help improve revenue.

Example #2
10

Suppose a construction company, DBC, wants to determine whether


the execution of a new project is worth it and whether it can earn
decent profits from the investments. Considering that the prices of
different construction materials and the finished building vary over
time, managers of the organization can utilize business statistics to
estimate the profit it can earn from the project. Based on the
estimate, the businesses can decide whether executing the project
would be prudent.

Importance
One can understand the importance of this concept by going through
the following points.

Statistical analysis allows organizations to quantify an organization’s


performance and identify patterns. This enables a business’s
managers to take informed decisions based on facts rather than
intuitions.
Another key advantage of business statistics is that it helps companies
in performance management. For instance, it enables managers to
know whether the employees are meeting their productivity
requirements. This, in turn, allows the managers to take the required
actions to support the employees performing below the level of
expectation.
Businesses also use statistics to predict whether the market will react
positively or negatively to a new offering. This is vital before investing
in the development of a new product.
This method helps identify a relationship between multiple variables
and their impact on each other, for example, the impact of
advertising on sales.
Limitations
The limitations of this concept are as follows:

When utilizing statistics as a diagnostic tool for a business, managers


may suffer from outcome bias.
11

Another disadvantage is that individuals have a tendency to


inaccurately determine the impact of sample size if it is small.
In businesses, statistical tests are often carried out from a frequentist
approach. This might not represent the questions asked.
Business Statistics vs Statistics vs Business Analytics
People new to the business world are often confused regarding
statistics, business statistics, and analytics. Clearing the confusion is
crucial to make judicious decisions and propel business growth. In
that regard, understanding the key differences between these
concepts is essential. So, let us find out how they differ.

Business Statistics Statistics Business Analytics


This refers to the method of applying statistical techniques and tools
to managerial and business problems. It refers to the study of facts,
numerical data, measurements, and figures. Business analytics
explores explanations and events while statistics in business
compares them, assigns weight to a few of the explanations, and
casts doubt on the other ones.
It helps analyze the data available to businesses for making prudent
business decisions. Statistics help conduct research, make
informed decisions, and develop critical thinking. This concept aims
to assist businesses in making data-driven decisions to improve
business outcomes.

Measure of Central Tendencies:

Central Tendencies in Statistics are the numerical values that are used
to represent mid-value or central value a large collection of numerical
data. These obtained numerical values are called central or average
values in Statistics. A central or average value of any statistical data or
series is the value of that variable that is representative of the entire
data or its associated frequency distribution. Such a value is of great
significance because it depicts the nature or characteristics of the
entire data, which is otherwise very difficult to observe.
12

Measures of Central Tendency Meaning


The representative value of a data set, generally the central value or
the most occurring value that gives a general idea of the whole data
set is called the Measure of Central Tendency.

Measures of Central Tendency


Some of the most commonly used measures of central tendency are:

Mean
Median
Mode
Central Tendency

Mean
Mean in general terms is used for the arithmetic mean of the data,
but other than the arithmetic mean there are geometric mean and
harmonic mean as well that are calculated using different formulas.
Here in this article, we will discuss the arithmetic mean.

Mean for Ungrouped Data


Arithmetic mean (\bar{x} ) is defined as the sum of the
individual observations (xi) divided by the total number of
observations N. In other words, the mean is given by the sum of all
observations divided by the total number of observations.

\bold{\bar{x} = \frac{\sum x_i}{N}}

OR

Mean = Sum of all Observations ÷ Total number of Observations

Example: If there are 5 observations, which are 27, 11, 17, 19, and 21
then the mean (\bar{x} ) is given by
13

\bar{x} = (27 + 11 + 17 + 19 + 21) ÷ 5

⇒\bar{x} = 95 ÷ 5

⇒\bar{x} = 19

Mean for Grouped Data


Mean (\bar{x} ) is defined for the grouped data as the sum of the
product of observations (xi) and their corresponding frequencies (fi)
divided by the sum of all the frequencies (fi).

\bold{\bar{x} = \frac{\sum f_i x_i}{\sum f_i}}

Example: If the values (xi) of the observations and their frequencies


(fi) are given as follows:

xi

15

10

fi

10

8
14

10

then Arithmetic mean (\bar{x} ) of the above distribution is


given by

\bar{x} = (4×5 + 6×10 + 15×8 + 10×7 + 9×10) ÷ (5 + 10 + 8 +


7 + 10)

⇒\bar{x} = (20 + 60 + 120 + 70 + 90) ÷ 40

⇒\bar{x} = 360 ÷ 40

⇒\bar{x} =9

Central Tendencies in Statistics are the numerical values that are used
to represent mid-value or central value a large collection of numerical
data. These obtained numerical values are called central or average
values in Statistics. A central or average value of any statistical data or
series is the value of that variable that is representative of the entire
data or its associated frequency distribution. Such a value is of great
significance because it depicts the nature or characteristics of the
entire data, which is otherwise very difficult to observe.

Measures of Central Tendency Meaning


The representative value of a data set, generally the central value or
the most occurring value that gives a general idea of the whole data
set is called the Measure of Central Tendency.

Measures of Central Tendency


Some of the most commonly used measures of central tendency are:
15

Mean
Median
Mode
Central Tendency

Mean
Mean in general terms is used for the arithmetic mean of the data,
but other than the arithmetic mean there are geometric mean and
harmonic mean as well that are calculated using different formulas.
Here in this article, we will discuss the arithmetic mean.

Mean for Ungrouped Data


Arithmetic mean (\bar{x} ) is defined as the sum of the
individual observations (xi) divided by the total number of
observations N. In other words, the mean is given by the sum of all
observations divided by the total number of observations.

\bold{\bar{x} = \frac{\sum x_i}{N}}

OR

Mean = Sum of all Observations ÷ Total number of Observations

Example: If there are 5 observations, which are 27, 11, 17, 19, and 21
then the mean (\bar{x} ) is given by

\bar{x} = (27 + 11 + 17 + 19 + 21) ÷ 5

⇒\bar{x} = 95 ÷ 5

⇒\bar{x} = 19

Mean for Grouped Data


16

Mean (\bar{x} ) is defined for the grouped data as the sum of the
product of observations (xi) and their corresponding frequencies (fi)
divided by the sum of all the frequencies (fi).

\bold{\bar{x} = \frac{\sum f_i x_i}{\sum f_i}}

Example: If the values (xi) of the observations and their frequencies


(fi) are given as follows:

xi

15

10

fi

10

10

then Arithmetic mean (\bar{x} ) of the above distribution is


given by
17

\bar{x} = (4×5 + 6×10 + 15×8 + 10×7 + 9×10) ÷ (5 + 10 + 8 +


7 + 10)

⇒\bar{x} = (20 + 60 + 120 + 70 + 90) ÷ 40

⇒\bar{x} = 360 ÷ 40

⇒\bar{x} =9

Related Resources,

Mean Using Direct Method


Shortcut Method for Arithmetic Mean
Mean Using Step Deviation Method
Types of Mean
Mean can be classified into three different class groups which are

Arithmetic Mean
Geometric Mean
Harmonic Mean
Arithmetic Mean: The formula for Arithmetic Mean is given by

\bold{\bar{x} = \frac{\sum x_i}{N}}

Where,

x1, x2, x3, . . ., xn are the observations, and


N is the number of observations.
Geometric Mean: The formula for Geometric Mean is given by

\bold{\text{G.M.} = \sqrt[n]{x_1\cdot x_2\cdot x_3\cdot \ldots \cdot


x_n}}

Where,
18

x1, x2, x3, . . ., xn are the observations, and


n is the number of observations.
Harmonic Mean: The formula for Harmonic Mean is given by

\bold{\text{H. M. } = \frac{n }{1/x_1 + 1/x_2 +\ldots + 1/x_n}}

OR

\bold{\text{H. M. } = \frac{n }{\sum (1/x_i)}}

Where,

x1, x2, . . ., xn are the observations, and


n is the number of observations.
Properties of Mean (Arithmetic)
There are various properties of Arithmetic Mean, some of which are
as follows:

The algebraic sum of deviations from the arithmetic mean is zero i.e.,
\bold{\sum{(x_i - \bar{x})} = 0} .
If \bold{\bar{x}} is the arithmetic mean of observations and a is
added to each of the observations, then the new arithmetic mean is
given by \bold{\bar{x'} =\bar{x}+a}
If \bold{\bar{x}} is the arithmetic mean of observations and a is
subtracted from each of the observations, then the new arithmetic
mean is given by \bold{\bar{x'} =\bar{x}-a}
If \bold{\bar{x}} is the arithmetic mean of observations and a is
multiplied by each of the observations, then the new arithmetic mean
is given by \bold{\bar{x'} =\bar{x}\times a}
If \bold{\bar{x}} is the arithmetic mean of observations and each
of the observations is divided by a, then the new arithmetic mean is
given by \bold{\bar{x'} =\bar{x}\div a}
Disadvantage of Mean as Measure of Central Tendency
19

Although Mean is the most general way to calculate the central


tendency of a dataset however it can not give the correct idea always,
especially when there is a large gap between the datasets.

Median
The Median of any distribution is that value that divides the
distribution into two equal parts such that the number of
observations above it is equal to the number of observations below it.
Thus, the median is called the central value of any given data either
grouped or ungrouped.

Median of Ungrouped Data


To calculate the Median, the observations must be arranged in
ascending or descending order. If the total number of observations is
N then there are two cases

Case 1: N is Odd

Median = Value of observation at [(n + 1) ÷ 2]th Position

When N is odd the median is calculated as shown in the image below.

Median when n is odd

Case 2: N is Even

Median = Arithmetic mean of Values of observations at (n ÷ 2)th and


[(n ÷ 2) + 1]th Position

When N is even the median is calculated as shown in the image


below.

Median when n is even


20

Example 1: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20,
32 then the Median is given by

Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 32,
36, 38

N = 10 which is even then

Median = Arithmetic mean of values at (10 ÷ 2)th and [(10 ÷ 2) + 1]th


position

⇒Median = (Value at 5th position + Value at 6th position) ÷ 2

⇒Median = (26 + 28) ÷ 2

⇒Median = 27

Example 2: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20
then the Median is given by

Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 36,
38

N = 9 which is odd then

Median = Value at [(9 + 1) ÷ 2]th position

⇒Median = Value at 5th position

⇒Median = 26

Median of Grouped Data


The Median of Grouped Data is given as follows:

\bold{Median =l+ \frac{N/2 - c_f}{f} \times h}


21

Where,

l is the lower limit of median class,


n is the total number of observations,
cf is the cumulative frequency of the preceding class,
f is the frequency of each class, and
h is the class size.
Example: Calculate the median for the following data.

Class

10 – 20

20 – 30

30 – 40

40 – 50

50 – 60

Frequency

10

12

Solution:
22

Create the following table for the given data.

Class Frequency Cumulative Frequency


10 – 20

20 – 30

10

15

30 – 40

12

27

40 – 50

35

50 – 60

40

As n = 40 and n/2 = 20,


23

Thus, 30 – 40 is the median class.

l = 30, cf = 15, f = 12, and h = 10

Putting the values in the formula \bold{Median =l+ \frac{N/2 - c_f}{f}


\times h}

Median = 30 + (20 – 15)/12) × 10

⇒Median = 30 + (5/12) × 10

⇒Median = 30 + 4.17

⇒Median = 34.17

So, the median value for this data set is 34.17

Mode
The Mode is the value of that observation which has a maximum
frequency corresponding to it. In other, that observation of the data
occurs the maximum number of times in a dataset.

Mode of Ungrouped Data


Mode of Ungrouped Data can be simply calculated by observing the
observation with the highest frequency. Let’s see an example of the
calculation of the mode of ungrouped data.

The mode of the data set is the highest frequency term in the data set
as shown in the image added below.

Mode of ungrouped data

Example: Find the mode of observations 5, 3, 4, 3, 7, 3, 5, 4, 3.


24

Solution:

Create a table with each observation with its frequency as follows:

xi

fi

Since 3 has occurred a maximum number of times i.e. 4 times in the


given data;

Hence, Mode of the given ungrouped data is 3.

Mode of Grouped Data


The formula to find the mode of the grouped data is:

\bold{Mode = l +\left [\frac{f_1-f_0}{2f_1-f_0-f_2}\right]×h}

Where,
25

l is the lower class limit of modal class,


h is the class size,
f1 is the frequency of modal class,
f0 is the frequency of class which proceeds the modal class, and
f2 is the frequency of class which succeeds the modal class.
Example: Find the mode of the dataset which is given as follows.

Class Interval 10-20 20-30 30-40 40-50 50-60


Frequency 5 8 12 16 10
Solution:

As the class interval with the highest frequency is 40-50, which has a
frequency of 16. Thus, 40-50 is the modal class.

Thus, l = 40 , h = 10 , f1 = 16 , f0 = 12 , f2 = 10

Plugging in the values in formula \bold{Mode = l +\left [\frac{f_1-


f_0}{2f_1-f_0-f_2}\right]×h} , we get

Mode = 40 + (16 – 12)/(2 × 16 – 12 – 10) × 10

⇒Mode = 40 + (4/10)×10

⇒Mode = 40 + 4

⇒Mode = 44

Therefore, the mode for this set of data is 44.

Learn more about Mean, Median, and Mode of Grouped Data

Empirical Relation Between Measures of Central Tendency


The three central tendencies are related to each other by the
empirical formula which is given as follows:
26

2 × Mean + Mode = 3 × Median

This formula is used to calculate one of the central tendencies when


two other central tendencies are given.

Measure of Dispersion:

Measure of Dispersion is the numbers that are used to represent the


scattering of the data. These are the numbers that show the various
aspects of the data spread across various parameters. There are
various measures of dispersion that are used to represent the data
that includes,

Standard Deviation
Mean Deviation
Quartile Deviation
Variance
Range, etc
Dispersion in the general sense is the state of scattering. Suppose we
have to study the data for thousands of variables there we have to
find various parameters that represent the crux of the given data set.
These parameters are called the measure of dispersion.

Measure of Dispersion is the numbers that are used to represent the


scattering of the data. These are the numbers that show the various
aspects of the data spread across various parameters. There are
various measures of dispersion that are used to represent the data
that includes,

Standard Deviation
Mean Deviation
Quartile Deviation
Variance
27

Range, etc
Dispersion in the general sense is the state of scattering. Suppose we
have to study the data for thousands of variables there we have to
find various parameters that represent the crux of the given data set.
These parameters are called the measure of dispersion.

In this article, we will learn about, the measure of dispersion, its


various examples, formulas, and others related to it.

What is the Measure of Dispersion in Statistics?


Measures of Dispersion measure the scattering of the data, i.e. how
the values are distributed in the data set. In statistics, we define the
measure of dispersion as various parameters that are used to define
the various attributes of the data.

The image added below shows the measure of dispersion of various


types.

measure-of-dispersion

These measures of dispersion capture variation between different


values of the data.

Measures of Dispersion Definition


Measures of Dispersion is a non-negative real number that gives
various parameters of the data. The measure of dispersion will be
zero when the dispersion of the data set will be zero. If we have
dispersion in the given data then, these numbers which give the
attributes of the data set are the measure of dispersion.

Example of Measures of Dispersion


We can understand the measure of dispersion by studying the
following example, suppose we have 10 students in a class and the
marks scored by them in a Mathematics test are 12, 14, 18, 9, 11, 7,
28

9, 16, 19, and 20 out of 20. Then the average value scored by the
student in the class is,

Mean (Average) = (12 + 14 + 18 + 9 + 11 + 7 + 9 + 16 + 19 + 20)/10

= 135/10 = 13.5

Then, the average value of the marks is 13.5

Mean Deviation = {|12-13.5| + |14-13.5| + |18-13.5| + |9-13.5| +


|11-13.5| + |7-13.5| + |9-13.5| + |16-13.5| + |19-13.5| + |20-
13.5|}/10 = 34.5/10 = 3.45

Types of Measures of Dispersion


Measures of dispersion can be classified into two categories shown
below:

Absolute Measures of Dispersion


Relative Measures of Dispersion
These measures of dispersion can be further divided into various
categories. The measures of dispersion have various parameters and
these parameters have the same unit.

Measure-of-Depresion

Let’s learn about them in detail.

Absolute Measures of Dispersion


These measures of dispersion are measured and expressed in the
units of data themselves. For example – Meters, Dollars, Kg, etc.
Some absolute measures of dispersion are:

Range: Range is defined as the difference between the largest and the
smallest value in the distribution.
29

Mean Deviation: Mean deviation is the arithmetic mean of the


difference between the values and their mean.

Standard Deviation: Standard Deviation is the square root of the


arithmetic average of the square of the deviations measured from the
mean.

Variance: Variance is defined as the average of the square deviation


from the mean of the given data set.

Quartile Deviation: Quartile deviation is defined as half of the


difference between the third quartile and the first quartile in a given
data set.

Interquartile Range: The difference between upper(Q3 ) and


lower(Q1) quartile is called Interterquartile Range. The formula for
Interquartile Range is given as Q3 – Q1

Relative Measures of Dispersion


Suppose we have to measure the two quantities that have different
units than we used relative measures of dispersion to get a better
idea about the scatter of the data. Various relative measures of the
dispersion are,

Coefficient of Range: The coefficient of range is defined as the ratio of


the difference between the highest and lowest value in a data set to
the sum of the highest and lowest value.

Coefficient of Variation: The coefficient of Variation is defined as the


ratio of the standard deviation to the mean of the data set. We use
percentages to express the coefficient of variation.

Coefficient of Mean Deviation: The coefficient of the Mean Deviation


is defined as the ratio of the mean deviation to the value of the
central point of the data set.
30

Coefficient of Quartile Deviation: The coefficient of the Quartile


Deviation is defined as the ratio of the difference between the third
quartile and the first quartile to the sum of the third and first
quartiles.

Now let’s learn more about some of Absolute Measures of Dispersion


in detail.

Range of Data Set


The range is the difference between the largest and the smallest
values in the distribution. Thus, it can be written as

R=L–S

where

L is the largest value in the Distribution


S is the smallest value in the Distribution
A higher value of range implies higher variation. One drawback of this
measure is that it only takes into account the maximum and the
minimum value which might not always be the proper indicator of
how the values of the distribution are scattered.

Example: Find the range of the data set 10, 20, 15, 0, 100.

Solution:

Smallest Value in the data = 0


Largest Value in the data = 100
Thus, the range of the data set is,

R = 100 – 0

R = 100
31

Note: Range cannot be calculated for the open-ended frequency


distributions. Open-ended frequency distributions are those
distributions in which either the lower limit of the lowest class or the
higher limit of the highest class is not defined.

Range for Ungrouped Data


The range of the data set for the ungrouped data set is first we have
to find the smallest and the largest value of the data set by observing
and the difference between them gives the range of ungrouped data.
This is explained by the following example:

Example: Find out the range for the following observations, 20, 24,
31, 17, 45, 39, 51, 61.

Solution:

Largest Value = 61
Smallest Value = 17
Thus, the range of the data set is

Range = 61 – 17 = 44

Range for Grouped Data


The range of the data set for the grouped data set is found by
studying the following example,

Example: Find out the range for the following frequency distribution
table for the marks scored by class 10 students.

Marks Intervals Number of Students


0-10 5
10-20 8
20-30 15
30-40 9
32

Solution:

For Largest Value: Taking the higher limit of Highest Class = 40


For Smallest Value: Taking the lower limit of Lowest Class = 0
Range = 40 – 0

Thus, the range of the given data set is,

Range = 40

Mean Deviation
Range as a measure of dispersion only depends on the highest and
the lowest values in the data. Mean deviation on the other hand
measures the deviation of the observations from the mean of the
distribution. Since the average is the central value of the data, some
deviations might be positive and some might be negative. If they are
added like that, their sum will not reveal much as they tend to cancel
each other’s effect. For example,

Consider the data given below, -5, 10, 25

Mean = (-5 + 10 + 25)/3 = 10

Now a deviation from the mean for different values is,

(-5 -10) = -15


(10 – 10) = 0
(25 – 10) = 15
Now adding the deviations, shows that there is zero deviation from
the mean which is incorrect. Thus, to counter this problem only the
absolute values of the difference are taken while calculating the mean
deviation.

So the formula for the mean deviation is,


33

Mean Deviation (MD) = \frac{|(x_1 - \mu)| + |(x_2 - \mu)| + ....|(x_n -


\mu)|}{n}

Mean Deviation for Ungrouped Data


For calculating the mean deviation for ungrouped data, the following
steps must be followed:

Step 1: Calculate the arithmetic mean for all the values of the dataset.

Step 2: Calculate the difference between each value of the dataset


and the mean. Only absolute values of the differences will be
considered. |d|

Step 3: Calculate the arithmetic mean of these deviations using the


formula,

M.D = \frac{\sum|d|}{n}

This can be explained using the example.

Example: Calculate the mean deviation for the given ungrouped data,
2, 4, 6, 8, 10

Solution:

Mean(μ) = (2+4+6+8+10)/(5)

μ=6

M. D = \frac{\sum|d|}{n}

⇒M.D = \frac{|(2 - 6)| + |(4 - 6)| + |(6 - 6)| + |(8 - 6)| + |(10 - 6)|}{5}

⇒M.D = (4+2+0+2+4)/(5)
34

⇒M.D = 12/5 = 2.4

Measures of Dispersion Formula


Measures of Dispersion Formulas are the formulas that are used to
tell us about the various parameters of the data. Various formulas
related to the measures of dispersion are discussed in the table
below.

The table added here is for the Absolute Measure of Dispersion.

Absolute Measures of Dispersion

Related Formulas

Range
H–S

where,

H is the Largest Value


S is the Smallest Value
Variance
Population Variance(σ2)

σ2 = Σ(xi-μ)2 /n

Sample Variance(S2)

S2 = Σ(xi-μ)2 /(n-1)

where,

μ is the mean
n is the number of observation
Standard Deviation S.D. = √(σ2)
35

Mean Deviation
μ = (x – a)/n

where,

a is the central value(mean, median, mode)


n is the number of observation
Quartile Deviation
(Q3 – Q1)/2

where,

Q3 = Third Quartile
Q1 = First Quartile
The table added here is for the Related Measure of Dispersion.

Relative Measures of Dispersion Related Formulas


Coefficient of Range (H – S)/(H + S)
Coefficient of Variation (SD/Mean)×100
Coefficient of Mean Deviation
(Mean Deviation)/μ

where,

μ is the central point for which the mean is calculated

Coefficient of Quartile Deviation (Q3 – Q1)/(Q3 + Q1)


Co-Efficient of Dispersion
Coefficients of dispersion are calculated when two series are
compared, which have great differences in their average. We also use
co-efficient of dispersion for comparing two series that have different
measurements. It is denoted using the letters C.D.

Measures of Dispersion and Central Tendency


36

Measures of Dispersion and Central Tendency both are numbers that


are used to describe various parameters of the data. The differences
between Measures of Dispersion and Central Tendency are added in
the table below,

Central Tendency

Measure of Dispersion

Central Tendency is the numbers that are used to quantify the


properties of the data set.

Measure of Distribution is used to quantify the variability of the data


of dispersion.

Measure of Central tendency include,

Mean
Median
Mode
Various parameters included for the measure of dispersion are,

Range
Variance
Standard Deviation
Mean Deviation
Quartile Deviation

Coefficient of Variance and Mean Deviation:


Coefficient of Variance:

Another statistical term that is related to the distribution is the


variance, which is the standard deviation squared (variance = SD²
). The SD may be either positive or negative in value because it is
calculated as a square root, which can be either positive or
37

negative. By squaring the SD, the problem of signs is eliminated.


One common application of the variance is its use in the F-test to
compare the variance of two methods and determine whether
there is a statistically significant difference in the imprecision
between the methods.

In many applications, however, the SD is often preferred because


it is expressed in the same concentration units as the data. Using
the SD, it is possible to predict the range of control values that
should be observed if the method remains stable. As discussed in
an earlier lesson, laboratorians often use the SD to impose
"gates" on the expected normal distribution of control values.

Normal or Gaussian distribution


Traditionally, after the discussion of the mean, standard
deviation, degrees of freedom, and variance, the next step was to
describe the normal distribution (a frequency polygon) in terms of
the standard deviation "gates." The figure here is a
representation of the frequency distribution of a large set of
laboratory values obtained by measuring a single control material.
This distribution shows the shape of a normal curve. Note that a
"gate" consisting of ±1SD accounts for 68% of the distribution or
68% of the area under the curve, ±2SD accounts for 95% and
±3SD accounts for >99%. At ±2SD, 95% of the distribution is inside
the "gates," 2.5% of the distribution is in the lower or left tail, and
the same amount (2.5%) is present in the upper tail. Some
authors call this polygon an error curve to illustrate that small
errors from the mean occur more frequently than large ones.
Other authors refer to this curve as a probability distribution.

Coefficient of variation
38

Another way to describe the variation of a test is calculate the


coefficient of variation, or CV. The CV expresses the variation as a
percentage of the mean, and is calculated as follows:

CV% = (SD/Xbar)100

In the laboratory, the CV is preferred when the SD increases in


proportion to concentration. For example, the data from a
replication experiment may show an SD of 4 units at a
concentration of 100 units and an SD of 8 units at a concentration
of 200 units. The CVs are 4.0% at both levels and the CV is more
useful than the SD for describing method performance at
concentrations in between. However, not all tests will
demonstrate imprecision that is constant in terms of CV. For
some tests, the SD may be constant over the analytical range.

The CV also provides a general "feeling" about the performance


of a method. CVs of 5% or less generally give us a feeling of good
method performance, whereas CVs of 10% and higher sound bad.
However, you should look carefully at the mean value before
judging a CV. At very low concentrations, the CV may be high and
at high concentrations the CV may be low. For example, a
bilirubin test with an SD of 0.1 mg/dL at a mean value of 0.5
mg/dL has a CV of 20%, whereas an SD of 1.0 mg/dL at a
concentration of 20 mg/dL corresponds to a CV of 5.0%.

Alternate formulae
The lessons on Basic QC Practices cover these same terms (see
QC - The data calculations), but use a different form of the
equation for calculating cumulative or lot-to-date means and SDs.
Guidelines in the literature recommend that cumulative means
and SDs be used in calculating control limits [2-4], therefore it is
important to be able to perform these calculations.
39

The cumulative mean can be expressed as Xbar = (summation


symbolxi)t /nt, which appears similar to the prior mean term
except for the "t" subscripts, which refer to data from different
time periods. The idea is to add the summation symbolxi and n
terms from groups of data in order to calculate the mean of the
combined groups.
The cumulative or lot-to-date standard deviation can be
expressed as follows:

This equation looks quite different from the prior equation in this
lesson, but in reality, it is equivalent. The cumulative standard
deviation formula is derived from an SD formula called the Raw
Score Formula. Instead of first calculating the mean or Xbar, the
Raw Score Formula calculates Xbar inside the square root sign.

Oftentimes in reading about statistics, an unfamiliar formula may


be presented. You should realize that the mathematics in
statistics is often redundant. Each procedure builds upon the
previous procedure. Formulae that seem to be different are
derived from mathematical manipulations of standard
expressions with which you are often already acquainted.

Coefficient of Mean Deviation:

It is calculated to compare the data of two series. The coefficient


of mean deviation is calcvilated by dividing mean deviation by the
average. If deviations are taken from mean, we divide it by mean,
if the deviations are taken from median, then it is divided by
mode and if the “deviations are taken from median, then we
divide mean deviation by median.
40
41

B. For Discrete Series:

M.D. = ∑fdy/N; Where; N=∑f

And dy is the deviation of variable from X, M or Z ignoring signs


(Taking +ive signs only).

Steps to Calculate:

1. Take X, M or Z series as desired.

ADVERTISEMENTS:

2. Take deviations ignoring signs.

3. Multiply dy by respective f; get ∑fdy

4. Use the following formula

M.D. = ∑fdy/N

ADVERTISEMENTS:
42

(Note : If value of X or M or Z is in decimal fractions better use


Direct Method to get result easily)

When Mean or Median or Mode is in Fractions, in that case,


Direct formula is applied

:
43

C. For Continuous Series:

For Continuous Series also ; M.D. = ∑fdy/N


44

Skew
45

Skewness and Kurtosis:

Skewness is an important statistical technique that helps to


determine asymmetrical behavior than of the frequency
distribution, or more precisely, the lack of symmetry of tails both
left and right of the frequency curve. A distribution or dataset is
symmetric if it looks the same to the left and right of the center
point.

Types of skewness: The following figure describes the


classification of skewness:
46

Types of Skewness
Types of Skewness

1. Symmetric Skewness: A perfect symmetric distribution is one in


which frequency distribution is the same on the sides of the
center point of the frequency curve. In this, Mean = Median =
Mode. There is no skewness in a perfectly symmetrical
distribution.

2. Asymmetric Skewness: A asymmetrical or skewed distribution


is one in which the spread of the frequencies is different on both
the sides of the center point or the frequency curve is more
stretched towards one side or value of Mean. Median and Mode
falls at different points.

Positive Skewness: In this, the concentration of frequencies is


more towards higher values of the variable i.e. the right tail is
longer than the left tail.
Negative Skewness: In this, the concentration of frequencies is
more towards the lower values of the variable i.e. the left tail is
longer than the right tail.
Kurtosis:
It is also a characteristic of the frequency distribution. It gives an
idea about the shape of a frequency distribution. Basically, the
measure of kurtosis is the extent to which a frequency
distribution is peaked in comparison with a normal curve. It is the
degree of peakedness of a distribution.

Types of kurtosis: The following figure describes the classification


of kurtosis:

Types of Kurtosis
Types of Kurtosis
47

Leptokurtic: Leptokurtic is a curve having a high peak than the


normal distribution. In this curve, there is too much concentration
of items near the central value.
Mesokurtic: Mesokurtic is a curve having a normal peak than the
normal curve. In this curve, there is equal distribution of items
around the central value.
Platykurtic: Platykurtic is a curve having a low peak than the
normal curve is called platykurtic. In this curve, there is less
concentration of items around the central value.
Difference Between Skewness and Kurtosis
Sr. No. Skewness Kurtosis
1. It indicates the shape and size of variation on either side of
the central value. It indicates the frequencies of distribution at
the central value.
2. The measure differences of skewness tell us about the
magnitude and direction of the asymmetry of a distribution. It
indicates the concentration of items at the central part of a
distribution.
3. It indicates how far the distribution differs from the normal
distribution. It studies the divergence of the given distribution
from the normal distribution.
4. The measure of skewness studies the extent to which
deviation clusters is are above or below the average. It indicates
the concentration of items.
5. In an asymmetrical distribution, the deviation below or
above an average is not equal. No such distribution takes place.
48

SECTION-II

CORRELATION

Meaning Of Correlation:

What is correlation?
Correlation refers to the statistical relationship between two entities.
In other words, it's how two variables move in relation to one
another. Correlation can be used for various data sets, as well. In
49

some cases, you might have predicted how things will correlate, while
in others, the relationship will be a surprise to you. It's important to
understand that correlation does not mean the relationship is causal.

To understand how correlation works, it's important to understand


the following terms:

Positive correlation: A positive correlation would be 1. This means the


two variables moved either up or down in the same direction
together.
Negative correlation: A negative correlation is -1. This means the two
variables moved in opposite directions.
Zero or no correlation: A correlation of zero means there is no
relationship between the two variables. In other words, as one
variable moves one way, the other moved in another unrelated
direction.

Types of correlation coefficients


While correlation studies how two entities relate to one another, a
correlation coefficient measures the strength of the relationship
between the two variables. In statistics, there are three types of
correlation coefficients. They are as follows:

Pearson correlation: The Pearson correlation is the most commonly


used measurement for a linear relationship between two variables.
The stronger the correlation between these two datasets, the closer
it'll be to +1 or -1.
Spearman correlation: This type of correlation is used to determine
the monotonic relationship or association between two datasets.
Unlike the Pearson correlation coefficient, it's based on the ranked
values for each dataset and uses skewed or ordinal variables rather
than normally distributed ones.
Kendall correlation: This type of correlation measures the strength of
dependence between two datasets.
50

Knowing your variables is helpful in determining which correlation


coefficient type you will use. Using the right correlation equation will
help you to better understand the relationship between the datasets
you're analyzing.

Related: Types of Graphs and Charts

How to calculate the correlation coefficient


You can use the following equation to calculate correlation:

∑ (x(i) - x)̅ (y(i) - ȳ) / √ ∑(x(i) - x)̅ ^2 ∑(y(i) - ȳ)^2

When calculating a correlation, keep in mind the following


representations:

x(i) = the value of x

y(i) = the value of y

x̅ = the mean of the x-value

ȳ = the mean of the y-value

Follow these steps to calculate the correlation coefficient:

1. Determine your data sets


In the beginning of your calculation, determine what your variables
will be. You can organize them in a chart if it helps you to better
visualize them. Separate them by x and y variables. For instance:

x: (1, 2, 3, 4) and y: (2, 3, 4, 5)

2. Calculate the mean of the x and y variables


To calculate the mean, also known as the average, add the values of
each variable together and divide by the number of values in that
51

dataset. Using the example, if you were to calculate the mean of x,


you'd add 1, 2, 3 and 4 together and divide by 4 because you have
four values for x. Do the same for the y variables. Using the example
above, you'd add together 2, 3, 4 and 5 and divide by 4 because you
have four values for y.

3. Subtract the mean


For the x-variable, subtract the mean from each value of x-variable
and call it "a." For the y-variable, subtract the mean from each value
of the y-variable and call it "b."

4. Multiply and find the sum


Multiply each a-value by the corresponding b-value. After you've
done this, find the sum, which will end up being the formula's
numerator.

5. Take the square root


At this point, you can square every a-value and determine the sum of
the result. After you've done this, calculate the square root of the
value you just determined. This will be the formula's denominator.

6. Divide
Divide the numerator (the value you determined in step 4) by the
denominator (the value you determined in step 5). This will result in
the correlation coefficient.

If you prefer to calculate digitally, there are correlation calculators


online. This method is more efficient when you have larger datasets.

Get interview-ready with tips from Indeed


Prepare for interviews with practice questions and tips
Examples of correlation
Use the following correlation examples to help you better analyze the
correlation results from your own datasets.
52

Positive correlations
Here are some examples of positive correlations:

1. The more time you spend on a project, the more effort you'll have
put in.

2. The more money you make, the more taxes you will owe.

3. The nicer you are to employees, the more they'll respect you.

4. The more education you receive, the smarter you'll be.

5. The more overtime you work, the more money you'll earn.

Negative correlations
Here are some examples of negative correlations:

1. The more payments you make on a loan, the less money you'll owe.

2. As the number of your employees decreases, the more job


positions you'll have open.

3. The more you work in the office, the less time you'll spend at
home.

4. The more employees you hire, the fewer funds you'll have.

5. The more time you spend on a project, the less time you'll have.

No correlation
Here are some examples of entities with zero correlation:

1. The nicer you treat your employees, the higher their pay will be.

2. The smarter you are, the later you'll arrive at work.


53

3. The wealthier you are, the happier you'll be.

4. The earlier you arrive at work, your need for more supplies
increases.

5. The more funds you invest in your business, the more employees
will leave work early.

Linear and Non Linear Correlation:

Correlation can be broadly classified into two types: linear and


nonlinear.

Linear Correlation:

Linear correlation is the most common type of correlation found in


statistical analysis. It refers to a linear relationship between two
variables, where the change in one variable is directly proportional to
the change in the other variable. In other words, if one variable
increases or decreases, the other variable also increases or decreases
in a proportional manner. A linear correlation is usually represented
by a straight line on a scatter plot.

For example, let's consider the relationship between the hours spent
studying and the grades obtained in a class. If the relationship is
linear, we can expect that the more time a student spends studying,
the better grades they will get. This can be represented by a straight
line on a scatter plot.

Linear correlation is usually measured using the Pearson correlation


coefficient. The Pearson correlation coefficient ranges from -1 to +1,
where -1 represents a perfect negative correlation, 0 represents no
correlation, and +1 represents a perfect positive correlation. If the
54

Pearson correlation coefficient is close to 1 or -1, it indicates a strong


linear correlation between the variables.

Nonlinear Correlation:

Nonlinear correlation refers to a relationship between two variables


that is not linear. In other words, the relationship between the
variables is not directly proportional. Nonlinear relationships can take
on many different forms, including quadratic, exponential,
logarithmic, and sinusoidal.

For example, consider the relationship between the amount of


fertilizer used and the crop yield. Initially, an increase in the amount
of fertilizer used may lead to a proportional increase in crop yield.
However, after a certain point, further increases in the amount of
fertilizer may lead to diminishing returns, and the crop yield may
plateau or even decrease. This relationship is nonlinear and cannot be
represented by a straight line on a scatter plot.

Nonlinear correlation can be measured using methods such as


Spearman's rank correlation coefficient, Kendall's tau, or polynomial
regression. These methods are more complex than linear correlation
analysis, as they involve fitting a curve to the data rather than a
straight line.

In conclusion, understanding the type of correlation between two


variables is essential in statistical analysis. Linear correlation refers to
a directly proportional relationship between two variables, while
nonlinear correlation refers to a relationship that is not directly
proportional. Different methods are used to measure linear and
nonlinear correlation, and choosing the appropriate method depends
on the nature of the data and the research question at hand.

Scatter Diagram:
55

A scatter plot is a type of graph. A scatter plot can be defined as a


graph containing bivariate data in the form of plotted points, which
allows viewers to see a correlation between plotted points.

A scatter plot in its simplest form is a way to display bivariate data.


Bivariate data is simply data that has been collected that reflects two
different variables. With a scatter plot, all someone has to do is plot
each point of bivariate data that has been collected. Do not connect
the points with a line. Instead, look at the graph to see if there is
some sort of relationship between the bivariate data. The relationship
might be strong or weak. Additionally, it may be positive, negative, or
have no relationship at all.

Scatter plots are sometimes called scatter diagrams. Some other


known that may be more familiar are line graphs, bar graphs, box and
whisker plots, or even picture graphs.

A Simple Scatter Plot Example


A simple scatter plot can be used to see the difference in outdoor
temperatures compared to ice cream sales. The two variables would
be outside temperature and ice cream sales. This data could be
collected and organized into a table.
Outside Temperature (°F) Ice Cream Sales (# of Ice Cream Cones)
50 3
65 18
70 54
85 75
100 98
Once the data is organized into a table, it can be turned into ordered
pairs. The x-value will always be the independent variable while the y-
value will always be the dependent variable.

(50, 3), (65, 18), (70, 54), (85, 75), (100, 98)
56

Now that points have been created, they can be plotted to see what
the scatter plot looks like. The independent variable will go along the
x-axis and the dependent variable will go along the y-axis.

Scatter plot showing Outside Temperature versus Ice Cream Cone


Sales

Scatter plot showing plotted points from temperature versus ice


cream sale data

From this example, it is easy to see that there is some correlation


between temperature and ice cream sales. As the temperature
increases, so do ice cream sales. This is an example of a positive
correlation. This scatter plot also shows a positive linear correlation
due to the trend of the plotted points generally forming a linear
relationship.

What is the Purpose of a Scatter Plot?


The main purpose of a scatter plot is to display bivariate data. Being
able to visualize the relationship between bivariate data gives us a lot
of information. Reading data points can take a long time to process,
but by simply looking at a scatter plot it is easy to glean if there is any
correlation, if there is possible causation, if there is a positive or
negative trend, if the trend is linear or exponential, etc. It often gives
57

mathematicians a good starting point in their research between the


relationship of the two variables all at a quick glance.

Visual Map of Your Data


Let's meet Tom. Tom enjoys working in his vegetable garden. He loves
all his vegetables, but the tomatoes are his favorite. Tom just wasn't
happy with his tomato crop this year, so he decided to study
tomatoes in his garden. He wants to know if there's any connection
between the number of tomatoes on a plant and the hours of
exposure to sun.

He brings out his lab notebook and starts to record the data. Tom
counts the number of tomatoes on each plant. He also records the
number of hours of sun each tomato plant gets during the day. Tom
now takes the data back indoors and wonders how to make sense of
it. Is there a connection between the two things that he measured?

That's where the scatter diagrams come in. Just like it sounds, a
scatter diagram, or scatter plot, is a graph of your data. Scatter
diagrams are types of graphs that help you find out if two things are
connected. In math, we like to call those things variables. How do you
know if there's a connection or a relationship between two variables?
We measure the two variables and graph them on an (x, y) coordinate
system.

Scatter Plot showing No Correlation

No Correlation

Scatter plots may also end up showing relationships that are not
linear. Examples of these are relationships that may be exponential or
quadratic.
58

Scatter Plot showing an Exponential Correlation

Exponential Correlation

Scatter Plot showing a Quadratic Correlation

Quadratic Correlation

Scatter Plot Examples


Real-Life Example 1
A student decided to survey their classmates when they got back to
school from summer break. They wanted to see how many books
their friends read and also their shoe sizes. After collecting their data,
they organized it into a table and displayed it in a scatter plot.
Books Read Shoe Size
5 7
12 7.5
18 8
55 6
20 7
12 9
4 10
6 7
78 7
25 6.5

Scatter Plot showing Books Read versus Shoe Size Data

Graph showing Books Read versus Shoe Size Data


59

In looking at this scatter plot it does not appear that there is any
potential positive slope or negative slope. Additionally, the data does
not seem to be showing signs of a linear pattern, exponential pattern,
or quadratic pattern. Therefore, this data has no correlation. This is to
be expected. Shoe size does not dictate how many books someone
reads, nor does the number of books someone reads dictate their
shoe size.

Real-Life Example 2
In an Algebra I class a student decided to survey their friends
regarding information that interested them about the pandemic. They
asked their classmates how many times they had to quarantine in the
year 2021 and how many tv shows they binge-watched. The student
then organized the data into a table and made a scatter plot with the
results.
Times Quarantined TV Shows Binged
1 5
1 6
2 10
3 15
4 22
2 12
1 4
2 11
3 18

Scatter Plot showing Number of Times Quarantined versus Number of


TV Shows Binge-Watched

Graph showing Number of Times Quarantined versus TV Shows


Binged Data
60

When this scatter plot is examined there appears to be a weak linear


line that could be drawn into the graph. This is not a strong positive
correlation, but there is some sort of weak correlation. This line that
has been drawn in would have a positive slope. Therefore, there is a
weak positive relationship between the number of times quarantined
and the number of tv shows watched. This is to be expected. The
more times someone has been stuck at home unable to leave, the
more tv they are going to watch.

Karl Person’s Coefficient of Correlation:

The first person to give a mathematical formula for the measurement


of the degree of relationship between two variables in 1890 was Karl
Pearson. Karl Pearson’s Coefficient of Correlation is also known as
Product Moment Correlation or Simple Correlation Coefficient. This
method of measuring the coefficient of correlation is the most
popular and is widely used. It is denoted by ‘r’, where r is a pure
number which means that r has no unit.

According to Karl Pearson, “Coefficient of Correlation is calculated by


dividing the sum of products of deviations from their respective
means by their number of pairs and their standard deviations.”

Karl~Pearson’s~Coefficient~of~Correlation(r)=\frac{Sum~of~Products
~of~Deviations~from~their~respective~means}{Number~of~Pairs\tim
es{Standard~Deviations~of~both~Series}}

Or

r=\frac{\sum{xy}}{N\times{\sigma_x}\times{\sigma_y}}

Where,

N = Number of Pair of Observations


61

x = Deviation of X series from Mean (X-\bar{X})

y = Deviation of Y series from Mean (Y-\bar{Y})

\sigma_x = Standard Deviation of X series


(\sqrt{\frac{\sum{x^2}}{N}})

\sigma_y = Standard Deviation of Y series


(\sqrt{\frac{\sum{y^2}}{N}})

r = Coefficient of Correlation

What is Karl Pearson’s Coefficient of Correlation?


The first person to give a mathematical formula for the measurement
of the degree of relationship between two variables in 1890 was Karl
Pearson. Karl Pearson’s Coefficient of Correlation is also known as
Product Moment Correlation or Simple Correlation Coefficient. This
method of measuring the coefficient of correlation is the most
popular and is widely used. It is denoted by ‘r’, where r is a pure
number which means that r has no unit.

According to Karl Pearson, “Coefficient of Correlation is calculated by


dividing the sum of products of deviations from their respective
means by their number of pairs and their standard deviations.”

Karl~Pearson’s~Coefficient~of~Correlation(r)=\frac{Sum~of~Products
~of~Deviations~from~their~respective~means}{Number~of~Pairs\tim
es{Standard~Deviations~of~both~Series}}

Or

r=\frac{\sum{xy}}{N\times{\sigma_x}\times{\sigma_y}}
62

Where,

N = Number of Pair of Observations

x = Deviation of X series from Mean (X-\bar{X})

y = Deviation of Y series from Mean (Y-\bar{Y})

\sigma_x = Standard Deviation of X series


(\sqrt{\frac{\sum{x^2}}{N}})

\sigma_y = Standard Deviation of Y series


(\sqrt{\frac{\sum{y^2}}{N}})

r = Coefficient of Correlation

Table of Content
Methods of Calculating Karl Pearson’s Coefficient of Correlation
1. Actual Mean Method
2. Direct Method
3. Short-Cut Method/Assumed Mean Method
4. Step Deviation Method
Change of Scale and Origin
Methods of Calculating Karl Pearson’s Coefficient of Correlation
Actual Mean Method
Direct Method
Short-Cut Method/Assumed Mean Method/Indirect Method
Step-Deviation Method
1. Actual Mean Method
The steps involved in the calculation of coefficient of correlation by
using Actual Mean Method are:

The first step is to calculate the mean of the given two series (say X
and Y).
63

Now, take the deviation of X series from \bar{X} and denote the
deviations by x.
Square the deviations of x and obtain the total; i.e., \sum{x^2}
Take the deviation of Y series from \bar{Y} and denote the
deviations by y.
Square the deviations of y and obtain the total; i.e., \sum{y^2}
Multiply the respective deviations of Series X and Y and obtain the
total; i.e., \sum{xy} .
Now, use the following formula to determine the Coefficient of
Correlation:
r=\frac{\sum{xy}}{\sqrt{\sum{x^2}\times{\sum{y^2}}}}

Example:
Use Actual Mean Method and determine the coefficient of correlation
for the following data:

Data Table

Solution:
Coefficient of Correlation

\bar{X}=\frac{\sum{X}}{N}=\frac{168}{7}=24

\bar{Y}=\frac{\sum{Y}}{N}=\frac{105}{7}=15

r=\frac{\sum{xy}}{\sqrt{\sum{x^2}\times{\sum{y^2}}}}

∑xy = 336, ∑x2 = 448, ∑y2 = 252

r=\frac{336}{\sqrt{448\times252}}=\frac{336}{\sqrt{1,12,896}}=\frac{3
36}{336}=1

Coefficient of Correlation = 1
64

It means that there is a perfect positive correlation between the


values of Series X and Series Y.

2. Direct Method
The steps involved in the calculation of coefficient of correlation by
using Direct Method are:

The first step is to calculate the sum of Series X (∑X).


Now, calculate the sum of Series Y (∑Y).
Square the values of X Series and calculate their total; i.e., ∑X2.
Square the values of Y Series and calculate their total; i.e., ∑Y2.
Multiply the values of Series X and Y and calculate their total; i.e., ∑XY.
Now, use the following formula to determine Coefficient of
Correlation:
r=\frac{N\sum{XY}-\sum{X}.\sum{Y}}{\sqrt{N\sum{X^2}-
(\sum{X})^2}{\sqrt{N\sum{Y^2}-(\sum{Y})^2}}}

Example:
Use Direct Method and determine the coefficient of correlation for
the following data:

Data Table

Solution:
Coefficient of Correlation

r=\frac{N\sum{XY}-\sum{X}.\sum{Y}}{\sqrt{N\sum{X^2}-
(\sum{X})^2}{\sqrt{N\sum{Y^2}-(\sum{Y})^2}}}

=\frac{(7\times2,856)-(168\times105)}{\sqrt{(7\times4,480)-
(168)^2}\times{\sqrt{(7\times1,827)-(105)^2}}}

=\frac{19,992-17,640}{\sqrt{31,360-28,224}\times{\sqrt{12,789-
11,025}}}
65

=\frac{2,352}{\sqrt{3,136}\times{\sqrt{1,764}}}=\frac{2,352}{56\times
42}

=\frac{2,352}{2,352}=1

Coefficient of Correlation = 1

It means that there is a perfect positive correlation between the


values of Series X and Series Y.

3. Short-Cut Method/Assumed Mean Method


Actual Mean can sometimes come in fractions which can make the
calculation of standard deviation complicated and difficult. In those
cases, it is suggested to use Short-Cut Method to simplify the
calculations. The steps involved in the calculation of coefficient of
correlation by using Assumed Mean Method are:

First of all, take the deviations of X Series from the assumed mean
and denote the values by dx. Calculate their total; i.e., ∑dx.
Now, square the deviations of X series and calculate their total; i.e.,
∑dx2.
Take the deviations of Y Series from the assumed mean and denote
the values by dy. Calculate their total; i.e., ∑dy.
Square the deviations of Y series and calculate their total; i.e., ∑dy2.
Multiply dx and dy and calculate their total; i.e., ∑dxdy.
Now, use the following formula to determine Coefficient of
Correlation:
r=\frac{N\sum{dxdy}-\sum{dx}.\sum{dy}}{\sqrt{N\sum{dx^2}-
(\sum{dx})^2}{\sqrt{N\sum{dy^2}-(\sum{dy})^2}}}

Where,

N = Number of pair of observations

∑dx = Sum of deviations of X values from assumed mean


66

∑dy = Sum of deviations of Y values from assumed mean

∑dx2 = Sum of squared deviations of X values from assumed mean

∑dy2 = Sum of squared deviations of Y values from assumed mean

∑dxdy = Sum of the products of deviations dx and dy

Example:
Use Step Deviation Method and determine the coefficient of
correlation for the following data:

Data Table

Solution:
Coefficient of Correlation under Step Deviation Method

r=\frac{N\sum{dx^\prime{dy^\prime}}-
\sum{dx^\prime}.\sum{dy^\prime}}{\sqrt{N\sum{dx^\prime{^2}}-
(\sum{dx^\prime})^2}{\sqrt{N\sum{dy^\prime{^2}}-
(\sum{dy^\prime})^2}}}

=\frac{(7\times35)-(7\times7)}{\sqrt{(7\times35)-
(7)^2}\times{\sqrt{(7\times35)-(7)^2}}}

=\frac{245-49}{\sqrt{245-49}\times{\sqrt{245-49}}}

=\frac{196}{\sqrt{196}\times{\sqrt{196}}}=\frac{196}{14\times14}

=\frac{196}{196}=1

Coefficient of Correlation = 1
67

It means that there is a perfect positive correlation between the


values of Series X and Series Y.

Change of Scale and Origin


Coefficient of Correlation does not depend upon the change of scale
and origin.

Change of Origin: If a constant is added or subtracted to the values


then it will not have any effect on the value of correlation coefficient.
Change of Scale: Similarly, if a constant is multiplied or divided by the
values, then it will not have any effect on the value of correlation
coefficient.
Example:
Find the coefficient of correlation from the following figures:

Data Table

Solution:
As the coefficient of correlation is not affected by the change in scale
and origin of the variables, we will multiply the X Series by 10 and
divide the Y series by 100.

Coefficient of Correlation

r=\frac{N\sum{dxdy}-\sum{dx}.\sum{dy}}{\sqrt{N\sum{dx^2}-
(\sum{dx})^2}{\sqrt{N\sum{dy^2}-(\sum{dy})^2}}}

=\frac{(8\times156)-[(-24)\times(-4)]}{\sqrt{(8\times1,584)-(-
24)^2}\times{\sqrt{(8\times44)-(-4)^2}}}

=\frac{1,248-96}{\sqrt{12,672-576}\times{\sqrt{352-16}}}

=\frac{1,152}{\sqrt{12,096}\times{\sqrt{336}}}=\frac{1,152}{110\times
18.3}
68

=\frac{1,152}{2,013}=0.57

Coefficient of Correlation = 0.57

It means that there is a moderate degree of positive correlation


between variables X and Y.

Properties of Correlation Coefficient:

The following are the main properties of correlation.

1. Coefficient of Correlation lies between -1 and +1:

The coefficient of correlation cannot take value less than -1 or more


than one +1. Symbolically,

-1<=r<= + 1 or | r |

2. Coefficients of Correlation are independent of Change of Origin:

This property reveals that if we subtract any constant from all the
values of X and Y, it will not affect the coefficient of correlation.

3. Coefficients of Correlation possess the property of symmetry:

The degree of relationship between two variables is symmetric as


shown below:

4. Coefficient of Correlation is independent of Change of Scale:

This property reveals that if we divide or multiply all the values of X


and Y, it will not affect the coefficient of correlation.
69

5. Co-efficient of correlation measures only linear correlation


between X and Y.

6. If two variables X and Y are independent, coefficient of correlation


between them will be zero.

Plot different sets of values i.e. (8, 70), (16, 58) (24, 50), (31, 32), (42,
26), (50, 12) on the graph paper. Join these points. The result is the
scatter diagram. This data gives high degree of negative correlation.

Probal Error of Correlation Coefficient:

Definition: The Probable Error of Correlation Coefficient helps in


determining the accuracy and reliability of the value of the coefficient
that in so far depends on the random sampling.

In other words, the probable error (P.E.) is the value which is added or
subtracted from the coefficient of correlation (r) to get the upper
limit and the lower limit respectively, within which the value of the
correlation expectedly lies.

The probable error of correlation coefficient can be obtained by


applying the following formula:

P.E.r-1r = coefficient of correlation


N = number of observations

There is no correlation between the variables if the value of ‘r’ is less


than P.E. This shows that the coefficient of correlation is not at all
significant.
The correlation is said to be certain when the value of ‘r’ is six times
more than the probable error; this shows that the value of ‘r’ is
significant.
70

By adding and subtracting the value of P.E from the value of ‘r,’ we
get the upper limit and the lower limit, respectively within which the
correlation of coefficient is expected to lie. Symbolically, it can be
expressedP.E.r-2
where rho denotes the correlation in a population

The probable Error can be used only when the following three
conditions are fulfilled:

The data must approximate to the bell-shaped curve, i.e. a normal


frequency curve.
The Probable error computed from the statistical measure must have
been taken from the sample.
The sample items must be selected in an unbiased manner and must
be independent of each other.
Thus, the probable error is calculated to check the reliability of the
value of coefficient calculated from the random sampling.

Multiple and Partial Correlation Coefficient:


Understanding Partial Correlation
2.1 Definition
Partial correlation is a statistical technique used to measure the
relationship between two variables while controlling the effects of
one or more additional variables. In other words, partial correlation
measures the strength and direction of the relationship between two
variables, while holding constant the effects of one or more other
variables.

2.2 Formula and Calculation


The formula for partial correlation is:

rxy.z = (rxy – rxz * ryz) / (sqrt(1 – rxz^2) * sqrt(1 – ryz^2))

where rxy.z is the partial correlation coefficient between x and y,


controlling for the effects of z, rxy is the correlation coefficient
71

between x and y, rxz is the correlation coefficient between x and z,


and ryz is the correlation coefficient between y and z.

2.3 Interpretation
The partial correlation coefficient measures the strength and
direction of the relationship between two variables while controlling
for the effects of one or more other variables. A positive partial
correlation coefficient indicates a positive relationship between the
two variables, while a negative partial correlation coefficient indicates
a negative relationship between the two variables. A partial
correlation coefficient of 0 indicates no relationship between the two
variables.

3. Understanding Multiple Correlation


3.1 Definition
Multiple correlation is a statistical technique used to measure the
relationship between a dependent variable and two or more
independent variables. In other words, multiple correlation measures
the strength and direction of the relationship between a dependent
variable and two or more independent variables.

3.2 Formula and Calculation


The formula for multiple correlation is:

R = sqrt(1 – (1-Rxy1^2) – (1-Rxy2^2) – … – (1-Rxyk^2))

where R is the multiple correlation coefficient, Rxy1 is the correlation


coefficient between the dependent variable and the first independent
variable, Rxy2 is the correlation coefficient between the dependent
variable and the second variable.

3.3 Interpretation
The multiple correlation coefficient measures the strength and
direction of the relationship between a dependent variable and two
or more independent variables. A multiple correlation coefficient of 1
72

indicates a perfect positive relationship between the dependent


variable and the independent variables, while a multiple correlation
coefficient of -1 indicates a perfect negative relationship between the
dependent variable and the independent variables. A multiple
correlation coefficient of 0 indicates no relationship between the
dependent variable and the independent variables.

4. Differences between Partial and Multiple Correlation


The main difference between partial and multiple correlation is that
partial correlation measures the relationship between two variables
while controlling for the effects of one or more additional variables,
whereas multiple correlation measures the relationship between a
dependent variable and two or more independent variables.

In other words, partial correlation measures the unique relationship


between two variables after controlling for the effects of other
variables, while multiple correlation measures the overall relationship
between a dependent variable and a set of independent variables.

5. Significance of Partial and Multiple Correlation


5.1 Advantages
Partial and multiple correlation analysis have several advantages.
They allow us to examine the relationship between variables while
controlling for the effects of other variables, which can help to reduce
bias in our analysis. They also allow us to identify the unique
contribution of each variable to the overall relationship between
variables.

5.2 Limitations
Partial and multiple correlation analysis also have some limitations.
They assume that the relationship between variables is linear and that
there is no interaction between variables. They also assume that the
variables are normally distributed and that there are no outliers or
influential observations in the data.
73

6. Applications of Partial and Multiple Correlation


Partial and multiple correlation analysis have many applications in
research and data analysis. They can be used to examine the
relationship between variables in psychology, sociology, economics,
and other fields. They can also be used in machine learning and
predictive modeling to identify the most important variables for
predicting an outcome.
74

REGRESSION

Meaning and it’s types:

What Is a Regression?
Regression is a statistical method used in finance, investing, and other
disciplines that attempts to determine the strength and character of
the relationship between one dependent variable (usually denoted by
Y) and a series of other variables (known as independent variables).

Also called simple regression or ordinary least squares (OLS), linear


regression is the most common form of this technique. Linear
regression establishes the linear relationship between two variables
based on a line of best fit. Linear regression is thus graphically
depicted using a straight line with the slope defining how the change
in one variable impacts a change in the other. The y-intercept of a
linear regression relationship represents the value of one variable
when the value of the other is zero. Non-linear regression models also
exist, but are far more complex.

Regression analysis is a powerful tool for uncovering the associations


between variables observed in data, but cannot easily indicate
causation. It is used in several contexts in business, finance, and
economics. For instance, it is used to help investment managers value
assets and understand the relationships between factors such as
commodity prices and the stocks of businesses dealing in those
commodities.

Regression as a statistical technique should not be confused with the


concept of regression to the mean (mean reversion).

KEY TAKEAWAYS
75

A regression is a statistical technique that relates a dependent


variable to one or more independent (explanatory) variables.
A regression model is able to show whether changes observed in the
dependent variable are associated with changes in one or more of the
explanatory variables.
It does this by essentially fitting a best-fit line and seeing how the data
is dispersed around this line.
Regression helps economists and financial analysts in things ranging
from asset valuation to making predictions.
In order for regression results to be properly interpreted, several
assumptions about the data and the model itself must hold.
Regression
Investopedia / Joules Garcia

Understanding Regression
Regression captures the correlation between variables observed in a
data set and quantifies whether those correlations are statistically
significant or not.

The two basic types of regression are simple linear regression and
multiple linear regression, although there are non-linear regression
methods for more complicated data and analysis. Simple linear
regression uses one independent variable to explain or predict the
outcome of the dependent variable Y, while multiple linear regression
uses two or more independent variables to predict the outcome
(while holding all others constant).

Regression can help finance and investment professionals as well as


professionals in other businesses. Regression can also help predict
sales for a company based on weather, previous sales, GDP growth, or
other types of conditions. The capital asset pricing model (CAPM) is
76

an often-used regression model in finance for pricing assets and


discovering the costs of capital.

Regression and Econometrics


Econometrics is a set of statistical techniques used to analyze data in
finance and economics. An example of the application of
econometrics is to study the income effect using observable data. An
economist may, for example, hypothesize that as a person increases
their income their spending will also increase.

If the data show that such an association is present, a regression


analysis can then be conducted to understand the strength of the
relationship between income and consumption and whether or not
that relationship is statistically significant—that is, it appears to be
unlikely that it is due to chance alone.

Note that you can have several explanatory variables in your


analysis—for example, changes to GDP and inflation in addition to
unemployment in explaining stock market prices. When more than
one explanatory variable is used, it is referred to as multiple linear
regression. This is the most commonly used tool in econometrics.

Econometrics is sometimes criticized for relying too heavily on the


interpretation of regression output without linking it to economic
theory or looking for causal mechanisms. It is crucial that the findings
revealed in the data are able to be adequately explained by a theory,
even if that means developing your own theory of the underlying
processes.

Calculating Regression
Linear regression models often use a least-squares approach to
determine the line of best fit. The least-squares technique is
77

determined by minimizing the sum of squares created by a


mathematical function. A square is, in turn, determined by squaring
the distance between a data point and the regression line or mean
value of the data set.

Once this process has been completed (usually done today with
software), a regression model is constructed. The general form of
each type of regression model is:

Simple linear regression:


=

+


+

Y=a+bX+u

Multiple linear regression:


=

+

1

1
78

+

2

2
+

3

3
+
.
.
.
+




+

where:

=
The dependent variable you are trying to predict
or explain

=
The explanatory (independent) variable(s) you are
using to predict or associate with Y

=
The y-intercept

=
79

(beta coefficient) is the slope of the explanatory


variable(s)

=
The regression residual or error term

Y=a+b
1

X
1

+b
2

X
2

+b
3

X
3

+...+b
t

X
t

+u
where:
Y=The dependent variable you are trying to predict
or explain
80

X=The explanatory (independent) variable(s) you are


using to predict or associate with Y
a=The y-intercept
b=(beta coefficient) is the slope of the explanatory
variable(s)
u=The regression residual or error term

Example of How Regression Analysis Is Used in Finance


Regression is often used to determine how many specific factors such
as the price of a commodity, interest rates, particular industries, or
sectors influence the price movement of an asset. The
aforementioned CAPM is based on regression, and it is utilized to
project the expected returns for stocks and to generate costs of
capital. A stock’s returns are regressed against the returns of a
broader index, such as the S&P 500, to generate a beta for the
particular stock.

Beta is the stock’s risk in relation to the market or index and is


reflected as the slope in the CAPM model. The return for the stock in
question would be the dependent variable Y, while the independent
variable X would be the market risk premium.

Additional variables such as the market capitalization of a stock,


valuation ratios, and recent returns can be added to the CAPM model
to get better estimates for returns. These additional factors are
known as the Fama-French factors, named after the professors who
developed the multiple linear regression model to better explain asset
returns.
1

Why Is It Called Regression?


Although there is some debate about the origins of the name, the
statistical technique described above most likely was termed
81

“regression” by Sir Francis Galton in the 19th century to describe the


statistical feature of biological data (such as heights of people in a
population) to regress to some mean level. In other words, while
there are shorter and taller people, only outliers are very tall or short,
and most people cluster somewhere around (or “regress” to) the
average.
2

What Is the Purpose of Regression?


In statistical analysis, regression is used to identify the associations
between variables occurring in some data. It can show both the
magnitude of such an association and also determine its statistical
significance (i.e., whether or not the association is likely due to
chance). Regression is a powerful tool for statistical inference and has
also been used to try to predict future outcomes based on past
observations.

How Do You Interpret a Regression Model?


A regression model output may be in the form of Y = 1.0 + (3.2)X1 -
2.0(X2) + 0.21.

Here we have a multiple linear regression that relates some variable Y


with two explanatory variables X1 and X2. We would interpret the
model as the value of Y changes by 3.2x for every one-unit change in
X1 (if X1 goes up by 2, Y goes up by 6.4, etc.) holding all else constant
(all else equal). That means controlling for X2, X1 has this observed
relationship. Likewise, holding X1 constant, every one unit increase in
X2 is associated with a 2x decrease in Y. We can also note the y-
intercept of 1.0, meaning that Y = 1 when X1 and X2 are both zero.
The error term (residual) is 0.21.

What Are the Assumptions That Must Hold for Regression Models?
In order to properly interpret the output of a regression model, the
following main assumptions about the underlying data process of
what you analyzing must hold:
82

The relationship between variables is linear


Homoskedasticity, or that the variance of the variables and error term
must remain constant
All explanatory variables are independent of one another

Types of Regression Models


There are numerous regression analysis approaches available for
making predictions. Additionally, the choice of technique is
determined by various parameters, including the number of
independent variables, the form of the regression line, and the type
of dependent variable.

Let us examine several of the most often utilized regression analysis


techniques:

1. Linear Regression
The most extensively used modelling technique is linear regression,
which assumes a linear connection between
dependent variable (Y) and an independent variable (X). It employs a
regression line, also known as a best-fit line. The linear connection is
defined as Y = c+m*X + e, where ‘c’ denotes the intercept, ‘m’
denotes the slope of the line, and ‘e’ is the error term.

The linear regression model can be simple (with only one dependent
and one independent variable) or complex (with numerous
dependent and independent variables) (with one dependent variable
and more than one independent variable).
83

Linear Regression
IMAGE
2. Logistic Regression
When the dependent variable is discrete, the logistic regression
technique is applicable. In other words, this technique is used to
compute the probability of mutually exclusive occurrences such as
pass/fail, true/false, 0/1, and so forth. Thus, the target variable can
take on only one of two values, and a sigmoid curve represents its
connection to the independent variable, and probability has a value
between 0 and 1.
84

Logistic Regression | Regression Models


IMAGE
3. Polynomial Regression
The technique of polynomial regression analysis is used to represent a
non-linear relationship between dependent and independent
variables. It is a variant of the multiple linear regression model, except
that the best fit line is curved rather than straight.

Polynomial Regression
IMAGE
4. Ridge Regression
When data exhibits multicollinearity, that is, the ridge regression
technique is applied when the independent variables are highly
correlated. While least squares estimates are unbiased in
multicollinearity, their variances are significant enough to cause the
85

observed value to diverge from the actual value. Ridge regression


reduces standard errors by biassing the regression estimates.

The lambda (λ) variable in the ridge regression equation resolves the
multicollinearity problem.

Ridge Regression
Log Lambda
IMAGE

5. Lasso Regression
86

As with ridge regression, the lasso (Least Absolute Shrinkage and


Selection Operator) technique penalizes the absolute magnitude of
the regression coefficient. Additionally, the lasso regression technique
employs variable selection, which leads to the shrinkage of coefficient
values to absolute zero.

Lasso Regression
IMAGE
6. Quantile Regression
The quantile regression approach is a subset of the linear regression
technique. It is employed when the linear regression requirements
are not met or when the data contains outliers. In statistics and
87

econometrics, quantile regression is used.

Quantile Regression
IMAGE
7. Bayesian Linear Regression
Bayesian linear regression is a form of regression analysis technique
used in machine learning that uses Bayes’ theorem to calculate the
regression coefficients’ values. Rather than determining the least-
squares, this technique determines the features’ posterior
distribution. As a result, the approach outperforms ordinary linear
regression in terms of stability.
88

Bayesian Linear

8. Principal Components Regression


Multicollinear regression data is often evaluated using the principle
components regression approach. The significant components
regression approach, like ridge regression, reduces standard errors by
biassing the regression estimates. Principal component analysis (PCA)
is used first to modify the training data, and then the resulting
transformed samples are used to train the regressors.

9. Partial Least Squares Regression


The partial least squares regression technique is a fast and efficient
covariance-based regression analysis technique. It is advantageous for
regression problems with many independent variables with a high
probability of multicollinearity between the variables. The method
decreases the number of variables to a manageable number of
predictors, then is utilized in a regression.

10. Elastic Net Regression


89

Elastic net regression combines ridge and lasso regression techniques


that are particularly useful when dealing with strongly correlated
data. It regularizes regression models by utilizing the penalties
associated with the ridge and lasso regression methods.

Elastic Net
IMAGE
Multiple Regression:

Multiple Regression is a step beyond simple regression. The main


difference between simple and multiple regression is that multiple
regression includes two or more independent variables – sometimes
called predictor variables – in the model, rather than just one.
90

As such, the purpose of multiple regression is to determine the utility


of a set of predictor variables for predicting an outcome, which is
generally some important event or behaviour. This outcome can be
designated as the outcome variable, the dependent variable, or the
criterion variable. For example, you might hypothesise that the need
to belong will predict motivations for Facebook use and that self-
esteem and meaningful existence will uniquely predict motivations for
Facebook use.

Before beginning your analysis, you should consider the following


points:

Regression analyses reveal relationships among variables (relationship


between the criterion variable and the linear combination of a set of
predictor variables) but do not imply a causal relationship.
A regression solution – or set of predictor variables – is sensitive to
combinations of variables. Whether a predictor is important in a
solution depends on the other predictors in the set. If the predictor of
interest is the only one that assesses some important facet of the
outcome, it will appear important. If a predictor is only one of several
predictors that assess the same important facet of the outcome, it
will appear less important. For a good set of predictor variables – the
smallest set of uncorrelated variables is best.
PowerPoint: Venn Diagrams

Please click on the link labeled “Venn Diagrams” to work through an


example.

Chapter Five – Venn Diagrams


In these Venn Diagrams, you can see why it is best for the predictors
to be strongly correlated with the dependent variable but
uncorrelated with the other Independent Variables. This reduces the
amount of shared variance between the independent variables. The
illustration in Slide 2 shows logical relationships between predictors,
91

for two different possible regression models in separate Venn


diagrams. On the left, you can see three partially correlated
independent variables on a single dependent variable. The three
partially correlated independent variables are physical health, mental
health, and spiritual health and the dependent variable is life
satisfaction. On the right, you have three highly correlated
independent variables (e.g., BMI, blood pressure, heart rate) on the
dependent variable of life satisfaction. The model on the left would
have some use in discovering the associations between those
variables, however, the model on the right would not be useful, as all
three of the independent variables are basically measuring the same
thing and are mostly accounting for the same variability in the
dependent variable.

There are two main types of regression with multiple independent


variables:

Standard or Single Step: Where all predictors enter the regression


together.
Sequential or Hierarchical: Where all predictors are entered in blocks.
Each block represents one step.
We will now be exploring the single step multiple regression:

All predictors enter the regression equation at once. Each predictor is


treated as if it had been analysed in the regression model after all
other predictors had been analysed. These predictors are evaluated
by the shared variance (i.e., level of prediction) shared between the
dependant variable and the individual predictor variable.

Multiple Regression Assumptions


There are a number of assumptions that should be assessed before
performing a multiple regression analysis:

The dependant variable (the variable of interest) needs to be using a


continuous scale.
92

There are two or more independent variables. These can be


measured using either continuous or categorical means.
The three or more variables of interest should have a linear
relationship, which you can check by using a scatterplot.
The data should have homoscedasticity. In other words, the line of
best fit is not dissimilar as the data points move across the line in a
positive or negative direction. Homoscedasticity can be checked by
producing standardised residual plots against the unstandardized
predicted values.
The data should not have two or more independent variables that are
highly correlated. This is called multicollinearity which can be checked
using Variance-inflation-factor or VIF values. High VIF indicates that
the associated independent variable is highly collinear with the other
variables in the model.
There should be no spurious outliers.
The residuals (errors) should be approximately normally distributed.
This can be checked by a histogram (with a superimposed normal
curve) and by plotting the of the standardised residuals using either a
P-P Plot, or a Normal Q-Q Plot .
Multiple Regression Interpretation
For our example research question, we will be looking at the
combined effect of three predictor variables – perceived life stress,
location, and age – on the outcome variable of physical health?

PowerPoint: Standard Regression

Please open the output at the link labeled “Chapter Five – Standard
Regression” to view the output.

Chapter Five – Standard Regression


Slide 1 contains the standard regression analysis output.

image
93

On Slide 2 you can see in the red circle, the test statistics are
significant. The F-statistic examines the overall significance of the
model, and shows if your predictors as a group provide a better fit to
the data than no predictor variables, which they do in this example.

The R2 values are shown in the green circle. The R2 value shows the
total amount of variance accounted for in the criterion by the
predictors, and the adjusted R2 is the estimated value of R2 in the
population.

Table with data on physical illness

Moving on to the individual variable effects on Slide 3, you can see


the significance of the contribution of individual predictors in light
blue. The unstandardized slope or the B value is shown in red, which
represents the change caused by the variable (e.g., increasing 1 unit
of perceived stress will raise physical illness by .40). Finally, you can
see the standardised slope value in green, which are also known as
beta values. These values are standardised ranging from +/-0 to 1,
similar to an r value.

We should also briefly discuss dummy variables:

Table on data on physical illness


A dummy variable is a variable that is used to represent categorical
information relating to the participants in a study. This could include
gender, location, race, age groups, and you get the idea. Dummy
variables are most often represented as dichotomous variables (they
only have two values). When performing a regression, it is easier for
interpretation if the values for the dummy variable is set to 0 or 1. 1
94

usually resents when a characteristic is present. For example, a


question asking the participants “Do you have a drivers license” with a
forced choice response of yes or no.

In this example on Slide 3 and circled in red, the variable is gender


with male = 0, and female = 1. A positive Beta (B) means an
association with 1, whereas a negative beta means an association
with 0. In this case, being female was associated with greater levels of
physical illness.

Multiple Regression Write Up


Here is an example of how to write up the results of a standard
multiple regression analysis:

In order to test the research question, a multiple regression was


conducted, with age, gender (0 = male, 1 = female), and perceived life
stress as the predictors, with levels of physical illness as the
dependent variable. Overall, the results showed the utility of the
predictive model was significant, F(3,363) = 39.61, R2 = .25, p< .001.
All of the predictors explain a large amount of the variance between
the variables (25%). The results showed that perceived stress and
gender of participants were significant positive predictors of physical
illness (β=.47, t= 9.96, p< .001, and β=.15, t= 3.23, p= .001,
respectively). The results showed that age (β=-.02, t= -0.49 p= .63)
was not a significant predictor of perceived stress.

Previous/next navigation
Previous: Section 5.2: Simple Regression Assumptions, Interpretation,
and Write Up
Next: Section 5.4: Hierarchical Regression Explanation, Assumptions,
Interpretation, a
In these Venn Diagrams, you can see why it is best for the predictors
to be strongly correlated with the dependent variable but
uncorrelated with the other Independent Variables. This reduces the
amount of shared variance between the independent variables. The
95

illustration in Slide 2 shows logical relationships between predictors,


for two different possible regression models in separate Venn
diagrams. On the left, you can see three partially correlated
independent variables on a single dependent variable. The three
partially correlated independent variables are physical health, mental
health, and spiritual health and the dependent variable is life
satisfaction. On the right, you have three highly correlated
independent variables (e.g., BMI, blood pressure, heart rate) on the
dependent variable of life satisfaction. The model on the left would
have some use in discovering the associations between those
variables, however, the model on the right would not be useful, as all
three of the independent variables are basically measuring the same
thing and are mostly accounting for the same variability in the
dependent variable.

There are two main types of regression with multiple independent


variables:

Standard or Single Step: Where all predictors enter the regression


together.
Sequential or Hierarchical: Where all predictors are entered in blocks.
Each block represents one step.
We will now be exploring the single step multiple regression:

All predictors enter the regression equation at once. Each predictor is


treated as if it had been analysed in the regression model after all
other predictors had been analysed. These predictors are evaluated
by the shared variance (i.e., level of prediction) shared between the
dependant variable and the individual predictor variable.

Multiple Regression Assumptions


There are a number of assumptions that should be assessed before
performing a multiple regression analysis:
96

The dependant variable (the variable of interest) needs to be using a


continuous scale.
There are two or more independent variables. These can be
measured using either continuous or categorical means.
The three or more variables of interest should have a linear
relationship, which you can check by using a scatterplot.
The data should have homoscedasticity. In other words, the line of
best fit is not dissimilar as the data points move across the line in a
positive or negative direction. Homoscedasticity can be checked by
producing standardised residual plots against the unstandardized
predicted values.
The data should not have two or more independent variables that are
highly correlated. This is called multicollinearity which can be checked
using Variance-inflation-factor or VIF values. High VIF indicates that
the associated independent variable is highly collinear with the other
variables in the model.
There should be no spurious outliers.
The residuals (errors) should be approximately normally distributed.
This can be checked by a histogram (with a superimposed normal
curve) and by plotting the of the standardised residuals using either a
P-P Plot, or a Normal Q-Q Plot .

Regression line:

Key Takeaways
The regression line establishes a linear relationship between two
sets of variables. The change in one variable is dependent on the
changes to the other (independent variable).
The Least Squares Regression Line (LSRL) is plotted nearest to the
data points (x, y) on a regression graph.
Regression is widely used in financial models like CAPM and
investing measures like Beta to determine the feasibility of a
project. It is also used for creating projections of investments and
financial returns.
97

If Y is the dependent variable and X is the independent variable,


the Y on X regression line equation is represented as follows:
‘Y = a + bX + ɛ.’
Regression Line Explained
A regression line is a statistical tool that depicts the correlation
between two variables. Specifically, it is used when variation in
one (dependent variable) depends on the change in the value of
the other (independent variable).

There can be two cases of simple linear regression:

The equation is Y on X, where the value of Y changes with a


variation in the value of X.
The equation is X on Y, where the change in X variable depends
upon the Y variable’s deviation.
Regression Line Explanation
You are free to use this image o your website, templates, etc,
Please provide us with an attribution link

Regression is extensively applied to various real-world


scenarios—business, investment, finance, and marketing. For
example, in finance, regression is majorly employed in the Beta
and Capital Asset Pricing Model (CAPM
)—for estimating returns and budgeting.

Capital Asset Pricing Model (CAPM)


You are free to use this image o your website, templates, etc,
Please provide us with an attribution link
98

Using regression
, the company can determine the appropriate asset price with
respect to the cost of capital. In the stock market, it is used for
determining the impact of stock price changes on the price of
underlying commodities.

In marketing, regression analysis can be used to determine how


price fluctuation results in the increase or decrease in goods
sales. It is very effective in creating sales projections for a future
period—by correlating market conditions, weather predictions,
economic conditions, and past sales.

Formula
The formula to determine the Least Squares Regression Line
(LSRL) of Y on X is as follows:

Y=a + bX + ɛ

Here,

Y is the dependent variable.


a is the Y-intercept.
b is the slope of the regression line.
X is the independent variable.
ɛ is the residual (error).
Also,

b = (N∑XY-(∑X)(∑Y) / (N∑X2– (∑X)2) ;

And,
99

a = (∑Y – b ∑X) / N

Where N is the total number of observations.

Example
Let us look at a hypothetical example to understand real-world
applications of the theory.

The finance manager of ABC Motors wants to correlate variation


in sales and variation in the price of electric bikes. For this
purpose, he analyzes data pertaining to the last five years.

We assume there is no error. The price and sales volume for the
previous five years are as follows:

Year Price (in $) Sales Volume


2017 2100 15000
2018 2050 16500
2019 2000 21000
2020 2200 19000
2021 2050 20000
Based on the given data, determine the regression line of Y on X,

Solution:

Let us determine the regression line of Y on X:

Given:

Y = Sales Volume
X = Profit
N=5
100

ɛ=0
Year Price (in $) (X) Sales Volume (Y) X2 XY
2017 2100 15000 4410000 31500000
2018 2050 16500 4202500 33825000
2019 2000 21000 4000000 42000000
2020 2200 19000 4840000 41800000
2021 2050 20000 4202500 41000000
– 10400 91500 21655000 190125000
Y = a + bX + ɛ

Let us first find out the value of b and a:

b = (N∑XY-(∑X)(∑Y) / (N∑X2– (∑X)2)

b = ((5×190125000) – (10400×91500)) / ( (5×21655000) – 104002


)
b = (950625000-951600000) / (08275000 -108160000)
b = – 8.478
a = (∑Y – b ∑X) / N

a = 91500 – ( – 8.478 × 10400) / 5


a = 35935
Y = 35935 + ( – 8.478 X) + 0
Y = 35935 – 8.478X
The data is represented as a regression line graph:

Regression Line Example


(Source)

Visualization of collected data makes data interpretation easier.


The regression line is sometimes called the line of best fit.
101

It is important to note that real-world data cannot always be


expressed with a regression equation. If the majority of
observations follow a pattern, then the outliers can be
eliminated. But sometimes, there is no obvious pattern. If there
are random irregularities in collected data—the regression
method is not suitable.

Properties of Regression:

Some of the properties of regression coefficient:


It is generally denoted by ‘b’.
It is expressed in the form of an original unit of data.
If two variables are there say x and y, two values of the regression
coefficient are obtained. One will be obtained when x is independent
and y is dependent and other when we consider y as independent
and x as a dependent. The regression coefficient of y on x is
represented by byx and x on y as bxy.
Both of the regression coefficients must have the same sign. If byx is
positive, bxy will also be positive and it is true for vice versa.
If one regression coefficient is greater than unity, then others will be
lesser than unity.
The geometric mean between the two regression coefficients is equal
to the correlation coefficient
R=sqrt(byx*bxy)
Also, the arithmetic means (am) of both regression coefficients is
equal to or greater than the coefficient of correlation.
(byx + bxy)/2= equal or greater than r.
The regression coefficients are independent of the change of the
origin. But, they are not independent of the change of the scale. It
means there will be no effect on the regression coefficients if any
constant is subtracted from the value of x and y. If x and y are
multiplied by any constant, then the regression coefficient will
change.
102

SECTION-III

TIME SERIES

Introduction:

What Is a Time Series?


A time series is a sequence of data points that occur in successive
order over some period of time. This can be contrasted with cross-
sectional data, which captures a point in time.

In investing, a time series tracks the movement of the chosen data


points, such as a security’s price, over a specified period of time with
data points recorded at regular intervals. There is no minimum or
maximum amount of time that must be included, allowing the data to
be gathered in a way that provides the information being sought by
the investor or analyst examining the activity.

KEY TAKEAWAYS
A time series is a data set that tracks a sample over time.
In particular, a time series allows one to see what factors influence
certain variables from period to period.
Time series analysis can be useful to see how a given asset, security,
or economic variable changes over time.
Forecasting methods using time series are used in both fundamental
and technical analysis.
Although cross-sectional data is seen as the opposite of time series,
the two are often used together in practice.
Understanding Time Series
103

A time series can be taken on any variable that changes over time. In
investing, it is common to use a time series to track the price of a
security over time. This can be tracked over the short term, such as
the price of a security on the hour over the course of a business day,
or the long term, such as the price of a security at close on the last
day of every month over the course of five years.

Time series analysis can be useful to see how a given asset, security,
or economic variable changes over time. It also can be used to
examine how the changes associated with the chosen data point
compare to shifts in other variables over the same time period.

Time series is also used in several nonfinancial contexts, such as


measuring the change in population over time. The figure below
depicts such a time series for the growth of the U.S. population over
the century from 1900 to 2000.

A time series graph of the population of the United States from 1900
to 2000.
A time series graph of the population of the United States from 1900
to 2000.
C.K. Taylor

Time Series Analysis


Suppose you wanted to analyze a time series of daily closing stock
prices for a given stock over a period of one year. You would obtain a
list of all the closing prices for the stock from each day for the past
year and list them in chronological order. This would be a one-year
daily closing price time series for the stock.

Delving a bit deeper, you might analyze time series data with
technical analysis tools to know whether the stock’s time series shows
104

any seasonality. This will help to determine if the stock goes through
peaks and troughs at regular times each year. Analysis in this area
would require taking the observed prices and correlating them to a
chosen season. This can include traditional calendar seasons, such as
summer and winter, or retail seasons, such as holiday seasons.

Alternatively, you can record a stock’s share price changes as it


relates to an economic variable, such as the unemployment rate. By
correlating the data points with information relating to the selected
economic variable, you can observe patterns in situations exhibiting
dependency between the data points and the chosen variable.

One potential issue with time series data is that since each variable is
dependent on its prior state or value, there can be a great deal of
autocorrelation, which can bias results.
Time Series Forecasting
Time series forecasting uses information regarding historical values
and associated patterns to predict future activity. Most often, this
relates to trend analysis, cyclical fluctuation analysis, and issues of
seasonality. As with all forecasting methods, success is not
guaranteed.

The Box-Jenkins Model, for instance, is a technique designed to


forecast data ranges based on inputs from a specified time series. It
forecasts data using three principles: autoregression, differencing,
and moving averages. These three principles are known as p, d, and q,
respectively. Each principle is used in the Box-Jenkins analysis, and
together they are collectively shown as an autoregressive integrated
moving average, or ARIMA (p, d, q). ARIMA can be used, for instance,
to forecast stock prices or earnings growth.
105

Another method, known as rescaled range analysis, can be used to


detect and evaluate the amount of persistence, randomness, or mean
reversion in time series data. The rescaled range can be used to
extrapolate a future value or average for the data to see if a trend is
stable or likely to reverse.

Cross-Sectional vs. Time Series Analysis


Cross-sectional analysis is one of the two overarching comparison
methods for stock analysis. Cross-sectional analysis looks at data
collected at a single point in time, rather than over a period of time.
The analysis begins with the establishment of research goals and the
definition of the variables that an analyst wants to measure. The next
step is to identify the cross section, such as a group of peers or an
industry, and to set the specific point in time being assessed. The final
step is to conduct analysis, based on the cross section and the
variables, and come to a conclusion on the performance of a
company or organization. Essentially, cross-sectional analysis shows
an investor which company is best given the metrics that they care
about.

Time series analysis, known as trend analysis when it applies to


technical trading, focuses on a single security over time. In this case,
the price is being judged in the context of its past performance. Time
series analysis shows an investor whether the company is doing
better or worse than before by the measures that they care about.
Often these will be classics like earnings per share (EPS), debt to
equity, free cash flow (FCF), and so on. In practice, investors will
usually use a combination of time series analysis and cross-sectional
analysis before making a decision—for example, looking at the EPS
over time and then checking the industry benchmark EPS.

What Are Some Examples of Time Series?


A time series can be constructed by any data that is measured over
time at evenly spaced intervals. Historical stock prices, earnings, gross
106

domestic product (GDP), or other sequences of financial or economic


data can be analyzed as a time series.

How Do You Analyze Time Series Data?


Statistical techniques can be used to analyze time series data in two
key ways: to generate inferences on how one or more variables affect
some variable of interest over time, or to forecast future trends.
Unlike cross-sectional data, which is essentially one slice of a time
series, the arrow of time allows an analyst to make more plausible
causal claims.

What Is the Distinction Between Cross-Sectional and Time Series


Data?
A cross section looks at a single point in time, which is useful for
comparing and analyzing the effect of different factors on one
another or describing a sample. Time series involves repeated
sampling of the same data over time. In practice, both forms of
analysis are commonly used, and when available, they are used
together.

How Are Time Series Used in Data Mining?


Data mining is a process that turns reams of raw data into useful
information. By utilizing software to look for patterns in large batches
of data, businesses can learn more about their customers to develop
more effective marketing strategies, increase sales, and decrease
costs. Time series, such as a historical record of corporate filings or
financial statements, are particularly useful here to identify trends
and patterns that may be forecasted into the future.

The Bottom Line


A time series is a sequence of numerical data points in successive
order. In investing, it tracks the movement of the chosen data points
at regular intervals and over a specified period of time.
107

In investing, a time series records chosen data points (such as a


security’s price) at regular intervals and tracks their movement over a
specified period of time.

Time series analysis can be useful to see what factors influence


certain variables from period to period. It can also provide insights
into how an asset, security, or economic variable changes over time.

A variety of financial and economic data, such as historical stock


prices, earnings, and GDP, can be analyzed as a time series.

Objectives of Time Series:

There are many objectives related to time series analysis, objectives


of time series analysis may be classified as

Description
Explanation
Prediction
Control
The description of the objectives of time series analysis are as follows:

Description
The first step in the analysis is to plot the data and obtain simple
descriptive measures (such as plotting data, looking for trends,
seasonal fluctuations and so on) of the main properties of the series.
In the above figure, there is a regular seasonal pattern of price change
although this price pattern is not consistent. Graph enables to look for
“wild” observations or outlier (not appear to be consistent with the
rest of the data). Graphing the time series makes possible the
presence of turning points where the upward trend suddenly changed
to a downward trend. If there is a turning point, different models may
have to be fitted to the two parts of the series.

Explanation
108

Observations were taken on two or more variables, making possible


to use the variation in one time series to explain the variation in
another series. This may lead to a deeper understanding. Multiple
regression model may be helpful in this case.

Prediction
Given an observed time series, one may want to predict the future
values of the series. It is an important task in sales of forecasting and
is the analysis of economic and industrial time series. Prediction and
forecasting used interchangeably.

Control
When time series generated to measure the quality of a
manufacturing process (the aim may be) to control the process.
Control procedures are of several different kinds. In quality control,
the observations are plotted on a control chart and the controller
takes action as a result of studying the charts. A stochastic model is
fitted to the series. Future values of the series are predicted and then
the input process variables are adjusted so as to keep the process on
target.

Identification of trends:

Trend analysis is a technique used to examine and predict movements


of an item based on current and historical data. You can use trend
analysis to improve your business using trend data to inform your
decision-making.

As your business becomes more established, you will be able to


compare data and identify trends in:

financial performance
competitor movement and growth
manufacturing efficiency
new or emerging technologies
109

customer complaints
staff performance reviews and key performance indicators (KPIs).

Understanding the value of trend analysis


Trend analysis helps you compare your business against other
businesses to establish a benchmark of how your business should be
operating, at both the initial stage and ongoing, or developing.

Analysing market trends is key to adapting and changing your


business, keeping current and ahead of the industry, and for continual
growth.

Trend analysis consists of:

trend data, for assessing changes within your own business


performance over time
benchmark data, for comparing your business to a similar
organisation (learn about benchmarking your business for greater
performance)
market trends, for analysing the data from a whole industry or sector.
Gathering data
The most important rule for gathering data for trend analysis is that it
is up to date, reliable and consistent, because this is what you will
base your business decisions on, and you need to have an accurate
comparison of information over time.

The amount and quality of data will depend on the information


captured over the months and years the business has been operating.
But if the business has little or no data, you can use benchmarking
data and market trends to gather the information.

If the data is only partially captured or inaccurate, the analysis can


only be partially correct.
110

For example, ensuring you or your bookkeeper retain all data, that it
is kept up to date and entered accurately, will mean you can run
regular reports on past performance giving you insights into where
the business is going.

Tips for gathering data


Businesses commonly use financial record-keeping software that is
compatible with the Australian Taxation Office so that analysis can be
done in a more streamlined way.
Financial ratios and calculators let you use data from your financial
statements to learn about your business’s profitability.
Conducting thorough due diligence when buying an established
business or franchise will give you an advantage by having access to
historical data to rely on in your analysis.
Identifying relevant data
The following explains the type of trend data that may help your
analysis, why it is useful to collect and where the data can be sourced.
Data analysis
Data analysis can be completed using common business software that
includes visualisation of the data in charts and graphs and is often
easier to interpret than raw data, as it shows the trends more clearly.

Business intelligence (BI) software was once only affordable for large
businesses but is now available as software as a service (SAAS) at a
low monthly or yearly cost.

You can also access data and analytics on your website and social
media platforms.

The benefits of using BI software include:

integration with common free or lower cost business software and


apps the business may already use—for example, for
finance and banking
customer relationship databases
111

website and social media analytics


WHS
rostering and other human resources
equipment and maintenance
records and file management cloud systems
filtering aggregated data into date ranges and categories
exploring raw data within the visualisations
sharing with staff and stakeholders.
If you are not using BI software or the commonly used business
software with visualisations and reports, you can use spreadsheets to
manually analyse the data.

Analysis requires you and your advisers to interpret the data—the


software you use is only as good as your ability to interpret and act on
what you see.

Interpreting your business trend data


When interpreting your data, ask the following questions as part of
the analysis.

When would a trend become worrying and require your action? For
example, decreasing purchases in a retail location over the past 1 to 2
quarters may be explained by increasing domestic costs, but over the
past year the demographics in your location may have changed. You
may need to review your products and services.
What will be your critical decision points? Can you, for instance, apply
a threshold that is an acceptable variation for your business (e.g. 10%
over or under)?
What opportunity might improve your business over another? For
example, if your information technology (IT) system is experiencing
interruptions and it is a continuing trend, would outsourcing your
system be preferable to purchasing a new system? The cost of
outsourcing may be better than purchasing a new system.
112

What would constitute a crisis trend? In other words, what trend—if


it were to continue—might cause permanent damage to the
business?
What patterns are you seeing between the data sets? For example,
does the data from your project management system show causes
from your customer management system?
How does your business data compare to your industry benchmarks?
How could you improve each function of your business slightly to
improve your own benchmarks?
Limitations of trend analysis
There are some limitations to trend analysis, for example:

external financial crises and recessions, and the effects of a pandemic


factors that have changed results during the recorded period, such as
purchasing new equipment or outsourcing
adjustments for inflation.
Trend analysis is ‘working on the business’, rather than ‘in the
business’.

The Pareto Principle (80% consequences result from 20% causes) also
shows the importance of working on the business. The amount of
time you commit to trend analysis will give you more valuable
improvements across your entire business.

Variation in time series:

The components, by which time series is composed of, are called the
component of time series data. There are four basic components of
the time series data described below.

Different Sources of Variation are:

Seasonal effect (Seasonal Variation or Seasonal Fluctuations)


Many of the time series data exhibits a seasonal variation which is the
annual period, such as sales and temperature readings. This type of
113

variation is easy to understand and can be easily measured or


removed from the data to give deseasonalized data. Seasonal
Fluctuations describes any regular variation (fluctuation) with a period
of less than one year for example cost of various types of fruits and
vegetables, clothes, unemployment figures, average daily rainfall,
increase in the sale of tea in winter, increase in the sale of ice cream
in summer, etc., all show seasonal variations. The changes which
repeat themselves within a fixed period, are also called seasonal
variations, for example, traffic on roads in morning and evening
hours, Sales at festivals like EID, etc., increase in the number of
passengers at weekend, etc. Seasonal variations are caused by
climate, social customs, religious activities, etc.
Other Cyclic Changes (Cyclical Variation or Cyclic Fluctuations)
Time series exhibits Cyclical Variations at a fixed period due to some
other physical cause, such as daily variation in temperature. Cyclical
variation is a non-seasonal component that varies in a recognizable
cycle. Sometimes series exhibits oscillation which does not have a
fixed period but is predictable to some extent. For example, economic
data affected by business cycles with a period varying between about
5 and 7 years. In weekly or monthly data, the cyclical component may
describe any regular variation (fluctuations) in time series data. The
cyclical variation is periodic in nature and repeats itself like a business
cycle, which has four phases (i) Peak (ii) Recession (iii)
Trough/Depression (iv) Expansion.
Trend (Secular Trend or Long Term Variation)
It is a longer-term change. Here we take into account the number of
observations available and make a subjective assessment of what is
long term. To understand the meaning of the long term, let for
example climate variables sometimes exhibit cyclic variation over a
very long time period such as 50 years. If one just had 20 years of
data, this long term oscillation would appear to be a trend, but if
several hundreds of years of data are available, then long term
oscillations would be visible. These movements are systematic in
nature where the movements are broad, steady, showing a slow rise
or fall in the same direction. The trend may be linear or non-linear
114

(curvilinear). Some examples of the secular trends are: Increase in


prices, Increase in pollution, an increase in the need for wheat,
increase in literacy rate, decrease in deaths due to advances in
science. Taking averages over a certain period is a simple way of
detecting a trend in seasonal data. Change in averages with time is
evidence of a trend in the given series, though there are more formal
tests for detecting a trend in time series.
Other Irregular Variation (Irregular Fluctuations)
When trend and cyclical variations are removed from a set of time
series data, the residual left, which may or may not be random.
Various techniques for analyzing series of this type examine to see “if
irregular variation may be explained in terms of probability models
such as moving average or autoregressive models, i.e. we can see if
any cyclical variation is still left in the residuals. These variations occur
due to sudden causes are called residual variation (irregular variation
or accidental or erratic fluctuations) and are unpredictable, for
example, a rise in prices of steel due to strike in the factory, accident
due to failure of the break, flood, earth quick, war, etc.
Component of Time Series Data

Methods of estimation of Trends:

This method uses the concept of ironing out the fluctuations of the
data by taking the means. It measures the trend by eliminating the
changes or the variations by means of a moving average. The simplest
of the mean used for the measurement of a trend is the arithmetic
means (averages).

Moving Average
The moving average of a period (extent) m is a series of successive
averages of m terms at a time. The data set used for calculating the
average starts with first, second, third and etc. at a time and m data
taken at a time.
115

In other words, the first average is the mean of the first m terms. The
second average is the mean of the m terms starting from the second
data up to (m + 1)th term. Similarly, the third average is the mean of
the m terms from the third to (m + 2) th term and so on.

If the extent or the period, m is odd i.e., m is of the form (2k + 1), the
moving average is placed against the mid-value of the time interval it
covers, i.e., t = k + 1. On the other hand, if m is even i.e., m = 2k, it is
placed between the two middle values of the time interval it covers,
i.e., t = k and t = k + 1.

When the period of the moving average is even, then we need to


synchronize the moving average with the original time period. It is
done by centering the moving averages i.e., by taking the average of
the two successive moving averages.

Drawbacks of Moving Average


The main problem is to determine the extent of the moving average
which completely eliminates the oscillatory fluctuations.
This method assumes that the trend is linear but it is not always the
case.
It does not provide the trend values for all the terms.
This method cannot be used for forecasting future trend which is the
main objective of the time series analysis.

Least Square Method:

What Is the Least Squares Method?


The least squares method is a form of mathematical regression
analysis used to determine the line of best fit for a set of data,
providing a visual demonstration of the relationship between the data
points. Each point of data represents the relationship between a
known independent variable and an unknown dependent variable.
116

This method is commonly used by statisticians and traders who want


to identify trading opportunities and trends.

The least squares method is a statistical procedure to find the best fit
for a set of data points.
The method works by minimizing the sum of the offsets or residuals
of points from the plotted curve.
Least squares regression is used to predict the behavior of dependent
variables.
The least squares method provides the overall rationale for the
placement of the line of best fit among the data points being studied.
Traders and analysts can use the least squares method to identify
trading opportunities and economic or financial trends.
Understanding the Least Squares Method
The least squares method is a form of regression analysis that
provides the overall rationale for the placement of the line of best fit
among the data points being studied. It begins with a set of data
points using two variables, which are plotted on a graph along the x-
and y-axis. Traders and analysts can use this as a tool to pinpoint
bullish and bearish trends in the market along with potential trading
opportunities.

The most common application of this method is sometimes referred


to as linear or ordinary. It aims to create a straight line that minimizes
the sum of squares of the errors generated by the results of the
associated equations, such as the squared residuals resulting from
differences in the observed value and the value anticipated based on
that model.

For instance, an analyst may use the least squares method to


generate a line of best fit that explains the potential relationship
between independent and dependent variables. The line of best fit
determined from the least squares method has an equation that
highlights the relationship between the data points.
117

If the data shows a lean relationship between two variables, it results


in a least-squares regression line. This minimizes the vertical distance
from the data points to the regression line. The term least squares is
used because it is the smallest sum of squares of errors, which is also
called the variance. A non-linear least-squares problem, on the other
hand, has no closed solution and is generally solved by iteration.

Dependent variables are illustrated on the vertical y-axis, while


independent variables are illustrated on the horizontal x-axis in
regression analysis. These designations form the equation for the line
of best fit, which is determined from the least squares method.
Advantages and Disadvantages of the Least Squares Method
The best way to find the line of best fit is by using the least squares
method. But traders and analysts may come across some issues, as
this isn’t always a fool-proof way to do so. Some of the pros and cons
of using this method are listed below.

Advantages
One of the main benefits of using this method is that it is easy to
apply and understand. That’s because it only uses two variables (one
that is shown along the x-axis and the other on the y-axis) while
highlighting the best relationship between them.

Investors and analysts can use the least square method by analyzing
past performance and making predictions about future trends in the
economy and stock markets. As such, it can be used as a decision-
making tool.

Disadvantages
The primary disadvantage of the least square method lies in the data
used. It can only highlight the relationship between two variables. As
118

such, it doesn’t take any others into account. And if there are any
outliers, the results become skewed.

Another problem with this method is that the data must be evenly
distributed. If this isn’t the case, the results may not be reliable.
Pros
Easy to apply and understand

Highlights relationship between two variables

Can be used to make predictions about future performance

Cons
Only highlights relationship between two variables

Doesn’t account for outliers

May be skewed if data isn’t evenly distributed


119

INDEX NUMBER

Defination:

Index number meaning refers to a statistical instrument used to


evaluate changes in multiple variables or a single variable over time. It
helps in expressing economic data time series and comparing
contrasting information.
This tool has various types. A few popular ones are quantity, price,
value, and special purpose. Moreover, individuals use different
formulae to compute index numbers, for example, the simple
aggregative method and the simple average of price relatives method.
There are various uses of index numbers. For example, they help
governments to formulate new policies and adjust existing ones.
Index Number In Economics Explained
Index number meaning refers to a process of evaluating variations in
different variables and fields over time. Typically, it has a base value
of 100, indicating price, production level, price, and more. In
economics, it simplifies comparison, which can be difficult when using
raw data. In other words, it helps one quantify the changes in a
sector, industry, or variable. Moreover, one can utilize this statistical
measure to monitor changes in data sets over time.
120

This tool is very useful for comparing currencies having multiple


nominal values. In addition, various countries use this technique to
change their policies.

It has the following features. Let us look at them in detail.

These are a special type of average utilized to measure a relative or


net change in a single or group of variables when measuring absolute
change is impossible.
It represents the early changes in factors that one might not be able
to measure directly.
The technique for calculating this average subset varies from one
variable to another.
One can use this method to compare the levels of any phenomenon
on a specific date with its levels on a previous date.
One can change index numbers to any unit of measurement. Experts
often apply indexing methods to employment, production,
unemployment, inflation, etc.

Types
Although there are different types of index numbers, most of their
primary objective is to simplify data to make comparison easier. One
often uses this method in public and private sectors to make well-
informed decisions regarding policies, prices, and investments. Let us
look at some of the popular types of this statistical tool:

Quantity : It measures the changes in the volume or quantity of the


products produced, sold, and consumed over certain durations.
Hence, they measure the relative changes in the volume of a certain
set of items between timeframes. One can use it to measure
construction, production, or employment. It has two subtypes —
weighted and unweighted.
Price : It computes the change in the price of a single or group of
variables between two time periods. A series of numbers organized to
compare the values of two durations make it up. One often uses this
121

method to compare the price of products from one timeframe to the


base period. An example of this tool is the Consumer Price Index or
CPI.
Special Purpose : Its main purpose is not the same as the other types.
Typically, the primary objective of this tool is to track the changes in a
unique variable group, particular sector, or industry. Index numbers
tracking the changes in the securities market are an example of this
type.
Value : It measures the change in the aggregate value of a single or
group of variables compared to its value in the base period. One can
use this statistical measure to track the changes in sales, inventory,
trade, etc.
Index Number Formula
Formula

There are multiple formulae for calculating index numbers. Two


popular techniques are as follows:

Simple Aggregative Method


The formula is as follows:

P01 = ΣP1 ÷ ΣP0 x 100

Where:

P01 is the index number.

ΣP1 is the sum of all prices in the year for which one has to compute
the index number.

ΣP0 is the base year.

Simple Average of Relatives Price Method


Let us look at the formula for computing via this method.
122

P01 = ΣR ÷ N

Where:

ΣR = The sum of the price relatives or R = P1 ÷ P0 x 100

N = Total number of items


Calculation Examples
Let us look at two index number examples using the above formulae
to understand the concept better.

Example #1: Using The Simple Aggregative Method


Product All Prices (Rs.) In The Base Year 1980 (P0) All Prices (Rs.)
In The Current Year 1985 (P1)
B 15 30
C 20 25
D 45 55
E 30 40
Total ΣP0 = 110 ΣP1 = 150
P01 = 150/110 x100

Or, P01 = 136.36

Example #2: Using The Simple Average of Relatives Price Method


Product All Prices (Rs.) In The Base Year (P0) All Prices (Rs.) In
The Current Year (P1) Price Relatives
(R = ΣP1 ÷ ΣP0 x 100)
C 20 30 30/20 x 100 = 150
D 25 50 50/25 x 100 = 200
E 50 75 75/50 x 100 = 150
F 32 40 40/32 x 100 = 125
N=4 ΣR = 625
P01 = ΣR ÷ N

Or, P01 = 625/4


123

Or, P01 = 156.25

Importance
Some uses of index numbers are as follows:

1. General Importance

Generally, this tool helps in many ways. Some of them are as follows:

It helps one compare separate data sets concerning different


durations or different places.
The tool simplifies complicated facts.
It helps one make future predictions.
Moreover, this tool is useful for individuals engaging in practical or
academic research.

2. Measurement of Changes In Price Level

This tool measures the difference in the price levels or the value of
money. Additionally, it warns about inflationary tendencies, enabling
a government to take effective anti-inflationary measures.

3. This statistical measure provides information concerning the


production trends of different sectors in an economy. This helps one
evaluate different industries’ conditions.

4 . Formation And Modification of Economic Policies

They help governments formulate and assess policies. A government


can formulate new policies or adjust the existing ones based on the
changes occurring in the economic conditions.

Besides these, the tools have specific uses in economics. They are as
follows:
124

They help analyze markets for particular commodities.


In the stock market, they provide information regarding the price
trends of stocks.
Also, by using this tool, bank officials can get information regarding
the changes in deposits.

5. Highlights Variation in Cost Of Living

This tool highlights the changes in a nation’s cost of living. It indicates


if a country’s cost of living is increasing or decreasing. As a result, the
government can modify workers’ wages to minimize the impact of
inflation on wage earners.

Limitations
The statistical measure has the following limitations:

It is never 100% accurate owing to the practical difficulties associated


with the calculation process. Moreover, it quantifies average change,
thus indicating only broad trends.
One cannot use Index numbers prepared for a particular purpose for
another purpose.
The tool does not consider the items’ quality. A general increase in
the index may be possible because of a product’s quality
improvement, not because of a price increase.
Different nations use different base years of the computation of index
numbers. Moreover, they include different items having different
qualities. Hence, this technique is not reliable for making international
comparisons.

Weight aggregate Methods:

untancy
Business Studies
125

Microeconomics
Macroeconomics
Statistics for Economics
Human Resource Management
Marketing
Income Tax
Finance
Indian Economic Development
Management
Commerce

Open In App

Related Articles
Explore Our Geeks Community
Write an Interview Experience
Share Your Campus Experience
CBSE Class 11 Statistics for Economics Notes
Chapter 1: Concept of Economics and Significance of Statistics in
Economics
Chapter 2: Collection of Data
Chapter 3: Organisation of Data
Chapter 4: Presentation of Data: Textual and Tabular
Chapter 5: Diagrammatic Presentation of Data
Chapter 6: Measures of Central Tendency: Arithmetic Mean
Chapter 7: Measures of Central Tendency: Median and Mode
Chapter 8: Measures of Dispersion
Chapter 9: Correlation
Chapter 10: Index Number
Index Number | Meaning, Characteristics, Uses and Limitations
Methods of Construction of Index Number
Unweighted or Simple Index Numbers: Meaning and Methods
Methods of calculating Weighted Index Numbers
Fisher's Index Number as an Ideal Method
Fisher's Method of calculating Weighted Index Number
126

Paasche's Method of calculating Weighted Index Number


Laspeyre's Method of calculating Weighted Index Number
Laspeyre's, Paasche's, and Fisher's Methods of Calculating Index
Number
Consumer Price Index (CPI) or Cost of Living Index Number:
Construction of Consumer Price Index|Difficulties and Uses of
Consumer Price Index
Methods of Constructing Consumer Price Index (CPI)
Wholesale Price Index (WPI) | Meaning, Uses, Merits, and Demerits
Index Number of Industrial Production: Meaning, Characteristics,
Construction, and Example
Inflation and Index Number
Important Formulas in Statistics for Economics
Methods of calculating Weighted Index Numbers
Read
Discuss
A statistical measure that helps in finding out the percentage change
in the values of different variables, such as the price of different
goods, production of different goods, etc., over time is known as the
Index Number. The percentage change is determined by taking a base
year as a reference. This base year is the year of comparison. When
an investigator studies different goods simultaneously, then the
percentage change is considered the average for all the goods. There
are two broad categories of Index Numbers: viz., Simple or
Unweighted and Weighted Index Numbers.

According to Spiegel, “An Index Number is a statistical measure


designed to show changes in a variable or group of related variables
with respect to time, geographic location or other characteristics.”

According to Croxton and Cowden, “Index Numbers are devices for


measuring difference in the magnitude of a group of related
variables.”
127

Weighted Index Numbers


In other index numbers, equal importance is given to each item.
However, under Weighted Index Number, rational weights are given
to each commodity or item explicitly. These weights indicate the
relative importance of the commodities or items included in the
determination of the index.

Here, rational weights mean the weights which are perfectly rational
for one investigation. However, this weight might be unsuitable for
other investigations. In fact, the purpose of the index number and the
nature of the data concerned with it helps in deciding the rational
weights.

There are two methods through which Weighted Index Numbers can
be constructed; viz., Weighted Aggregative Method and Weighted
Average of Price Relatives Method.

1. Weighted Aggregative Method:


This method involves assigning weights to different items and
obtaining weighted aggregate of the prices instead of finding a simple
aggregate of prices. Some important methods of constructing
Weighted Aggregative Index are as follows:

Laspeyre’s Method
Paasche’s Method
Fisher’s Ideal Method
Drobish and Bowley’s Method
Marshall Edgeworth Method
Walsch’s Method
Kelly’s Method
Note: According to CBSE Syllabus, we will be only studying Laspeyre’s,
Paasche’s, and Fisher’s Methods.

i) Laspeyre’s Method
128

The method of calculating Weighted Index Numbers under which the


base year quantities are used as weights of different items is known
as Laspeyre’s Method. The formula for Laspeyre’s Price Index is:

Laspeyre's~Price~Index~(P_{01})=\frac{\sum{p_1q_0}}{\sum{p_0q_0}}
\times{100}

Here,

P01 = Price Index of the current year

p0 = Price of goods at base year

q0 = Quantity of goods at base year

p1 = Price of goods at the current year

ii) Pasche’s Method


The method of calculating Weighted Index Numbers under which the
current year’s quantities are used as weights of different items is
known as Pasche’s Method. The formula for Pasche’s Price Index is:

Pasche's~Index~Number~(P_{01})=\frac{\sum{p_1q_1}}{\sum{p_0q_1
}}\times{100}

Here,

P01 = Price Index of the current year

p0 = Price of goods in the base year

q1 = Quantity of goods in the base year

p1 = Price of goods in the current year


129

iii) Fisher’s Method


The method of calculating Weighted Index Numbers under which the
combined techniques of Pasche and Laspeyre are used is known as
Fisher’s Method. In other words, both the base year and current
year’s quantities are used as weights. The formula for Fisher’s Price
Index is:

Fisher's~Price~Index~(P_{01})=\sqrt{\frac{\sum{p_1q_0}}{\sum{p_0q_
0}}\times{\frac{\sum{p_1q_1}}{\sum{p_0q_1}}}}\times{100}

Here,

P01 = Price Index of the current year

p0 = Price of goods in the base year

q1 = Quantity of goods in the base year

p1 = Price of goods in the current year

Fisher’s Method is considered the Ideal Method for Constructing


Index Numbers.

2. Weighted Average of Price Relatives Method:


Under this method, the base year prices of the commodities are taken
as the basis to calculate the price relatives for the current year. The
calculated price relatives are then multiplied by their respective
weights of the items. After that, the products determined are added
up and divided by the sum of weights.

The steps required to construct index number through this method


are as follows:

Firstly, calculate the price relatives of the current year


(\frac{p_1}{p_0}\times100) ; i.e., divide the price of each commodity
130

in the current year by the price in the base year, and denote the value
calculated as R.
Now, multiply the price of commodities in the base year (p0) with
their respective weights (q0), and denote the value weights by W.
After that, multiply the price relatives (R) with value weights (W) and
obtain their total; i.e., ∑RW.
Determine the total of value weights; i.e., ∑W.
Use the following formula to determine Index Number:
P_{01}=\frac{\sum{RW}}{\sum{W}}

Example:
Use Weighted Relatives Method and determine the index number
from the following data for the year 2021 with 2010 as the base year.

Information Table

Solution:
Weighted Index Number Table

Weighted Average of Price Relatives

P_{01}=\frac{\sum{RW}}{\sum{W}}

\frac{1,02,182}{790}=129.34

The Index Number of 129.34 shows that there is an increase of


29.34% in the prices in the year 2021 as compared to the year 2010.
131

SECTION-IV

SAMPLING
Meaning and basic of Sampling:

What Is Sampling?
Sampling is a process in statistical analysis where researchers take a
predetermined number of observations from a larger population.
Sampling allows researchers to conduct studies about a large group
by using a small portion of the population. The method of sampling
depends on the type of analysis being performed, but it may include
simple random sampling or systematic sampling. Sampling is
commonly done in statistics, psychology, and the financial industry.

KEY TAKEAWAYS
Sampling allows researchers to use a small group from a larger
population to make observations and determinations.
132

Types of sampling include random sampling, block sampling,


judgment sampling, and systematic sampling.
Researchers should be aware of sampling errors, which may be the
result of random sampling or bias.
Companies use sampling as a marketing tool to identify the needs and
wants of their target market.
Certified public accountants use sampling during audits to determine
the accuracy and completeness of account balances.
Sampling
Investopedia / Crea Taylor

How Sampling Works


It can be difficult for researchers to conduct accurate studies on large
populations. In some cases, it can be impossible to study every
individual in the group. That’s why they often choose a small portion
to represent the entire group. This is called a sample. Samples allow
researchers to use characteristics of the small group to make
estimates of the larger population.

The chosen sample should be a fair representation of the entire


population. When taking a sample from a larger population, it is
important to consider how the sample is chosen. To get a
representative sample, it must be drawn randomly and encompass
the whole population. For example, a lottery system could be used to
determine the average age of students in a university by sampling
10% of the student body.

Sampling is commonly used when studying large portions of the


population for economic purposes. For instance, the monthly
employment report involves the use of sampling, the U.S. Bureau of
Labor Statistics (BLS) reports:
133

The Current Employment Statistics by using 122,000 businesses and


government agencies
The Current Population Survey with a sample of 60,000 different
households across the country
1

Researchers should be aware of sampling errors. This occurs when


the sample that is selected doesn’t represent the entire population.
This means that the results taken from the sample deviate from the
larger population. Sampling error may occur randomly or because
there is some form of bias. For instance, some members of the
sample group may choose not to participate, or they differ in some
way from other participants.

Sampling isn’t an exact science, so the results should be taken as


generalizations. As such, don’t make conclusions about the broader
population based on the sample group.
Types of Audit Sampling
As noted above, there are several different types of sampling that
researchers can use. These include random, judgment, block, and
systemic sampling. These are discussed in more detail below.

Random Sampling
With random sampling, every item within a population has an equal
probability of being chosen. It is the furthest removed from any
potential bias because there is no human judgement involved in
selecting the sample.

For example, a random sample may include choosing the names of 25


employees out of a hat in a company of 250 employees. The
population is all 250 employees, and the sample is random because
each employee has an equal chance of being chosen.
134

Judgment Sampling
Auditor judgment may be used to select the sample from the full
population. An auditor may only be concerned about transactions of a
material nature. For example, assume the auditor sets the threshold
for materiality for accounts payable transactions at $10,000. If the
client provides a complete list of 15 transactions over $10,000, the
auditor may just choose to review all transactions due to the small
population size.

The auditor may alternatively identify all general ledger accounts with
a variance greater than 10% from the prior period. In this case, the
auditor is limiting the population from which the sample selection is
being derived. Unfortunately, human judgment used in sampling
always comes with the potential for bias, whether explicit or implicit.

Block Sampling
Block sampling takes a consecutive series of items within the
population to use as the sample. For example, a list of all sales
transactions in an accounting period could be sorted in various ways,
including by date or by dollar amount.

An auditor may request that the company’s accountant provide the


list in one format or the other in order to select a sample from a
specific segment of the list. This method requires very little
modification on the auditor’s part, but it is likely that a block of
transactions will not be representative of the full population.

Systematic Sampling
Systematic sampling begins at a random starting point within the
population and uses a fixed, periodic interval to select items for a
sample. The sampling interval is calculated as the population size
divided by the sample size. Despite the sample population being
selected in advance, systematic sampling is still considered random if
the periodic interval is determined beforehand and the starting point
is random.
135

Assume that an auditor reviews the internal controls related to a


company’s cash account and wants to test the company policy that
stipulates that checks exceeding $10,000 must be signed by two
people. The population consists of every company check exceeding
$10,000 during the fiscal year, which, in this example, was 300. The
auditor uses probability statistics and determines that the sample size
should be 20% of the population or 60 checks. The sampling interval
is 5 or 300 checks ÷ 60 sample checks.

Therefore, the auditor selects every fifth check for testing. Assuming
no errors are found in the sampling test work, the statistical analysis
gives the auditor a 95% confidence rate that the check procedure was
performed correctly. The auditor tests the sample of 60 checks and
finds no errors, so he concludes that the internal control over cash is
working properly.

Example of Sampling
Market Sampling
Businesses aim to sell their products and/or services to target
markets. Before presenting products to the market, companies
generally identify the needs and wants of their target audience. To do
so, they may employ sampling of the target market population to gain
a better understanding of those needs to later create a product
and/or service that meets those needs. In this case, gathering the
opinions of the sample helps to identify the needs of the whole.

Audit Sampling
During a financial audit, a certified public accountant (CPA) may use
sampling to determine the accuracy and completeness of account
balances in their client’s financial statements. This is called audit
sampling.
2
Audit sampling is necessary when the population (the account
transaction information) is large.
136

What Is Sampling Error?


Sampling error is what happens when the sample collected for review
doesn’t represent the entire population being studied. This
jeopardizes the accuracy and validity of the study being conducted.
For instance, sampling error occurs if researchers include professors
in the sample when they’re trying to determine how students feel
about the university experience. Sampling error may be random or
the result of some type of bias.

What Is Cluster Sampling?


Cluster sampling is a form of probability sampling. When researchers
conduct cluster sampling, they divide the population into smaller
groups. They then select individuals randomly from these groups to
form their samples and conduct their studies. This kind of sampling is
used when both the overall population and sample size is too large to
handle.

What’s the Difference Between Probability and Non-Probability


Sampling?
Probability sampling gives researchers the chance to come to stronger
conclusions about the entire population that is being studied. It
involves the use of random sampling, which means that all of the
participants in the group are equally likely to get a chance to be
chosen as a representative sample of the entire population. The result
is often unbiased.

Non-probability sampling, on the other hand, allows researchers to


easily collect information. This type of sampling is generally biased as
it is unknown which participants will be chosen as a sample.

The Bottom Line


Statisticians often resort to sampling in order to conduct research
when they’re dealing with large populations. Sampling is a technique
that involves taking a small number of participants from a much
bigger group. This is often found when data needs to be collected
137

about the population, including statistical analysis, population


surveys, and economic studies.

Sampling and Non-Sampling Error:

To understand what sampling errors are, you first need to know a


little bit about sampling and what it means in survey research. (If
you’re all clued up on sampling already, feel free to skip ahead to the
next section.)

When you’re running a survey, you’re usually interested in a much


bigger group of people than you can reach. The practical solution is to
take a representative sample – a group that stands in for the actual
population you want to study.

To make sure that your sample provides a fair representation, you


need to follow some survey sampling best practices. Perhaps the
most well-known of these is getting your sample size right. (If the
sample size is too big, you’re putting in lots of work for no meaningful
gain. If the sample size is too small, you can’t be sure your sample is
representative of the actual population.)

But there’s more to doing sampling well than just getting the right
sample size. For this reason, it is important to understand both
sampling errors and non-sampling errors so you can prevent them
from causing problems in your research.

Download the eBook: How to Minimise Sampling and Non-Sampling


Errors

Non-sampling errors vs. sampling errors: definitions


Somewhat confusingly, the term ‘sampling error’ doesn’t mean
mistakes researchers have made when selecting or working with a
sample. Problems like choosing the wrong people, letting bias enter
138

the picture, or failing to anticipate that participants will self-select or


fail to respond: these are non-sampling errors.

Sampling error definition


Sampling error, on the other hand, means the difference between the
mean values of the sample and the mean values of the entire
population, so it only happens when you’re working with
representative samples. It’s the inevitable gap between your sample
and the true population value.

As OECD explains, the whole population will never be perfectly


represented by a sample because the population is larger and more
complete. In this sense, sampling error occurs whenever you’re
sampling. It’s not a human error, and it can’t be completely avoided.

Interestingly, it’s not usually possible to quantify the degree of


sampling error in a study since – by definition – the relevant data for
the entire population is not measured.

However, you can reduce sampling errors by following good practices


– more on that below.

Is sampling error the same as standard error?


Standard error is a popular way of measuring sampling error. It
expresses the extent of the sampling error so that it can be
communicated and understood. Sampling error is the concept,
standard error is the way it’s measured.

What about standard deviation?


Standard error is a kind of standard deviation. It’s the amount that
the sample mean differs from the entire population mean. Or to think
of it another way, it’s the amount the sample mean would vary if you
repeated the sampling process multiple times.

And confidence intervals?


139

A confidence interval expresses how much your results are affected


by error – i.e. how confident you can be that they are right.
Confidence intervals express the upper and lower limits of your
margin of error. If the margin of error is narrow, the confidence will
be greater. The confidence interval of a result is often expressed in
percentages, e.g. 95% or 99%.

Non-sampling error definition


Non-sampling errors can happen even when you’re not sampling. i.e.
they need to be avoided whether you’re working with a
representative sample (such as with a national survey) or doing total
enumeration of your entire population (such as when you’re carrying
out employee experience surveys with your workforce).

Non-sampling errors occur when there are problems with the


sampling method, or the way the survey is designed or carried out.

Examples of sampling and non-sampling errors


1. Population specification errors (non-sampling errors)
This error occurs when the researcher does not understand who they
should survey. For example, imagine a survey about breakfast cereal
consumption in families. Who to survey? It might be the entire family,
the person who most often does the grocery shopping, or the
children. The shopper might make the purchase decision, but the
children influence the cereal choice.

This kind of non-sampling error can be avoided by thoroughly


understanding your research question before you begin constructing
a questionnaire or selecting respondents.

2. Sample frame error (non-sampling error)


Sample frame error occurs when the wrong subpopulation is used to
select a sample so that it significantly fails to represent the entire
population. A classic frame error occurred in the 1936 U.S.
presidential election between Roosevelt – the Democratic candidate –
140

and Landon of the Republican party. The sample frame was from car
registrations and telephone directories. In 1936, many Americans did
not own cars or telephones, and those who did were largely
Republicans. The results wrongly predicted a Republican victory.

The error here lies in the way a sample has been selected. Bias has
been unconsciously introduced because the researchers didn’t
anticipate that only certain kinds of people would show up in their list
of respondents, and parts of the population of interest have been
excluded. A modern equivalent might be using mobile phone
numbers, and therefore inadvertently missing out on adults who
don’t own a mobile phone, such as older people or those with severe
learning disabilities.

Frame errors can also happen when respondents from outside the
population of interest are incorrectly included. For example, say a
researcher is doing a national study. Their list might be drawn from a
geographical map area that accidentally includes a small corner of a
foreign territory – and therefore includes respondents who are not
relevant to the scope of the study.

3. Selection error (non-sampling error)


Selection error occurs when respondents self-select their
participation in the study – only those that are interested respond. It
can also be introduced from the researcher’s side as a non-random
sampling error. For example, if a researcher puts out a call for
responses on social media, they’re going to get responses from
people they know, and of those people, only the more helpful or
affable individuals will reply. They’re not a random sample of the
whole population.

Selection error can be controlled by improving data collection


methods and going to extra lengths to get participation. A typical
survey process includes initiating pre-survey contact requesting
cooperation, actual surveying, and post-survey follow-up. If a
141

response is not received, a second survey request follows, and


perhaps interviews using alternate modes such as telephone or
person-to-person.

4. Non-response (non-sampling error)


Non-response errors occur when respondents are different from
those who do not respond. For example, say you’re doing market
research in advance of launching a new product. You might get a
disproportionate level of participation from your existing customers,
since they know who you are, and miss out on hearing from a broader
pool of people who don’t yet buy from you. Like selection error, this
leads to a non-random sample that misrepresents the whole
population.

Non-response error may occur because either the potential


respondent was not contacted or refused to respond. The extent of
this non-response error can be checked through follow-up surveys
using alternate modes.

5. Sampling errors
As described previously, sampling errors occur because of variation in
the number or representativeness of the sample that responds.
Sampling errors can be controlled and reduced by (1) careful sample
designs, (2) large enough samples (check out our online sample size
calculator), and (3) multiple contacts to ensure a representative
response.

sampling error bubble visual

Be sure to keep an eye out for these sampling and non-sampling


errors so you can avoid them in your research.

How can you use sampling and non-sampling errors to improve


market research?
142

Understanding how sampling errors and the different kinds of non-


sampling errors work will better equip you to produce reliable results
for your business market research.

Data that are skewed by common non-sampling errors, or


unnecessary levels of sampling error, can introduce confusion in your
business, as different results from different studies might conflict with
each other.

Even worse, poor-quality data can lead to false predictions, as in the


Roosevelt election when sample frame error led to false confidence in
a Republican win. Translate that issue into a business scenario and
you could end up badly misjudging your market and making costly
mistakes.

How Qualtrics can help


Working with samples and avoiding statistical errors can quickly
become a complex task that requires expert knowledge and specialist
staff. Fortunately, there are solutions you can incorporate into your
business that require neither of those things.

With the Qualtrics market research platform, you can take advantage
of market-leading statistical tools that produce reports and data that
non-experts can easily understand. With predictions and insights
expressed in simple sentences, you can use them to communicate
findings at all levels of your company and make crucial business
decisions with confidence.

As well as best-in-class software capabilities, you also have access to


research services and experts who can support you with everything
from survey strategy and creation to panel management and
execution.
143

Hypothesis Testing
Formulation and Procedure for testing a hypothesis:

What Is Hypothesis Testing?


Hypothesis testing, sometimes called significance testing, is an
act in statistics whereby an analyst tests an assumption
regarding a population parameter. The methodology
employed by the analyst depends on the nature of the data
used and the reason for the analysis.
144

Hypothesis testing is used to assess the plausibility of a


hypothesis by using sample data. Such data may come from a
larger population, or from a data-generating process. The
word “population” will be used for both of these cases in the
following descriptions.

KEY TAKEAWAYS
Hypothesis testing is used to assess the plausibility of a
hypothesis by using sample data.
The test provides evidence concerning the plausibility of the
hypothesis, given the data.
Statistical analysts test a hypothesis by measuring and
examining a random sample of the population being analyzed.
The four steps of hypothesis testing include stating the
hypotheses, formulating an analysis plan, analyzing the
sample data, and analyzing the result.
Practice trading with virtual money
Find out what a hypothetical investment would be worth
today.

How Hypothesis Testing Works


In hypothesis testing, an analyst tests a statistical sample, with
the goal of providing evidence on the plausibility of the null
hypothesis.

Statistical analysts test a hypothesis by measuring and


examining a random sample of the population being analyzed.
All analysts use a random population sample to test two
145

different hypotheses: the null hypothesis and the alternative


hypothesis.

The null hypothesis is usually a hypothesis of equality


between population parameters; e.g., a null hypothesis may
state that the population mean return is equal to zero. The
alternative hypothesis is effectively the opposite of a null
hypothesis (e.g., the population mean return is not equal to
zero). Thus, they are mutually exclusive, and only one can be
true. However, one of the two hypotheses will always be true.

The null hypothesis is a statement about a population


parameter, such as the population mean, that is assumed to
be true.
1
4 Steps of Hypothesis Testing
All hypotheses are tested using a four-step process:

The first step is for the analyst to state the hypotheses.


The second step is to formulate an analysis plan, which
outlines how the data will be evaluated.
The third step is to carry out the plan and analyze the sample
data.
The final step is to analyze the results and either reject the
null hypothesis, or state that the null hypothesis is plausible,
given the data.
1

Real-World Example of Hypothesis Testing


146

If, for example, a person wants to test that a penny has


exactly a 50% chance of landing on heads, the null hypothesis
would be that 50% is correct, and the alternative hypothesis
would be that 50% is not correct.

Mathematically, the null hypothesis would be represented as


Ho: P = 0.5. The alternative hypothesis would be denoted as
“Ha” and be identical to the null hypothesis, except with the
equal sign struck-through, meaning that it does not equal
50%.

A random sample of 100 coin flips is taken, and the null


hypothesis is then tested. If it is found that the 100 coin flips
were distributed as 40 heads and 60 tails, the analyst would
assume that a penny does not have a 50% chance of landing
on heads and would reject the null hypothesis and accept the
alternative hypothesis.

If, on the other hand, there were 48 heads and 52 tails, then it
is plausible that the coin could be fair and still produce such a
result. In cases such as this where the null hypothesis is
“accepted,” the analyst states that the difference between
the expected results (50 heads and 50 tails) and the observed
results (48 heads and 52 tails) is “explainable by chance
alone.”

What are the Four Key Steps Involved in Hypothesis Testing?


147

Hypothesis testing begins with an analyst stating two


hypotheses, with only one that can be right. The analyst then
formulates an analysis plan, which outlines how the data will
be evaluated. Next, they move to the testing phase and
analyze the sample data. Finally, the analyst analyzes the
results and either rejects the null hypothesis or states that the
null hypothesis is plausible, given the data.

What are the Benefits of Hypothesis Testing?


Hypothesis testing helps assess the accuracy of new ideas or
theories by testing them against data. This allows researchers
to determine whether the evidence supports their hypothesis,
helping to avoid false claims and conclusions. Hypothesis
testing also provides a framework for decision-making based
on data rather than personal opinions or biases. By relying on
statistical analysis, hypothesis testing helps to reduce the
effects of chance and confounding variables, providing a
robust framework for making informed conclusions.
3

What are the Limitations of Hypothesis Testing?


Hypothesis testing relies exclusively on data and doesn’t
provide a comprehensive understanding of the subject being
studied. Additionally, the accuracy of the results depends on
the quality of the available data and the statistical methods
used. Inaccurate data or inappropriate hypothesis formulation
may lead to incorrect conclusions or failed tests. Hypothesis
testing can also lead to errors, such as analysts either
accepting or rejecting a null hypothesis when they shouldn’t
have. These errors may result in false conclusions or missed
148

opportunities to identify significant patterns or relationships


in the data.

Large and Small Sample Test:

Z,t and F test:

The coronavirus pandemic has made a statistician out of us


all. We are constantly checking the numbers, making our own
assumptions on how the pandemic will play out, and
generating hypotheses on when the “peak” will happen. And
it’s not just us performing hypothesis building – the media is
thriving on it.

A few days back, I was reading a news article that mentioned


this outbreak “could potentially be seasonal” and relent in
warmer conditions:

Hypothesis about coronavirus


So I started wondering – what else can we hypothesize about
the coronavirus? Are adults more likely to be affected by the
outbreak of coronavirus? How does Relative Humidity impact
the spread of the virus? What is the evidence to support
these claims? How can we test these hypotheses? As a
Statistic enthusiast, all these questions dig up my old
knowledge about the fundamentals of Hypothesis Testing. In
this article, we will discuss the concept of Hypothesis Testing
and the difference between the z-test and t-test. We will then
conclude our Hypothesis Testing learning using a COVID-19
case study.
149

What is Hypothesis Testing?


Hypothesis testing helps in data analysis by providing a way to
make inferences about a population based on a sample of
data. It allows analysts to decide whether to accept or reject a
given assumption or hypothesis about the population based
on the evidence provided by the sample data. For example,
hypothesis testing can determine whether a sample mean
significantly differs from a hypothesized population mean or
whether a sample proportion differs substantially from a
hypothesized population proportion. This information helps
decide whether to accept or reject a given assumption or
hypothesis about the population. In statistical analysis,
hypothesis testing makes inferences about a population based
on a sample of data.

In machine learning, hypothesis testing evaluates a model’s


performance and determines its parameters’ significance. For
example, a t-test or z-test compares the means of two data
groups to determine if there is a significant difference
between them. Improve your model using this information, or
select the best set of features. Additionally, hypothesis testing
can evaluate a model’s accuracy and decide how to proceed
with further development or deployment. We can even test
the statistical validity of machine learning algorithms like
linear regression and logistic regression on a given dataset
using the process of hypothesis testing.

This extensive tutorial on hypothesis testing is what you need


to get started with the topic.

Fundamentals of Hypothesis Testing


150

Let’s take an example to understand the concept of


Hypothesis Testing. A person is on trial for a criminal offense,
and the judge needs to provide a verdict on his case. Now,
there are four possible combinations in such a case:

First Case: The person is innocent, and the judge identifies the
person as innocent
Second Case: The person is innocent, and the judge identifies
the person as guilty
Third Case: The person is guilty, and the judge identifies the
person as innocent
Fourth Case: The person is guilty, and the judge identifies the
person as guilty
Outcome possibilities of Hypothesis Testing [z test and t test]
As you can clearly see, there can be two types of error in the
judgment – Type 1 error, when the verdict is against the
person while he was innocent, and Type 2 error, when the
verdict is in favor of the person while he was guilty.

According to the Presumption of Innocence, the person is


considered innocent until proven guilty. That means the judge
must find the evidence which convinces him “beyond a
reasonable doubt.” This phenomenon of “Beyond a
reasonable doubt” can be understood as Probability (Judge
Decided Guilty | Person is Innocent) should be small.

Basic Concepts of Hypothesis Testing


We consider the Null Hypothesis to be true until we find
strong evidence against it. Then we accept the Alternate
Hypothesis. We also determine the Significance Level (⍺),
which can be understood as the probability of (Judge Decided
151

Guilty | Person is Innocent) in the previous example. Thus, if ⍺


is smaller, it will require more evidence to reject the Null
Hypothesis. Don’t worry; we’ll cover all this using a case study
later.

Inferential statistics [z test and t test]


Steps to Perform Hypothesis Testing
There are four steps to performing Hypothesis Testing:

Set the Null and Alternate Hypotheses


Set the Significance Level, Criteria for a decision
Compute the test statistic
Make a decision
Steps of Hypothesis Testing [z test and t test]
It must be noted that z-Test & t-Tests are Parametric Tests,
which means that the Null Hypothesis is about a population
parameter, which is less than, greater than, or equal to some
value. Steps 1 to 3 are quite self-explanatory but on what
basis can we make a decision in step 4? What does this p-
value indicate?

We can understand this p-value as the measurement of the


Defense Attorney’s argument. If the p-value is less than ⍺ , we
reject the Null Hypothesis, and if the p-value is greater than ⍺,
we fail to reject the Null Hypothesis.

Critical Value, P-Value


Let’s understand the logic of Hypothesis Testing with the
graphical representation for Normal Distribution.
152

Graphical representation of normal distribution [z test and t


test]
The above visualization helps to understand the z-value and
its relation to the critical value. Typically, we set the
Significance level at 10%, 5%, or 1%. If our test score lies in
the Acceptance Zone, we fail to reject the Null Hypothesis. If
our test score lies in the Critical Zone, we reject the Null
Hypothesis and accept the Alternate Hypothesis.

Critical Value is the cut off value between Acceptance Zone


and Rejection Zone. We compare our test score to the critical
value and if the test score is greater than the critical value,
that means our test score lies in the Rejection Zone and we
reject the Null Hypothesis. On the other hand, if the test score
is less than the Critical Value, that means the test score lies in
the Acceptance Zone and we fail to reject the null Hypothesis.

But why do we need a p-value when we can reject/accept


hypotheses based on test scores and critical values?

P-value has the benefit that we only need one value to make a
decision about the hypothesis. We don’t need to compute
two different values such as critical value and test scores.
Another benefit of using the p-value is that we can test at any
desired level of significance by comparing this directly with
the significance level.

p-value [z test and t test]


This way, we don’t need to compute test scores and critical
values for each significance level. We can get the p-value and
directly compare it with the significance level of our interest.
153

Directional Hypothesis
In the Directional Hypothesis, the null hypothesis is rejected if
the test score is too large (for right-tailed) or too small (for
left-tailed). Thus, the rejection region for such a test consists
of one part, which is on the right side for a right-tailed test; or
the rejection region is on the left side from the center in the
case of a left-tailed test.

Directional hypothesis graph [z test and t test]


Non-Directional Hypothesis
In a Non-Directional Hypothesis test, the Null Hypothesis is
rejected if the test score is either too small or too large. Thus,
the rejection region for such a test consists of two parts: one
on the left and one on the right. This is a case of a two-tailed
test.

Non-directional hypothesis graph [z test and t test]


What is the Z-Test Statistic?
z tests are a statistical way of testing a Null Hypothesis when
either:

We know the population variance, or


We do not know the population variance, but our sample size
is large n ≥ 30
If we have a sample size of less than 30 and do not know the
population variance, we must use a t-test. This is how we
judge when to use the z-test vs the t-test. Further, it is
assumed that the z-statistic follows a standard normal
distribution. In contrast, the t-statistics follows the t-
154

distribution with a degree of freedom equal to n-1, where n is


the sample size.

It must be noted that the samples used for z-test or t-test


must be independent sample, and also must have a
distribution identical to the population distribution. This
makes sure that the sample is not “biased” to/against the Null
Hypothesis which we want to validate/invalidate.

Examples of Z Test
One-Sample Z-Test
We perform the One-Sample z-Test when we want to
compare a sample mean with the population mean.
Z-score formula [z test]
Here’s an Example to Understand a One Sample z-Test

Let’s say we need to determine if girls on average score higher


than 600 in the exam. We have the information that the
standard deviation for girls’ scores is 100. So, we collect the
data of 20 girls by using random samples and record their
marks. Finally, we also set our ⍺ value (significance level) to be
0.05.

In this example:

Mean Score for Girls is 641


The number of data points in the sample is 20
The population mean is 600
Standard Deviation for Population is 100
Z score, p-value, and critical value example [z test and t test]
155

Since the P-value is less than 0.05, we can reject the null
hypothesis and conclude based on our result that Girls on
average scored higher than 600.

Two-Sample Z-Test
We perform a Two Sample z-test when we want to compare
the mean of two samples.
Z score calculation [z test]
Here’s an Example to Understand a Two Sample Z-Test

Here, let’s say we want to know if Girls on an average score 10


marks more than the boys. We have the information that the
standard deviation for girls’ Score is 100 and for boys’ score is
90. Then we collect the data of 20 girls and 20 boys by using
random samples and record their marks. Finally, we also set
our ⍺ value (significance level) to be 0.05.

In this example:

Mean Score for Girls (Sample Mean) is 641


Mean Score for Boys (Sample Mean) is 613.3
Standard Deviation for the Population of Girls’ is 100
Standard deviation for the Population of Boys’ is 90
Sample Size is 20 for both Girls and Boys
Difference between Mean of Population is 10
Two-sample z-score calculations [z-test]
Thus, we can conclude based on the p-value that we fail to
reject the Null Hypothesis. We don’t have enough evidence to
conclude that girls on average score of 10 marks more than
the boys. Pretty simple, right?
156

What is the T-Test?


T-tests are a statistical way of testing a hypothesis when:

We do not know the population variance


Our sample size is small, n < 30
Examples of T Test
One-Sample T-Test
We perform a One-Sample t-test when we want to compare a
sample mean with the population mean. The difference from
the z-Test is that we do not have the information on
Population Variance here. We use the sample standard
deviation instead of population standard deviation in this
case.

t score formula [t test]


Here’s an Example to Understand a One Sample T-Test

Let’s say we want to determine if on average girls score more


than 600 in the exam. We do not have the information related
to variance (or standard deviation) for girls’ scores. To a
perform t-test, we randomly collect the data of 10 girls with
their marks and choose our ⍺ value (significance level) to be
0.05 for Hypothesis Testing.

In this example:

Mean Score for Girls is 606.8


The size of the sample is 10
The population mean is 600
157

Standard Deviation for the sample is 13.14


t score calculation [t test]
Our p-value is greater than 0.05 thus we fail to reject the null
hypothesis and don’t have enough evidence to support the
hypothesis that on average, girls score more than 600 in the
exam.

Two-Sample T-Test
We perform a Two-Sample t-test when we want to compare
the mean of two samples.

Two-sample t-test formula


Here’s an Example to Understand a Two-Sample T-Test

Here, let’s say we want to determine if on average, boys score


15 marks more than girls in the exam. We do not have the
information related to variance (or standard deviation) for
girls’ scores or boys’ scores. To perform a t-test. we randomly
collect the data of 10 girls and boys with their marks. We
choose our ⍺ value (significance level) to be 0.05 as the
criteria for Hypothesis Testing.

In this example:

Mean Score for Boys is 630.1


Mean Score for Girls is 606.8
Difference between Population Mean 15
Standard Deviation for Boys’ score is 13.42
Standard Deviation for Girls’ score is 13.14
Two-sample t-test calculation
158

Thus, p-value is less than 0.05 so we can reject the null


hypothesis and conclude that on average boys score 15 marks
more than girls in the exam.

F-Test:

What is F Test in Statistics?


F test is statistics is a test that is performed on an f
distribution. A two-tailed f test is used to check whether the
variances of the two given samples (or populations) are equal
or not. However, if an f test checks whether one population
variance is either greater than or lesser than the other, it
becomes a one-tailed hypothesis f test.

F Test Definition
F test can be defined as a test that uses the f test statistic to
check whether the variances of two samples (or populations)
are equal to the same value. To conduct an f test, the
population should follow an f distribution and the samples
must be independent events. On conducting the hypothesis
test, if the results of the f test are statistically significant then
the null hypothesis can be rejected otherwise it cannot be
rejected.

F Test Formula
The f test is used to check the equality of variances using
hypothesis testing. The f test formula for different hypothesis
tests is given as follows:

Left Tailed Test:


159

Null Hypothesis:
H
0
:
σ
2
1
=
σ
2
2
Alternate Hypothesis:
H
1
:
σ
2
1
<
σ
2
2
Decision Criteria: If the f statistic < f critical value then reject
the null hypothesis

Right Tailed test:

Null Hypothesis:
H
0
:
160

σ
2
1
=
σ
2
2
Alternate Hypothesis:
H
1
:
σ
2
1
>
σ
2
2
Decision Criteria: If the f test statistic > f test critical value
then reject the null hypothesis

Two Tailed test:

Null Hypothesis:
H
0
:
σ
2
1
=
161

σ
2
2
Alternate Hypothesis:
H
1
:
σ
2
1

σ
2
2
Decision Criteria: If the f test statistic > f test critical value
then the null hypothesis is rejected

F Statistic
The f test statistic or simply the f statistic is a value that is
compared with the critical value to check if the null
hypothesis should be rejected or not. The f test statistic
formula is given below:

F statistic for large samples: F =


σ
2
1
σ
2
2
, where
162

σ
2
1
is the variance of the first population and
σ
2
2
is the variance of the second population.

F statistic for small samples: F =


s
2
1
s
2
2
, where
s
2
1
is the variance of the first sample and
s
2
2
is the variance of the second sample.

The selection criteria for the


σ
2
1
and
163

σ
2
2
for an f statistic is given below:

For a right-tailed and a two-tailed f test, the variance with the


greater value will be in the numerator. Thus, the sample
corresponding to
σ
2
1
will become the first sample. The smaller value variance will
be the denominator and belongs to the second sample.
For a left-tailed test, the smallest variance becomes the
numerator (sample 1) and the highest variance goes in the
denominator (sample 2).
F Test Critical Value
A critical value is a point that a test statistic is compared to in
order to decide whether to reject or not to reject the null
hypothesis. Graphically, the critical value divides a distribution
into the acceptance and rejection regions. If the test statistic
falls in the rejection region then the null hypothesis can be
rejected otherwise it cannot be rejected. The steps to find the
f test critical value at a specific alpha level (or significance
level),
α
, are as follows:

Find the degrees of freedom of the first sample. This is done


by subtracting 1 from the first sample size. Thus, x =
n
164

1

1
.
Determine the degrees of freedom of the second sample by
subtracting 1 from the sample size. This given y =
n
2

1
.
If it is a right-tailed test then
α
is the significance level. For a left-tailed test 1 -
α
is the alpha level. However, if it is a two-tailed test then the
significance level is given by
α
/ 2.
The F table is used to find the critical value at the required
alpha level.
The intersection of the x column and the y row in the f table
will give the f test critical value.
ANOVA F Test
The one-way ANOVA is an example of an f test. ANOVA stands
for analysis of variance. It is used to check the variability of
group means and the associated variability in observations
within that group. The F test statistic is used to conduct the
ANOVA test. The hypothesis is given as follows:

H
165

0
: The means of all groups are equal.

H
1
: The means of all groups are not equal.

Test Statistic: F = explained variance / unexplained variance

Decision rule: If F > F critical value then reject the null


hypothesis.

To determine the critical value of an ANOVA f test the degrees


of freedom are given by
d
f
1
= K - 1 and
d
f
1
= N - K, where N is the overall sample size and K is the
number of groups.

Non Parametric Test:

Non-parametric tests are the mathematical methods used in


statistical hypothesis testing, which do not make assumptions
about the frequency distribution of variables that are to be
evaluated. The non-parametric experiment is used when
166

there are skewed data, and it comprises techniques that do


not depend on data pertaining to any particular distribution.

The word non-parametric does not mean that these models


do not have any parameters. The fact is, the characteristics
and number of parameters are pretty flexible and not
predefined. Therefore, these models are called distribution-
free models.

Non-Parametric T-Test
Whenever a few assumptions in the given population are
uncertain, we use non-parametric tests, which are also
considered parametric counterparts. When data are not
distributed normally or when they are on an ordinal level of
measurement, we have to use non-parametric tests for
analysis. The basic rule is to use a parametric t-test for
normally distributed data and a non-parametric test for
skewed data.

Non-Parametric Paired T-Test


The paired sample t-test is used to match two means scores,
and these scores come from the same group. Pair samples t-
test is used when variables are independent and have two
levels, and those levels are repeated measures.

Difference Between Parametric And Nonparametric


T-test Formula
Hypothesis Testing Formula
Chi-Square Test

Non-parametric Test Methods


167

The four different techniques of parametric tests, such as


Mann Whitney U test, the sign test, the Wilcoxon signed-rank
test, and the Kruskal Wallis test are discussed here in detail.
We know that the non-parametric tests are completely based
on the ranks, which are assigned to the ordered data. The
four different types of non-parametric test are summarized
below with their uses, null hypothesis, test statistic, and the
decision rule.

Kruskal Wallis Test


Kruskal Wallis test is used to compare the continuous
outcome in greater than two independent samples.

Null hypothesis, H0: K Population medians are equal.

Test statistic:

If N is the total sample size, k is the number of comparison


groups, Rj is the sum of the ranks in the jth group and nj is the
sample size in the jth group, then the test statistic, H is given
by:

Decision Rule: Reject the null hypothesis H0 if H ≥ critical


value

Sign Test
168

The sign test is used to compare the continuous outcome in


the paired samples or the two matches samples.

Null hypothesis, H0: Median difference should be zero

Test statistic: The test statistic of the sign test is the smaller of
the number of positive or negative signs.

Decision Rule: Reject the null hypothesis if the smaller of


number of the positive or the negative signs are less than or
equal to the critical value from the table.

Mann Whitney U Test


Mann Whitney U test is used to compare the continuous
outcomes in the two independent samples.

Null hypothesis, H0: The two populations should be equal.

Test statistic:

If R1 and R2 are the sum of the ranks in group 1 and group 2


respectively, then the test statistic “U” is the smaller of:

Decision Rule: Reject the null hypothesis if the test statistic, U


is less than or equal to critical value from the table.
169

Wilcoxon Signed-Rank Test


Wilcoxon signed-rank test is used to compare the continuous
outcome in the two matched samples or the paired samples.

Null hypothesis, H0: Median difference should be zero.

Test statistic: The test statistic W, is defined as the smaller of


W+ or W- .

Where W+ and W- are the sums of the positive and the


negative ranks of the different scores.

Decision Rule: Reject the null hypothesis if the test statistic, W


is less than or equal to the critical value from the table.

Advantages and Disadvantages of Non-Parametric Test


The advantages of the non-parametric test are:

Easily understandable
Short calculations
Assumption of distribution is not required
Applicable to all types of data
The disadvantages of the non-parametric test are:

Less efficient as compared to parametric test


The results may or may not provide an accurate answer
because they are distribution free

Applications of Non-Parametric Test


170

The conditions when non-parametric tests are used are listed


below:

When parametric tests are not satisfied.


When testing the hypothesis, it does not have any
distribution.
For quick data analysis.
When unscaled data is available.

Chi Square Test:

PreviousNext
Tutorial Playlist
Table of Contents
What Is a Chi-Square Test?Chi-Square Test DefinitionFormula
For Chi-Square TestFundamentals of Hypothesis TestingWhat
Are Categorical Variables?View More
The world is constantly curious about the Chi-Square test’s
application in machine learning and how it makes a
difference. Feature selection is a critical topic in machine
learning, as you will have multiple features in line and must
choose the best ones to build the model. By examining the
relationship between the elements, the chi-square test aids in
the solution of feature selection problems. In this tutorial, you
will learn about the chi-square test and its application.

What Is a Chi-Square Test?


The Chi-Square test is a statistical procedure for determining
the difference between observed and expected data. This test
can also be used to determine whether it correlates to the
categorical variables in our data. It helps to find out whether a
171

difference between two categorical variables is due to chance


or a relationship between them.

Chi-Square Test Definition


A chi-square test is a statistical test that is used to compare
observed and expected results. The goal of this test is to
identify whether a disparity between actual and predicted
data is due to chance or to a link between the variables under
consideration. As a result, the chi-square test is an ideal
choice for aiding in our understanding and interpretation of
the connection between our two categorical variables.

A chi-square test or comparable nonparametric test is


required to test a hypothesis regarding the distribution of a
categorical variable. Categorical variables, which indicate
categories such as animals or countries, can be nominal or
ordinal. They cannot have a normal distribution since they can
only have a few particular values.

For example, a meal delivery firm in India wants to investigate


the link between gender, geography, and people’s food
preferences.

It is used to calculate the difference between two categorical


variables, which are:

As a result of chance or
Because of the relationship
Your Data Analytics Career is Around The Corner!
Data Analyst Master’s ProgramEXPLORE PROGRAM
Formula For Chi-Square Test
172

Chi_Sq_formula.

Where

c = Degrees of freedom

O = Observed Value

E = Expected Value

The degrees of freedom in a statistical calculation represent


the number of variables that can vary in a calculation. The
degrees of freedom can be calculated to ensure that chi-
square tests are statistically valid. These tests are frequently
used to compare observed data with data that would be
expected to be obtained if a particular hypothesis were true.

The Observed values are those you gather yourselves.

The expected values are the frequencies expected, based on


the null hypothesis.

Fundamentals of Hypothesis Testing


Hypothesis testing is a technique for interpreting and drawing
inferences about a population based on sample data. It aids in
determining which sample data best support mutually
exclusive population claims.

Null Hypothesis (H0) - The Null Hypothesis is the assumption


that the event will not occur. A null hypothesis has no bearing
on the study’s outcome unless it is rejected.
173

H0 is the symbol for it, and it is pronounced H-naught.

Alternate Hypothesis(H1 or Ha) - The Alternate Hypothesis is


the logical opposite of the null hypothesis. The acceptance of
the alternative hypothesis follows the rejection of the null
hypothesis. H1 is the symbol for it.

Become a Data Scientist With Real-World Experience


Data Scientist Master’s ProgramEXPLORE PROGRAM
What Are Categorical Variables?
Categorical variables belong to a subset of variables that can
be divided into discrete categories. Names or labels are the
most common categories. These variables are also known as
qualitative variables because they depict the variable’s quality
or characteristics.

Categorical variables can be divided into two categories:

Nominal Variable: A nominal variable’s categories have no


natural ordering. Example: Gender, Blood groups
Ordinal Variable: A variable that allows the categories to be
sorted is ordinal variables. Customer satisfaction (Excellent,
Very Good, Good, Average, Bad, and so on) is an example.
Why Do You Use the Chi-Square Test?
Chi-square is a statistical test that examines the differences
between categorical variables from a random sample in order
to determine whether the expected and observed results are
well-fitting.

Here are some of the uses of the Chi-Squared test:


174

The Chi-squared test can be used to see if your data follows a


well-known theoretical probability distribution like the Normal
or Poisson distribution.
The Chi-squared test allows you to assess your trained
regression model’s goodness of fit on the training, validation,
and test data sets.
Become an Expert in Data Analytics!
Data Analyst Master’s ProgramEXPLORE PROGRAM
What Does A Chi-Square Statistic Test Tell You?
A Chi-Square test ( symbolically represented as 2 ) is
fundamentally a data analysis based on the observations of a
random set of variables. It computes how a model equates to
actual observed data. A Chi-Square statistic test is calculated
based on the data, which must be raw, random, drawn from
independent variables, drawn from a wide-ranging sample
and mutually exclusive. In simple terms, two sets of statistical
data are compared -for instance, the results of tossing a fair
coin. Karl Pearson introduced this test in 1900 for categorical
data analysis and distribution. This test is also known as
‘Pearson’s Chi-Squared Test’.

Chi-Squared Tests are most commonly used in hypothesis


testing. A hypothesis is an assumption that any given
condition might be true, which can be tested afterwards. The
Chi-Square test estimates the size of inconsistency between
the expected results and the actual results when the size of
the sample and the number of variables in the relationship is
mentioned.
175

These tests use degrees of freedom to determine if a


particular null hypothesis can be rejected based on the total
number of observations made in the experiments. Larger the
sample size, more reliable is the result.

There are two main types of Chi-Square tests namely -

Independence
Goodness-of-Fit
Independence
The Chi-Square Test of Independence is a derivable ( also
known as inferential ) statistical test which examines whether
the two sets of variables are likely to be related with each
other or not. This test is used when we have counts of values
for two nominal or categorical variables and is considered as
non-parametric test. A relatively large sample size and
independence of obseravations are the required criteria for
conducting this test.

For Example-

In a movie theatre, suppose we made a list of movie genres.


Let us consider this as the first variable. The second variable is
whether or not the people who came to watch those genres
of movies have bought snacks at the theatre. Here the null
hypothesis is that th genre of the film and whether people
bought snacks or not are unrelatable. If this is true, the movie
genres don’t impact snack sales.

Future-Proof Your AI/ML Career: Top Dos and Don’ts


Free Webinar | 5 Dec, Tuesday | 7 PM ISTREGISTER NOW
176

Goodness-Of-Fit
In statistical hypothesis testing, the Chi-Square Goodness-of-
Fit test determines whether a variable is likely to come from a
given distribution or not. We must have a set of data values
and the idea of the distribution of this data. We can use this
test when we have value counts for categorical variables. This
test demonstrates a way of deciding if the data values have a
“ good enough” fit for our idea or if it is a representative
sample data of the entire population.

For Example-

Suppose we have bags of balls with five different colours in


each bag. The given condition is that the bag should contain
an equal number of balls of each colour. The idea we would
like to test here is that the proportions of the five colours of
balls in each bag must be exact.

Who Uses Chi-Square Analysis?


Chi-square is most commonly used by researchers who are
studying survey response data because it applies to
categorical variables. Demography, consumer and marketing
research, political science, and economics are all examples of
this type of research.

Example
Let’s say you want to know if gender has anything to do with
political party preference. You poll 440 voters in a simple
177

random sample to find out which political party they prefer.


The results of the survey are shown in the table below:

chi-1.

To see if gender is linked to political party preference,


perform a Chi-Square test of independence using the steps
below.

Step 1: Define the Hypothesis


H0: There is no link between gender and political party
preference.

H1: There is a link between gender and political party


preference.

Step 2: Calculate the Expected Values


Now you will calculate the expected frequency.

Chi_Sq_formula_1.

For example, the expected value for Male Republicans is:

Chi_Sq_formula_2

Similarly, you can calculate the expected value for each of the
cells.

chi-2.

Step 3: Calculate (O-E)2 / E for Each Cell in the Table


178

Now you will calculate the (O - E)2 / E for each cell in the
table.

Where

O = Observed Value

E = Expected Value

chi-3.

Step 4: Calculate the Test Statistic X2


X2 is the sum of all the values in the last table

= 0.743 + 2.05 + 2.33 + 3.33 + 0.384 + 1

= 9.837

Before you can conclude, you must first determine the critical
statistic, which requires determining our degrees of freedom.
The degrees of freedom in this case are equal to the table’s
number of columns minus one multiplied by the table’s
number of rows minus one, or (r-1) (c-1). We have (3-1)(2-1) =
2.

Finally, you compare our obtained statistic to the critical


statistic found in the chi-square table. As you can see, for an
alpha level of 0.05 and two degrees of freedom, the critical
statistic is 5.991, which is less than our obtained statistic of
9.83. You can reject our null hypothesis because the critical
statistic is higher than your obtained statistic.
179

This means you have sufficient evidence to say that there is an


association between gender and political party preference.

PreviousNext
Tutorial Playlist
Table of Contents
What Is a Chi-Square Test?Chi-Square Test DefinitionFormula
For Chi-Square TestFundamentals of Hypothesis TestingWhat
Are Categorical Variables?View More
The world is constantly curious about the Chi-Square test’s
application in machine learning and how it makes a
difference. Feature selection is a critical topic in machine
learning, as you will have multiple features in line and must
choose the best ones to build the model. By examining the
relationship between the elements, the chi-square test aids in
the solution of feature selection problems. In this tutorial, you
will learn about the chi-square test and its application.

What Is a Chi-Square Test?


The Chi-Square test is a statistical procedure for determining
the difference between observed and expected data. This test
can also be used to determine whether it correlates to the
categorical variables in our data. It helps to find out whether a
difference between two categorical variables is due to chance
or a relationship between them.

Chi-Square Test Definition


A chi-square test is a statistical test that is used to compare
observed and expected results. The goal of this test is to
identify whether a disparity between actual and predicted
180

data is due to chance or to a link between the variables under


consideration. As a result, the chi-square test is an ideal
choice for aiding in our understanding and interpretation of
the connection between our two categorical variables.

A chi-square test or comparable nonparametric test is


required to test a hypothesis regarding the distribution of a
categorical variable. Categorical variables, which indicate
categories such as animals or countries, can be nominal or
ordinal. They cannot have a normal distribution since they can
only have a few particular values.

For example, a meal delivery firm in India wants to investigate


the link between gender, geography, and people’s food
preferences.

It is used to calculate the difference between two categorical


variables, which are:

As a result of chance or
Because of the relationship
Your Data Analytics Career is Around The Corner!
Data Analyst Master’s ProgramEXPLORE PROGRAM
Formula For Chi-Square Test
Chi_Sq_formula.

Where

c = Degrees of freedom

O = Observed Value
181

E = Expected Value

The degrees of freedom in a statistical calculation represent


the number of variables that can vary in a calculation. The
degrees of freedom can be calculated to ensure that chi-
square tests are statistically valid. These tests are frequently
used to compare observed data with data that would be
expected to be obtained if a particular hypothesis were true.

The Observed values are those you gather yourselves.

The expected values are the frequencies expected, based on


the null hypothesis.

Fundamentals of Hypothesis Testing


Hypothesis testing is a technique for interpreting and drawing
inferences about a population based on sample data. It aids in
determining which sample data best support mutually
exclusive population claims.

Null Hypothesis (H0) - The Null Hypothesis is the assumption


that the event will not occur. A null hypothesis has no bearing
on the study’s outcome unless it is rejected.

H0 is the symbol for it, and it is pronounced H-naught.

Alternate Hypothesis(H1 or Ha) - The Alternate Hypothesis is


the logical opposite of the null hypothesis. The acceptance of
the alternative hypothesis follows the rejection of the null
hypothesis. H1 is the symbol for it.
182

Become a Data Scientist With Real-World Experience


Data Scientist Master’s ProgramEXPLORE PROGRAM
What Are Categorical Variables?
Categorical variables belong to a subset of variables that can
be divided into discrete categories. Names or labels are the
most common categories. These variables are also known as
qualitative variables because they depict the variable’s quality
or characteristics.

Categorical variables can be divided into two categories:

Nominal Variable: A nominal variable’s categories have no


natural ordering. Example: Gender, Blood groups
Ordinal Variable: A variable that allows the categories to be
sorted is ordinal variables. Customer satisfaction (Excellent,
Very Good, Good, Average, Bad, and so on) is an example.
Why Do You Use the Chi-Square Test?
Chi-square is a statistical test that examines the differences
between categorical variables from a random sample in order
to determine whether the expected and observed results are
well-fitting.

Here are some of the uses of the Chi-Squared test:

The Chi-squared test can be used to see if your data follows a


well-known theoretical probability distribution like the Normal
or Poisson distribution.
The Chi-squared test allows you to assess your trained
regression model’s goodness of fit on the training, validation,
and test data sets.
183

Become an Expert in Data Analytics!


Data Analyst Master’s ProgramEXPLORE PROGRAM
What Does A Chi-Square Statistic Test Tell You?
A Chi-Square test ( symbolically represented as 2 ) is
fundamentally a data analysis based on the observations of a
random set of variables. It computes how a model equates to
actual observed data. A Chi-Square statistic test is calculated
based on the data, which must be raw, random, drawn from
independent variables, drawn from a wide-ranging sample
and mutually exclusive. In simple terms, two sets of statistical
data are compared -for instance, the results of tossing a fair
coin. Karl Pearson introduced this test in 1900 for categorical
data analysis and distribution. This test is also known as
‘Pearson’s Chi-Squared Test’.

Chi-Squared Tests are most commonly used in hypothesis


testing. A hypothesis is an assumption that any given
condition might be true, which can be tested afterwards. The
Chi-Square test estimates the size of inconsistency between
the expected results and the actual results when the size of
the sample and the number of variables in the relationship is
mentioned.

These tests use degrees of freedom to determine if a


particular null hypothesis can be rejected based on the total
number of observations made in the experiments. Larger the
sample size, more reliable is the result.

There are two main types of Chi-Square tests namely -

Independence
184

Goodness-of-Fit
Independence
The Chi-Square Test of Independence is a derivable ( also
known as inferential ) statistical test which examines whether
the two sets of variables are likely to be related with each
other or not. This test is used when we have counts of values
for two nominal or categorical variables and is considered as
non-parametric test. A relatively large sample size and
independence of obseravations are the required criteria for
conducting this test.

For Example-

In a movie theatre, suppose we made a list of movie genres.


Let us consider this as the first variable. The second variable is
whether or not the people who came to watch those genres
of movies have bought snacks at the theatre. Here the null
hypothesis is that th genre of the film and whether people
bought snacks or not are unrelatable. If this is true, the movie
genres don’t impact snack sales.

Future-Proof Your AI/ML Career: Top Dos and Don’ts


Free Webinar | 5 Dec, Tuesday | 7 PM ISTREGISTER NOW
Goodness-Of-Fit
In statistical hypothesis testing, the Chi-Square Goodness-of-
Fit test determines whether a variable is likely to come from a
given distribution or not. We must have a set of data values
and the idea of the distribution of this data. We can use this
test when we have value counts for categorical variables. This
test demonstrates a way of deciding if the data values have a
185

“ good enough” fit for our idea or if it is a representative


sample data of the entire population.

For Example-

Suppose we have bags of balls with five different colours in


each bag. The given condition is that the bag should contain
an equal number of balls of each colour. The idea we would
like to test here is that the proportions of the five colours of
balls in each bag must be exact.

Who Uses Chi-Square Analysis?


Chi-square is most commonly used by researchers who are
studying survey response data because it applies to
categorical variables. Demography, consumer and marketing
research, political science, and economics are all examples of
this type of research.

Example
Let’s say you want to know if gender has anything to do with
political party preference. You poll 440 voters in a simple
random sample to find out which political party they prefer.
The results of the survey are shown in the table below:

chi-1.

To see if gender is linked to political party preference,


perform a Chi-Square test of independence using the steps
below.
186

Step 1: Define the Hypothesis


H0: There is no link between gender and political party
preference.

H1: There is a link between gender and political party


preference.

Step 2: Calculate the Expected Values


Now you will calculate the expected frequency.

Chi_Sq_formula_1.

For example, the expected value for Male Republicans is:

Chi_Sq_formula_2

Similarly, you can calculate the expected value for each of the
cells.

chi-2.

Step 3: Calculate (O-E)2 / E for Each Cell in the Table


Now you will calculate the (O - E)2 / E for each cell in the
table.

Where

O = Observed Value

E = Expected Value
187

chi-3.

Step 4: Calculate the Test Statistic X2


X2 is the sum of all the values in the last table

= 0.743 + 2.05 + 2.33 + 3.33 + 0.384 + 1

= 9.837

Before you can conclude, you must first determine the critical
statistic, which requires determining our degrees of freedom.
The degrees of freedom in this case are equal to the table’s
number of columns minus one multiplied by the table’s
number of rows minus one, or (r-1) (c-1). We have (3-1)(2-1) =
2.

Finally, you compare our obtained statistic to the critical


statistic found in the chi-square table. As you can see, for an
alpha level of 0.05 and two degrees of freedom, the critical
statistic is 5.991, which is less than our obtained statistic of
9.83. You can reject our null hypothesis because the critical
statistic is higher than your obtained statistic.

This means you have sufficient evidence to say that there is an


association between gender and political party preference.

Chi_Sq_formula_3

When to Use a Chi-Square Test?


188

A Chi-Square Test is used to examine whether the observed


results are in order with the expected values. When the data
to be analysed is from a random sample, and when the
variable is the question is a categorical variable, then Chi-
Square proves the most appropriate test for the same. A
categorical variable consists of selections such as breeds of
dogs, types of cars, genres of movies, educational attainment,
male v/s female etc. Survey responses and questionnaires are
the primary sources of these types of data. The Chi-square
test is most commonly used for analysing this kind of data.
This type of analysis is helpful for researchers who are
studying survey response data. The research can range from
customer and marketing research to political sciences and
economics.

Chi-square distributions (X2) are a type of continuous


probability distribution. They’re commonly utilized in
hypothesis testing, such as the chi-square goodness of fit and
independence tests. The parameter k, which represents the
degrees of freedom, determines the shape of a chi-square
distribution.

A chi-square distribution is followed by very few real-world


observations. The objective of chi-square distributions is to
test hypotheses, not to describe real-world distributions. In
contrast, most other commonly used distributions, such as
normal and Poisson distributions, may explain important
things like baby birth weights or illness cases per year.

Because of its close resemblance to the conventional normal


distribution, chi-square distributions are excellent for
189

hypothesis testing. Many essential statistical tests rely on the


conventional normal distribution.

In statistical analysis, the Chi-Square distribution is used in


many hypothesis tests and is determined by the parameter k
degree of freedoms. It belongs to the family of continuous
probability distributions. The Sum of the squares of the k
independent standard random variables is called the Chi-
Squared distribution. Pearson’s Chi-Square Test formula is -

Chi_Square_Distribution_1

Where X^2 is the Chi-Square test symbol

Σ is the summation of observations

O is the observed results

E is the expected results

The shape of the distribution graph changes with the increase


in the value of k, i.e. degree of freedoms.

When k is 1 or 2, the Chi-square distribution curve is shaped


like a backwards ‘J’. It means there is a high chance that X^2
becomes close to zero.
When k is greater than 2, the shape of the distribution curve
looks like a hump and has a low probability that X^2 is very
near to 0 or very far from 0. The distribution occurs much
longer on the right-hand side and shorter on the left-hand
side. The probable value of X^2 is (X^2 - 2).
190

When k is greater than ninety, a normal distribution is seen,


approximating the Chi-square distribution.

CONCEPT OF BUISNESS ANALYTICS

Meaning of Business Analytics:

Articles
191

Tutorials
Interview Questions
Free Courses
Videos
Career Guide
Great Learning Blog Data Science and Business Analytics
Business Analytics
Quick Links
Data Science
Business Analytics
Python
Free Data Science Course
Free Data Science Courses
Data Visualization Courses
What is Business Analytics? Definition, Examples & Types
By Swati Aggarwal Updated on Jul 14, 2023 48571
Table of contents
What is Business Analytics?
Difference Between Business Analytics and Business
Intelligence
Types of Business Analytics
Business Analytics Jobs
Business Analytics Myths
Business Analytics tools
Business Analytics FAQs
What is Business Analytics?
From manual effort to machines, there has been no looking
back for humans. In came the digital age and out went the last
iota of doubt anyone had regarding the future of mankind.
Business Analytics, Machine Learning, AI, Deep Learning,
Robotics, and Cloud have revolutionized the way we look,
192

absorb, and process information. While there are still ongoing


developments happening in several of these advanced fields,
business analytics has gained the status of being all-pervasive
across functions and domains. There is no aspect of our lives
untouched by Analytics. The mammoth wings of analytics are
determining how we buy our toothpaste to how we choose
dating partners to how we lead our lives. Read on to know
what is Business Analytics.

Moving on to a more technical definition of business analytics,


Gartner says, “Business analytics is comprised of solutions
used to build analysis models and simulations to create
scenarios, understand realities and predict future states.
Business analytics includes data mining, predictive analytics,
applied analytics and statistics, and is delivered as an
application suitable for a business user. These analytics
solutions often come with prebuilt industry content that is
targeted at an industry business process (for example, claims,
underwriting or a specific regulatory requirement).”

Business Analytics is interchangeably used with data analytics.


The only difference being that while data analytics is the birth
child of the data boom, business analytics represents a
coming of age that centers data insights at the heart of
business transactions. Nearly 90% of all small, mid-size, and
large organizations have set up analytical capabilities over the
last 5 years in an effort to stay relevant in the market and
draw value out of the insights that large volumes of data
recorded in the digital age can provide. Now that you know
the definition of business analytics, let us take a look at a
193

more comprehensive understanding of the business analytics


process.

Professionals, on the other hand, are also in a rush to bag


analytical roles for career success. So, what does it mean for
aspiring analytics professionals?

Nearly every domain has seen an up-rise in the number of


opportunities in the analytics segment but there is still a huge
supply-demand gap that exists when filling these positions.
This is because of the lack of relevant quality education in
graduation (that still continues to teach its archaic curriculum)
and also a lack of enthusiasm to upskill especially in the more
seasoned professionals with more than 5 years of experience.
Slowly, this trend is changing with freshers taking up business
analytics and data science courses before entering the
workforce and the seasoned professionals taking cognizance
of the fact that they may render themselves jobless without
upskilling to the skillset required in the digital economy. To
make a switch:

Find opportunities within your own firm to move – Every mid


to large organization is establishing its analytical capabilities
and there are ample opportunities out there for people to
switch. If you have experience with reporting or analysis or
statistics or advanced excel, chances are that your leaders will
be open to you moving on to a more complex role. In the
beginning, you may have to juggle between your regular work
and new analytical initiatives, but this is one of the easiest
ways to get started.
194

Take up an Analytics Course – Learning things scientifically in a


structured format helps you scale faster. Several options are
available when it comes to Analytics courses right from
MOOCs, weekend programs, hybrid courses (classroom +
online) to full-time programs. While traditional full-time
programs tend to promise the best results, hybrid courses and
MOOCs are more suited to the learning needs of working
professionals.
Get Hands-On Experience – A certificate or merely stating on
your CV that you know analytics tools and techniques will not
help you get through job interviews. What you need are ready
projects on your resume to make an impression. Participating
in online hackathons, free projects with public data, or solving
analytics challenges by Kaggle or Analytics Vidhya will go a
long way in giving you the confidence to make this switch.
Business Analytics for Beginners | Principal Component
Analysis Explained | Great Learning
Comment on the blog below if you need further help figuring
out these steps and we will be sure to help you.

Do you know what the main components of a Business


Analytics dashboard are? Let us take a look at them.

Data Aggregation: Before you start the process of analysis,


you are required to gather, organise and filter data through
transactional records
Data Mining: The process of data mining refers to sorting
through a large volume of datasets using statistics, machine
learning. This helps in identifying trends and establishing
relationships
195

Association and Sequence Identification: We must then


identify actions that are performed in relation to other actions
or in a particular sequence
Text Mining: This allows us to explore a large volume of
unstructured text datasets. This is done for qualitative and
quantitative analysis of the data
Forecasting: Forecasting is done in order to analyze historical
data. This data can be from a specific time period. It allows us
to make informed estimates and determine future behaviour
Predictive Analytics: This allows us to use various statistical
tools and techniques to create a predictive model. This model
extracts information from different datasets and provides
information regarding patterns
Optimization: After all the trends have been identified, and
once all the predictions have been made, businesses must
engage in simulation techniques that allow us to test the best-
case scenarios
Data Visualization: provides visual representations in the form
of charts or graphs. This ensures quick data analysis
Difference Between Business Analytics and Business
Intelligence
Another million dollar question is differentiating business
analytics from business intelligence as BI is often used in place
of BA. In the industry, however, there is only a marginal
difference in the way these terms are defined.

While BI is more about using data collected over a period of


time from different sources to create dashboards, reports,
and documentation; BA focuses on the implementation of
data insights into actionable steps. Hence, BI has a longer loop
with BA functions integrated at mostly all steps. As
196

Dr.Bappaditya (Course Director – PGPBABI, Great Lakes)


points out, “I believe Business Analytics (BA) and Business
Intelligence (BI) are very overlapping so there isn’t a lot of
difference. Business intelligence is slightly more generic
where you are using data from various sources and then you
are analyzing it. That’s also a part of BA but in BI you finally
finish the loop by implementing the insights generated from
data. So in that sense, BI is slightly all-encompassing.

Essentially, whatever be the nomenclature – whether it is BI


or BA, we are actually trying to figure out how to capture
data, how to make it neat and how to make it talk to each
other. We look at unstructured, unusual data to generate
insights doing prediction analysis and finally implement it. So
whether it is BA or BI, both of them cover the same thing. I
don’t think that there is much of a difference whether you
take up a Business Intelligence course or a Business Analytics
course, because any program (BA or BI) the coursework listed
under either will pretty much be the same thing.”

Let us take a look at differentiating and common factors


between Business Intelligence and Analytics.

Business Intelligence Business Analytics


While BI is more about using data collected over a period of
time from different sources to create dashboards, reports,
and documentation
BA focuses on the implementation of data insights into
actionable steps
Business intelligence focuses on descriptive analytics
Business analytics focuses on predictive analytics
197

Business Intelligence focuses on Diagnostic Analytics


Business Analytics focuses on Prescriptive Analytics
Both BI and BA focus on presenting and organising data for
visualization Both BI and BA focus on presenting and
organising data for visualization
Deals with what happened Deals with the why’s of what
happened
4 Types of Business Analytics
There are mainly four types of Business Analytics, each of
these types are increasingly complex. They allow us to be
closer to achieving real-time and future situation insight
application. Each of these types have been discussed below.

Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
1. Descriptive Analytics
It summarizes an organisation’s existing data to understand
what has happened in the past or is happening currently.
Descriptive Analytics is the simplest form of analytics as it
employs data aggregation and mining techniques. It makes
data more accessible to members of an organisation such as
the investors, shareholders, marketing executives, and sales
managers.

It can help identify strengths and weaknesses and provides an


insight into customer behavior too. This helps in forming
strategies that can be developed in the area of targeted
marketing.
198

2. Diagnostic Analytics
This type of Analytics helps shift focus from past performance
to the current events and determine which factors are
influencing trends. To uncover the root cause of events,
techniques such as data discovery, data mining and drill-down
are employed. Diagnostic analytics makes use of probabilities,
and likelihoods to understand why events may occur.
Techniques such as sensitivity analysis, and training
algorithms are employed for classification and regression.

3. Predictive Analytics
This type of Analytics is used to forecast the possibility of a
future event with the help of statistical models and ML
techniques. It builds on the result of descriptive analytics to
devise models to extrapolate the likelihood of items. To run
predictive analysis, Machine Learning experts are employed.
They can achieve a higher level of accuracy than by business
intelligence alone.

One of the most common applications is sentiment analysis.


Here, existing data collected from social media and is used to
provide a comprehensive picture of an users opinion. This
data is analysed to predict their sentiment (positive, neutral
or negative).

4. Prescriptive Analytics
Going a step beyond predictive analytics, it provides
recommendations for the next best action to be taken. It
suggests all favorable outcomes according to a specific course
of action and also recommends the specific actions needed to
deliver the most desired result. It mainly relies on two things,
199

a strong feedback system and a constant iterative analysis. It


learns the relation between actions and their outcomes. One
common use of this type of analytics is to create
recommendation systems.

Busting Business Analytics Myths


You Need Programming to Learn Business Analytics – A
professional doesn’t require programming experience to learn
business analytics as most of the tools and techniques are
easy to use. Tableau, a data visualization tool, even has drag
and drop elements that make it really easy for anyone to get
started. That’s why Business Analytics has found such
ubiquitous application in all domains and professionals from
vastly different industries like BFSI, Marketing, Agriculture,
Healthcare, Genomics, etc. find it to be a great career option
and a natural progression in their careers. A good knowledge
of statistics will need to be developed though.
You Need Advanced Maths for Business Analytics – Is based
on the use of common human intelligence that can be applied
to solve any and all industry problem. Hence, you don’t need
Fourier series or advanced mathematical algorithms to build
analytical models. Math learned till 10+2 level is good enough
and can serve as a starting base for professionals in all
domains. Also, unknowingly a lot of professionals apply math
in their day to day work with excel and data interpretation so
Math beyond the basic level is not a mandate to learn the
principles of business analytics.
Learning BA Tools And Techniques is Enough for Becoming a
Good Business Analyst – While learning Python, R, or SAS can
get you through the door for an interview, it would be fairly
hard for anyone to excel in their role without two things – one
200

is domain knowledge and the second would be a business


sense. To understand a client’s (or your own internal)
requirement, to formulate a problem statement that needs to
be processed, a business analytics professional has a lot under
his/her purview. Simply put, learning tools and techniques is
only a small piece of the larger picture to get you started in
your role. Dr. Bappaditya (Director, PG Program Business
Analytics, Great Lakes) explains this with an example: “A
million records of a customer for credit cards are processed to
figure out bad customers from good ones. While a data
scientist will crunch that data to find insights, a Business
Analytics professional will put a decision rule to it. A business
analyst will look at all this data and come to the simple rule
that customer is good if his credit score is above this (let’s say
95%) or his income is above that and the number of
dependents is this. Otherwise, a customer is bad.
So, Business Analytics is much more applied and with a very
specific objective in mind.
A Business Analytics Profile Is All About Crunching Numbers –
Number crunching or in technical terms – cleaning of data,
slicing and dicing of data, converting an unstructured data set
into a structured one is all part of the process. However, the
profile of a business analytics professional is not limited by
these functions. The essence of true business analytics lies in
resolving business problems combining domain knowledge,
client interactions, business sense, and basic human
intelligence apart.
Business Analytics Tools
Business Analytics tools help analysts to perform the tasks at
hand and generate reports which may be easy for a layman to
understand. These tools can be obtained from open source
201

platforms, and enable business analysts to manage their


insights in a comprehensive manner. They tend to be flexible
and user-friendly. Various business analytics tools and
techniques like Python, R, SAS, Tableau, Statistical concepts,
and building of analytical models are required to be able to
apply for business analytics roles.

A working knowledge of business analytics and business


intelligence tools is a key differentiator for professionals
competing for business analytics jobs. The relevance of data
analysis tools is determined by the project and client
requirements. Python, R, SAS, Excel, and Tableau have all got
their unique places when it comes to usage. According to the
Great Learning Skills Report 2018, SQL is the top requirement
to excel in the field of data science, followed closely by Python
and Java. Hadoop, R, and SAS have also climbed up the ladder
to be among the top 10 skills required as per data from
Indeed.com. Let’s take a closer look at some of these business
analytics tools-

Python
SAS
R
Tableau
Python – Python has a very regular syntax as it stands out for
its general-purpose characteristics. It has a relatively gradual
and low learning curve for it focuses on simplicity and
readability. Python is very flexible and can also be used in web
scripting. It is mainly applied when there is a need for
integrating the data analyzed with a web application or the
statistics is to be used in a database production. The IPython
202

Notebook facilitates and makes it easy to work with Python


and data. One can share notebooks with other people without
necessarily telling them to install anything which reduces code
organizing overhead, hence allowing one to focus on doing
other useful work. Python offers several libraries for
visualization like Boken, Pygal, and Seaborn which may, in
turn, be too many to pick. And unlike R, its visualizations are
convoluted and not attractive to look.
Python for Data Science | Data Science with Python | Python
for Data Analysis | 11 Hours Full Course

SAS – SAS is widely used in most private organizations as it is a


commercial software which also ensures that it has a whole
lot of online resources available. Also, those who already
know SQL might find it easier to adapt to SAS as it comes with
PROC SQL option. The tool has a user-friendly GUI and can
churn through terabytes of data with ease. It comes with an
extensive documentation and tutorial base which can help
early learners get started seamlessly. SAS has two
disadvantages: Base SAS is struggling hard to catch up with
the advancements in graphics and visualization in data
analytics. Even the graphics packages available in SAS are
poorly documented which makes them difficult to use. Also,
SAS has just begun work on adding deep learning support
while its competition is far ahead in the race.
R – R is an open source software and is completely free to use
making it easier for individual professionals or students
starting out to learn. While several forums and online
communities post religiously about its usage, R can have a
very steep learning curve as you need to learn to code at the
203

root level. Graphical capabilities or data visualization is the


strongest forte of R with R having access to packages like
GGPlot, RGIS, Lattice, and GGVIS among others which provide
superior graphical competency. R is gaining momentum as it
added a few deep learning capabilities. One can use KerasR
and Keras package in R which are mere interfaces for the
original Keras package built on Python.
Tableau – Tableau is the most popular and advanced data
visualization tool in the market. Story-telling and presenting
data insights in a comprehensive way has become one of the
trademarks of a competent business analyst. It offers a free
public version but a paid version is required for those who
would like to keep their reports and data confidential. Tableau
is a great platform to develop customized visualizations in no
time, thanks to the drop and drag features. Tableau can be
easily integrated with most analytical languages and data
sources and visualizations created are platform and screen-
size independent. The downside of Tableau is that it comes
with a cost especially for large enterprises and there are no
version-control options yet.

Applications of Business Analytics:


business analytics across all verticals of a business. It can be
applied to finance, marketing, human resource management,
CRM, and even manufacturing.

Finance: Business analytics are a boon for the financial sector.


Various divisions like investment banking, financial planning,
portfolio management, budgeting, and forecasting can benefit
204

greatly by implementing business analytics. The business


analysts use data mining tools and statistics on the available
financial data to observe the trends of similar products in case
of a new product or observe the existing product to
determine future actions.
Marketing: Business analytics play a pivotal role to determine
a successful product marketing strategy. Right from
conducting market research for the product in demand, to
creating a future road map for product lifecycle, the business
analysts are involved through all the stages of the product.
They start by analyzing consumer behavior, their buying
patterns, trends to determine the target customer. They are
also involved in other areas like determining advertising
techniques, identifying loyal customers, and adding new
features to the product when required.
HR professionals: HR professionals can use business analytics
to identify the potential candidate for the jobs available. They
can use the widely available data and analyze the same to
identify the right candidate based on their qualification and
experience. The analytics can also be used to determine other
factors like salary expected and retention rate.
CRM: CRM can use business analytics to determine key
performance indicators like the socio-economic status of
customers, purchasing patterns, lifestyles, etc., to develop a
bonding with the customer.
Manufacturing: Different areas of manufacturing like supply
chain management, inventory management, and risk
management use business analytics to improve efficacy based
on the product data.
2. Examples of Business Analytics
205

Let us look at a couple of real-life examples of business


analytics.

OTT platforms: Many over the top platforms (OTT platforms)


like Netflix, Prime, HotStar base on movies and series you
have watched to suggest the right movies and series for you.

Amazon: Amazon is one of the biggest e-tailers in the world


and one of the reasons is business analytics. For example,
when you search for a product like a shirt on the website, will
suggest similar products and also accessories for the shirt.

The following are examples of applications of business


analytics in various sectors:

Example of application of business analytics in finance: A


stockbroker can advise about a particular stock by taking into
account the historical chart of a stock, its current
performance, and its future vision. Business analysts can
understand the trend of the stock, predict its results, and
advise the investor accordingly.
Example of application of business analytics in marketing:
While developing a new software product, business analytics
can use data mining to collect data of a similar product, and
analyze the same to identify the demand for the product,
identify the target customer and even help in advertising in
the right media.
Example of application of business analytics in human
resource management: A HR manager can use previous
employment data of an existing employee to determine the
retention rate for the employee.
206

Example of application of business analytics in CRM: An online


retail website can understand the purchase pattern of the
recurring customer and determine targeted discounts and
other advertising strategies for the customer.
Example of application of business analytics in manufacturing:
The historical data of 10-year-old machinery that has been
decommissioned will help determine the depreciation value
and number of years of the life of similar new machinery.
Example of application of business analytics in credit card
companies: The credit card companies identify their potential
customer by referring to important financial data of the
customer like their credit score, and spending patterns to
offer credit cards.
3. Uses of Business Analytics
As the data keeps piling by the second, more and more
organizations rely on business analytics to boost their
business operations. If used in the right way, business
analytics provide to be a powerful tool to build and grow and
business. Typically, organizations use business analytics
through different phases of a product or service lifecycle.

In the initial stage, the tools are used to conduct market


research to collect quantitative and qualitative data on a
product.

Many statistical tools are used at the next stage to analyze the
collected market metrics. The analyst synthesizes the research
data to define the new product and to determine its features.
He also uses advanced analytics and statistical tools to
determine hidden patterns and trends.
207

At the next stage, the analyst disseminates information to


relevant stakeholders through interactive tools like
dashboards and reports.

They provide all the required research and analysis to


determine the future road map of the product and help in
developing a winning product strategy.

To maintain a healthy business, the analysts provide data


about the changing trends and help update to latest trends
that change at a drop of a hat.

You might also like