0% found this document useful (0 votes)
60 views193 pages

2024 - Data Analytics Book

Basic Econometric Tools and Techniques in Data Analytics

Uploaded by

Sendhil R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views193 pages

2024 - Data Analytics Book

Basic Econometric Tools and Techniques in Data Analytics

Uploaded by

Sendhil R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 193

Basic Econometric Tools and Techniques

in Data Analytics

V.Chandrasekar
R.Sendhil
V.Geethalakshmi
A.Suresh
Nikita Gopal
V Chandrasekar, Ramadas Sendhil, V Geethalakshmi, A Suresh, Nikita Gopal
Editors

Basic Econometric Tools and


Techniques in Data Analytics

ICAR- CENTRAL INSTITUTE OF FISHERIES TECHNOLOGY


Editors

V Chandrasekar
Senior Scientist, Agricultural Economics
ICAR-Central Institute of Fisheries
Technology
Cochin, Kerala, India

Ramadas Sendhil
Associate Professor
Department of Economics
Pondicherry University
Puducherry, India

V Geethalakshmi
Principal Scientist
ICAR-Central Institute of Fisheries
Technology
Cochin, Kerala, India

A Suresh
Principal Scientist, Agricultural Economics
ICAR-Central Institute of Fisheries
Technology
Cochin, Kerala, India

Nikita Gopal
Principal Scientist and Head of the Extension
Information & Statistics Division
ICAR-Central Institute of Fisheries
Technology
Cochin, Kerala, India

January 29, 2024


ISBN 978-81-965133-8-2

The edited volume has been published with financial support from the ICAR-Central
Institute of Fisheries Technology (CIFT), Cochin, Kerala, India. The use of general
descriptive names, registered names, trademarks, or service marks in this publication does
not imply, even in the absence of a specific statement, that these names are exempt from
applicable protective laws and regulations, nor are they free for general use. The publisher,
authors, and editors have made every effort to ensure that the advice and information
presented in this book are accurate and reliable as of the publication date. However, no
warranties are given by the publisher and editors regarding the content nor for any potential
errors or omissions. Additionally, the ICAR-CIFT remains neutral concerning
jurisdictional claims in published maps, illustrations, and institutional affiliations.
PREFACE
The landscape of data analytics is rapidly changing, and it is at the intersection of
econometrics and advanced statistical techniques that valuable information can be gained
from complex datasets. The need for robust econometric tools has been immense as
academicians, researchers, and data analysts learn how to deal with the ever-increasing
amount of data available to them. This edited volume, “Basic Econometric Tools and
Techniques in Data Analytics,” an outcome of the 5-day Scheduled Caste Sub-Plan
Program (SCSP) training of ICAR-Central Institute of Fisheries Technology in
collaboration with the Department of Economics, School of Management, Pondicherry
University (A Central University) aims at bridging the gap between theoretical econometric
concepts and their practical use in data analysis.

Economic theory, mathematics, and statistical methods are combined in econometrics thus
making it a powerful framework for modeling relationships within datasets. For both
beginners and experts alike, this book offers an extensive compendium of chapters
discussing fundamental econometric tools necessary to draw meaningful conclusions from
various datasets. Written by field experts, all the chapters in this volume have varied
perspectives on basic econometric tools to form one solid piece of academic writing. Topics
covered include mastering essential software for econometric analysis like R, linear
regression, and hypothesis testing, as well as advanced techniques like time-series
forecasting and panel regression models. The book also underscores the significance of
having strong background knowledge in these subjects, even as it centers around practical
applications in econometrics. This includes each chapter crafted to explain lucidly and
provide exemplifications and enhanced learning exercises to enable readers to comprehend
what they have learned and put it into practice.

The target audience for this book involves students, researchers, and professionals from
diverse fields such as economics, commerce, finance, business, and social sciences, among
others, who wish to employ econometrics for evidence-based decision-making. Whether
used as a college text or a procedural manual for experts, this book seeks to arm users with
the requisite information and skill set for basic data analysis using econometric tools.

We would like to thank all the contributors for their insights that made this compilation a
priceless source of material. We trust that "Basic Econometric Tools and Techniques in
Data Analytics" will become an indispensable companion for those individuals who want to
understand the dynamism of data analysis through the eyes of econometrics.

Editors
CONTENTS

1 Role of Econometric Tools and Techniques in Data Analysis 1


Jyothimol Joseph, Keshav Soni, Ramadas Sendhil and
V Chandrasekar

2 An Introduction to R and R Studio 13


J. Jayasankar, Fahima M.A and Megha K.J

3 Regression Analysis:Simple and Multiple Regression Using R 37


V Chandrasekar, Ramadas Sendhil and Geethalakshmi. V

4 Diagnostic Tests in Regression Analysis 50


Amaresh Samantaraya

5 Data Mining and Computation Software for Social Sciences 61


V. Geethalakshmi and V Chandrasekar

6 Introduction to Indices and Performance Evaluation 78


J. Charles Jeeva and R. Narayana Kumar

7 Fundamentals of Panel Data Analysis 89


Umanath Malaiarasan

8 Estimation of Total Factor Productivity by Using Malmquist 125


Total Factor Productivity Approach: Case of Rice in India
A. Suresh

9. Forecasting Methods – An Overview 139


Ramadas Sendhil, V Chandrasekar, L Lian Muan Sang,
Jyothimol Joseph and Akhilraj M

10. Emerging Trends and Technology for Data-Driven Market 161


Research
R. Narayan Kumar

11. Data Visualization for Data Science 178


Chandrasekar V, Ramadas Sendhil and V. Geethalakshmi
EDITORS PROFILE

V. Chandrasekar, Senior Scientist in Agricultural


Economics, started his career at ICAR-Central Potato
Research Institute, Shimla, and later moved to ICAR-
Central Institute of Fisheries Technology, Cochin, Kerala.
With almost Fourteen years of service, including 3 years at
Veraval Research Centre, Gujarat, he conducted important
studies on fishermen, livelihoods, and technology impact.
Before joining the Agricultural Research Service, he
forecasted agricultural commodity markets for
Puducherry. Dr. Chandrasekar has worked on numerous
research projects and leads the Impact Assessment of
ICAR-CIFT Technologies. He has published 25 research
papers and obtained a copyright for independently
developing two mobile apps for ICAR-CIFT. Furthermore,
he worked as a manager at the Agriculture Technology
Information Centre for five years. He organized outreach
programs in Tamil Nadu, actively participating in national
and international training programs held at ICAR-CIFT.

Ramadas Sendhil presently serves as an Associate


Professor in the Department of Economics, School of
Management, Pondicherry University, and is a former
employee of the Indian Council of Agricultural Research
(ICAR). He has been with the ICAR-Indian Institute of
Wheat and Barley Research (IIWBR), Karnal, between 2011
and 2022 (January), and also associated with the ICAR-
National Dairy Research Institute, Karnal, between 2014
and 2022 (January) in teaching & mentoring PG and Ph.D.
scholars. He is an agricultural economist with around 15
years of professional experience, including around 10.5
years in wheat and barley research on production,
marketing & policy prescription, and around eight years of
experience in teaching and mentorship. Sendhil is a
University Gold Medalist during his Post-Graduation at Sri
Venkateswara Agricultural College, Tirupati, and had his
Doctorate in Agricultural Economics from the ICAR-Indian
Agricultural Research Institute (IARI), New Delhi. He has
published 126 research papers in peer-reviewed national
and international journals of high repute, 10
edited/authored books, and presented his research
proposals/findings at various events held in the USA,
Canada, Italy, Australia, South Korea, Japan, and Ghana. He
is an Associate Editor of the Agricultural Economics
Research Review, an Editorial Board member of the Indian
Journal of Economics and Development, a peer-reviewer
of several SCOPUS-indexed journals, and a Guest Editor of
Frontiers in Sustainable Food Systems. He has been
honored with several recognitions, including the Elected
Fellow of the National Academy of Agricultural Sciences
(NAAS), Lal Bahadur Shastri Outstanding Young Scientist
(ICAR), Fellow of the Society for Advancement of Wheat
and Barley Research (SAWBAR), Dr.RT Joshi award from
the Agricultural Economics Research Association (AERA-
India), Young Agricultural Economist Award (AERA), Best
Worker (IIWBR), Prof. Mahatim Singh Memorial Award
(SAWBAR), Uma Lele Mentorship Award (AAEA, USA), NFP
grant, LI-LMI AAEA Award, IARI Fellowship and ICAR-JRF.
Hitherto, he has completed 9 research projects and is
executing 2 projects. In addition, he has rich administrative
and outreach experience wherein he has organized 11
capacity-building programs (training, seminar, conference,
workshop, etc.) and delivered around 40 invited talks. He
has taught and guided M.Sc. (4 completed) & Ph.D. (1
completed) Scholars at the ICAR-National Dairy Research
Institute, Karnal, for about 8 years and is presently guiding
4 Ph.D. Scholars at Pondicherry University. He’s interested
in teaching agricultural economics and executing
transdisciplinary and multi-institutional research to foster
innovations and policy formulation that lead to
agricultural transformation. His research interests include
Agricultural Economics, Time Series Analysis, Value Chain
Analysis, Food Policy, Market Outlook, and Climate
Change.

V Geethalakshmi, Principal Scientist, joined ICAR Central


Sheep and Wool Research Institute, Avikanagar Rajasthan
after completing M.Sc. & Ph. D in Agricultural Statistics
from PG School, Indian Agricultural Research Institute,
New Delhi, a prestigious research and academic
organization dedicated to Agricultural research in the
country. Since 2003, she has been working at ICAR Central
Institute of Fisheries Technology, Cochin, a premier
research organization in developing harvest and post-
harvest fisheries technologies. She has 45 research
publications to her credit and has guided seven students
during her career. She has worked on interdisciplinary
projects, covering areas like statistical modeling for
fisheries evaluation, value chain in fisheries, and
assessment of harvest and post-harvest losses. As Nodal
officer of the Swachhta Action Plan, she is working with
stakeholders in the fisheries sector and propagating the
conversion of ‘Waste to wealth’ by implementing units for
converting fish waste into manure and aquafeed.

A. Suresh is a Principal Scientist of Agricultural Economics


at the Central Institute of Fisheries Technology (CIFT),
Kochi, Kerala, a research organization under the ambit of
the Indian Council of Agricultural Research (ICAR). He has
broad-ranging experience in issues of crop, livestock, and
fisheries sectors. Before joining CIFT, he served as a faculty
in the Division of Agricultural Economics at the Indian
Agricultural Research Institute, New Delhi; the National
Institute for Agricultural Economics and Policy Research
(NCAP), New Delhi; and Central Sheep and Wool Research
Institute, Rajasthan. He has taught natural resource
management and environmental economics, as well as
green economics and economic development courses at
IARI. His area of specialization is Agricultural Development
and Natural Resource Economics.

Nikita Gopal is a Principal Scientist and Head of the


Extension Information & Statistics Division at the Indian
Council of Agricultural Research-Central Institute of
Fisheries Technology (ICAR-CIFT), Kochi, Kerala, India. In
her research career of 25 years, she has worked on projects
related to women in seafood processing, small-scale
aquaculture and fisheries, seaweed farming, dried fish
production, and small-scale fish vending.
Chapter 1
Role of Econometric Tools and Techniques
in Data Analysis
Jyothimol Joseph1, Keshav Soni1, Ramadas Sendhil1 and V Chandrasekar2
1 Department of Economics, Pondicherry University (A Central University),

Puducherry, India.
2 ICAR-Central Institute of Fisheries Technology, Cochin, India

Introduction

Economics, a fascinating discipline, draws on foundational aspects from various fields like
mathematics and statistics. The applicability of economic theories is judged and tested over
time by several visionary researchers to deduce the relationship between economic agents
and economic activities. Economics is subdivided into various disciplines, each focusing on
specific aspects of life. Microeconomics explores individual decision-making, macroeconomics
delves into aggregate decision-making, development economics examines the impact of
economic decisions on sustenance and well-being, and environmental economics scrutinizes
the relationship between mankind and nature, to cite a few branches. In a way, every sub-
discipline deals with the agent’s economic decisions and the spillover effect of these decisions
on their and surrounding lives.

Why Econometrics?

The common misconception among many is that econometrics is the synchronized


application of mathematics and statistics in economic theory. Economic theories are mere
conjectures or hypotheses based on certain qualitative observations. For example,
microeconomics deduced the inverse relationship between the demand and price of a product,
and macroeconomics, the inverse relationship between inflation and unemployment. All these
are simple theories or conjectures until proven, as theories themselves don’t provide the
magnitude of these relationships. To test these theories, researchers must measure the
economic variables quantitatively, necessitating precise tools to estimate their relationship.
While theoretical economics suggests the relationship between two or more variables,
econometrics gives empirical content to these economic theories. To assess the relationship
between two or more variables, we must be able to deduce them into mathematical equations.
Mathematical economics facilitates this process, ensuring logical consistency in economic
theories. Modelling is done with the help of mathematical economics, which it enforces.

1
Mathematical models are deterministic since there is no scope for variability. The very close
alternative to the mathematical method is statistics, but it is only concerned with collecting,
processing, and representing economic theory. Since mathematical economics and statistics
are just supplementary arms for economic theory, they need specific disciplines for
quantitively measuring this economic phenomenon or decisions. Economic decisions are
subject to variability due to individual differences and diverse situational contexts. These
variabilities are called errors in econometrics. So, econometrics deals with empirical evidence
of economic theories based on precise tools and sophisticated techniques that follow
mathematical and statistical principles of unbiasedness, efficiency, and consistency. “The
method of econometric research aims, essentially, at a conjunction of economic theory and
actual measurements, using the theory and technique of statistical inference as a bridge pier”
(Haavelmo, 1944).

Application of Econometric Tools

Econometric tools have wide-ranging applicability in real life across various fields and
industries. Nowadays, policies aren’t based on trial-and-error methods but on hardcore
econometric models to observe the expected impact beforehand. Based on these findings,
policies are customized and implemented to achieve efficiency and distributive equity.
Government and central banks use econometric models to evaluate the repercussions of fiscal
and monetary policies on key macroeconomic indicators such as inflation, per capita
disposable income, unemployment rate, GDP growth, and money supply. These econometric
tools are also used in scrutinizing and fending off the impacts of exogenous variables on
endogenous variables. For example, during the subprime crisis of 2007-08, central banks'
resilient policies for the money market helped stabilize the Indian economy. These
econometric tools also help in choosing the right policy. For example, econometric models are
instrumental in determining whether demand-side policies or supply-side policies are the
appropriate choice for market correction.
These econometric tools are extensively used in financial markets to analyze volatility. For
example, asset prices are dynamic and pose a severe challenge in crises. The choice of
investment strategies, risk assessment of the asset, the performance of the asset shortly, and
hedging strategies are heavily reliant on these econometric tools. Multinational corporations
use these econometric tools for demand estimation, observing consumer behavior, price
determination, and forecasting sales. Apart from these, healthcare economics uses
econometric tools to evaluate the effectiveness of healthcare intervention, and environmental

2
economists use it for cost-benefit analysis and optimization of natural resources to scrutinize
the relationship between pollution levels and their impact on mankind. So, every sub-
discipline of economics is heavily dependent on these environmental tools to optimize the
benefit efficiently while keeping allocative and distributive equity intact.
Hence, the primary uses of econometric tools include the following:
• Econometric models help in formulating relationships between economic variables.
• Econometric techniques help in testing hypotheses about economic relationships.
• Econometric models are heavily used in forecasting and estimating the future trends
of economic variables
Example: the expected growth rate of public expenditure in healthcare by the central
government in the subsequent ten years based on preceding investment.
• Econometrics is frequently used to assess the effects of specific economic policies.
• It helps in estimating causality between two variables.
Example: the link between risk and return.

Steps in econometric analysis

Among the several schools of thought in econometric methodology, steps based on classical
methodology are as follows:

1. Stating the research question or hypotheses


Ex.: Does there exist an inverse relationship between unemployment and inflation?

2. Specification of the economic model


As per Phillips curve theory, an inverse relationship persists between the inflation
rate and the unemployment rate.

3. Specification of the econometric model


Econometric modeling is a way of reflecting the economic model to its empirical part.
𝛹𝑡 = 𝛹𝑡−1 − 𝛼(𝑢𝑡 − 𝑢̅) + 𝜀𝑡
where 𝜳t is inflation, 𝜳t-1 is the lagged inflation rate, εt is the unemployment rate,
uˉ is the natural unemployment rate.

4. Obtaining data
Econometric data are not derived from controlled experiments but are gathered
through observation of real-world events and behaviors.

3
Economic data sets come in different formats. A cross-sectional data set contains
various variables at a single point in time. In econometrics, cross-sectional variables
are typically represented by the subscript "i," where "i" takes values from 1 to N,
representing the number of cross-sections. This type of data is often used in applied
microeconomics, labor economics, public finance, business economics, demographic
economics, and health economics.
A time series dataset involves recording observations of one or more variables at
sequential time intervals, making it particularly useful in macroeconomic research.
Time series variables are typically represented with the subscript "t."
Panel data combines aspects of both cross-sectional and time series data, collecting
information from multiple variables over time. Panel data are represented using both
"i" and "t" subscripts, referring to cross-sectional and time series data, respectively.
For instance, the GNP of five countries over a 10-year period might be denoted as
Yit, where t = 1, 2, 3, ..., 10 and i = 1, 2, 3, 4,5.

5. Estimation, validation, hypothesis testing and prediction


After collecting the appropriate data set, the researcher must estimate the
parameters of the econometric model. This estimation is then assessed both from
an economic perspective (do the results align with established economic theory?)
and from a statistical perspective (evaluating significance tests and goodness-of-fit
measures).

Source: Ripollés, Martínez-Zarzoso and Alguacil (2022)

4
Tools of Econometric Analysis

1. Classical Linear Regression Models

Regression analysis is a widely used tool in economics as it allows economists to model


and quantify the relationships between economic variables, providing a structured
framework for empirical analysis. The aim is to assess how the changes in the explanatory
variables correspond to the variations in the explained variable. This process includes the
estimation of parameters to quantify these relationships. Ultimately, regression analysis
provides a method for assessing and predicting the average value of the dependent
variable based on fixed or known values of the explanatory variables across repeated
samples.

The ordinary least square model is one of the most used and reliable regression analysis
methods. These models are based on Gauss-Markov assumptions and are widely applied
in various fields for modeling, predictions, and testing hypotheses. Mainly CLRM is of
two types:

a) Simple Linear Regression: This model characterizes the linear relationship between
a dependent and a single independent variable.
Yt = 𝜸 + β1Xt + ut
where 𝜸 is the intercept, β1 is the slope, and ut is the disturbance term..

b) Multiple Linear Regression: This model is simply the extension of simple linear
regression to include more than one independent variable and is represented as
Yt = β0 + β1X1t + β2X2t + β3X3t +…….. βnXnt + ut
where X1, X2, ..., Xn are the independent variables.

Assumptions of Classical Linear Regression Models

• Linearity: The explained variable can be computed as a linear function of


explanatory variables, augmented by an error term.

• Xt has some variation: at least one of the observations of independent


variables has to be different so that the sample Var(X) is not 0.
• Xt is non-stochastic and remains constant across repeated samples.
• The expected mean value of the error term is zero.

5
• Homoscedasticity: Constant variance of error terms across all levels of the
independent variables.
• Independence: Observations are independent of each other.
• Normality of disturbance term: The disturbance terms are assumed to be
normally distributed.
• No Perfect Multicollinearity among independent variables.

Source: Asteriou & Hall (2011)

The parameters (β0, β1, β2, ..., βn) are estimated using the method of ordinary least
squares, which “minimizes the sum of squared differences between observed and
predicted values” and are assumed to possess the properties of linearity, unbiasedness,
consistency, and efficiency (BLUE).

2. Estimation and Hypothesis Testing

Hypothesis testing involves assessing whether a given result supports a proposed


hypothesis. This is done through methods like confidence intervals and significance tests.
The confidence interval approach estimates an unknown parameter's possible range,
typically using percentages such as 90% or 95%. If the value predicted by the null
hypothesis lies outside this range, the null hypothesis is rejected; if it falls within the
interval, it cannot be rejected.
In significance testing, a test statistic is calculated based on the assumption that the null
hypothesis is true. The test statistic follows a known probability distribution, such as the

6
normal, t, F, or chi-square distributions. This helps assess how likely it is to observe the
data if the null hypothesis holds. A p-value is then calculated to indicate the strength of
the evidence. A small p-value suggests that the null hypothesis should be rejected, while
a larger p-value indicates there isn’t enough evidence to reject it. Both methods help
determine the validity of the null hypothesis in statistical analysis.

3. Dummy Variable Regression Models

This type of regression analysis model includes categorical variables (also known as
dummy variables or indicator variables). These categorical variables represent categories
or groups that cannot be quantitatively measured. Examples include gender, education,
race, religion, geographical region, etc. They take the value of 0 or 1, indicating the
absence or presence of a particular categorical attribute. A regression model with all its
regressors dummy is called an Analysis of Variance (ANOVA) model.
Yi = α + β1D1i + β2D2i¬ + β3D3i + ui
If there are h categories, only h−1 dummy variables will be taken to avoid the problems
of dummy variable trap and perfect multicollinearity, which is problematic for regression
analysis. The coefficients of these dummy variables, known as differential intercept
coefficients, indicate the average change in the dependent variable when transitioning
from the benchmark category to the category associated with the dummy variable.
Regression models incorporating quantitative and qualitative variables are referred to as
analysis of covariance (ANCOVA) models.
Yi = β1 + β2D2i + β3D3i + β4Xi + ui
Applications

• Helpful for comparing two (or more) regressions


• This is a method of de-seasonalizing time series
• Widely used in piecewise linear regression
• Interpretation of dummies in semi-log models
• Used for capturing the joint effect of two or more variables: interactive dummies
• Dummy variables are also used for testing structural stability

4. Qualitative Response Models

Also known as binary or discrete choice models or limited dependent variable regression
models, are a class of statistical models used when the dependent variable is categorical.

7
Different techniques used include:

a) Linear Probability Model (LPM)


In this model, the dummy dependent variable is explained by only one regressor.
Di = β1 + β2X2i + ui
Where Di is the dichotomous dummy variable
The primary limitation of the linear probability model (LPM) is its assumption that
the probability of an event occurring changes linearly with the value of the
independent variable. The disturbance terms are also heteroskedastic, and R2 has
no significance in explaining the model..
b) Logit model
This model addresses the issue of the (0,1) boundary condition problem. In this
model, the dependent variable is the log of the odds ratio, representing a linear
relationship with the regressors.
Li = β1 + β2X2i + β3X3i +···+ βkXki + ui where Li = ln(Pi /1 – Pi)
The maximum likelihood method is used if the data is at the micro level. However, if
the data is grouped or replicated, the OLS method can estimate the parameters.
c) Probit model
The logit and probit models are more closely related. However, the rationale for
using the probit model over the logit model is that many economic variables follow
a normal distribution. Consequently, assessing these variables through the
cumulative normal distribution is deemed more appropriate, justifying the
preference for the probit model.
d) Multinomial and ordered logit and probit models
If dummy variables represent multiple response categories that exhibit a clear rank
or order, such as "strongly agree," "agree," etc, then ordered logit and probit models
can be used. A multinomial model is used if the dummy variables represent multiple
response categories with no natural ordering of the considered alternatives.
Multinomial model usage has been increasing in applied econometrics these days to
address real-world problems.
e) Tobit model
Extension of probit model that uses censored variables i.e., if information is available
only on some observations. The model can be represented as
Yi = 𝜸1 + 𝜸2Xi + εi if RHS > 0
=0 otherwise

8
5. Panel Data Regression Models

According to Baltagi, since panel data relates to cross-section over time, heterogeneity
can be captured in observation units. With observations that span both time and
individuals in a cross-section, panel data provides “more informative data, more
variability, less collinearity among variables, more degrees of freedom and more
efficiency”. It is better suited for analyzing changes over time, identifying and measuring
the impact of policies and laws, and examining complex behavioural models. Panel data
regression model can be expressed as:

Yit = β1 + β2X2it + β3X3it + ... + βkXkit + uit


where i and t denote units and time, respectively.

A panel is considered balanced when every subject has the same number of time periods
observed, whereas it is unbalanced if the number of observations varies across subjects.
Common estimation techniques for panel data regression models include pooled ordinary
least squares (OLS), fixed effects models, and random effects models.

6. Dynamic Econometric Models

These models are used to analyze the variables whose values change continuously. These
econometric models are capable of dealing with the volatility of the series. In financial
economics, most of the series, such as stock prices and future options prices, are
examples of volatile time series, and they require special techniques, such as
autoregressive conditional heteroskedasticity (ARCH) modeling, to extract the
information from the series itself.

Traditional econometrics views the variance of the distribution terms as constant over
time (CLRM assumption of homoscedasticity). However, financial time series exhibit high
volatility in particular periods. This volatile nature has huge implications for the overall
economy. So, the ARCH family of models is used to analyze these volatile time series.
Several different models of ARCH, such as GARCH (Generalized Autoregressive
Conditional Heteroskedasticity), GARCH-M (GARCH in mean), T-GARCH (Threshold
GARCH), E-GARCH (Exponential GARCH), and others, are used frequently in analysis.
Each technique and tools have its advantages and limitations.

9
7. Time Series Forecasting

The fundamental objective of a researcher is to predict the future based on certain


history or the nature of the data. A time series consists of a sequence of observations
on the values a variable assumes at different time intervals. These observations can be
collected at various frequencies, such as daily, weekly, monthly, quarterly, annually,
quinquennially, or decennially. For example, stock prices and temperature are collected
daily, money supply data is collected weekly, consumer price index is collected monthly,
GDP data is collected quarterly and annually, and census is collected decennially.
Nowadays, with technological advancements, data is being collected continuously in real-
time. Although time series modeling is used heavily, it presents several problems for
econometricians, such as non-stationarity. If the time series is non-stationary, its mean
and variance vary systematically. Analysis of a non-stationary series can lead to difficulty
in forecasting, misleading statistical inference, and model instability. Therefore, ensuring
that a time series is stationary is a crucial step in time series analysis, and this can be
verified using various tests.
In time series modeling, the first step is to extract as much information as possible from
the variable itself. When analyzing a single time series, this is known as univariate time
series analysis. In such analyses, it is assumed that the variable’s behavior is influenced
by its previous values. Therefore, what happens in the next time period t+1 is largely
dependent on the events in the current period t. This concept is represented by the
Autoregressive (AR) model. If the value of the variable at time t is influenced by its two
previous values, the model is called an autoregressive model of order two, or AR(2). More
generally, an autoregressive model of order p, denoted as AR(p), includes p lagged
dependent variables. In addition to univariate models, econometric analysis can also
involve multivariate time series models. One common framework for analyzing such data
is the vector autoregression (VAR) model, which handles multiple variables by modeling
each variable as a linear function of its own lagged values and the lagged values of other
variables in the system. VAR models are useful for capturing dynamic interactions
between variables and are often applied in fields like macroeconomics and finance. A
further extension of VAR models is the Vector Error Correction Model (VECM), which
accounts for cointegration, indicating long-term relationships among variables.
Additionally, various non-linear models are employed to analyze more complex time
series patterns.

10
Conclusion

In summary, the significance of econometric tools and techniques in data analysis within
economics is irrefutable. As a bridge between economic theories and empirical evidence,
econometrics provides a quantitative foundation for testing and validating hypotheses across
various economic disciplines. From shaping government policies and analyzing financial
markets to evaluating healthcare interventions and optimizing environmental resource
allocation, econometric models play a crucial role in understanding and predicting economic
phenomena. The versatility of these econometric tools allows for a thorough and nuanced
exploration of economic relationships, ensuring that researchers can adapt their models to
the complexities of the data at hand. As the field of econometrics continues to evolve,
integrating cutting-edge statistical and mathematical techniques with established economic
theories remains an essential component of empirical research. The advancements in
computational power and the increasing availability of data further enhance the capacity of
econometricians to provide accurate, data-driven insights, guiding decision-makers and
researchers in navigating the complexities of the dynamic economic landscape and
transforming theoretical insights into actionable strategies.

11
Bibliography
Amemiya, T. (1981). Qualitative response model: A survey. Journal of Economic Literature,
19, 481–536.
Asteriou, D., & Hall, S. G. (2011). Applied Econometrics. UK: Palgrave Macmillan.
Baltagi, B. H. (1995). Econometric analysis of panel data. John Wiley and Sons.
Berndt, E. R. (1991). The practice of econometrics: Classic and contemporary. Addison-
Wesley.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and applications .
Cambridge University Press.
Cramer, J. S. (2001). An introduction to the logit model for economist (2nd ed., p. 33).
Timberlake Consultants Ltd.
Cromwell, J. B., Labys, W. C., & Terraza, M. (1994). Univariate tests for time series models.
Sage Publications.
Cuthbertson, K., Hall, S. G., & Taylor, M. P. (1992). Applied econometric techniques (p. 100).
University of Michigan Press.
Goldberger, A. S. (1991). A course in econometrics. Harvard University Press.
Greene, W. H. (1993). Econometric analysis (2nd ed., pp. 535–538). Macmillan.
Gujarati, D. N. (2012). Basic Econometrics. McGraw Hill Education Private Limited.
Haavelmo. (1944). The Probability Approach in Econometrics. Supplement to Econometrica,
12, iii.
Hood, W. C., & Koopmans, T. C. (1953). Studies in econometric method (p. 133). John Wiley
& Sons.
Johnston, J. (1984). Econometric methods (3rd ed.). McGraw-Hill.
Kmenta, J. (1986). Elements of econometrics (2nd ed., pp. 723–731). Macmillan.
Murray, M. P. (2006). Econometrics: A modern introduction. Pearson/Addison Wesley.
Patterson, K. (2000). An introduction to applied econometrics: A time series approach. St.
Martin’s Press.
Ripollés, J., Martínez-Zarzoso, I., & Alguacil, M. (2022). Dealing with Econometrics: Real World
Cases with Cross-Sectional Data. UK: Cambridge Scholars Publishing.
Wooldridge, J. M. (1999). Econometric analysis of cross section and panel data. MIT Press.

12
Chapter 2
An Introduction to R and R Studio
J.Jayasankar, Fahima M.A and Megha K.J
ICAR-Central Marine Fisheries Research Institute, Kochi, Kerala

Introduction

Statistical methods have inspired many computational tools in the past decades, so much so
that tools-inspired methodological options have been recorded. Many generic software
programs perform basic statistical analyses and tests, making the inference process well-
founded and relatively easy. R, an evolved offshoot of software S, is the latest one off the
block with explosive growth and adoption. The following section gives a practical overview of
what it takes to get R functional and about the basic maneuvers.

The R Environment

R is an integrated ecosystem of software facilities for data manipulation, calculation, and


graphical display. Among other things, it has

• An effective data handling and storage facility.


• A suite of operators for calculations on arrays, in particular matrices.
• A large, coherent, integrated collection of intermediate tools for data analysis.
• Graphical options for data analysis and display either directly on the computer or on
hardcopy.
• A well-founded and fast-evolving, simple, and effective programming language.
R is very much a carrier for newly developing methods of interactive data analysis. It has
developed rapidly and extended by a large collection of packages.

R can be regarded as an implementation of the S language developed at Bell Laboratories by


Rick Becker, John Chambers, and Allan Wilks, and also forms the basis of the S-Plus system.
There are about 25 packages supplied with R (called “standard” and “recommended”
packages), and many more are available through the CRAN (Comprehensive R Archive
Network family of Internet sites (via https://CRAN.R-project.org) and elsewhere. Most
classical statistics and much of the latest methodology are available with R, but users may
need to be prepared to do some work to find it.

13
R Studio

The integrated development environment (IDE) that envelops R is called R studio.

The RStudio interface has four main panels

• Console where you can type commands and see output. The console is all you would
see if you run R in the command line without RStudio. The prompt, by default ‘>’,
indicates that R is waiting for your commands.

• Script editor where you can type out commands and save them to a file. You can also
submit the commands to run in the console.

• Environment/History: Environment shows all active objects, and history keeps track
of all commands run in the console.

• Files/Plots/Packages/Help

Installing Procedure

Steps for installing R

1: Go to the website https://posit.co/download/rstudio-desktop/

2: Install R corresponding to your version of Windows.

To install R on Windows, click the “Download R


for Windows link.” After downloading, run the
.exe file and follow the installation instructions.
After installing, the user can open R by clicking
the R icon.

14
Step for installing RStudio

To install RStudio on Windows, click the “download RStudio for Windows” and choose the
appropriate version. Run the .exe file after downloading and follow the installation
instructions. Users can now work on R studio for analysis.

After finishing the installation procedure, the user can open RStudio by clicking the RStudio
icon, as shown in the figure above.

15
R commands, case sensitivity, etc.

➢ Technically R is an expression language with a very simple syntax.

➢ Normally all alphanumeric symbols are allowed (and in some countries, this includes
accented letters) plus ‘.’ and ‘_’, with the restriction that a name must start with ‘.’ or
a letter, and if it starts with ‘.’ the second character must not be a digit. Names are
effectively unlimited in length.

➢ Commands are separated either by a semi-colon (‘;’), or by a newline. Elementary


commands can be grouped together into one compound expression by braces (‘{’ and
‘}’).

➢ Comments can be put almost to anywhere, starting with a hash mark (‘#’), everything
to the end of the line is a comment. If a command is not complete at the end of a
line, R will give a different prompt, by default + on second and subsequent lines and
continue to read input until the command is syntactically complete. This prompt may
be changed by the user. We will generally omit the continuation prompt and indicate
continuation by simple indenting.

➢ To delete objects in memory, we use the function,

• rm: rm(x) deletes the object x,

• rm(x,y) deletes both the objects x et y,

• rm(list=ls()) deletes all the objects in memory.

➢ Objects in R obtain values by assignment. This is achieved by the gets arrow, <-.

16
Getting help with functions and features

To get more information on any specific named function,

For example solve, the command is


help(solve)
An alternative is ?solve
The help.search command (alternatively ??) allows searching for help in various ways.
For example,
??solve
?help.search for details and more examples.
The examples on a help topic can normally be run by example(topic).
Windows versions of R have other optional help systems:
Use ?help for further details.
How to search about R in Google: https://cran.r-project.org/web/

Setting a working directory

The working directory can be created using following steps.


setwd(“path of the working directory”) or
Session Set Working Director Choose Directory
To get the current working directory, getwd().

Basic Arithmetic

a) Vectors
Vectors are variables with one or more values of the same type. A variable with a single value
is known as a scalar. In R, a scalar is a vector of length 1. There are at least three ways to
create vectors in R: (a) sequence, (b) concatenation function, and (c) repetition function.
Eg:
▪ vector1 <- c(1,5,9)
vector2 <- c(20,21,22,23,24,25)
▪ vector<-seq(1,10,by=1)
vector
1 2 3 4 5 6 7 8 9 10
▪ A<-rep(5,3)
A
555

17
➢ logical vectors
As well as numerical vectors, R allows the manipulation of logical quantities. The elements of
a logical vector can have the values TRUE, FALSE, and NA. The first two are often abbreviated
as T and F, respectively. Note, however, that T and F are just variables that are set to TRUE
and FALSE by default but are not reserved words and, hence, can be overwritten by the user.
Hence, you should always use TRUE and FALSE.

We can use logical operators to obtain the logical vector.


x<-c(5,9,2)
x<=2
FALSE FALSE TRUE

➢ Character vector

We can equally create a character vector in which each entry is a string of text. Strings in R
are contained within double quotes.

Eg: x<-c(“Hello”, “Hai”)


x
“Hello” “Hai”

➢ Missing values

In some cases, the components of a vector may not be completely known. When an element
or value is “not available” or a “missing value” in the statistical sense, a place within a vector
may be reserved for it by assigning it the special value NA. The function is.na(x) gives a logical
vector of the same size as x with value TRUE if and only if the corresponding element in x is
NA.

Eg: z <- c(1:3,NA)


is.na(z)
[1] FALSE FALSE FALSE TRUE

Note that there is a second kind of “missing” value produced by numerical computation, the
so-called Not a Number, NaN, values.

Examples are 0/0 or Inf – Inf.

➢ Class of an object

All objects in R have a class reported by the function class.

18
For example "numeric","logical","character","list","matrix","array", "factor" and "data.frame"
are possible values.
Y<-c(2,4,6,8,10,12)
X<-c(1,2,3,4,5,6)
b<-data.frame(X,Y)
b
X Y
11 2
22 4
33 6
44 8
5 5 10
6 6 12
class(b)
[1] "data.frame"

b) Arrays

An array can be considered as a multiply subscripted collection of data entries. R allows simple
facilities for creating and handling arrays, particularly the special case of matrices. A
dimension vector is a vector of non-negative integers. The dimensions are indexed from one
up to the values given in the dimension vector. A vector can be used by R as an array only if
it has a dimension vector as its dim attribute. Suppose, for example, z is a vector of 15
elements. The assignment dim(z) <-c(3,5) gives it the dim attribute that allows it to be
treated as a 3 by 5 array.

➢ Array indexing: Individual elements of an array may be referenced by giving the array's
name followed by the subscripts in square brackets, separated by commas. More generally,
subsections of an array may be specified by giving a sequence of index vectors in place of
subscripts; however, if any index position is given an empty index vector, then the full
range of that subscript is taken. To access elements in a 2D array, you need two indices
– one for the row and one for the column. The first index refers to the row number, and
the second refers to the column number.

Eg: z<-array(c(1,2,3,4,5,6),c(3,2))

19
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
z[2,2]
[1] 5
➢ The array() function

As well as giving a vector structure a dim attribute, arrays can be constructed from vectors
by the array function, which has the form

z <- array(data_vector, dim_vector).

For example − If we create an array of dimension (2, 3), it creates a matrix with 2 rows and
3 columns. An array is created using the array() function. It takes vectors as input and uses
the values in the dim parameter to create an array.

c) Matrices

Matrices are mostly used in statistics and so play an important role in R. To create a matrix,
use the function matrix(), specifying elements by column first.

Eg: matrix(1:12,nrow=3,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
matrix(c(1,2,3,4,5,6),nrow=2)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
matrix(c(1,2,3,4,5,6),byrow=TRUE,ncol=3)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
matrix(c(1,2,3,4,5,6),ncol=3)
[,1] [,2] [,3]

20
[1,] 1 3 5
[2,] 2 4 6
➢ Special functions for constructing certain matrices:
diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
produce a identity matrix.
diag(1:3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3
➢ Matrix multiplication is performed using the operator %*%, which is distinct from scalar
multiplication *.

a<-matrix(c(1:9),3,3)
x<-c(1,2,3)
a
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
a%*%x
[,1]
[1,] 30
[2,] 36
[3,] 42
➢ A standard function exists for common mathematical operations on Matrices.

1. Transpose of a matrix

t(a)
[,1] [,2] [,3]
[1,] 1 2 3

21
[2,] 4 5 6
[3,] 7 8 9
2. Determinant of a matrix

a<-matrix(c(1:8,10),3,3)
a
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 10
det(a)
[1] -3
3. Dimension of a matrix

dim(a)

3,3

4. Inverse of a matrix

a<-matrix(c(1:8,10),3,3)
solve(a)
[,1] [,2] [,3]
[1,] -0.6666667 -0.6666667 1
[2,] -1.3333333 3.6666667 -2
[3,] 1.0000000 -2.0000000 1
➢ Sub setting

Matrices can be subsetted much the same way as the vectors.


a<-matrix(c(1:8,10),3,3)
a[2,1]
[1] 2
a[,2]
[1] 4 5 6
➢ Combining Matrices

You can stitch matrices together using the rbind() and cbind() function.
cbind(a,t(a))
[,1] [,2] [,3] [,4] [,5] [,6]

22
[1,] 1 4 7 1 2 3
[2,] 2 5 8 4 5 6
[3,] 3 6 10 7 8 10

Note: Uni-dimensional arrays are called vectors in R. Two-dimensional arrays are called
matrices.

d) Eigenvalues and Eigenvectors

The function eigen(Sm) calculates the eigenvalues and eigenvectors of a symmetric matrix
Sm. The result of this function is a list of two components named values and vectors. The
assignment ev <- eigen(Sm) will assign this list to ev. Then ev$val is the vector of eigenvalues
of Sm and ev$vec is the matrix of corresponding eigenvectors.

Eg:

eigen(a)
eigen() decomposition
$values
[1] 16.7074933 -0.9057402 0.1982469
$vectors
[,1] [,2] [,3]
[1,] -0.4524587 -0.9369032 0.1832951
[2,] -0.5545326 -0.1249770 -0.8624301
[3,] -0.6984087 0.3264860 0.4718233
Lists

The main object for holding data in R. These are a bit like vectors except that each entry can
be any other R object.
x<-list(1:3,TRUE,"hello")
x[[3]]
[1] "hello"
Here x has three elements: A numeric vector, a logical and string. We can select an entry of x
with double square bracket. The function names() can be used to obtain a character vector
of all the names of the object in list.

23
e) Data frames

A DataFrame in R is a tabular data structure that stores values of any data type. Use
class(name of your data frame) or is(name of your data frame, “data.frame”) command to
check whether it is a data frame or not.

• The command data.frame() creates a data frame, each argument representing a column.
books<-data.frame(author=c("Raju","Radha"),year=c(1980,1979))
books
author year
1 Raju 1980
2 Radha 1979
We can select rows and columns in the same way as in the matrices.
books[2,]
author year
1 Radha 1979
• as.list(data.frame) –will convert a data frame object into a list object.

• dim(books) – will return the dimension of the data frame.

dim(books)

[1] 2 2

• names(books)- will return the column names of a data frame, row.names(books) will

return the row names.


names(books)
[1] "author" "year"
row.names(books)
[1] "1" "2"
• The head() function in R is used to display the first n rows present in the input data
frame. By default, the head() function returns the first 6 rows by default.
head(books)
author year
1 Raju 1980
2 Radha 1979
• The summary function returns the minimum, maximum, mean, median, and 1st and 3rd
quartiles for a numerical vector.

24
summary(books)
author year
Length:2 Min. :1979
Class :character 1st Qu.:1979
Mode :character Median :1980
Mean :1980
3rd Qu.:1980
Max. :1980
• The unique() function in R is used to eliminate or delete the duplicate values or the rows
present in the vector, data frame, or matrix as well.
A <- c(1, 2, 3, 3, 2, 5, 6, 7, 6, 5)
unique(A)
[1] 1 2 3 5 6 7
• factor() In R, factors are used to work with categorical variables, variables that have a
fixed and known set of possible values
x <-c("female", "male", "male", "female")
factor(x)
[1] female male male female
Levels: female male
• table() function in R Language is used to create a categorical representation of data
with variable name and the frequency in the form of a table.
vec = c(10, 14, 13, 10, 12, 13, 12, 10, 14, 12)
table(vec)
vec
10 12 13 14
3 3 2 2
• The str() function displays the internal structure of an object such as an array, list, matrix,
factor, or data frame.
vec = list(10, 14, 13, 10, 12, 13, 12, 10, 14, "as")
str(vec)
List of 10
$ : num 10
$ : num 14
$ : num 13

25
$ : num 10
$ : num 12
$ : num 13
$ : num 12
$ : num 10
$ : num 14
$ : chr "as"
• The View() function in R can be used to invoke a spreadsheet-style data viewer within
RStudio.
• paste() method in R programming is used to concatenate the two string values by
separating with delimiters.
string1 <- "R"
string2 <- "RStudio"
answer <- paste(string1, string2, sep=" and ")
print(answer)
[1] "R and RStudio"
• The print() function prints the specified message to the screen, or other standard
output device.
print("hello")
[1] "hello"

f) Reading and writing data

• It is often necessary to load data externally.


read.table() and read.csv() are two popular functions used for reading tabular data into
R.
df <- read.table (file='C:\\Users\\bob\\Desktop\\data.txt',header =TRUE)
data<-read.csv(file.choose()) or
data<-read.csv( “path of the file”)
In R, CSV (Comma-Separated Values) files play a crucial role in data manipulation and
analysis. CSV files are plain text files that store tabular data, where each row contains
values separated by commas. R provides several built-in functions to read data from CSV
files, with read. csv() is the most commonly used.

26
• In R, we can write data frames easily to a file, using the write.table() and write.csv()
command.

write.table(books, file="cars1.txt",row.names=F)

write.csv(books,file="car1.csv",row.names=F)

R, by default, creates a column of row indices. If we wanted to create a file without the
row indices, we would use the command.

Functions

A function in a programming language is much like its mathematical equivalent. It has some
input called arguments and an output called return value.

Writing function

square<-function(x){
x^2
}
square(4)
[1] 16
Note: objects which are created inside a function do not exist outside it.
• for() loops

The most common way to execute a block of code multiple times is with a for () loop.
for (x in 1:5) {
+ print(x)
+}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Commonly using other loops are while loop and nested loop.

• if() statement

If statement is one of the Decision-making statements in the R programming language. It is


one of the easiest decision-making statements. It is used to decide whether a certain

27
statement or block of statements will be executed or not, i.e., if a certain condition is true,
then a block of statements is executed; otherwise, it is not.

a <- 5
if(a > 0)
+{
+ print("Positive Number")
+}
[1] "Positive Number"

R Packages

R comes with many data set build-ins, particularly in the MASS package. A package is a
collection of functions, data sets, and other objects. To install and load a package,

install.packages(“package name”)

library(“package”)

Eg: Lubridate is an R package that makes it easier to work with dates and times.

install.packages(“lubridate”)

library(lubridate)

To get the list of available data sets in base R we can use data() but to get the list of data
sets available in a package we first need to load that package then data() command shows
the available data sets in that package.

a) Tidyverse package

The Tidyverse suite of integrated packages is designed to work together to make common
data science operations more user-friendly. The packages have functions for data wrangling,
tidying, reading/writing, parsing, and visualizing, among others.

Tidyverse Packages in R has the following:


• Data Visualization and Exploration
ggplot2
• Data Wrangling and Transformation
dplyr
tidyr
stringr

28
• Data Import and Management
tibble
readr

i) Data Visualization and Exploration in Tidyverse in R

1. ggolot2
ggplot2 is an R data visualization library based on The Grammar of Graphics. ggplot2 can
create data visualizations such as bar charts, pie charts, histograms, scatterplots, error charts,
etc., using high-level API. It also allows you to add different data visualization components or
layers in a single visualization. Once ggplot2 has been told which variables to map to which
aesthetics in the plot, it does the rest of the work so that the user can focus on interpreting
the visualizations and take less time to create them. However, this also means that it is not
possible to create highly customized graphics in ggplot2. If you want to install ggplot2, the
best method is to install the tidyverse using:
install.packages("tidyverse")

Or you can just install ggplot2 using:

library("ggplot2")

gfg <-data.frame( x=c('A', 'B', 'C', 'D', 'E', 'F'),


y=c(4, 6, 2, 9, 7, 3))

> ggplot(gfg, aes(x, y, fill=x)) +


geom_bar(stat="identity")

If stat = "identity", then the bar chart will display the values in the data frame as is.

ii) Data handling and Transformation in Tidyverse in R

1. dplyr

dplyr is a very popular data manipulation library in R. It has five important functions that are
combined naturally with the group by() function that can help in performing these functions
in groups. These functions include the mutate() function which can add new variables that
are functions of existing variables, select() function that selects the variables based on their
names, filter() function that picks selects the variables based on their values. Summarise ()

29
function that reduces multiple values into a summary, and the arrange() function that
arranges the arranges the row orderings. If you want to install dplyr, the best method is to
install the tidyverse using:

install.packages("tidyverse")
Or you can just install dplyr using:
install.packages("dplyr")
Eg:
library(dplyr)
data(starwars)
print(starwars %>% filter(species == "Droid"))

2. tidyr

tidyr is a data cleaning library in R which helps to create tidy data. Tidy data means that all
the data cells have a single value with each of the data columns being a variable and the data
rows being an observation.This tidy data is a staple in the tidyverse and it ensures that more
time is spent on data analysis and to obtain value from data rather than cleaning the data
continuously and modifying the tools to handle untidy data.The functions in tidyr broadly fall
into five categories namely, Pivoting which changes the data between long and wide forms,
Nesting which changes grouped data so that a group is a single row with a nested data frame,
Splitting character columns and then combining them, Rectangling which converts nested
lists into tidy tibbles and converting implicit missing values into explicit values. If you want to
install tidyr, the best method is to install the tidyverse using:

install.packages("tidyverse")
Or you can just install tidyr using:
install.packages("tidyr")

30
3. Stringr

stringr is a library that has many functions used for data cleaning and data preparation tasks.
It is also designed for working with strings and has many functions that make this an easy
process.

All of the functions in stringr start with str and they take a string vector as their first
argument. Some of these functions include str_detect(), str_extract(), str_match(),
str_count(), str_replace(), str_subset(), etc. If you want to install stringr, the best method
is to install the tidyverse using:
install.packages("tidyverse")
Or you can just install stringr from CRAN using:
install.packages("stringr")
Eg:
library(stringr)
str_length("hello")
5

iii) Data Import and Management in Tidyverse in R

1. readr

This readr library provides a simple and speedy method to read rectangular data such as that
with file formats tsv, csv, delim, fwf, etc. readr can analyze many different types of data using
a function that examines the total file. This is done automatically by readr in most cases.
readr can read different kinds of file formats using different functions, namely read_csv() for
comma-separated files, read_tsv() for tab-separated files, read_table() for tabular files,
read_fwf() for fixed-width files, read_delim() for delimited files, and, read_log() for web log
files. If you want to install readr, the best method is to install the tidyverse using:

install.packages("readr")
library(readr)
myData = read_tsv("sample.txt", col_names = FALSE)
print(myData)

31
2. tibble

A tibble is a form of a data.frame that includes the useful parts


of it and discards the parts that are not so important. So tibbles
don’t change variables’ names or types like data.frames nor do
they do partial matching but bring problems to the forefront
much sooner such as when a variable does not exist.

So a code with tibbles is much cleaner and effective than before.


Tibbles is also easier to use with larger datasets that contain
more complex objects, in part before an enhanced print()
method.

You can create new tibbles from column vectors using the
tibble() function and you can also create a tibble row-by-row
using a tribble() function. If you want to install tibble, the best
method is to install the tidyverse using:

install.packages("tibble")
library(tibble)
tib <- tibble(a = c(1,2,3), b = c(4,5,6), c = c(7,8,9))
tib
# A tibble: 3 x 3
a b c
<dbl> <dbl> <dbl>
1 1 4 7
2 2 5 8
3 3 6 9

R plotting

a. plot()

The plot() function is used to draw points (markers) in a diagram.


plot(Parameter 1, Parameter 2)
The function takes parameters for specifying points in the diagram. Parameter 1 specifies
points on the x-axis. Parameter 2 specifies points on the y-axis.
Eg: x<-c(1,3,5,7,9) y<-c(2,4,6,8,10)
plot(x,y)

32
Draw a line

The plot() function also takes a type parameter with the value l to draw a line to connect all
the points in the diagram:
plot(x,y,type="l")

Plot label

The plot() function also accept other parameters,


such as main, xlab and ylab if you want to
customize the graph with a main title and different
labels for the x and y-axis.
Eg: plot(x,y,type="l",main ="My graph",
xlab ="x axis",ylab = "y axis")
Note: for including color,
Use col="color" to add a color to the line and
lwd=widthsize to adjust the width size.

b. Box plot

The boxplot() function takes in any number of


numeric vectors, drawing a boxplot for each vector.
main-to give the title, xlab and ylab-to provide
labels for the axes.

Eg:
boxplot(iris[,1],xlab="Sepal.Length",ylab="Length(in centemeters)", main="Summary
Characteristics of Sepal.Length(Iris Data) ")

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris
virginica and Iris versicolor). Four features were measured from each sample: the length
and the width of the sepals and petals, in centimeters. If we want to add color to boxplot
use argument ‘col’.

Eg:

33
boxplot(iris[,1],xlab="Sepal.Length",ylab="Length(in
centemeters)", main="Summary Characteristics of
Sepal.Length(Iris Data) ",col= “orange”)

c. Histogram

• R uses hist () function to create histograms.


• This hist () function uses a vector of values to plot the
histogram.
• Histogram comprises of an x-axis range of continuous values, y-axis plots frequent
values of data in the x-axis with bars of variations of
heights.
Syntax:

hist (v, main, xlab, xlim, ylim, breaks,col,border)


where v – vector with numeric values
main – denotes title of the chart
col – sets color
border -sets border color to the bar
xlab - description of x-axis
xlim - denotes to specify range of values on x-axis
ylim – specifies range values on y-axis
break – specifies the width of each bar.
Eg:
Create data for the graph.
v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19, 27, 39)
Create the histogram.
hist(v, xlab = "No.of Articles ",col = "green", border = "black")
Note: Range of X and Y values
To describe the range of values we need to do the following steps:
1: We can use the xlim and ylim parameters in X-axis and Y-axis.
2:Take all parameters which are required to make a histogram chart.
hist(v, xlab = "No.of Articles", col = "green",
border = "black", xlim = c(0, 50),
ylim = c(0, 5), breaks = 5)

34
d. Bar plots

Bar plots can be created in R using the barplot() function. We can supply a vector or matrix
to this function. If we supply a vector, the plot will have bars with their heights equal to the
elements in the vector.
Eg:
max.temp <- c(22, 27, 26, 24, 23, 26, 28)
barplot(max.temp,
main = "Maximum Temperatures in a Week",
xlab = "Degree Celsius",
ylab = "Day",
names.arg = c("Sun", "Mon", "Tue", "Wed",
"Thu", "Fri", "Sat") , col = "lightgreen”)
Note

A histogram represents the frequency distribution of


continuous variables. Conversely, a bar graph is a
diagrammatic comparison of discrete variables.
Histogram presents numerical data. whereas bar
graph shows categorical data. The histogram is drawn
in such a way that there is no gap between the bars.

e. Scatterplots

A "scatter plot" is a type of plot used to display the relationship between two numerical
variables, and plots one dot for each observation. It needs two vectors of same length, one
for the x-axis (horizontal) and one for the y-axis (vertical):

Eg:
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y, main="Observation of Cars",
xlab="Car age", ylab="Car speed",col="black",
pch=21,bg="lightgreen")

35
Pros and Cons of R

Advantages of R
• Open source
• Data wrangling
• Array of packages
• Quality of plotting and graphing
• Platform independent
• Machine learning operations
• Continuously growing

Disadvantages of R
• Weak origin
• Data handling
• Basic security
• Complicated Language

Bibliography
Venables, W. N., & Smith, D. M. the R Development Core Team (2007). An introduction to
R.<http.cran.r-project.org/doc/manuals/R-intro.pdf>Accessed,18(07).
https://www.stats.ox.ac.uk/~evans/Rprog/LectureNotes.pdf

https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf

https://web.itu.edu.tr/~tokerem/The_Book_of_R.pdf

36
Chapter 3
Regression Analysis: Simple and Multiple
Regression Using R
V Chandrasekar1, Ramadas Sendhil2 & Geethalakshmi. V1
1
ICAR-Central Institute of Fisheries Technology, Cochin, India
2 Department of Economics, Pondicherry University (A Central University),

Puducherry, India.

Introduction to Simple Linear Regression

In the world of statistics and data analysis, simple linear regression serves as one of the
fundamental tools for exploring the relationship between two variables. It allows us to
understand how changes in one variable are associated with changes in another. In this
chapter, we'll delve into the concept of simple linear regression, its mechanics, and how to
interpret the results. Simple linear regression involves two main variables: the independent
variable (X) and the dependent variable (Y). The relationship between these variables is
assumed to be linear, meaning that changes in X result in proportional changes in Y. The
equation that defines simple linear regression is: Y = β0 + β1X + ϵ

Where;
Y = dependent variable.
X = independent variable.
β0 = intercept (the value of Y when X is zero).
β1 = slope (the change in Y for a one-unit change in X).
ϵ = error term

Understanding Simple Linear Regression using income and expenditure data


In the realm of personal finance, simple linear regression serves as a vital tool for
understanding and predicting various phenomena, such as income and expenditure patterns.
In this chapter, we'll explore a hypothetical example of how simple linear regression can be
applied in personal finance, from data collection to result interpretation. Financial research
often involves investigating how different factors affect individuals' financial behavior. Simple
linear regression allows researchers to explore the relationship between a single independent
variable (such as income) and a dependent variable (like expenditure).

37
Example: Income and Expenditure
Let's consider a scenario where a researcher wants to study the effect of income on expenditure habits. The researcher collects data from several
individuals with varying income levels and records their corresponding expenditure amounts. Here's a summary of the data:
Expenditure

Expenditure

Expenditure

Expenditure

Expenditure

Expenditure

Expenditure

Expenditure

Expenditure

Expenditure
Income

Income

Income

Income

Income

Income

Income

Income

Income

Income
38650 25099 22557 16734 65036 47022 35145 32048 33654 36856 63785 65081 59862 42900 28518 25363 74261 52832 34395 39771
49817 36848 53727 34960 36535 23407 62421 54121 24255 22240 35428 37321 51984 35797 37473 32572 25101 22834 51419 44754
49263 48479 62279 53478 22888 20102 36838 32232 71139 64444 31421 14051 41953 29916 34438 21919 57149 60317 60068 52350
32167 29541 54852 41339 47512 51428 72436 50472 30423 42381 64631 37523 23500 9662 42824 38572 68247 46810 32855 23999
71987 59585 40365 38231 54615 51088 63477 43204 23755 16458 26437 20115 47118 26957 18268 3596 33699 27531 59214 46765
37320 26474 65125 43324 34354 33463 59421 48701 19869 26811 20045 19203 23095 36273 28916 14422 63225 46026 39125 36818
46768 34290 60315 51059 71787 53911 24616 21680 26308 17530 63937 55618 27331 27858 21723 19642 23010 30794 26444 21647
45004 21344 69514 50087 55087 45386 25414 18779 71391 58658 27660 15245 38848 31744 69792 57184 47022 42277 57232 38848
31239 31008 71974 55938 30999 18296 67109 64018 31072 12536 68336 41819 35767 16055 26860 26028 72491 62548 34022 34335
46422 39723 27597 26208 46499 17317 68336 54763 15610 10488 38296 25154 41619 20006 16535 18085 56693 39742 16139 35640
46352 34094 69584 58483 18306 13592 50850 33698 74808 52540 37730 35024 15469 17242 44186 45304 40898 34245 47860 36349
27755 21500 46725 47865 35369 28465 60329 51257 28162 19027 31622 32177 33862 9031 38834 33527 33169 23885 45938 44616
71218 63101 63706 38907 46085 22319 65769 45106 57469 53218 51017 49943 36171 30848 37869 18820 67707 62510 56742 46073
74690 63362 61690 50303 53638 55023 35766 18123 65433 60432 56127 48738 65062 46446 62696 54012 54081 51256 60550 47054
21201 15540 60765 48091 68816 55577 55322 41010 65651 51263 18587 7494 18507 21921 18926 25785 36346 39869 43181 35757
25615 30289 54870 53226 43193 38798 24117 19803 54724 49361 53660 42516 44226 46149 17644 14437 49273 33407 64982 56628
23571 13512 15919 7515 33855 15865 42671 39666 20874 28470 23384 24554 64795 56401 60915 55821 37015 34206 65484 59180
23905 24347 16828 16924 49345 51820 35327 33375 45919 44824 49782 47917 65637 66118 63766 39913 68992 56126 46905 32309
47580 29062 55023 51039 49379 43799 61455 52366 50768 41971 26319 23265 71832 59693 63948 40183 70278 51651 31288 25720
19966 26868 40462 24128 26039 24147 51600 48603 74658 48789 26488 19228 28114 30896 32066 28424 64779 56941 67839 53268
73132 61153 50074 43091 57136 41890 47132 26462 58562 49282 38622 29147 56885 41594 29105 25113 16417 18830 71218 53403
35306 27253 48659 38134 61198 50002 68498 48314 37669 42214 41238 44211 48027 30852 23523 15584 66383 50941 73000 67515
24311 13245 15086 13861 37737 37687 54670 30033 70651 72188 63883 47989 24152 15836 70067 57150 46067 36483 58758 35635
35451 32577 28670 43052 71195 59207 41788 32319 63797 54007 38450 26633 29280 39014 34859 25018 70344 53941 15896 13944
52295 45814 58802 49301 21972 25053 37504 39389 19232 17236 49929 37036 31765 32876 25198 16577 31047 21231 60041 43420

38
Expenditure

Expenditure

Expenditure

Expenditure

Expenditure

Expenditure

Expenditure

Expenditure

Expenditure

Expenditure
Income

Income

Income

Income

Income

Income

Income

Income

Income

Income
66943 57184 64863 53953 59543 38646 22768 24276 73665 69888 34029 8593 26878 29578 57008 43230 30464 20265 57255 47145
39027 37626 49404 32900 39246 24522 15787 10687 65381 49861 38224 24401 21267 28435 43842 37717 61852 56350 52402 36070
22934 10703 56277 40879 70839 44780 19264 15597 73032 63723 19118 17255 26963 23346 47744 37955 54174 47742 64523 38171
23828 22905 72306 53977 69531 45190 59066 48744 30395 25648 28608 25984 42332 43694 53973 44718 38630 26304 41747 39622
25519 21906 53398 39726 36632 40093 51913 50294 67056 45987 74538 44848 53528 43481 54588 34106 56977 48403 65251 57150
69356 66481 28281 23325 17914 5501 18822 5345 19303 11643 33566 27426 50939 47139 17907 12370 63672 46823 74504 63381
18580 16855 59337 58369 35427 27563 25467 24807 42259 25092 67101 51481 62526 52541 31149 36978 21014 15861 72275 53488
35913 24327 35226 37622 45357 31765 32237 36165 29250 20680 63282 51460 53269 39991 50833 40786 35013 29458 32777 29642
68288 62579 32423 32612 48697 39857 72627 57467 44294 38051 19335 25646 20508 26798 48536 40805 59969 42627 20341 30712
20729 22977 35007 23782 40583 37766 64839 51465 20729 7348 51309 40161 23837 22977 66098 60164 68432 62601 28425 13731
52265 60313 71884 55132 56370 50922 56908 49455 52264 36076 42805 39076 69223 65853 31292 26761 57573 42703 61039 42691
22454 10874 47215 61893 54640 42930 66359 57141 71642 65208 15818 20631 23618 10190 65651 46551 69290 50569 60766 49552
70785 53667 35971 31502 31885 20014 59751 36169 22130 29741 29100 15579 45480 38590 23454 30782 16261 15002 42462 49687
41930 24515 32363 25636 44200 36918 39001 29969 72094 45721 56470 40389 73128 62912 16023 17633 59087 46261 53209 32302
19588 20277 15165 9375 57626 50491 64636 45321 21864 13378 35735 44261 16998 6334 52505 45182 74838 65727 60547 49384
50641 36135 40049 19784 37190 25798 66304 47364 44173 50164 23474 15574 44859 32580 30378 24145 26102 20787 31564 14542
39845 26015 62004 49735 21849 11044 31213 29274 50171 35669 58475 44545 20823 29541 49497 38652 16630 7838 24847 6748
30674 35635 22830 8436 42943 33863 46983 30214 26044 22851 53008 37821 48367 47610 15773 15833 62379 54496 34374 22892
36852 27626 21922 8831 34123 22619 15760 7691 29194 30719 34370 28885 47637 48321 57440 40659 29683 17110 52515 48335
37917 26649 34365 35228 35834 20250 36727 36623 62467 37666 18683 14530 42724 33736 59128 34989 62816 45513 34741 27109
53610 40229 59346 42651 35120 17944 71967 61982 68620 46036 50979 58956 53821 45593 61973 43704 22227 8022 60899 47041
51984 43498 53102 31582 66625 62847 17828 20953 23735 40819 15331 25003 42579 38339 54949 43822 20733 7917 34432 22450
52435 38076 56667 40588 62741 52561 21447 10808 59664 37282 16206 15944 42188 34139 45196 39503 28707 19375 45329 39390
71039 57057 74416 67339 37373 30303 36588 30428 71560 50810 66900 58596 56960 42867 43141 40402 39305 34262
34263 32299 21370 3778 43955 31664 20927 19590 22922 16888 73496 47663 29988 18144 46361 39407 68381 55034

39
To run the example data in R, you can follow these steps:
1. Install R and RStudio
If you haven't done so yet, you should install R and RStudio. R is a programming language
used for statistical computing, while RStudio is an integrated development environment (IDE)
designed to simplify working with R. You can download R from the Comprehensive R Archive
Network (CRAN) at https://cran.r-project.org/ and RStudio from the official RStudio website
(https://www.rstudio.com/products/rstudio/download/).

2. Open R Studio
After installing RStudio, open the application.

3. Prepare Your Data


Create a CSV file containing the example data. You can use any text editor or spreadsheet
software to create this file. Here's how your CSV file should look:
Save this file as " income _data.csv" in a location you can easily access.

4. Read the Data into R


In RStudio, you'll need to read the CSV file into R. Here's the code to do that:
R
Read the CSV file into R
income _data <- read.csv ("path to your income _data.csv")
Print the data to verify it was read correctly
Print (income _data)
Make sure to replace " path to your income _data.csv " with the actual path to your CSV file.

Before conducting linear regression analysis, it is essential to verify that the data satisfies the
four key assumptions for linear regression. These assumptions include:
• Linearity: The relationship between the independent and dependent variables must
be linear.
• Independence: The observations should be independent of one another.
• Homoscedasticity: The residuals should have constant variance across all levels of
the independent variable(s).
• Normality of Residuals: The residuals (errors) should follow a normal distribution.

40
Let's use R to check these assumptions using the example Bajra dataset.
Read the CSV file into R
income_data <- read.csv ("path/to/your/ income_data.csv")
Load necessary libraries
library(ggplot2)
library(car)

Simple Regression
summary(income_data)

In a simple regression analysis of income data, the summary function provides a numerical
overview, including minimum, median, mean, and maximum values for income and expenditure
variables.

Income Expenditure
Min. :15086 Min. : 3596
1st Qu.:30086 1st Qu.:24361
Median :44260 Median :36887
Mean :44692 Mean :36185
3rd Qu.:59942 3rd Qu.:47904
Max. :74838 Max. :72188

Multiple Regression
summary(cancer_data)

Given that the variables are numeric, running the code generates a numerical summary for
both the independent variables (smoking and junk food) and the dependent variable (cancer
disease occurrence).
Cancer occurrence Junk food Drinking
Min. : 0.79 Min. : 1.62 Min. : 0.76
1st Qu.: 9.33 1st Qu.: 29.51 1st Qu.:12.02
Median :15.14 Median : 51.88 Median :22.91
Mean :14.70 Mean : 54.63 Mean :22.32
3rd Qu.:19.95 3rd Qu.: 83.18 3rd Qu.:32.36
Max. :29.51 Max. :107.15 Max. :43.65

41
Verify that your data meets the assumptions.
In R, we can verify that our data meet the four key assumptions for linear regression. Since
we are working with simple regression involving just one independent and one dependent
variable, there is no need to check for hidden relationships between variables. However, if
there is autocorrelation within the variables (e.g., multiple observations from the same
subject), simple linear regression might not be appropriate. In these situations, a more
structured approach like a linear mixed-effects model should be considered. Additionally, use
the hist() function to assess whether the dependent variable follows a normal distribution

Given that the observations display a bell-shaped distribution with a concentration in the
middle and fewer data points at the extremes, we can proceed with linear regression.
Linearity: To assess linearity, we visually inspect the relationship between the independent
and dependent variables using a scatter plot to determine if a straight line could adequately
represent the data points.

42
Since the relationship appears approximately linear, we can move forward with the linear
model. Homoscedasticity, or consistent variance, ensures the prediction error remains stable
across the model's prediction range. We'll assess this assumption after it fits the linear model.

The independence of observations means there's no autocorrelation.

Use the cor() function to check if your independent variables are highly correlated.
cor(cancer_data$Junkfood, cancer_data$Drinking)
For example, cor(cancer_data$Junkfood, cancer_data$drinking) gives an output of 0.015,
indicating a small correlation (only 1.5%). Hence, both parameters can be included in our
model. Use the hist() function to determine whether your dependent variable follows a
normal distribution. hist(cancer_data$Cancer)

Since the observations show a bell-shaped pattern, we can proceed with the linear regression.
Fit a linear regression model
model <- lm(Yield ~ Fertilizer, data = bajra_data)

Check Assumptions
1. Linearity: Check by plotting a scatterplot of the predictor against the response variable
ggplot(cancer_data, aes(x = Exercise, y = Cancer)) +
+ geom_point() +
+ geom_smooth(method = "lm") +
+ labs(x = "Exercise (No of days)", y = "Cancer (severity)", title = "Scatterplot of Cancer
vs. Exercise")

43
To interpret the results of the provided data, we conducted a linear regression analysis
between the variables "Exercise" and "Cancer" to explore their potential relationship. The
analysis included examining the scatterplot between exercise and cancer to visually identify
any patterns, calculating the correlation coefficient to measure the strength and direction of
their relationship, and fitting a linear regression model to the data. The coefficients from the
regression equation were interpreted to understand how exercise levels might relate to cancer
severity. Additionally, the model's goodness-of-fit was assessed using metrics like R-squared
and p-values. These steps provided insights into the potential impact of exercise on cancer

44
severity and helped identify limitations or areas for further investigation in the dataset, such
as the influence of other factors like drinking habits on cancer severity.

2. Independence: Not directly testable from data, but typically assumed based on study
design. If data comes from a randomized experiment or a properly designed observational
study, independence can be assumed.

3. Homoscedasticity: Check by plotting residuals against fitted values


> plot(Yield ~ Fertilizer, data = bajra_data)
> par(mfrow=c(2,2))
> plot(model)
> par(mfrow=c(1,1))

The `residualPlot` will plot the residuals against the fitted values. We are looking for a random
scatter of points around the horizontal line at zero, indicating homoscedasticity.

The linear regression analysis conducted on the dataset revealed a significant relationship
between Cancer and Exercise, as demonstrated by the scatterplot, where the red regression
line demonstrates a positive correlation between the two variables. The diagnostic plots
created for the regression model, such as the residuals vs. fitted values plot, quantile-quantile
(Q-Q) plot, scale-location plot, and residuals vs. leverage plot, offer insights into the model's
assumptions and possible issues. These plots help evaluate the regression model's adequacy,
45
including its linearity, homoscedasticity, normality of residuals, and influential observations.
Overall, the analysis suggests that Exercise may be a significant predictor of Cancer, but
further investigation and model refinement may be necessary to fully understand the
relationship and ensure the model's validity and reliability.

4. Normality of Residuals: Check by plotting a histogram and a QQ plot of residuals


hist(residuals(model), breaks = 15, main = "Histogram of Residuals", xlab = "Residuals")
qqPlot(residuals(model), main = "Normal Q-Q Plot of Residuals")

Alternatively, you can use the Shapiro-Wilk test for normality


To perform the Shapiro-Wilk test for normality in R for the above data, you would use the
`shapiro.test()` function. Here's the R command: Shapiro-Wilk test for normality

> shapiro.test(cancer_data$Cancer)
Shapiro-Wilk normality test
data: cancer_data$Cancer
W = 0.98021, p-value = 2.709e-06

> shapiro.test(cancer_data$Exercise)
Shapiro-Wilk normality test
data: cancer_data$Exercise
W = 0.95028, p-value = 6.836e-12

> shapiro.test(cancer_data$Drinking)
Shapiro-Wilk normality test
data: cancer_data$Drinking
W = 0.9615, p-value = 3.951e-10

This code will conduct Shapiro-Wilk tests for normality on the variables "Cancer," "Exercise,"
and "Drinking" in the dataset named `cancer_data`.

The Shapiro-Wilk normality tests were conducted for three variables: "Cancer," "Exercise,"
and "Drinking." For the "Cancer" variable, the test yielded a Shapiro-Wilk statistic (W) of
0.98021 and a very low p-value of 2.709e-06, indicating a rejection of the null hypothesis of
normality. Similarly, for the "Exercise" variable, the test resulted in a W statistic of 0.95028

46
and an extremely low p-value of 6.836e-12, also leading to the rejection of the null
hypothesis. Finally, for the "Drinking" variable, the W statistic was 0.9615 with a p-value of
3.951e-10, again indicating non-normality. In summary, all three variables significantly
deviate from a normal distribution based on the Shapiro-Wilk tests.

shapiro.test(residuals(model))

The histogram and QQ plot of residuals visually indicate whether the residuals are
approximately normally distributed. The Shapiro-Wilk test provides a formal test of normality.
If the p-value from the test is greater than 0.05, we fail to reject the null hypothesis of
normality. Inspecting these plots and conducting tests will help you determine whether your
data meet the assumptions for linear regression. If the assumptions are violated, you may
need to apply transformations to the variables or consider alternative modelling techniques.

5. Fit a Linear Regression Model


Now, let's fit a linear regression model to the data. Here's the code to do that:
R
Fit a linear regression model
> model <- lm (Cancer ~ Drinking, data = cancer_data)

Print the summary of the model


> summary(model)

lm(formula = Cancer ~ Exercise, data = cancer_data)


Residuals:
Min 1Q Median 3Q Max
-5.8204 -1.7559 0.0226 1.7181 5.3899

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.556276 0.212113 120.48 <2e-16 ***
Exercise -0.198772 0.003376 -58.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

47
Residual standard error: 2.338 on 496 degrees of freedom
Multiple R-squared: 0.8748, Adjusted R-squared: 0.8746
F-statistic: 3467 on 1 and 496 DF, p-value: < 2.2e-16

Interpreting the Results

The intercept (25.556276) represents the estimated Cancer level when the Exercise variable
is zero. The coefficient for Exercise (-0.198772) suggests that for each unit increase in
Exercise, Cancer decreases by 0.198772 units on average. The p-value for Exercise is
extremely low (<2e-16), indicating strong evidence that Exercise is associated with Cancer.

Prediction

For a given level of Exercise, you can predict the corresponding Cancer level using the
equation: Cancer = 25.556276 - 0.198772 * Exercise.

Interpretation of Slope

The slope coefficient (-0.198772) represents the change in the response variable (Cancer)
per unit change in the predictor variable (Exercise). In this case, it suggests that, on average,
an increase of one unit in Exercise is associated with a decrease of 0.198772 units in Cancer.

Coefficient of Determination (R2)

The coefficient of determination (R-squared) quantifies the proportion of variation in the


dependent variable (Cancer) that is accounted for by the predictor variable (Exercise). With
an R² of 0.8748, this means that approximately 87.48% of the variability in Cancer is
explained by Exercise, suggesting that Exercise has a strong influence in predicting Cancer
levels.

Overall, the results indicate a significant negative association between Exercise and Cancer,
with Exercise explaining a large proportion of the variability in Cancer levels.

48
Bibliography
Goldberger, A. S. (1991). A course in econometrics. Harvard University Press.
Greene, W. H. (1993). Econometric analysis (2nd ed., pp. 535–538). Macmillan.
Gujarati, D. N. (2012). Basic Econometrics. McGraw Hill Education Private Limited.

49
Chapter 4
Diagnostic Tests in Regression Analysis
Amaresh Samantaraya1
1Department of Economics, Pondicherry University (A Central University),

Puducherry, India.

Introduction

Econometric analysis is widely used today in empirical research both in academics and in
aiding policy/decision making by public authorities and private business. Literally,
econometrics means economic measurement. It is rare to find a research paper published in
professional journals in economics or reports pertaining to economic policy published by
government and professional organizations without application of econometric tools. But
application of econometric analysis is not confined to economics alone. Researchers and
analysts are often using econometric analysis for empirical investigation of issues pertaining
to a variety of disciplines including in commerce and management, sociology, psychology,
medical studies, etc. The importance of econometrics for undertaking empirical analysis in a
variety of fields cannot be overstated.

Econometric analysis involves regression of dependent variable on a set of explanatory


variables. Here, dependent variable is that variable, the behaviour of which the
researcher/analyst wish to analyze. For example, if we wish to understand the behaviour of
‘interest rate’, then it is our dependent variable. The explanatory variables are a set of
variables which are used to explain the movement of the dependent variable. Usually, the
explanatory variables are chosen using an economic theory or postulates/theories derived
from the relevant field of study. For example, if we use Keynesian liquidity preference theory *
to explain interest rate behaviour, then money stock and GDP can be used as two explanatory
variables. Using regression analysis, one can assess how changes in money stock and GDP
impact movements in interest rate.

In the following, our endeavour is to underscore the relevance of diagnostic checks in


regression analysis or econometrics.

*
Exogenous money supply and Keynesian money demand
50
Major Components of Regression Analysis

Regression analysis applied in econometrics composed of mainly four steps, such as (a) Model
Specification, (b) Model Estimation, (c) Diagnostic Checks, and (d) Hypothesis Testing and
Inferences. Each of such steps is briefly explained, below.

(a) Model Specification


This is the first key step in econometric analysis. In involves selection of variables to be
included in the model based on economic theory or postulates of the relevant area, explaining
how the behaviour of the dependent variable is explained using information on explanatory
variables. Say, using Keynesian liquidity preference theory, we can specify an econometric
model as in Equation (1), below:
Yt = β0 + β1 X1t + β2 X2t + ut --- (1)
where, Yt: Dependent variable such interest rate under liquidity preference theory
X1t and X2t: Explanatory variables such as ‘money stock’ and ‘GDP’ under liquidity preference
theory
ut: Stochastic error term

(b) Model Estimation


After specification of the model, the next step is estimation. Data on the dependent and
explanatory variables can be obtained from primary or secondary sources, depending on the
relevance. If one wishes to explain, the behaviour of interest rate using liquidity preference
theory, then one can obtain data on interest rate, money stick and GDP from official website
of RBI or any other reliable source. The important issue is we do not have readymade values
of the intercept and slope coefficients in Equation (1). The researcher or analyst needs to
estimate values of βis from available data on Yt, X1t and X2t. Two broad methods are used for
estimating such estimates βis. One is least square based methods and second one is
Maximum-Likelihood (ML) based method. Ordinary Least Square (OLS) and its modified
variants are part of least square based methods. Details of such methods are discussed in
popular books in econometrics (Gujarati 1995; Pindyck and Rubinfeld 2000; Ramanathan
2002).

(c) Diagnostic Checks


A major issue in econometric or regression analysis is - the estimates of βis as obtained using
OLS or any other method of estimation are based on data pertaining to the dependent and
explanatory variables for the sample. The objective of the empirical analysis is often to make

51
a statement or draw inferences about the impact of explanatory variables on the dependent
variable, in general. If we could have obtained data for the entire population (all possible data
points over time and across the countries), then the counterparts of βis in Equation (1) can
be termed as population parameters.
In hypothesis testing, which is part of last step as discussed below, the researcher makes
inference about the population parameters based on the sample estimates obtained from
Step 2 above. Then, what is the role of diagnostic checks? Let us explain as below.
Gauss-Markov theorem suggests that if the assumptions of Classical Linear Regression
Model (CLRM) are satisfied, the OLS estimates of sample estimates become Best Linear
Unbiased Estimators (BLUE) of population parameters. Such assumptions are listed as below:
(i) Mean of stochastic error term in Equation (1) is zero i.e., E(ut) = 0.
(ii) No auto-correlation in the stochastic error term i.e., E(utus) = 0 for all t≠s.
(iii) Stochastic error term is homoscedastic i.e., E(ut2) = σ2.
(iv) There is no correlation between the explanatory variables and stochastic error
term.
(v) There is no perfect multicollinearity or high multicollinearity amongst the
explanatory variables.
(vi) The explanatory variables are non-stochastic.
(vii) The econometric model is correctly specified. It requires that no relevant
explanatory variable is excluded from the model; no irrelevant explanatory
variable is included in the model; and mathematical functional form of the model
is correct.
(viii) The stochastic error term is randomly distributed.
If all the above assumptions, except the last one are satisfied application of OLS to estimate
the econometric model will produce best linear unbiased estimator for the population
parameters. Hence, we need not look for any other alternative method of estimation.
Otherwise, we need to revise the method of estimation. If the last assumption is satisfied,
then it helps in hypothesis testing. In the above backdrop, diagnostic checks are undertaken
in econometric analysis, to establish the relevance of OLS as the estimation procedure, and
accordingly drawing inferences from the estimated results. In addition to the above, in
conventional analysis, a battery of indicators such as coefficient of determination, Akaike
Information Criterion (AIC), Scwartz Bayesian Criterion (SBC), t-test and F-test etc. are
employed as part of diagnostic checks in econometric analysis. Necessary details are provided
in Section III.

52
(d) Hypothesis Testing and Inferences
Hypothesis testing is used to draw inferences about population parameters based on the
sample estimate. For example, let us explain about examining the validity of liquidity
preference theory. One can collect data on interest rate, money stock, GDP for a particular
country say India for a given period, say 1970-71 to 2019-20, and estimate values of the
intercept and partial slope coefficients applying OLS to say, Equation 1. One can also employ
panel data estimation techniques using data for several counties say, India, Brazil, the US, etc.
for a given period of time. In any case, each of the above data set represents a sample of a
country or a set of a country for a given period. The estimated sample estimates are certainly
relevant for the sample. But, our ultimate objective is to make inference about a general
statement about say impact of change in money stock or GDP on interest rate, and thus on
the relevance of liquidity preference theory.
Economics and many streams in social sciences are non-experimental in nature.
Economists/statisticians cannot produce data on interest rates or money stock in the
laboratory. They can only rely on available data for different countries for specific time
periods. In other words, the economists can never data pertaining to all counties, and for all
times. Given this constraint, hypothesis testing is used to make a statement about impact of
money stock or GDP on interest rates (for the population), based on the sample estimates
obtained from the estimated results for the sample.
For example, if sample estimate of say, β1 is obtained as 0.06 in Equn. 1 by applying OLS, can
the researcher reject any statistically significant impact of money stock on interest rate
(assuming, Yt stands for interest rate, and Xit represent money stock). Or, based on the same,
can the researcher infer statistically significant impact of money stock on interest rate? The
related exercise is quite rigorous.
It may be noted that the focus of the present paper is to discuss about diagnostic checks in
econometrics. Hence, other steps in econometric analysis are covered in the above, very
briefly. The readers may refer to standard textbooks on econometrics (given in references)
for further details.

Key Items under Diagnostic Checks

As indicated in the preceding section, a key part of diagnostic checks in econometrics/


regression analysis is to verify validity of key assumptions of CLRM. Consequences of
violation of such assumptions, methods of detection and remedial measures are briefly

53
highlighted in the following. The readers may refer to standard textbooks on econometrics
for technical details. Moreover, we confine to methods of detection which are widely used in
practical econometric analysis, and the related list is not exhaustive.

(a) Tests for Autocorrelation


If there is autocorrelation in the stochastic error term ut in Equation 1, the OLS estimates of
the partial slope coefficients will continue to be unbiased, but not efficient. There are
alternate estimates such as Generalized Least square (GLS) which can produce low variance
ut for as compared to OLS. In that case, the t-value† based on OLS will not be precise, and
can lead to erroneous inferences. In hypothesis testing, inferences on statistical significance
of individual βis are based on t-test, which compare estimated t-values with tabulated t-values
relevant for t-distribution. In short, in the presence of autocorrelation in ut, inferences drawn
based on estimated sample estimates can be erroneous.
This underscores the need for checking for presence of autocorrelation in any econometric
analysis. It is more important for regression analysis using time series data. While econometric
textbooks provide a list of several tests for autocorrelation, in practice two such tests are
widely used, as indicated below.

Durbin-Watson (DW) d-statistic


This is derived from the estimated values of ut. The formula for d-statistic is give as below.

By construction, when there is no autocorrelation, the value of d = 2. If d = 0, it implies


existence of perfect positive autocorrelation, and if d = 4, it implies presence of perfect


Estimated t = (Sample estimate of βi – Population parameter βi)/(Standard Error of
Sample Estimate of βi). Standard error of estimated βi depends on standard error of ut.
54
negative autocorrelation. However, unlike t or F values popularly used in regression analysis
for hypothesis testing d-statistic do not follow a standard distribution. However, we can use
lower and upper critical values provided by Durbin-Watson to check presence or absence of
autocorrelation, using the decision criteria as given below.

It may be noted that d-statistic can be used to check for first-order autocorrelation only. It
cannot be used to verify higher order autocorrelation. Secondly, if the estimated d-statistic
values are in certain regions as indicated in the above graph, then we cannot make a decision
about presence/absence of autocorrelation problem. There are several other shortcomings in
the d-statistic test for autocorrelation. The LM test given below is used to overcome such
deficiencies. Nevertheless, DW test is popularly used as it is readily calculated from the
estimated stochastic error terms, and reported by default in most of the econometric
software packages.

LM test for Autocorrelation


Breusch and Godfrey test is based on Lagrange Multiplier (LM) principle, in which an auxiliary
regression is run in which estimated value of the stochastic error term from the original
regression model is used as the dependent variable and its lagged values are used as
explanatory variables. Using the estimated coefficient of determination values derived from
the estimation of auxiliary regression, comparing chi-square critical values, a test is developed
to detect presence of autocorrelation. This test is capable of checking presence of
autocorrelation of higher order, and does not suffer from indecision as in DW test for some
values of d-statistic.

In the presence of autocorrelation, there are broadly two types of remedial measures available
for the researchers. Firstly, one can use GLS methods such as Cochrane-Orcutt and Hildreth-
Lu procedures to correct for the problem of autocorrelation. Secondly, robust standard errors
correcting for autocorrelation problem can be used for hypothesis testing instead of standard
errors obtained from OLS. Popular econometric software packages have incorporated both
the methods. Providing technical details of the remedial measures are beyond the scope of
this chapter.

55
(b) Tests for Heteroscedasticity
Similar to the above, if the stochastic error term in Equation (1) suffer from the problem of
heteroscedasticity, the estimated t-value derived from OLS can lead to erroneous inferences.
In that case, OLS estimates continue to be unbiased, but not efficient. An alternative
procedure, weighted Least Squares (WLS) is BLUE is presence of heteroscedasticity. Park
test and White test are popularly used for checking for the presence of the problem of
heteroscedasticity.

Park Test
It is a two-step test procedure. In the first step, OLS is applied to the original regression
model as in Equation (1), and estimated values of stochastic error terms are obtained. In the
second step, square of the estimated stochastic error term is regressed on the explanatory
variable (s) of the original regression model, which is expected to cause heteroscedasticity of
ut. Say, if in Equation 1, X1t is expected to cause heteroscedasticity, in the second step,
estimated series of ut is regression on X1t. If the coefficient of X1t in the second regression is
statistically significant (using say t-test), then it indicates that the original equation suffers
from the problem of heteroscedasticity. On the contrary, if it is statistically not significant,
one can safely conclude that the original regression model does not suffer from the problem
of heteroscedasticity.

White Test
White test is a two-step LM procedure widely used to detect presence of heteroscedasticity.
As in the Park Test, in White test also OLS is applied to the original regression model, and
estimated values of stochastic error term obtained. In the second step, square of the
estimated stochastic error term series is regressed on the explanatory variables of the original
models along with their square terms and cross-products. From this auxiliary regression,
estimated value of the coefficient of determination is used to develop a F or LM test, which
helps in checking presence or absence of the problem of heteroscedasticity.

If presence of heteroscedasticity is detected, then there are two alternative remedial


measures which are popularly used. Firstly, using estimated variance of the stochastic error
term from the original equation, both the dependent and explanatory variables are
transformed/ adjusted. Applying OLS on the transformed variables removes
heteroscedasticity problem. This procedure is referred as Weighted Least Squares (WLS)

56
procedure. Secondly, similar to the case of autocorrelation problem, White’s robust standard
errors or Heteroscedasticity Autocorrelation (HAC) adjusted standard errors are used to
undertake hypothesis testing for the estimated OLS coefficients of the original regression
model.

(c) Tests for High Multi-collinearity


Notwithstanding high correlation amongst the explanatory variables in a multiple regression,
OLS estimates continue to be BLUE. However, the standard error or variance of the ut and
estimated intercept/partial slope coefficients will be very high. In that case, the estimated t-
values will be high, and hence high probability that the null hypothesis will not be rejected. It
may lead to inferences not supporting the impact of the explanatory variables on the
dependent variable. However, if there is perfect multi-collinearity, relevant partial slope
coefficients cannot be estimated, at all.
To assess the problem of high multicollinearity, mainly 3 alternative measures are used.
Firstly, from the OLS results of the original regression model, if the value of the coefficient
of determination is found to be high, while estimated t-values for several partial slope
coefficients are low, this indicates possibility of high multicollinearity. Secondly, the coefficient
of correlation amongst the explanatory variables can be estimated directly, and high values
for such coefficients will suggest high multicollinearity. Thirdly, one can estimate Variance
Inflation Factor (VIF) given by the formula below:

where, r223 represents the correlation between explanatory variables. The VIF represents
magnification of variance of the stochastic error term in the original regression model, due to
strong correlation between the explanatory variables. It is observed that if the correlation
coefficient of two explanatory variables is 0.8, then VIF is 2.78. With rise of correlation
coefficient to 0.95, VIF rises to 10.26. Thus, correlation coefficient of less than 0.8 does not
cause grave problem for our regression analysis.
As part of remedial measures, explanatory variables are transformed or sometimes one of the
explanatory variables which is found to strongly correlated with another is dropped to get rid
of high multicollinearity problem. But many econometricians believe remedy causes bigger
problem than multicollinearity itself, and sometimes economic interpretation from the
estimated coefficients of the transformed variables may not be relevant to the research. So,
many prefer not to do anything. It is because, despite high-multicollinearity, OLS continues
to be BLUE.

57
(d) Model specification
The researcher should ensure that mathematical functional form and inclusion of the
explanatory variables in the econometric model are as per the theoretical prescriptions. For
example, if one need to estimate Phillips curve (suggesting negative association between
inflation and unemployment), the mathematical form of the regression model should be in
the form of an inverse function. This is in line with the functional form of Phillips curve –
rectangular hyperbola. Moreover, the scholar also needs to be mindful about what to include
or exclude as part of right-hand side variables. As detailed in standard econometric textbooks,
exclusion of relevant explanatory variables makes OLS estimates biased, while inclusion of
irrelevant variables make OLS inefficient. Before applying standard procedures to detect
model specification errors, the scholar should ensure that the functional form and variables
on the right-hand side of the regression model are in strict conformity to the economic theory
or postulations of the relevant area of research. To check for model misspecification errors,
broadly two types of criteria are used. Firstly, Durbin-Watson d-statistic provides a good rule
of thumb test. If d-statistics is estimated to be closer to ‘2’, it implicitly suggests absence of
model specification error. On the other hand, if the estimated value of d-statistics is low
(lower than lower critical value dL), concerns on model misspecification cannot be avoided.
Ramsay’s Regression Specification Error Test (RESET) is widely used for assure the
researcher about correctness about model specification. It uses a F-test to check if the model
can be improved by including any missing variable. Technical details of the same are provided
in standard econometric textbooks.

(e) Normality test for stochastic error term


Although normality assumption for the stochastic error term is not part of CLRM, but it is
necessary for undertaking hypothesis in econometric analysis using OLS. Plotting estimated
OLS residuals can be an informal way of checking normality assumption. If the plot resembles
a symmetric bell-shaped distribution, it implies that the stochastic error term is normally
distributed. To formally check for normally, one can use Jarque-Bera (JB) Test for Normality,
given by

where, n: number of observations


S: skewness coefficient of the estimated errors
K: kurtosis coefficient of the estimated errors

58
If JB value is estimated to be ‘zero’, it implies that S=0, and K=3, which suggests the
distribution is normal. Under the null hypothesis of normal distribution of ut, JB
asymptotically follows a chi-square distribution with the degrees of freedom of 2. Therefore,
comparing the estimated JB value with the corresponding critical chi-square value, one can
infer about the normality assumption of ut.

(f) Coefficient of Determination and Model Selection Criteria


We have referred to coefficient of determination in the above on several occasions. The
coefficient of determination is denoted as R2, and given by the formula below (for multiple
regression model).

where, RSS: Residual Sum of Squares


TSS: Total Sum of Squares
This is used as a proxy for goodness of fit for the model. If the estimated value of R 2 (adjusted
R2 as reported in many econometric software for multiple regression models) is closer to 1,
it suggests that the model is a good fit. In that case, it can be inferred that the explanatory
variables included in the model explain variation of the dependent variable reasonably well.
On the contrary, if the estimated value of R2 is low, then it is inferred that the model is poor
in explaining variation of the dependent variable.
It may be noted that the estimated value of R2 in a regression model depends on the nature
of data. With time series data, particularly if the data are on level like GDP, per capita income,
money stock, exports, etc., R2 tends to be very high, often exceeding 0.9. On the contrary,
with cross-section of panel data, the estimated values of R2 are often found to be low. In such
models, even a value of 0.2 cannot be considered as an indicator of poor goodness of fit.
In addition to R2, researchers popularly use Akaike Information Criterion (AIC) and Schwarz
Information Criterion (SIC) for model selection, particularly choosing lag-length of
explanatory variables. Formula for AIC and SIC are given below.

where, k: number of regressors in the regression model.


ln: natural logarithmic function
Other notations are already defined in the above. Lower values of AIC and SBC suggest better
model.
59
Conclusions
As discussed in the above, diagnostic checks in econometric analysis is of paramount
importance. Such checks verify if the required conditions are met to facilitate drawing valid
inferences about the population parameters from the sample estimates. The researchers and
analysts need to be sound with in-depth understanding of the econometric methods so as to
get clarity on the importance of each item of diagnostic checks. Otherwise, there is a scope
for undertaking erroneous regression analysis, and consequently drawing faulty inferences
from the estimated results.

In this chapter, the discussion on the diagnostic checks in regression analysis/ econometrics
was presented with minimal use of technical/mathematical details. The focus was to highlight
the relevance of the diagnostic checks, and provide a lucid description of the same. For
technical details, the readers may refer to standard textbooks in econometrics, as given in the
references.

Bibliography
Gujarati, Damodar N. (1995): Basic Econometrics, McGraw Hill, 3rd Edition.
Pindyck, Robert S. and Daniel L. Rubinfeld (2000): Econometric Models and Economic
Forecasts, McGraw- Hill, 4th Edition.
Ramanathan, Ramu (2002): Introductory Econometrics with Applications, Harcorut College
Publishers, 5th edition.

60
Chapter 5
Data Mining and Computation114 1114
Software
for Social11414
Sciences*
V. Geethalakshmi and V Chandrasekar
ICAR-Central Institute of Fisheries Technology, Cochin, India.

Introduction

Statistics is the branch of science that deals with data generation, management, analysis and
information retrieval. Statistical methods dominate scientific research as they include
planning, designing, collecting data, analyzing, drawing meaningful interpretation and
reporting of research findings. Statistics has a key role to play in fisheries research carried
out in the various disciplines viz., Aquaculture, Fisheries Resource Management, Fish Genetics,
Fish Biotechnology, Aquatic Health, Nutrition, Environment, Fish Physiology and Post-
Harvest Technology for enhancing production and ensuring sustainability. For formulating
advisories and policies for stakeholders at all levels, the data generated from the various sub-
sectors in fisheries and aquaculture has to be studied.

With the advent of computational software, dealing with complicated datasets is relatively
easier. Advanced computational techniques aid in data analysis which is crucial to evolve
statistical inference from research data. Data management is also possible with advanced
statistical software.

A well-structured statistical system will form the basis for decision-making at various levels
of a sector especially during the planning and implementation. Statistics can play more
dominant role
• as a tool for policy-making and implementation
• assessing the impact of technology
• in sustaining nutritional safety
• in socio-economic upliftment of people below the poverty line
• to identify emerging opportunities through effective coordination
• speedy dissemination of information by networking and appropriate
human resource development

* This chapter has been republished by the author, as cited below:


Geethalakshmi V. & Chandrasekar V. (2022) “Data Mining and Computation Software for Improving Fisheries” in
the Research Recent Technological Developments in Fisheries: Pre and Post-Harvest Operations. ICAR-CIFT, Kochi,
P 209-219
61
Data Mining

When there is large amount of data available in many forms, to derive meaningful conclusions
without loss of data, data mining can be used. Data mining helps in extracting knowledge
from huge datasets. The technique aids in the computational process of discovering patterns
in large data sets involving methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems. The extracted information from a data set using
data mining will be transformed into an understandable structure for further use. The key
properties of data mining are

• Automatic discovery of patterns


• Prediction of likely outcomes
• Creation of actionable information
• Focus on large datasets and databases
For managing a business, a lot of inputs are required which can be in both quantitative and
qualitative form as well in other formats. Data mining can be effectively used for analysing
patterns, and based on this, opportunities can be identified for which necessary action can be
streamlined. Data mining handles large databases and aids in finding predictive information
in from them. Questions that traditionally required extensive hands-on analysis can now be
answered directly from the data — quickly. Suppose you introduce a new product in the
market and wish to know its potential market. Using the data mining technique, one can use
data on past promotional mailings to identify the targets most likely to maximize return on
investment in future mailings. Even bankruptcy and other forms of default in financial
ventures can be predicted and the segments of a population likely to exhibit such responses
can be assessed through such predictions.
Extensively used across various industries and disciplines, data mining can be said as the heart
of analytics efforts to effectively harness the massive data generated in many formats.
Telecom, Media and Technology: Consumer data throws light on the competitiveness
prevalent in a market which is already overloaded with similar products of the same nature
which a firm is trying to sell. Consumer behaviour can be predicted through analytic models
from the data for the industries like Telecom, media and technology companies which enjoy
the uniqueness of generating real time data every second. Such analytics point out ways to
undertake highly targeted and relevant campaigns for promoting your product.
Education: The data on student progress, can be effectively mined to predict their
performance and formulate strategies in the form on interventions in the system which help

62
streamlining the sector. Data mining helps educators access student data, predict
achievement levels and pinpoint students or groups of students in need of extra attention.
Finance & Banking: Banking system maintains billions of transactions of their customer base
and the automated algorithms coupled with. data mining will help companies get a better
view of market risks, detect fraud faster, manage regulatory compliance obligations, and get
optimal returns on their marketing investments.
Insurance: The insurance companies may have to handle risks, fraud, the defaulting of
customers and also retain their customer base. In the competitive insurance market, the
products have to be priced to attract customers and find new businesses to expand their
customer base.
Manufacturing: The production line has to be aligned to the supply structure, and the other
departments like quality assurance, packing, branding and maintenance have to be taken care
of for seamless operations. The demand forecast forms the basis of supply chain and timely
delivery has to be ensured. Data mining can be used to predict wear and tear of production
assets and anticipate maintenance, which can maximize uptime and keep the production line
on schedule.
Retailing: Large customer databases hold hidden customer insight that can help you improve
relationships, optimize marketing campaigns and forecast sales. Through more accurate data
models, retail companies can offer more targeted campaigns – and find the offer that makes
the biggest impact on the customer. Data mining tools sweep through databases and identify
previously hidden patterns in one step. An example of pattern discovery is the analysis of
retail sales data to identify seemingly unrelated products that are often purchased together.
Other pattern discovery problems include detecting fraudulent credit card transactions and
identifying anomalous data that could represent data entry keying errors.

Data can be of the following types - Record data – Transactional, Temporal data – Time
series, sequence (biological sequence data), Spatial & Spatial-Temporal data, Graph data,
Unstructured data - twitter, status, review, news article and Semi-structured data -
publication data, xml. Data mining can be employed for:
Anomaly Detection (Outlier/change/deviation detection): The identification of unusual data
records, that might be interesting or data errors that require further investigation.
Association Rule Learning (Dependency modelling): Searches for relationships between
variables. For example, a supermarket might gather data on customer purchasing habits.
Using association rule learning, the supermarket can determine which products are frequently

63
bought together and use this information for marketing purposes. This is sometimes referred
to as market basket analysis.
Clustering is the task of discovering groups and structures in the data that are in some way
or another "similar", without using known structures in the data.
Classification is the task of generalizing known structure to apply to new data. For example,
an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
Regression attempts to find a function which models the data with the least error.
Summarization providing a more compact representation of the data set, including
visualization and report generation.

The Data Mining Process

In order to explore the unknown underlying dependency in the data an initial hypothesis is
assumed. There may be several hypotheses formulated for a single problem at this stage.
Data generation is the second step which can be either through a designed experiment. The
second possibility is when the expert cannot influence the data- generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Data collection affects its
theoretical distribution. It is important to make sure that the data used for estimating a model
and the data used later for testing and applying a model come from the same, unknown,
sampling distribution. In the observational setting, data are usually "collected" from the
existing databases, data warehouses, and data marts.

64
Data pre-processing is an important step before doing the analysis. Firstly, outliers have to
be identified and removed or treated. Commonly, outliers result from measurement errors,
coding and recording errors, and, sometimes, are natural, abnormal values. Such
nonrepresentative samples can seriously affect the model produced later. Pre-processing
involves either removal of outliers from data or develop robust models which are insensitive
to outliers. Data pre-processing also includes several steps such as variable scaling and
different types of encoding. For estimating the model, selection and implementation of the
appropriate data-mining technique is an important step.

Data-mining models should help in decision making. Hence, such models need to be
interpretable in order to be useful because humans are not likely to base their decisions on
complex "black-box" models. Note that the goals of accuracy of the model and accuracy of
its interpretation are somewhat contradictory. Usually, simple models are more interpretable,
but they are also less accurate. Modern data-mining methods are expected to yield highly
accurate results using high dimensional models.

Data Mining Techniques

Important data mining techniques are:

• Classification analysis. This analysis is used to retrieve important and relevant


information about data, and metadata
• Association rule learning
• Anomaly or outlier detection
• Clustering analysis
• Regression analysis

Association analysis is the finding of association rules showing attribute-value conditions


that occur frequently together in a given set of data. Association analysis is widely used for
a market basket or transaction data analysis. Association rule mining is a significant and
exceptionally dynamic area of data mining research. One method of association-based
classification, called associative classification, consists of two steps. In the main step,
association instructions are generated using a modified version of the standard association
rule mining algorithm known as A priori. The second step constructs a classifier based on the
association rules discovered.

65
Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. Data Mining has a different type of
classifier:
• Decision Tree is a flow-chart-like tree structure, where each node represents a test on
an attribute value, each branch denotes an outcome of a test, and tree leaves represent
classes or class distributions.
• SVM (Support Vector Machine) is a supervised learning strategy used for classification
and additionally used for regression. When the output of the support vector machine
is a continuous value, the learning methodology is claimed to perform regression; and
once the learning methodology will predict a category label of the input object, it’s
known as classification.
• Generalized Linear Model (GLM) is a statistical technique, for linear modeling. GLM
provides extensive coefficient statistics and model statistics, as well as row diagnostics.
It also supports confidence bounds.
• Bayesian classification is a statistical classifier. They can predict class membership
probabilities, for instance, the probability that a given sample belongs to a particular
class. Bayesian classification is created on the Bayes theorem.
• Classification by Backpropagation
• K-NN Classifier: The k-nearest neighbor (K-NN) classifier is taken into account as an
example-based classifier, which means that the training documents are used for
comparison instead of an exact class illustration, like the class profiles utilized by other
classifiers.
• Rule-Based Classification represent the knowledge in the form of If-Then rules. An
assessment of a rule evaluated according to the accuracy and coverage of the classifier.
If more than one rule is triggered then we need to conflict resolution in rule -based
classification.
• Frequent-Pattern Based Classification (or FP discovery, FP mining, or Frequent itemset
mining) is part of data mining. It describes the task of finding the most frequent and
relevant patterns in large datasets.
• Rough set theory can be used for classification to discover structural relationships
within imprecise or noisy data. It applies to discrete-valued features. Continuous-
valued attributes must therefore be discrete prior to their use. Rough set theory is
based on the establishment of equivalence classes within the given training data.

66
• Fuzzy Logic: Rule-based systems for classification have the disadvantage that they
involve sharp cut-offs for continuous attributes. Fuzzy Logic is valuable for data mining
frameworks performing grouping /classification. It provides the benefit of working at
a high level of abstraction.

Clustering Unlike classification and prediction, which analyze class-labelled data objects or
attributes, clustering analyzes data objects without consulting an identified class label. In
general, the class labels do not exist in the training data simply because they are not known
to begin with. Clustering can be used to generate these labels. The objects are clustered
based on the principle of maximizing the intra-class similarity and minimizing the interclass
similarity. That is, clusters of objects are created so that objects inside a cluster have high
similarity in contrast with each other, but are different objects in other clusters. Each
Cluster that is generated can be seen as a class of objects, from which rules can be inferred.
Clustering can also facilitate classification formation, that is, the organization of
observations into a hierarchy of classes that group similar events together.
Regression can be defined as a statistical modelling method in which previously obtained
data is used to predicting a continuous quantity for new observations. This classifier is also
known as the Continuous Value Classifier. There are two types of regression models: Linear
regression and multiple linear regression models.

Data Generation in Fisheries

Date generation in fisheries will vary depending on the nature of research undertaken. For
example, when species behaviour, growth, abundance, etc. is studied detailed data on spatial
distribution and catch is required. If the focus is to predict the profit of the coming years, an
economist should study the effect of population size on producer's costs. The macro level
data on infrastructure, employment, earnings, investment etc. will be considered to formulate
management measures. Enormous data from marine fishing gets generated from commercial
fishing vessels and research vessels which will can be mined to analyse the trend, resource
abundance, etc.

In ‘Fishery technology’ large volumes of data generated in a wide range of applied scientific
areas of fishing technology, fish processing, quality control, fishery economics, marketing and
management. Apart from statistical data collected in technological research, data also
collected on production, export, socio-economics etc. for administrative and management
decision making.

67
Major areas of data generation are as follows:
❖ fishing vessel and gear designs
❖ fishing methods
❖ craft and gear materials
❖ craft and gear preservation methods
❖ fishing efficiency studies
❖ fishing accessories
❖ emerging areas include use of GIS and remote sensing

Data on various aspects of fishing gets collected for administrative purposes and policy
making. For administrative purposes, voluminous data gets generated through fisheries
departments of states. Each district has officials entrusted with the work of collection of
data which are coordinated at the state level. State level figures are compiled at the National
level by Department of Animal Husbandry and Dairying, Ministry of Agriculture, New Delhi.

National
level
estimates

State level figures

District level estimates compiled by officials

Information is also compiled on macroeconomic variables like GSDP from fishing by the
respective Directorates of Economics & Statistics.
Infrastructure
Indian fisheries is supported by a vast fishing fleet of 2,03,202 fishing crafts categorized into
mechanized, motorized and non-motorised. The registration of these fishing crafts are done
at various ports across India and license for fishing operations has to be obtained from the
respective states. The fish processing sector largely managed by the private sector has per
day processing capacity installed at 11000 tonnes per day. Data is also collected on the
infrastructure facilities and inventories by agencies from time to time, such as number of

68
mechanized, motorized and non-motorized fishing crafts, fish landing centers, fisheries
harbours, types of gears and accessories, fish markets, ice plants and cold storages, Socio-
economic data like population of fishermen, welfare schemes, cooperative societies, financial
assistance, subsidies, training programs, etc.

Fish Landings and Fishing Effort


Indian fisheries has seen tremendous development over the past six decades owing to
technology changes in fishing like mechanization of propulsion, gear and handling,
introduction of synthetic gear materials, development of acoustic fish finding devices, satellite
based fish detection techniques, advances in electronic navigation and communication
equipment. The increase in fish production can be said as exponential with a mere 75000 MT
in 1950-51 to 11.42 million MT in the current year. Both marine fisheries and aquaculture
have contributed to the present production level with share from culture fisheries more than
the capture fisheries. It is important task to collect macro level data from state and country
on fish production and details of the species caught in the sea.
The data on fish catch and effort (a measure of fishing activity of vessels at sea), from all the
coastal states, Union territories, Islands is being done by ICAR-Central Marine Fisheries
Research institute and maintained as database. Based on standard sampling methodology
developed by CMFRI, daily data on commercial landings from selected centres/zones all over
the coast is collected, compiled and published. Detailed time series data has been generated
on species wise, region wise, gear wise fish landings are collected and compiled for the use
of researchers and policy makers. The beach price of fish (species wise) is also collected
periodically.

Data on fish farms, production and area under aquaculture is maintained by the respective
State Fisheries departments and compiled at the National level. Apart from capture fisheries
(marine) and culture fisheries (aquaculture) the fish production from inland water bodies like
lake, ponds, reservoirs, etc. is collected and compiled at State level. For developing the sector,
various programmes and projects have to be formulated and implemented. To achieve the
objectives of such developmental programmes, the current status of production of fish from
various regions has to be made known. The need for fish production data maintained by these
agencies from marine sources, aquaculture and inland water bodies arises while formulating
various research studies and development projects at district, state and National level.

69
Data Generation along the Fish Value Chain
Fresh fish after harvest is iced and distributed through various channels into the domestic
markets and overseas markets. Around 80% of the fish is marketed fresh, 12% of fish gets
processed for the export sector, 5% is sent for drying/curing and the rest is utilized for other
purposes.

Marine Products Export Development Authority (MPEDA) maintains the database on the
export of fish and fishery products from India to various countries. The weekly prices realized
by Indian seafood products in the various overseas markets are also collected and compiled
by the agency. Marine Products Export Development Authority (MPEDA), established in
1972 under the Ministry of Commerce, is responsible for collecting data regarding production
and exports, apart from formulating and implementing export promotion strategies. Before
MPEDA was established, the Export Promotion Council of India was undertaking this task.
Fish processing factories established all over the country generate data on daily production,
procurement of raw material and movement of price structure, etc., which is generally kept
confidential. Data on quality aspects maintained by the Export Inspection Council of India
through the Export Inspection Agency (EIA) in each region under the Ministry of Commerce
and Industry. The EIA is the agency approving the suitability of the products for export.
⚫ bacteriological organisms present in the products
⚫ rejections in terms of quantity
⚫ reason for rejection etc.

Fish Quality Control


Other types of data generated by CIFT in fishing and fish processing technology are quality
control data on fish and fishery products, ice, water, etc. Offshoot of processing technology
is Quality Control of which Statistical Quality Control forms an integral part. Due to the
stringent quality control measures imposed by importing countries, especially the EU and
USFDA standards, samples of fish and related products like raw materials, ice and water
samples, and swabs from fish processing factories are tested at the quality control labs.
Another area where statistics get generated is in product development: consumer
acceptability and preference studies mainly for value-added products. Statistical sensory
evaluation methods are used to analyze this data.

70
At the Central Institute of Fisheries Technology (CIFT), we periodically collect data on the
following aspects, which are used for policy decisions.
◼ Techno-economic data on various technologies developed
◼ Data on the Economics of operation of mechanized, motorized, and traditional crafts
◼ Data for the estimation of fuel utilization by the fishing industry
◼ Year-wise data on Installed capacity utilization in the Indian seafood processing
industry
◼ Demand – supply and forecast studies on the fishing webs
◼ Harvest and post-harvest losses in fisheries
◼ Transportation of fresh fish and utilization of trash fish
◼ Impact of major trade policies like the impact of anti-dumping, trend analysis of
price movement of marine products in the export markets
◼ Study on the impact of technology and study on socio-economic aspects

Computational Software for Fisheries Research

R is an open-source software that provides a programming environment for statistical data


analysis. R can be effectively used for data storage, data analysis, and various graphing
functions. R works on the principle of ‘functions’ and objects. There are about 25 packages
supplied with R (called “standard” and “recommended” packages), and many more are
available through the CRAN family of Internet sites (via https://CRAN.R-project.org) and
elsewhere. It is widely used for analyzing fisheries data.

Data Mining Software

Compared to other data mining software, SAS Enterprise Miner is a very comprehensive tool
that can handle a wide variety of data mining tasks. Further, it is very user-friendly and easy
to learn, even for users who are not familiar with SAS programming. Finally, it has a wide
range of built-in features and functionality, which makes it a very powerful tool.

Data Mining using SAS


SAS Enterprise Miner is a software tool from SAS used for data mining and predictive
modelling. It provides a graphical user interface for easy access to data mining and machine
learning algorithms. It can be used to build predictive models from data sets of any size.

71
Features

• SAS Data mining tools help you to analyze big data


• It is an ideal tool for Data mining, text mining & optimization.
• SAS offers a distributed memory processing architecture that is highly scalable
The process flow of SAS Enterprise Miner is as follows:
1. Data is imported into the project.
2. A model is created using the data.
3. The model is validated and deployed.

Data Preparation

Data Input

You can load a dataset into SAS Enterprise Miner by using the Data Import node. This node
lets you specify the dataset's location and other necessary information, such as variable types
and roles. Nodes are the building blocks of a SAS Enterprise Miner process flow. There are
various node types, each of which performs a different task. For example, there are nodes for
data import, data cleansing, modeling, and results visualization.
The main components of SAS Enterprise Miner are the data source, the data target, the
model, and the results. The data source is the location from which the data is being imported.
The data-target is the location to which the data is being exported. The model is the
statistical or machine learning model that is being used to analyze the data. The results are
the model output, which can be used to make predictions or decisions.
Decision trees are a type of predictive modeling used to classify data. In SAS Enterprise Miner,
decision trees are generated using the Tree Model node. This node takes a dataset as input
and generates a decision tree based on the variables in the dataset. The tree can then be
used to predict the class of new data.
Data Partition
You can split datasets in SAS Enterprise Miner by using the Partition node. This node will
take a dataset as input and will output two or more partitions based on the settings that you
specify. You can specify the percentage of records that should go into each partition, or you
can specify a particular variable on which to split the dataset. Partitioning provides mutually
exclusive data sets. Two or more mutually exclusive data sets share no observations with each
other. Partitioning the input data reduces the computation time of preliminary modeling runs.

72
The Data Partition node enables you to partition data sets into training, test, and validation
data sets. The training data set is used for preliminary model fitting. The validation data set
is used to monitor and tune the model weights during estimation and for model assessment.
The test data set is an additional hold-out data set that you can use for model assessment.
This node uses simple random sampling, stratified random sampling, or user-defined
partitions to create partitioned data sets.
Filtering Data
The Filter node tool is located on the Sample tab of the Enterprise Miner tools bar. Use the
Filter node to create and apply filters to your training data set. You can also use the Filter
node to create and apply filters to the validation and test data sets. You can use filters to
exclude certain observations, such as extreme outliers and errant data you do not want to
include in your mining analysis. Filtering extreme values from the training data produces
better models because the parameter estimates are more stable.

Explore Node of SAS Enterprise Miner

Association node enables you to identify association relationships within the data. For
example, if a customer buys a loaf of bread, how likely is the customer to also buy a gallon of
milk? The node also enables you to perform sequence discovery if a sequence variable is
present in the data set. The Cluster node enables you to segment your data by grouping
statistically similar observations. Similar observations tend to be in the same cluster, and
observations that are different tend to be in different clusters. The cluster identifier for each
observation can be passed to other tools for use as an input, ID, or target variable. It can also
be used as a group variable that enables the automatic construction of separate models for
each group.

DMDB node creates a data mining database that provides summary statistics and factor-
level information for class and interval variables in the imported data set. The DMDB is a
metadata catalog that stores valuable counts and statistics for model building.
Graph Explore node is an advanced visualization tool that allows you to graphically explore
large volumes of data to uncover patterns and trends and reveal extreme values in the
database. For example, you can analyze univariate distributions, investigate multivariate
distributions, and create scatter and box plots and constellation and 3-D charts. Graph
Explore plots are fully interactive and are dynamically linked to highlight data selections in
multiple views.

73
Link Analysis node transforms unstructured transactional or relational data into a model that
can be graphed. Such models can be used to discover fraud detection, criminal network
conspiracies, telephone traffic patterns, website structure and usage, database visualization,
and social network analysis. Also, the node can be used to recommend new products to
existing customers.
Market Basket node performs association rule mining of transaction data in conjunction with
item taxonomy. This node is useful in retail marketing scenarios that involve tens of
thousands of distinct items, where the items are grouped into subcategories, categories,
departments, and so on. This is called item taxonomy. The Market Basket node uses the
taxonomy data and generates rules at multiple levels in the taxonomy.
MultiPlot node is a visualization tool that allows you to graphically explore larger volumes of
data. The MultPlot node automatically creates bar charts and scatter plots for the input and
target variables without making several menu or window item selections. The code created
by this node can be used to create graphs in a batch environment.
Path Analysis node enables to analyze Web log data to determine the paths that visitors take
as they navigate through a website. You can also use the node to perform sequence analysis.
SOM/Kohonen node enables you to perform unsupervised learning by using Kohonen vector
quantization (VQ), Kohonen self-organizing maps (SOMs), or batch SOMs with Nadaraya-
Watson or local-linear smoothing. Kohonen VQ is a clustering method, whereas SOMs are
primarily dimension-reduction methods.
StatExplore node is a multipurpose node that you use to examine variable distributions and
statistics in your data sets. Use the StatExplore node to compute standard univariate
statistics, standard bivariate statistics by class target and class segment, and correlation
statistics for interval variables by interval input and target. You can also use the StatExplore
node to reject variables based on target correlation.
Variable Clustering node is a useful tool for selecting variables or cluster components for
analysis. Variable clustering removes collinearity, decreases variable redundancy, and helps
reveal the underlying structure of the input variables in a data set. Large numbers of variables
can complicate the task of determining the relationships that might exist between the
independent variables and the target variable in a model. Models that are built with too many
redundant variables can destabilize parameter estimates, confound variable interpretation,
and increase the computing time that is required to run the model. Variable clustering can
reduce the number of variables that are required to build reliable predictive or segmentation
models.

74
Variable Selection node enables you to evaluate the importance of input variables in
predicting or classifying the target variable. The node uses either an R 2 or a Chi-square
selection (tree-based) criterion. The R2 criterion removes variables that have large
percentages of missing values and removes class variables that are based on the number of
unique values. The variables unrelated to the target are set to a status of rejected. Although
rejected variables are passed to subsequent tools in the process flow diagram, these variables
are not used as model inputs by modeling nodes such as the Neural Network and Decision
Tree tools.

Modelling Data using SAS Enterprise miner

AutoNeural node can be used to automatically configure a neural network. The AutoNeural
node implements a search algorithm to incrementally select activation functions for various
multilayer networks.

Decision Tree node enables you to fit decision tree models into your data. The
implementation includes features in various popular decision tree algorithms (for example,
CHAID, CART, and C4.5). The node supports both automatic and interactive training. When
you run the Decision Tree node in automatic mode, it automatically ranks the input variables
based on the strength of their contribution to the tree. This ranking can be used to select
variables for subsequent modeling. You can override any automatic step with the option to
define a splitting rule and prune explicit tools or subtrees. Interactive training lets you explore
and evaluate data splits as you develop them.
Dmine regression node enables you to compute a forward stepwise least squares regression
model. In each step, the independent variable that contributes maximally to the model R-
square value is selected. The tool can also automatically bin continuous terms.
DMNeural node is another modeling node that you can use to fit an additive nonlinear model.
The additive nonlinear model uses bucketed principal components as inputs to predict a
binary or an interval target variable with the automatic selection of an activation function.
Ensemble node enables the creation of models by combining the posterior probabilities (for
class targets) or the predicted values (for interval targets) from multiple predecessor models.
Gradient Boosting node uses tree boosting to create a series of decision trees that together
form a single predictive model. Each tree in the series fits the residual prediction of the earlier
trees. The residual is defined in terms of the derivative of a loss function. For squared error
loss with an interval target, the residual is simply the target value minus the predicted value.
Boosting is defined for binary, nominal, and interval targets.

75
LARS node enables you to use Least Angle Regression algorithms to perform variable
selection and model fitting tasks. The LARs node can produce models that range from simple
intercept models to complex multivariate models that have many inputs. When using the
LARs node to perform model fitting, the node uses criteria from either the least angle
regression or the LASSO regression to choose the optimal model.
MBR (Memory-Based Reasoning) node enables you to identify similar cases and to apply
information that is obtained from these cases to a new record. The MBR node uses k-nearest
neighbor algorithms to categorize or predict observations.
Model Import node enables you to import models into the SAS Enterprise Miner environment
that SAS Enterprise Miner did not create. Models created using SAS PROC LOGISTIC (for
example) can now be run, assessed, and modified in SAS Enterprise Miner.
Neural Network node enables you to construct, train, and validate multilayer feedforward
neural networks. Users can select from several predefined architectures or manually select
input, hidden, and target layer functions and options.
Partial Least Squares node is a tool for modeling continuous and binary targets based on
SAS/STAT PROC PLS. The Partial Least Squares node produces DATA step score code and
standard predictive model assessment results.
Regression node enables to fit linear and logistic regression models to your data. You can use
continuous, ordinal, and binary target variables. You can use both continuous and discrete
variables as inputs. The node supports the stepwise, forward, and backward selection
methods. A point-and-click interaction builder enables to create higher-order modeling terms.
Rule Induction node enables you to improve the classification of rare events in your modeling
data. The Rule Induction node creates a Rule Induction model that uses split techniques to
remove the largest pure split node from the data. Rule Induction also creates binary models
for each level of a target variable and ranks the levels from the rarest event to the most
common. After all levels of the target variable are modeled, the score code is combined into
a SAS DATA step.
Two Stage node enables you to compute a two-stage model to predict a class and an interval
target variable simultaneously. The interval target variable is usually a value that is associated
with a level of the class target.
Survival data mining
Survival data mining is the application of survival analysis to customer data mining problems.
The application to the business problem changes the nature of the statistical techniques. The
issue in survival data mining is not whether an event will occur in a certain time interval but

76
when the next event will occur. The SAS Enterprise Miner Survival node is located on the
Applications tab of the SAS Enterprise Miner toolbar. The Survival node performs survival
analysis on mining customer databases when there are time-dependent outcomes. The time-
dependent outcomes are modeled using multinomial logistic regression. The discrete event
time and competing risks control the occurrence of the time-dependent outcomes. The
Survival node includes functional modules that prepare data for mining, expand data to one
record per time unit, and perform sampling to reduce the size of the expanded data without
information loss. The Survival node also performs survival model training, validation, scoring,
and reporting.

77
Chapter 6
Introduction to Indices and Performance
Evaluation
J. Charles Jeeva and R. Narayana Kumar
Madras Regional Station
ICAR- Central Marine Fisheries Research Institute
Chennai-600 020

Indices in Social Science Research

Indices are a sum of a series of individual yes/no questions that are then combined in a single
numeric score. They are usually a measure of the quantity of some social phenomenon and
are constructed at a ratio level of measurement. The word is derived from Latin, in which
index means "one who points out," an "indication," or a "forefinger." In Latin, the plural form
of the word is indices. In statistics and research design, an index is a composite statistic – a
measure of changes in a representative group of individual data points or other words. This
compound measure aggregates multiple indicators.

Features of Index numbers are as follows: Average: They predict or represent the changes
that take place in terms of averages. Quantitative: They offer the accurate measurement of
quantitative change. Measures of relative changes: they measure relative changes over time.

Index Development in Social Science

A set of indices is produced that measure dimensions of social development, understood as


informal institutions such as the strength of civil society, intergroup relations, or gender
equality, which contribute to better growth and governance.

Difference between Scale and Index in Research

Scales are always used to give scores at the individual level. However, indices could be used
to give scores at both individual and aggregate levels. They differ in how the items are
aggregated.

A scale is an index that, in some sense, only measures one thing. For example, a final exam in
a given course could be thought of as a scale: it measures competence in a single subject. In
contrast, a person's GPA can be considered an index: it is a combination of several separate,
independent competencies.

78
To summarize, an index is a measure that contains several indicators and is used to
summarize some more general concepts.

Indicators for Impact Monitoring and Assessment

Indicators are quantitative or qualitative variables that can be measured or described and
when observed periodically, demonstrate trends; they help to communicate complex
phenomena. They represent the abstraction of a phenomenon or a variable. In other words,
an indicator is just an indicator. It is not the same as the phenomenon of interest, but only
an indicator of that phenomenon (Patton, 1997).

Classification of Indicators

Scientific indicators tend to be quantitatively measurable; they are global within a given
discipline and are meant to be comparable across space and time.

Grassroots (indigenous/local) indicators are signals used by local people (individuals, groups,
communities) based on their own observations, perceptions, and local knowledge, applied
within specific cultural, ecological, and spiritual contexts; they tend to be more descriptive.

Another classification of indicators says that they can be broadly classified into two
categories, namely, final and intermediate.

Final indicator: When an indicator measures the effect of an intervention on individuals’ well-
being, we call it a "final" indicator.

For example, literacy may be considered one of the dimensions of `well-being,’ so an indicator
measuring it—say, the proportion of people of a certain age who can read a simple text and
write their name—would be a final indicator. Sometimes final indicators are divided into
“outcome” and “impact”.

Impact indicators measure key dimensions of `well-being’ such as freedom from hunger,
literacy, good health, empowerment, and security.

Outcome indicators capture access to, use of, and satisfaction with public services, such as
the use of health clinics and satisfaction with the services received, access to credit,
representation in political institutions, and so on. These are not dimensions of well-being ‘in
themselves but are closely related. They may be contextual. Thus, both the impact and
outcome indicators should constitute the final indicators of impact assessment and
monitoring impact.

79
Intermediate indicator: when an indicator measures a factor that determines an outcome or
contributes to the process of achieving an outcome, we call it an “input” or “output” indicator,
depending on the stage of the process—in other words, an "intermediate" indicator.

For example, many things may be needed to raise literacy levels: more schools and teachers,
better textbooks, etc. A measure of public expenditures on classrooms and teachers would be
‘input’ indicators, while measures of classrooms built and teachers trained would be ‘output’
indicators. What is important is that inputs and outputs are not goals in themselves; rather,
they help to achieve the chosen goals.

Features of Good Indicators

A good indicator:

• Is a direct and unambiguous measure of progress/change—more (or less) it is


unmistakably better.
• Is relevant— it measures factors that reflect the objectives.
• Varies across areas and groups over time and is sensitive to policy changes, programs,
and institutions.
• Is not easily blown off course by unrelated developments and cannot be easily
manipulated to show achievement where none exists.
• Can be tracked (better if already available), is available frequently, and is not too costly
to track.

Identification and Selection of Indicators for Impact Monitoring and Assessment

Once a set of goals/objectives of the project have been agreed upon through a participatory
analysis process, the next step is to identify indicators—also in a participatory way—to
measure progress toward those goals as a result of an intervention or a development project.
The impact monitoring and assessment depend critically on the choice of appropriate
indicators. Preferably, they should be derived from the identification and descriptions of
relevant variables being given by the clients, with appropriate indicators of them being based
on discussion of all the stakeholders.

80
Basis for Indicators of Impact Assessment

Indicators should comprise comprehensive information about the program outcomes:

• Indicators of the program impact based on the program objectives are needed to
guide policies and decisions at all levels of society- village, town, city, district, state,
region, nation, continent, and world.

• These indicators must represent all important concerns of all the stakeholders in
the program: An ad-hoc collection of indicators that just seem relevant is not
adequate. A more systematic approach must look at the interaction of the program
components with the environment.

• The number of indicators should be as small as possible but not smaller than
necessary. The indicator set must be comprehensive and compact, covering all
relevant aspects.

• The process of finding an indicator set must be participatory to ensure that the set
encompasses the visions and values of the community or region for which it is
developed.

• Indicators must be clearly defined, reproducible, unambiguous, understandable, and


practical. They must reflect the interests and views of different stakeholders.

• From a look at these indicators, it must be possible to deduce the viability and
sustainability of change due to a project program and current developments and to
compare with alternative change/development paths.

• A framework, a process, and criteria for finding an adequate set of indicators to


assess all aspects of the program's impact are needed.

Appropriate Tools

Participatory Rural Appraisal (PRA) tools are often only seen as appropriate for gathering
information at the beginning of an intervention, as part of a process of appraisal and planning.
Development workers may talk about having ‘done’ a PRA, sometimes seeing it as a step
towards getting funding. However, PRA tools have a much wider range of potential uses and
can often be readily adapted and used for participatory monitoring and participatory
evaluation.

81
A few examples described are as follows:

Transect walk is a means of involving the community in monitoring and evaluating changes
that have occurred over the program intervention period. This method entails direct
observation while incorporating the views of community members.

Spider web diagram is used as a means for participants to monitor and evaluate key areas of
a program. The spider web is a simple diagrammatic tool for discussion use; it does not entail
any direct field observations.

Participatory mapping is perhaps the easiest and most popular participatory tool used here
to evaluate project interventions.

Photographic comparisons are another easy visual tool, here used to stimulate community
discussions in evaluating program interventions.

Timeline is a tool used to elaborate historical change.

Well-being ranking differentiates the benefits that different community members have
gained from the development interventions.

The H-form is a simple monitoring and evaluation tool. This method is particularly designed
for monitoring and evaluation of programs. It was developed in Somalia to assist local people
in monitoring and evaluating local environmental management. The method can be used for
developing indicators, evaluating activities, and facilitating and recording interviews with
individuals regarding tank silt applications.

PME as an integral part of all community-based interventions

However interesting a participatory evaluation at the end of a program might be, without it
having been based on a sound system of participatory monitoring throughout the project
intervention, the evaluation in itself is limited. Thus, the first conclusion to draw is that
monitoring and evaluation should be made a systematic feature of all interventions, seeking
community participation from the outset in defining what should be monitored (indicators),
how often and by whom the monitoring should be conducted, how this information will be
used, etc.

Reducing the number of indicators to a manageable set

A detailed analysis usually produces many components of plausible impact, long viability
impact chains, and potential indicators. Furthermore, there will generally be several, perhaps
many, appropriate indicators for answering each assessment question or particular aspects
82
of it. It is, therefore, essential to condense the impact analysis system and the indicator set
as much as permissible without losing essential information. There are several possibilities to
do this. They are:

• Aggregation. Use the highest level of aggregation possible. For example, when
applied in the final impact assessment, they are likely to be disaggregated into
smaller components, according to the requirement of the impact assessment
scheme.

• Condensation. Locate an appropriate indicator representing the ultimate effect of


a particular program activity(s), without bothering with indicators for intermediate
effects.

• Weakest-link approach. Identify the weakest links in the program and define
appropriate indicators. Do not bother with other components that may be vital but
not related to direct program effects.

• Basket average. If several indicators represent somewhat different aspects of an


oriental question, define an index that provides an average reading of the situation.

• Representative indicator. Indicator/variable that provides reliable information


characteristic of a whole complex situation.

• Subjective viability assessment. If very little quantitative information for a vital


component is available, use a summary subjective viability assessment indicator.

• Basket minimum. If a particular orientor satisfaction is in an acceptable state of


each of several indicators, adopt the one with the currently worst performance as
the representative indicator.

Performance Evaluation

All organizations that have learned the art of “winning from within” by focusing inward on
their employees rely on a systematic performance evaluation process to measure and
evaluate employee performance regularly. Ideally, employees are graded annually on their
work anniversaries, based on which they are either promoted or given a suitable distribution
of salary raises. Performance evaluation also directly provides periodic feedback to employees,
such that they are more self-aware in terms of their employee performance
evaluation metrics.

83
Performance Evaluation is a formal and productive procedure to measure an employee’s work
and results based on their job responsibilities. It is used to gauge the value an employee adds
in terms of increased business revenue compared to industry standards and overall employee
return on investment (ROI).

Purpose of Performance Evaluation

Performance evaluation measures an individual’s or organization’s job performance to


determine how well they fulfill their responsibilities. We will learn about some important
additional purposes of performance evaluation in this section:

➢ Periodic performance evaluation is an employee’s report card that acknowledges the


work he/she has done in a specific time and the scope for improvement.

➢ An employer can provide consistent feedback on an employee’s strengths and strive


for improvement in the areas the employees need to work on.

➢ It is an integrated platform for the employee and employer to attain common ground
on what both think is befitting a quality performance. This helps improve
communication, which usually leads to better and more accurate team metrics and,
thus, improved performance results.

➢ This entire performance evaluation process aims to improve how a team or an


organization function to achieve higher levels of customer satisfaction.

➢ A manager should evaluate his/her team members regularly and not just once a
year. This way, the team can avert new and unexpected problems with constant work
to improve competence and efficiency.

➢ An organization’s management can conduct frequent employee training and skill


development sessions based on the development areas recognized after a
performance evaluation session.

➢ The management can effectively manage the team and conduct productive resource
allocation after evaluating the goals and preset standards of performance.

➢ Regular performance evaluation can help determine the scope of growth in an


employee’s career and the level of motivation with which he/she contributes to an
organization’s success.

➢ Performance evaluation lets an employee understand where he/she stands as


compared to others in the organization.
84
Benefits of Performance Evaluation

Now that we know why the staff performance measurement process is necessary, let us look
at the top 5 key benefits the employee performance evaluation offers.

Improved communication

In staff performance evaluation processes, managers continuously give team members


feedback. This feedback is based on their assignments, their understanding of them,
completion, and delivery. Using this feedback, employees can improve their work and plug any
gap areas identified by their managers. It also brings to light many issues that the employees
may have and need to be addressed. It helps in open and honest communication between the
manager and the team.

Build a career path

Managers help their employees with assignments and how they can effectively do them. A
performance evaluation meeting is a perfect time to examine an employee’s career path. It
lets the employee know what their future goals are and what they need to do to get there. It
helps them create small and achievable goals, assign deadlines, and work toward completion.
It also lets them know where they stand in the hierarchy and where they will be in the future.

Check levels of engagement

Engaged employees perform better than their counterparts. They are better team players,
are more productive, and help their peers out actively. A staff performance evaluation is a
perfect time to check employee engagement. It will help you understand how engaged the
employee is and let you know what steps you would need to take to ensure high engagement.

Get feedback for yourself

A performance evaluation meeting is not only to give feedback; it is a good opportunity to


get feedback on your performance from the team members. Understand what your gap areas
are and what more you can do to improve the performance of your team members and be a
good mentor to them.

Resources planning

Staff appraisals help in understanding how an employee is performing and what their future
assignments or goals can be. It not only helps in effective goals management but also in
resource planning. You can effectively reallocate your resources or hire new members to add
to your team.
85
Performance Evaluation Methods

There are 5 most critical performance evaluation methods. Using only one of these
performance evaluation methods might help an organization gain one-sided information
while using multiple methods to help obtain insights from various perspectives, which will be
instrumental in forming an unbiased and performance-centric decision.

1. Self-evaluation: Employees are expected to rate themselves using multiple-choice or open-


ended questions by considering some evaluation criteria. After self-evaluation, the
management can fairly assess an employee by considering their thoughts about their
performance. It is an amazing method to get started with employee reviews. An organization’s
management can compare every employee’s self-evaluation with the rating their manager
provides, which makes the performance evaluation process exhaustive and effective. The gap
between self-evaluated ratings and the supervisor’s ratings can be discussed to maintain a
certain level of transparency.

2. 360-degree employee performance evaluation: In this performance evaluation method, an


employee is rated in terms of the advancements made by them within the team and with
external teams. Inputs from supervisors of different departments and evaluations done by
direct supervisors and immediate peers are considered. Thus, in 360-degree feedback, each
employee is rated for the job done according to their job description and the work done by
them in association with other teams.

3. Graphics rating scale: It is one of supervisors’ most widely used performance evaluation
methods. Numeric or text values corresponding to values from poor to excellent can be used
in this scale, and parallel evaluation of multiple team members can be conducted using this
graphical scale. Employee skills, expertise, conduct, and other qualities can be evaluated
compared to others in a team. It is important to make each employee understand the value
of each entity of the scale in terms of success and failure. This scale should ideally be the
same for each employee.

4. Developmental checklists: Every organization has a roadmap for each employee for their
development and exhibited behavior. Maintaining a checklist for development is one of the
most straightforward performance evaluation methods. This checklist has several
dichotomous questions, the answers of which need to be positive. If not, then the employee
requires some developmental training in the areas where they need improvement.

86
5. Demanding events checklist: There are events in each employee’s career with an
organization where they must exhibit immense skill and expertise. An intelligent manager
always lists demanding events where employees show good or bad qualities.

Conclusion

A discussion on what indicators are. How can we classify these indicators based on the
purpose for which they are used to measure and monitor different impact components? The
basis for identification, selection, and how to apply them in impact monitoring and
assessment (IMA) has been presented in this article.

Further, to summarize, a well-conducted performance evaluation is an important component


of the employee’s professional growth and development and the organization’s overall
success. A comprehensive and fair evaluation process allows employees to receive useful
feedback, identify areas for improvement, and set goals for future growth. Organizations can
create a positive and productive evaluation experience for all parties involved by adhering to
best practices such as setting clear employee expectations, providing regular feedback, and
recognizing and rewarding good performance.

The ultimate goal of a performance evaluation is to drive performance improvements, bond


positive working relationships, and support the employee’s and the organization’s growth and
success. Performance appraisals can be stressful for employees who are concerned about the
outcome of the meeting and the impact it might have on their position within the
organization. That is why it is so important to create and implement a performance appraisal
system that will benefit both the employee and the organization, while being sensitive to the
needs of employees.

87
Bibliography

Bassel, H.1999. Indicators for Sustainable Development: Theory, Method, Applications. A


Report to the Balaton Group, International Institute for Sustainable Development
(IISD), 161 Portage Avenue East, 6th floor, Winnipeg, Manitoba, Canada. 1999.

Estrella, M., Blauert, J. et. al. Etd. Learning from Change: Issues and Experiences in
Participatory Monitoring and Evaluation. Intermediate Technology Publications,
IDRC, Canada.

Feder, G and Slade, R.H.1986. “Methodological Issues in the Evaluation and Extension
research.” In Etd. Jones, G.E. 1986. Investing in Rural Extension: strategies and Goals.
Elsevier Applied Science Publishers, London.

Herweg, K. and Steiner, K. 2002. Impact Monitoring and Evaluation: Instruments for Use in
Rural Development Projects With a Focus on Sustainable Land Management. Vol.
1: Procedure. Rural Development Department, World Bank, Washington D.C. 1998.

Jaiswal, N.K. and Das, P.K. (1981) Transfer of Technology in Rice Farming, Rural Development
Digest, 4 (4): 320-353.

NGO Programme Karnataka-Tamil Nadu (2005) Participatory Monitoring and Evaluation:


Field Experiences. Intercooperation Delegation, Hyderabad.

Patton, M. 1997 Etd. Utilization Focused Evaluation: the new Century Text. Sage
Publications, Ch.7-8. International Educational and Professional Publishers, New
Delhi. 1997.

Reddy, L. Narayana. (2006) Participatory Technology and Development, LEISA India,


September 2006, p. 27.

Sustad, J. and M. Cohen. 1998. Toward Guidelines for Lower-Cost Impact Assessment
Methodologies for Microenterprise Programs. A Discussion paper, AIMS,
Management systems International, 600 water Street SW. Washington, DC.

Web Reference: https://kissflow.com/hr/performance-management/employee-


performance-appraisal-method/

88
Chapter 7
114
Fundamentals of Panel Data 1114
Analysis
11414
Umanath Malaiarasan
Madras Institute of Development Studies, Chennai-020

Introduction

The dynamics of nature, society, and human activities are continuously evolving. It is driven
by numerous factors ranging from environmental shifts to socio-economic transformations.
Understanding these complex and interrelated phenomena is essential for addressing
contemporary challenges and shaping future trajectories. The world we inhabit is in a state
of continuous change, with natural systems undergoing profound changes in response to
climate variability, habitat destruction, and resource exploitation. The behavior of nature is
undergoing unprecedented shifts that have far-reaching implications for ecosystems,
biodiversity and human well-being. Concurrently, societal structures and norms are in a state
of instability, shaped by demographic shifts, technological advancements, and cultural
dynamics. Globalization has ushered in a new era of interconnectedness, transforming
patterns of trade, migration, and communication on a scale never before witnessed.
Meanwhile, human activities from production to consumption are reshaping the physical
structure and social landscapes of the earth, putting forth pressures on natural resources,
and altering the fabric of communities worldwide.

In this dynamic landscape, traditional research methods, statistical and econometrics analysis
often fall short of capturing the complexity and temporal dynamics of these phenomena.
Cross-sectional studies provide only a static snapshot of reality, overlooking the temporal
dimension crucial for understanding change over time. Similarly, time series analysis may
overlook individual-level variation and the heterogeneity of responses across different
contexts. As we confront the complex challenges of the 21st century—from climate change
and biodiversity loss to inequality and social unrest—there should be alternative approaches
and robust analytical tools capable of capturing the dynamic interactions shaping our world.
Researchers and policymakers need a data analysis that offers a unique vantage point by
combining the strengths of both approaches to examine individuals over time and uncover
patterns of change within and across various levels of analysis.

89
The importance of panel data analysis in capturing the dynamic nature of nature, society, and
human activities lies in its ability to disentangle the complex inter-relationships among these
phenomena. By longitudinally tracking individuals, households, communities, or regions, panel
data analysis allows researchers to explore how changes in one domain influence and are
influenced by changes in others. For instance, it can shed light on how environmental policies
impact socioeconomic outcomes or how shifts in cultural norms affect individual behaviors
and societal structures. Moreover, panel data analysis facilitates the identification of causal
pathways and feedback loops, providing insights essential for informed decision-making and
policy formulation. In the pages that follow, we will explore the methodological foundations
with suitable empirical applications in panel data analysis.

Panel Data Analysis

Panel data techniques have gained popularity in recent years due to their ability to address
the challenges and limitations of conventional Ordinary Least Squares (OLS) estimations.
OLS estimations often yield uncertain outcomes, and the history of regression analysis is
marked by numerous violations of its assumptions (Bickel, 2007; Gil-Garcia, 2008; Gefen,
Straub & Boudreau, 2000; Hair et al., 1998). These violations can lead to biased and
inefficient estimates, compromising the validity and reliability of the results. To overcome
these challenges, researchers have developed a considerable array of tests and procedures to
identify and rectify OLS violations. However, these adjustments can be complex and time-
consuming, requiring researchers to make assumptions about the nature and extent of the
violations. This introduces additional uncertainty into the analysis and may limit the
generalizability of the findings. In contrast, panel data techniques offer a promising
alternative. By utilizing data collected over time from the same individuals, organizations, or
units, panel data methods allow researchers to control for unobserved heterogeneity and
time-invariant factors that may complicate the analysis. This longitudinal approach provides
a more comprehensive understanding of the relationships between variables and allows for
the examination of dynamic processes and causal effects.

Panel data analysis occupies a pivotal position at the intersection of time series and cross-
sectional econometrics. Conventionally, time series parameter identification relied on
concepts such as stationarity, pre-determinedness, and uncorrelated shocks, while cross-
sectional parameter identification leaned on exogenous instrumental variables and random
sampling. Panel datasets, by encompassing both dimensions, have expanded the realm of
possible identification arrangements, prompting economists to reevaluate the nature and

90
sources of parameter identification. One line of inquiry stemmed from utilizing panel data to
control unobserved time-invariant heterogeneity in cross-sectional models. Another strand
aimed to dissect variance components and estimate transition probabilities among states.
Studies in these domains loosely corresponded to early investigations into fixed and random
effects approaches. The former typically sought to measure regressor effects while holding
unobserved heterogeneity constant, while the latter focused on parameters characterizing
error component distributions. A third vein explored autoregressive models with individual
effects and broader models with lagged dependent variables. A significant portion of research
in the first two traditions concentrated on models with strictly exogenous variables. This
differs from time series econometrics, where distinguishing between predetermined and
strictly exogenous variables is fundamental in model specification. However, there are
instances where theoretical or empirical concerns warrant attention to models exhibiting
genuine lack of strict exogeneity after accounting for individual heterogeneity. Various terms
are employed to denote panel data, encompassing pooled data, pooled time series and cross-
sectional data, micropanel data, longitudinal data, and event history analysis, among others
(Baltagi, 2008; Greene, 2012; Gujarati, 2003; Wooldridge, 2002).

Structure of Panel Data

Cross-sectional data

Cross-sectional data refers to observations collected from different individuals, units, or


entities at a single point in time. It provides a snapshot of a population at a specific moment.
These data are often used in social science research, economics, public health, and various
other fields to study relationships between variables, assess differences across groups, and
make comparisons. In cross-sectional studies, researchers collect data from a sample or entire
population at a particular time point or within a relatively short timeframe. This approach
allows for the examination of a wide range of characteristics, behaviors, and outcomes within
the population at that specific moment. Common methods for collecting cross-sectional data
include surveys, censuses, and observational studies. While cross-sectional data offer valuable
insights into a population's characteristics and relationships between variables at a single
point in time, it cannot establish causality or determine the direction of relationships between
variables. Additionally, they may not capture changes over time or account for individual-level
differences that could influence study outcomes. Table 1 shows the example for cross-
sectional data, i.e., it presents the data for per acre yield, fertilizers, and seed across different
farms in a single period of time 2020-21.

91
Table 1. Per acre yield, quantity use of seed and fertilizers in paddy production for various
farms in 2020-21
Year State Yield in quintal Seed in kg Fertilizers in kg
2020-21 Farm 1 60.32 62.42 267.14
2020-21 Farm 2 35.52 52.15 26.56
2020-21 Farm 3 27.68 37.25 117.19
2020-21 Farm 4 39.24 19.82 223.16
2020-21 Farm 5 54.05 15.1 223.22
2020-21 Farm 6 31.88 15.66 62.42
2020-21 Farm 7 54.88 62.6 320.58
2020-21 Farm 8 39.04 95.47 160.34
2020-21 Farm 9 41.18 50.78 161.05
2020-21 Farm 10 19.66 51.88 286.13
2020-21 Farm 1 1 44.48 56.52 128.75
2020-21 Farm 1 2 71.93 0.00 181.68
2020-21 Farm 1 3 45.91 70.54 213.37
2020-21 Farm 1 4 39.16 20.93 180.88
2020-21 Farm 1 5 42.73 57.85 165.87

The corresponding equation for the above cross-sectional data can be expressed as:

(1) Cross-section: 𝑌𝑖 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖 + 𝛽2 𝑆𝑒𝑒𝑑𝑖 + 𝑢𝑖 𝑖 = 1,2, … … 𝑛

where 𝑌𝑖 is the value of the dependent variable yield of paddy for the ith farm; Fert and Seed
represent the amount of fertilizers and seed used in the ith farm; and ∝, 𝛽𝑠 and 𝑢 represent
the intercept, slope coefficients, and error terms of the equations, respectively. This equation
represents a linear regression model for cross-sectional data, where the goal is to estimate
the coefficients ∝ and 𝛽𝑠 that best describes the relationship between the independent
variables and the dependent variable for all individual farms in the sample.
Time series data
Time series data refers to observations collected at regular intervals over a continuous period
for a single entity. In other words, it represents a sequence of data points indexed by time.
Time series data are commonly used in various fields, such as economics, finance,
meteorology, and engineering, to study the behavior of a phenomenon or variable over time.

92
Observations in a time series are arranged in chronological order, with each observation
corresponding to a specific point in time. These data are typically collected at regular intervals
such as hourly, daily, monthly, or yearly. This regularity facilitates the analysis of periodic
fluctuations and trends over different time scales. Time series data can involve a single
variable (univariate time series) or multiple variables (multivariate time series). Univariate
time series focuses on the behavior of a single variable over time, while multivariate time
series considers the interactions between multiple variables. Time series data often exhibit
stochastic or random behavior, meaning that they are subject to inherent variability and
uncertainty. This stochastic component can arise from various sources, including random
fluctuations, external shocks, and measurement errors. Time series data often exhibit
autocorrelation, indicating that observations are correlated with themselves over time.
Stationarity is a fundamental concept in time series analysis, referring to the stability of
statistical properties over time. A stationary time series has a constant mean, variance, and
autocovariance structure over time, making it easier to model and analyze. Time series data
are analyzed using various statistical techniques, including time series models, spectral
analysis, cointegration, and forecasting methods. These methods allow researchers to identify
patterns, estimate parameters, make predictions, and infer causal relationships from time
series data. Table 2 shows the example for time-series data, i.e., it presents the data for per
acre yield, fertilizers, and seed over a period of time for a single farm.

Table 2. Per acre yield, quantity use of seed and fertilizers in paddy production for the single
farm over a period of time
Year Farms Yield in quintal Fertilizers in kg Seed in kg
2004-05 Farm1 22.19 9.32 65.74
2005-06 Farm1 25.17 9.78 65.05
2006-07 Farm1 16.71 10.17 67.16
2007-08 Farm1 25.38 10.75 64.14
2008-09 Farm1 26.75 8.95 63.22
2009-10 Farm1 25.83 12.61 64.62
2010-11 Farm1 29.58 15.67 62.2
2011-12 Farm1 26.51 16.6 60.98
2012-13 Farm1 31.41 16.04 58.43
2013-14 Farm1 31.84 16.66 58.02
2014-15 Farm1 32.45 23.96 61.8

93
Year Farms Yield in quintal Fertilizers in kg Seed in kg
2015-16 Farm1 32.82 21.76 60.62
2016-17 Farm1 32.69 24.7 59.38
2017-18 Farm1 33.44 24.01 56.55
2018-19 Farm1 34.52 24 56.74
2019-20 Farm1 34.79 24.68 56.02
2020-21 Farm1 35.52 26.56 52.15

The general form of the equation for the above time series data can be expressed as follows:

(2) Time-series: 𝑌𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑡 + 𝑢𝑡 𝑡 = 1,2, … … 𝑡

where 𝑌𝑡 is the value of the dependent variable yield of paddy at the t th period (year) for a
single farm; Fert and Seed represents the amount of fertilizers and seed used in the t th year;
and ∝, 𝛽𝑠 and 𝑢 represent the intercept, slope coefficients, and error term of the equations,
respectively. This equation represents a linear regression model for time series data, where
the goal is to estimate the coefficients ∝ and 𝛽𝑠 that best describes the relationship between
the independent variables and the dependent variable for all the years in the sample.

Panel data
The organization of data in a panel format involves recording individual observations for each
variable across different time points. The temporal units can vary, spanning years, months,
weeks, days, and even shorter intervals such as hours, minutes, and seconds. The choice of
time units depends on the anticipated behavior of the variable over time. Researchers may
explore various time expressions, including lagged, linear, squared, and quadratic
representations. Each case within the panel signifies an individual observation of a specific
variable from panels such as individuals, groups, firms, organizations, cities, states, countries,
etc., and an identifier for each case is essential. In principle, it is feasible to estimate time
series for each case or cross-sectional regressions for each time unit using the corresponding
equations (1) and (2). These expressions depict simple pooled Ordinary Least Squares (OLS)
models and when applied to panel data (equation (3)), it is referred to as pooled OLS
regression.

94
The pooled panel data approach aggregates observations for each case over time without
distinguishing between cases, thereby neglecting the effects across individuals and time.
Consequently, this estimation may distort the true relationships among variables studied
across cases and over time. Table 3 shows the example for panel data, i.e., it presents the data
for per acre yield, fertilizers, and seed across two different individual farms and over a three-
month period of time.

Table 3. Per acre yield, quantity use of seed and fertilizers in paddy production across
different farms over different time periods

Year Farms Yield in quintal Fertilizers in kg Seed in kg

2020-21 Farm1 60.32 62.42 267.14


2019-20 Farm1 63.71 59.65 277.24
2018-19 Farm1 64.35 44.91 256.41
2019-20 Farm2 34.79 56.02 24.68
2020-21 Farm2 35.52 52.15 26.56
2018-19 Farm2 34.52 56.74 24
2018-19 Farm3 29.68 43.88 131.71
2020-21 Farm3 27.68 37.25 117.19
2019-20 Farm3 30.05 42.1 128.64
2018-19 Farm4 42.58 21.18 195.87
2019-20 Farm4 45.97 21.35 196
2020-21 Farm4 39.24 19.82 223.16

The general form of the equation for the above time series data can be expressed as follows:

(3) Panel data: 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡

𝑌𝑖𝑡 represents the value of the dependent variable for the ith farm at tth time period; Fert and
Seed represents the amount of fertilizers and seed used in the ith farm at tth time period; and
∝, 𝛽𝑠 and 𝑢 represent the intercept, slope coefficients and error term of the equation,
respectively.

95
Types of Panel Data

In general, the observations within a sample remain consistent across all time periods.
However, there are instances, especially in random surveys, where the observations in one
period's sample differ from those in another. This distinction leads to what is known as a
balanced panel dataset for the former (Table 3) and an unbalanced panel dataset for the
latter (Table 4). Generally, an unbalanced panel dataset arises due to missing observations
for certain variables over specific time periods during the data collection process. Apart from
these, there are other forms of panel data called short, long, and dynamic panel data sets.
Short panels have a limited number of time periods relative to the number of cross-section
observed. For example, in Table 5, number of firms (5) is more than the number of time-
periods (2 years). In contrast, long panels span a large number of time periods, allowing for
extensive longitudinal analysis. For example, in Table 6, number of time periods (6 years) is
more than the number of panels (2 farms). Dynamic panels incorporate lagged values of
variables to capture temporal dependencies and serial correlation, enabling researchers to
analyze dynamic processes over time. For example, in Table 7, one year lagged value of yield
(yit-1) is taken as one of the independent variables in the data set.

Table 4. Unbalanced panel data


Year Farms Yield in quintal Seed in kg Fertilizers in kg
2020-21 Farm 1 60.32 62.42 267.14
2019-20 Farm 1 63.71 59.65 277.24
2018-19 . . . .
2019-20 Farm 2 34.79 56.02 124.68
2020-21 Farm 2 35.52 52.15 126.56
2018-19 Farm 2 34.52 56.74 124.00
2020-21 Farm 3 29.68 43.88 131.71
2019-20 . . . .
2018-19 Farm 3 30.05 42.10 128.64
2018-19 Farm 4 42.58 21.18 195.87
2019-20 Farm 4 45.97 21.35 196.00
2020-21 Farm 4 39.24 19.82 223.16

96
Table 5. Short panel data (Micro panel)
Year Farms Yield in quintal Seed in kg Fertilizers in kg
2020-21 Farm 1 60.32 62.42 267.14
2019-20 Farm 1 63.71 59.65 277.24
2019-20 Farm 2 34.79 56.02 24.68
2020-21 Farm 2 35.52 52.15 26.56
2018-19 Farm 3 29.68 43.88 131.71
2020-21 Farm 3 27.68 37.25 117.19
2018-19 Farm 4 42.58 21.18 195.87
2019-20 Farm 4 45.97 21.35 196

Table 6. Long panel data (Macro panel)


Year Farms Yield in quintal Seed in kg Fertilizers in kg
2020-21 Andhra Pradesh 60.32 62.42 267.14
2019-20 Andhra Pradesh 63.71 59.65 277.24
2018-19 Andhra Pradesh 64.35 44.91 256.41
2017-18 Andhra Pradesh 34.79 56.02 24.68
2016-17 Andhra Pradesh 35.52 52.15 26.56
2015-16 Andhra Pradesh 34.52 56.74 24
2018-19 Bihar 29.68 43.88 131.71
2020-21 Bihar 27.68 37.25 117.19
2019-20 Bihar 30.05 42.1 128.64
2017-18 Bihar 42.58 21.18 195.87
2016-17 Bihar 45.97 21.35 196
2015-16 Bihar 39.24 19.82 223.16

Table 7. Dynamic panel data


Year Farms Yield (Yit) in quintal Seed (kg) Fertilizers (kg) Lagged Yield (Yit-1)
2020-21 Andhra Pradesh 60.32 62.42 267.14 -
2019-20 Andhra Pradesh 63.71 59.65 277.24 60.32
2018-19 Andhra Pradesh 64.35 44.91 256.41 63.71
2017-18 Andhra Pradesh 34.79 56.02 24.68 64.35
2016-17 Andhra Pradesh 35.52 52.15 26.56 34.79
2015-16 Andhra Pradesh 34.52 56.74 24 35.52
2018-19 Bihar 29.68 43.88 131.71 -
2020-21 Bihar 27.68 37.25 117.19 29.68
2019-20 Bihar 30.05 42.1 128.64 27.68
2017-18 Bihar 42.58 21.18 195.87 30.05
2016-17 Bihar 45.97 21.35 196 42.58
2015-16 Bihar 39.24 19.82 223.16 45.97

97
Why Panel Data?

Panel data offer several advantages over other types of datasets. Here are some reasons why
panel data are valuable. First, since panel data track entities (individuals, firms, states,
countries, etc.) over time, there is inherent heterogeneity among these units. Each unit may
have unique characteristics, behaviors or responses to changes over time. Panel data allow
researchers to account for this heterogeneity exists across different individuals or panels.
Second, panel data provide more informative data compared to cross-sectional or time series
data alone. By observing units over time, researchers can capture both within-unit and
between-unit variations. This leads to less collinearity among variables, as the inclusion of
time-series and cross-sectional variations helps separating the effects of different variables.
Third, panel data are particularly well-suited for studying the dynamics of change because
they capture how individuals, firms etc., evolve over time. For example, panel data can
effectively analyze phenomena such as spells of unemployment, job turnover and labor
mobility, providing insights into how these dynamics unfold over time and how various factors
influence them. Fourth, panel data can better detect and measure effects that cannot be
observed using pure cross-sectional or pure time series data. For instance, the effects of
policies like minimum wage laws on employment and earnings can be accurately studied by
incorporating successive waves of minimum wage increases over time, which is possible with
panel data. Fifth, panel data enable the study of more complex behavioral models that involve
interactions between individual units and changes over time. Compared to simpler cross-
sectional or time series data, phenomena such as economies of scale and technological change
can be better understood and modeled using panel data. Sixth, by providing data for several
thousand units over time, panel data can minimize biases from aggregating individuals or
firms into broad aggregates. This large sample size allows for more robust statistical analyses
and reduces the risk of biased estimates.

Estimation of Panel Data Model

a) Pooled OLS regression

In the Pooled OLS model, the relationship between the dependent variable yield (Y) and the
independent variables (Fertilizers, Seed) can be represented as follows:

4) 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡 𝑖 = 1,2, … … 𝑛 & 𝑡 = 1,2, … … 𝑛

98
where 𝑌𝑖𝑡 is the value of the dependent variable yield of paddy for the i th farm at tth time;
Fert and Seed represents the amount of fertilizers and seed used in the i th farm at tth time;
and ∝, 𝛽𝑠 and 𝑢 represent the intercept, slope coefficients and error term of the equations,
respectively. This equation represents a linear regression model for cross-sectional and time-
series data, where the goal is to estimate the coefficients ∝ and 𝛽𝑠 that best describes the
relationship between the independent variables and the dependent variable for all individual
farms in the sample. The estimated results of the pooled OLS regression data given in Table
8 are presented in Table 9.

Table 8. Pooled panel data set

Year Farms Yield in quintal Fertilizers in kg Seed in kg

2004-05 Farm1 22.19 9.32 65.74


2005-06 Farm1 25.17 9.78 65.05
2006-07 Farm1 16.71 10.17 67.16
2007-08 Farm1 25.38 10.75 64.14
2008-09 Farm1 26.75 8.95 63.22
2009-10 Farm1 25.83 12.61 64.62
2010-11 Farm1 29.58 15.67 62.2
2011-12 Farm1 26.51 16.6 60.98
2012-13 Farm1 31.41 16.04 58.43
2013-14 Farm1 31.84 16.66 58.02
2014-15 Farm1 32.45 23.96 61.8
2015-16 Farm1 32.82 21.76 60.62
2016-17 Farm1 32.69 24.7 59.38
2017-18 Farm1 33.44 24.01 56.55
2018-19 Farm1 34.52 24 56.74
2019-20 Farm1 34.79 24.68 56.02
2020-21 Farm1 35.52 26.56 52.15
2004-05 Farm2 22.82 86.3 60.44
2005-06 Farm2 25.78 84.84 53.47
2006-07 Farm2 25.08 81.15 52.68
2007-08 Farm2 29 87.92 51.95
2008-09 Farm2 26.65 80.59 52.34

99
Year Farms Yield in quintal Fertilizers in kg Seed in kg

2009-10 Farm2 18.97 65.32 52.85


2010-11 Farm2 19.29 76.68 52.14
2011-12 Farm2 27.58 97.39 47.42
2012-13 Farm2 24.26 97.36 47.47
2013-14 Farm2 25.2 98.92 46.31
2014-15 Farm2 30.69 103.99 43.85
2015-16 Farm2 27.49 99.79 45.26
2016-17 Farm2 30.81 104.08 44.56
2017-18 Farm2 31.06 106.61 46.85
2018-19 Farm2 29.68 131.71 43.88
2019-20 Farm2 30.05 128.64 42.1
2020-21 Farm2 27.68 117.19 37.25
2004-05 Farm3 34.78 147.71 13.3475
2005-06 Farm3 33.2 143.48 14.94
2006-07 Farm3 32.77 117.01 24.06
2007-08 Farm3 35 140.96 12.81
2008-09 Farm3 38.15 189.39 1.58
2009-10 Farm3 37 201.84 10.2
2010-11 Farm3 42.69 209.06 13
2011-12 Farm3 26.98 207.31 7.93
2012-13 Farm3 27.89 160.15 10.3
2013-14 Farm3 32.73 153.69 10.55
2014-15 Farm3 42.33 175.61 15
2015-16 Farm3 43.17 175.46 14
2016-17 Farm3 44.02 168.82 18
2017-18 Farm3 40.17 186.25 20
2018-19 Farm3 42.58 195.87 21.18
2019-20 Farm3 45.97 196 21.35
2020-21 Farm3 39.24 223.16 19.82

100
Table 9. Pooled OLS regression results
Regression Statistics
Multiple R 0.64
R Square 0.40
Adjusted R Square 0.38
Standard Error 5.310
Observations 51.00
ANOVA
Df SS MS F Significance F
Regression 2.000 920.031 460.016 16.314 0.000
Residual 48.000 1353.460 28.197
Total 50.000 2273.492
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 40.550 6.679 6.072 0.000 27.122 53.978
Fertilizers -0.003 0.029 -0.110 0.913 -0.061 0.055
Seed -0.221 0.097 -2.287 0.027 -0.415 -0.027

In the above estimated model, it is assumed that the intercept and slope coefficients are
consistently uniform across cases or over time, but the panel dataset may not support this
assumption. Actually the above estimated model does not distinguish between various farms
and does not tell us whether the response of yield of paddy to the input variables over time
is the same for all the farms. Here, we are assuming the regression coefficients are the same
for all the farms – no distinction between farms. By lumping together, the effect of different
farms at different times into one coefficient, we camouflage the heterogeneity (individuality
or uniqueness) that may exist among the farms (Gujarati, 2008). The individuality of each
farm (unobserved) is subsumed in the disturbance term (uit). This can cause the error term
to correlate with some of the regressors included in the model. As a result, the outcomes
could yield biased estimates of the variances for each estimated coefficient, rendering
statistical tests and confidence intervals inaccurate (Baltagi, 2008; Gujarati, 2003; Pindyck
and Rubinfeld, 1998; Wooldridge, 2002).

Suppose to consider the unobserved heterogeneous variable like the managerial skills of
farmers for which no data is observed in the panel equation (4), we can write it as follows:

(5) 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝛽6 𝑀𝑖𝑡 + 𝑢𝑖𝑡

101
where the additional variable M = management skills of farmers. Of the variables included in
the equation, only the variable M is time-invariant (or time constant) because it varies among
farmers but is constant over time for a given farmer. Although it is time-invariant, the variable
M is not directly observable, and therefore, we cannot measure its contribution to the
production function. We can do this indirectly if we write the equation as:
(6) 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑒𝑖 + 𝑢𝑖𝑡
where 𝑒𝑖 is called the unobserved or heterogeneity effect, reflecting the impact of M on yield.
In reality, there may be more such unobserved effects, such as the location of the farm, nature
of ownership, gender of the farmers, etc. Although such variables may differ among the
farmers, they will probably remain the same for any given farmer over the sample period.
Since 𝑒𝑖 is not directly observable, we can consider it as an unobserved random variable and
include it in the error term 𝑢𝑖𝑡 and thereby consider the composite error term 𝑤𝑖𝑡 = 𝑒𝑖 + 𝑢𝑖𝑡
and we can write the equation as:
(7) 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑤𝑖𝑡
but if the 𝑒𝑖 term included in the error term 𝑤𝑖𝑡 is correlated with any of the regressors in
the previous equation (i.e., Cov (Xit , wit ) ≠ 0 ), we have a violation of one of the key
assumptions of the OLS regression model that the error term is not correlated with the
regressors (i.e., Cov (Xit , wit ) = 0 ). As we know, in this situation, the OLS estimates are not
only biased, but they are also inconsistent. As there is a real possibility that the unobservable
𝑒𝑖 is correlated with one or more of the regressors, autocorrelation may also be possible (i.e.,
Cov (wit , wis ) = σ2u ; t ≠ s), where t and s are the different time periods. This means that σ2u
is non-zero, and therefore, the (unobserved) heterogeneity induces autocorrelation and we
will have to pay attention to it.

Problems in pooled regression of panel data


1. Individual Heterogeneity: Pooled regression assumes that all individuals in the panel
have the same intercept and slope coefficients. If there is significant individual
heterogeneity or unobserved individual-specific effects, this can lead to biased and
inefficient estimates. This can violate the assumption of no relationship between
error-term and exogenous variables (i.e., Cov (Xit , uit ) = 0). (A variable is said to be
strictly exogenous if it does not depend on current, past, and future values of the
error term uit).

102
2. Endogeneity: Endogeneity arises when the independent variables are correlated with
the error term. Pooled regression can suffer from endogeneity issues, especially if
there are time-varying factors that are omitted from the model.
3. Serial Correlation: Pooled regression assumes that observations are independent,
but in panel data, observations for the same individual over time may be correlated.
Therefore, ignoring serial correlation can lead to inefficient standard errors and
biased hypothesis testing.
4. Time Trends: Pooled regression does not account for time-specific trends or
changes. If there are time-varying factors that affect the dependent variable,
neglecting them can result in biased parameter estimates.
5. Dynamic Panel Bias: If lagged dependent variables are included as regressors in a
pooled regression with panel data, dynamic panel bias may occur. This bias arises
due to correlation (autocorrelation) between the lagged dependent variable and
unobserved individual-specific effects.
6. Inefficiency: Pooled regression may be less efficient compared to models that
account for individual-specific effects, such as fixed effects or random effects
models. Inefficiency can result in imprecise parameter estimates.

b) The Fixed Effect models

The issue of heterogeneity in pooled panel regression is addressed by controlling the


heterogeneity through different mechanisms such as fixed effect models (Least Squares
Dummy Variable (LSDV) model, within-group model, first difference model) and the random
effects model (REM). The fixed effects (FE) model is a statistical technique used in panel
data analysis to control for unobserved individual-specific heterogeneity. These fixed effects
can either control or remove the unobserved individual heterogeneity that does not vary over
time in any given data set.

Least-Squares Dummy Variable (LSDV) fixed effect model


The LSDV fixed effects model is a method used in panel data analysis to control for
unobserved individual-specific heterogeneity. The LSDV model allows for heterogeneity
among individuals by allowing each individual (or farm) to have its own intercept value by
including dummy variables for each individual in the regression equation to control the effect
of unobserved heterogeneity. Let's consider the general form of the panel data equation
along with the heterogeneity error term:

103
(8) 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑒i + 𝑢𝑖𝑡
where 𝑒i is an unobserved heterogeneity (farm dependent error-term). It is fixed over time
and varies across farms. The term “fixed effects” is because, although the intercept may differ
across farms, each farm’s intercept does not vary over time (i.e., it is time-invariant). So it can
be expressed as follows:
(9) 𝑌𝑖𝑡 =∝1it + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡
If we write the intercept as ∝1it , the intercept of each farm is time-variant. Also, the Fixed
Effect model above assumes that the slope coefficients of the regressors do not vary across
individuals or over time. Now, we can allow for the (fixed effect) intercept to vary among the
farms as:
(10) 𝑌𝑖𝑡 =∝0 +∝1 D1i +∝2 D2i + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡
where D1i = 1 for farm 2, 0 otherwise; D2i = 1 for farm 3, 0 otherwise; and so on. Since we
have 3 farms, we have introduced only 2 dummy variables to avoid falling into the dummy-
variable trap (i.e., the situation of perfect collinearity). Here, we treat farm 1 as the base or
reference category, and its effect is captured in the model's intercept. Example data structure
and estimated results are presented in Tables 10 and 11, respectively.

Table 10. Data for LSDV fixed effect model


Yield in Seed Fertilizers D1 for D2 for
Year State
quintal in kg kg farm2 farm 3
2004-05 Farm1 22.19 65.74 9.32 0 0
2005-06 Farm1 25.17 65.05 9.78 0 0
2006-07 Farm1 16.71 67.16 10.17 0 0
2007-08 Farm1 25.38 64.14 10.75 0 0
2008-09 Farm1 26.75 63.22 8.95 0 0
2009-10 Farm1 25.83 64.62 12.61 0 0
2010-11 Farm1 29.58 62.2 15.67 0 0
2011-12 Farm1 26.51 60.98 16.6 0 0
2012-13 Farm1 31.41 58.43 16.04 0 0
2013-14 Farm1 31.84 58.02 16.66 0 0
2014-15 Farm1 32.45 61.8 23.96 0 0
2015-16 Farm1 32.82 60.62 21.76 0 0
2016-17 Farm1 32.69 59.38 24.7 0 0
2017-18 Farm1 33.44 56.55 24.01 0 0
2018-19 Farm1 34.52 56.74 24 0 0
2019-20 Farm1 34.79 56.02 24.68 0 0

104
Yield in Seed Fertilizers D1 for D2 for
Year State
quintal in kg kg farm2 farm 3
2020-21 Farm1 35.52 52.15 26.56 0 0
2004-05 Farm2 22.82 60.44 86.3 1 0
2005-06 Farm2 25.78 53.47 84.84 1 0
2006-07 Farm2 25.08 52.68 81.15 1 0
2007-08 Farm2 29 51.95 87.92 1 0
2008-09 Farm2 26.65 52.34 80.59 1 0
2009-10 Farm2 18.97 52.85 65.32 1 0
2010-11 Farm2 19.29 52.14 76.68 1 0
2011-12 Farm2 27.58 47.42 97.39 1 0
2012-13 Farm2 24.26 47.47 97.36 1 0
2013-14 Farm2 25.2 46.31 98.92 1 0
2014-15 Farm2 30.69 43.85 103.99 1 0
2015-16 Farm2 27.49 45.26 99.79 1 0
2016-17 Farm2 30.81 44.56 104.08 1 0
2017-18 Farm2 31.06 46.85 106.61 1 0
2018-19 Farm2 29.68 43.88 131.71 1 0
2019-20 Farm2 30.05 42.1 128.64 1 0
2020-21 Farm2 27.68 37.25 117.19 1 0
2004-05 Farm3 34.78 13.3475 147.71 0 1
2005-06 Farm3 33.2 14.94 143.48 0 1
2006-07 Farm3 32.77 24.06 117.01 0 1
2007-08 Farm3 35 12.81 140.96 0 1
2008-09 Farm3 38.15 1.58 189.39 0 1
2009-10 Farm3 37 10.2 201.84 0 1
2010-11 Farm3 42.69 13 209.06 0 1
2011-12 Farm3 26.98 7.93 207.31 0 1
2012-13 Farm3 27.89 10.3 160.15 0 1
2013-14 Farm3 32.73 10.55 153.69 0 1
2014-15 Farm3 42.33 15 175.61 0 1
2015-16 Farm3 43.17 14 175.46 0 1
2016-17 Farm3 44.02 18 168.82 0 1
2017-18 Farm3 40.17 20 186.25 0 1
2018-19 Farm3 42.58 21.18 195.87 0 1
2019-20 Farm3 45.97 21.35 196 0 1
2020-21 Farm3 39.24 19.82 223.16 0 1

105
Table 11. Estimated results of LSDV fixed effect model
Regression Statistics
Multiple R 0.78
R Square 0.61
Adjusted R Square 0.57
Standard Error 4.41
Observations 51.00
ANOVA
Df SS MS F Significance F
Regression 4.000 1380.541 345.135 17.780 0.000
Residual 46.000 892.950 19.412
Total 50.000 2273.492
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept
(base farm1) 30.592 8.440 3.625 0.001 13.603 47.580
Seed -0.054 0.134 -0.403 0.689 -0.323 0.215
Fertilizers 0.112 0.034 3.246 0.002 0.043 0.181
D1 for farm2 -12.256 3.010 -4.072 0.000 -18.315 -6.198
D2 for farm 3 -11.945 6.650 -1.796 0.079 -25.329 1.440

As a result, the intercept ∝0 is the intercept value of farm 1, where D1=D2=0 i.e., 𝐸(𝑌1𝑖 ) =
∝0 + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡 . The other ∝ coefficients represent how much the intercept
values of the other farms differ from the intercept value of the first farm. For example, ∝1
tells by how much the intercept value of the second farm differs from ∝0. The sum (∝0+∝1)
gives the actual value of the intercept for farm 2. We can write it as 𝐸(𝑌2𝑖 ) =∝0 +
∝1 (1) + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡 for farm 2 and 𝐸(𝑌3𝑖 ) =∝0 +∝2 (1) + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 +
𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡 for farm 3.

Summary of intercept values for the above problem:


Farm 1 = ∝0;
Farm 2 = ∝0+ ∝1;
Farm 3 = ∝0 + ∝2

106
Thus given ∝0 the other dummy variable coefficients ∝0 and ∝2 tell us by how much the
intercept values of farm 2 and 3 differ from that of farm 1. The coefficients from the fixed
effect model produce estimators known as fixed effect estimators. This is called a one-way
fixed effect model as intercepts vary only across farms (to account for heterogeneity) but
not across time. We can also allow for time effect if we believe that the yield function changes
over time because of technological changes, changes in government regulation and/or tax
policies, and other such effects. Such a time effect can be easily accounted for if we introduce
time dummies, one for each year from 2004-05 to 2020-21. We can also consider the two-
way fixed effects model if we allow for both time periods and farms.

Disadvantages of the Fixed Effect LSDV Model


While the Least Squares Dummy Variable (LSDV) fixed effects model is a commonly used
method for controlling for unobserved individual-specific heterogeneity in panel data analysis,
it also comes with several disadvantages:
1) High dimensionality: The LSDV approach involves adding a separate dummy variable
for each individual in the panel dataset. This creates a high-dimensional regression
equation, especially when dealing with large datasets with many individuals. As the
number of cross-sectional individuals increases, the number of dummy variables also
increases, which results in computational challenges and estimation complexities,
particularly when the dataset contains fewer observations. For effective estimation,
the number of parameters to be estimated should be less than the sample size. In
the LSDV framework, there is a risk of having more parameters than the total
sample size, including both cross-sectional and time-series data. This violates one
of the assumptions of OLS regression and makes the LSDV model inestimable for
unbiased results.
2) Loss of efficiency: In the LSDV model, individual-specific fixed effects are estimated
separately for each individual, resulting in a loss of efficiency compared to other
fixed effects estimators, such as the Within-Group Estimator. This loss of efficiency
can be particularly pronounced when the number of individuals is large relative to
the number of periods.
3) Inability to estimate Time-Invariant variables: The LSDV model effectively removes
the variation of time-invariant variables within individuals since the fixed effects
absorb all the constant variation over time. As a result, the LSDV model cannot
estimate the effects of time-invariant variables on the dependent variable. For

107
instance, suppose we want to estimate a wage function for a group of workers using
panel data. Besides wage, a wage function may include age, experience, and
education as explanatory variables. We can also add gender category, color, and
ethnicity as additional variables in the model, and these variables will not change
over time for an individual subject; the LSDV approach may not be able to identify
the impact of such time-invariant variables on wages.
4) Omitted variable bias: If time-varying omitted variables correlate with both the
independent and dependent variables, the LSDV model may suffer from omitted
variable bias. While including individual-specific fixed effects helps control for time-
invariant omitted variables, it does not address the bias introduced by time-varying
omitted variables.
5) Heterogeneity in slopes ignored: The LSDV model assumes that the coefficients of
the independent variables are constant across individuals. However, in many cases,
there may be heterogeneity in the slopes of the relationships between the
independent and dependent variables across individuals. The LSDV model does not
allow for such heterogeneity in slopes.
6) Biased estimates for Time-Invariant variables with perfect collinearity: In the
presence of perfect collinearity between time-invariant independent variables and
individual fixed effects, the LSDV model produces biased estimates for the
coefficients of those variables. This issue arises because the individual fixed effects
absorb all the variation in the time-invariant variables, making it impossible to
identify their effects separately.

Pooled OLS vs LSDV Fixed Effect Model

When deciding between a Pooled OLS regression and a Fixed Effects model for panel data
analysis, the Restricted or Partial F-test and the Wald test of differential intercept can be
useful tools to assess which model is better suited for the data.

a) Restricted or Partial F-test


The Restricted or Partial F-test compares the fit of two nested regression models, one of
which is a restricted version of the other. In the context of panel data analysis, this test can
be applied to compare the Pooled OLS model (which assumes that there are no individual-
specific effects) with the Fixed Effects model (which includes individual-specific fixed effects)
𝑌𝑖𝑡 =∝0 +∝1 𝐷1𝑖 +∝2 𝐷2𝑖 + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡 .

108
Null hypothesis (H0): all the differential intercepts = 0
or
H0: ∝𝟏=∝𝟐= 0

As in the context of choosing between a Pooled OLS and Fixed Effects model for panel data
analysis, the F-test formula for comparing the fit of two different regression models can be
expressed as follows:
Suppose we have two models:

Restricted Model (Pooled OLS): 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡

Full Model (Fixed Effects): 𝑌𝑖𝑡 =∝0 +∝1 𝐷1𝑖 +∝2 𝐷2𝑖 + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡

To conduct the F-test, we estimate both the restricted (Pooled OLS) and full (Fixed Effects)
regression models using Ordinary Least Squares (OLS) regression. The F-statistic can be
calculated as:

𝑆𝑆𝐸𝑅 −𝑆𝑆𝐸𝐶
𝑘∗𝐶
(11) 𝐹= 𝑆𝑆𝐸𝐶
𝑛−𝑘𝐶

Where,
𝑆𝑆𝐸𝑅 = Error Sum of Square of the restricted model (pooled OLS)
𝑆𝑆𝐸𝐶 = Error Sum of Square of Complete model (FE-LSDV)
𝑘𝐶∗ = number of additional coefficients in the complete model
𝑘𝐶 = number of coefficients in the complete model
𝑛 = sample size
𝑅𝐶2 = R2 from the complete model
𝑅𝑅2 = R2 from the restricted model

If the estimated F-stat value is more than the F-Table value at a chosen significance level
(1%, 5%, and 10%), we can reject the null hypothesis that all the differential intercepts are
not equal to zero, i.e., there is no individual heterogeneity effect and accept that there is an
individual effect. The inclusion of the differential intercepts significantly improved the model
and therefore accepted the FEM model. This means that accounting for heterogeneity in the
model is important.
109
From the estimated regression results presented in Tables 8 and 10 for pooled OLS and fixed
effect model, respectively, we can calculate the F-value and compare the table value for the
decision.
1353.46 − 892.9504
𝐹= 2 = 11.861
892.9504
51 − 5
Where the estimated F-stat value (11.861) is more than the F-Table value (3.18) at a 5%
level of significance, we can reject the null hypothesis that all the differential intercepts are
not equal to zero, i.e., there is no individual heterogeneity, effect and accept individual effect
is there. The inclusion of the differential intercepts significantly improved the model and
therefore accepted the FEM model. This means that accounting for heterogeneity in the
model is important in this case.
b) Wald Test of Differential Intercept
The Wald test is used to test whether certain variables' coefficients significantly differ across
different groups or categories. In the context of panel data analysis, it can be used to test
whether the intercepts (or individual-specific effects) are significantly different across
individuals. Specifically, the test assesses whether the individual-specific intercepts in the
Fixed Effects model are jointly equal to zero. If the null hypothesis is rejected, it indicates that
there are significant differences in intercepts across individuals, supporting the use of the
Fixed Effects model. In the context of model selection, if the p-value associated with the Wald
test is below the chosen significance level, it suggests that the Fixed Effects model is
preferred over the Pooled OLS model, as it captures individual-specific effects that are not
accounted for in the Pooled OLS model.

The Fixed-Effect within-Group (WG) Estimator

The Fixed-Effect Within-Group (WG) Estimator is also a Fixed Effect method used to control
for unobserved individual-specific heterogeneity. This approach removes individual-specific
effects by differencing the data within each group (farms). This process needs to calculate
the mean values of the dependent and explanatory variables for each farm and then subtract
these means from each individual value of all the variables. These adjusted values are
commonly referred to as "de-meaned" or mean-corrected values. This procedure is repeated
for each farm, resulting in a set of de-meaned values for each variable. Subsequently, all the
de-meaned values across all farms are pooled together and an OLS regression is performed
on the combined dataset, consisting of the pooled mean-corrected values from farms.
110
We express each variable as a deviation from its time-mean to remove this heterogeneity i.e.,
by differencing values of the variables around their sample mean, we effectively eliminate the
heterogeneity in the data set. Let’s take the average of the equation (6)

(6) 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽1 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑒i + 𝑢𝑖𝑡 as follows

(12) 𝑌̅𝑖 = 𝛽1 ̅̅̅̅̅̅̅ 𝑆𝑒𝑒𝑑𝑖 ) + 𝑒̅i + 𝑢̅𝑖 and subtract it from the same equation
𝐹𝑒𝑟𝑡𝑖 + 𝛽2 ̅̅̅̅̅̅̅
as follows

(13) 𝐹𝑒𝑟𝑡𝑖 + 𝛽2 (𝑆𝑒𝑒𝑑𝑖𝑡 − ̅̅̅̅̅̅̅


𝑌𝑖𝑡 − 𝑌̅𝑖 = 𝛽1 (𝐹𝑒𝑟𝑡𝑖𝑡 − ̅̅̅̅̅̅̅ 𝑆𝑒𝑒𝑑𝑖 ) + (𝑒i − 𝑒̅)
i + (𝑢𝑖𝑡 −
𝑢̅𝑖 )

Here, observations are said to be “mean-corrected” or “time-meaned” or “de-meaned”. As 𝑒i


is not time-dependent (constant over time) 𝑒i=𝑒̅𝑖 and become 𝑒i−𝑒̅=0.
𝑖 Thus, by subtracting
the time-mean from each variable, we can effectively remove the unobserved farm-dependent
(heterogeneity) error term 𝑒i. Also, as the intercept ∝ is a constant, its mean is equal to its
value, so subtracting it from itself results in a zero value. The final form of the within-group
model is as follows:

(14) 𝑆𝑒𝑒𝑑𝑖𝑡 ) + 𝜇𝑖𝑡 and we can


𝐹𝑒𝑟𝑡𝑖 + 𝛽2 (𝑆𝑒𝑒𝑑𝑖𝑡 − ̅̅̅̅̅̅̅̅
𝑌𝑖𝑡 − 𝑌̅𝑖 = 𝛽1 (𝐹𝑒𝑟𝑡𝑖𝑡 − ̅̅̅̅̅̅̅
write it as

(15) 𝐷_𝑌1𝑖𝑡 = 𝛽1 𝐷_𝐹𝑒𝑟𝑡1𝑖𝑡 + 𝛽2 𝐷_𝑆𝑒𝑒𝑑1𝑖𝑡 + 𝜇𝑖𝑡

Once the data is demeaned, the model is estimated using OLS regression under the weak
exogeneity assumption that 𝑪𝒐𝒗(𝑿𝒊𝒕, 𝒆𝒊)=0. The model typically includes the demeaned
independent variables and a constant term. Only this assumption is necessary for consistent
estimators. In the Fixed-Effect Within-Group (WG) Estimator, the coefficient estimates
represent the within-group effects of the independent variables on the dependent variable.
These coefficients capture the relationship between the variables after controlling for time-
invariant individual-specific effects. Statistical inference, such as hypothesis testing and
confidence interval estimation, can be performed based on standard OLS procedures applied
to the demeaned data.

111
The Fixed-Effect Within-Group (WG) Estimator has several advantages. It effectively
controls for unobserved individual-specific heterogeneity. It allows for estimating the effects
of time-varying independent variables on the dependent variable while controlling for
individual-specific effects. It is computationally efficient and relatively straightforward to
implement. However, it is important to note that the Fixed-Effect Within-Group (WG)
Estimator also has limitations. It assumes that individual-specific effects are time-invariant.
It does not allow for the estimation of individual-specific effects, which may be of interest in
some cases. It may suffer from bias if the time-varying independent variables are correlated
with the individual-specific effects. The calculation of de-meaned values for the variables and
data set for the Fixed Effect Within Group model and its estimated OLS regression results
are presented in Tables 12 and 13, respectively.

Table 12. data set for the Fixed-Effect Within-Group (WG) model

D_Fert = D_Seed =
Yield Fertilizers Seed D_Y =
Year State Fert- Fert Seed –
quintal in kg in kg Yi - Ybar
bar Seed bar

2004-05 Farm1 22.19 9.32 65.74 4.765294 -51.4341 65.74


2005-06 Farm1 25.17 9.78 65.05 7.745294 -50.9741 65.05
2006-07 Farm1 16.71 10.17 67.16 -0.71471 -50.5841 67.16
2007-08 Farm1 25.38 10.75 64.14 7.955294 -50.0041 64.14
2008-09 Farm1 26.75 8.95 63.22 9.325294 -51.8041 63.22
2009-10 Farm1 25.83 12.61 64.62 8.405294 -48.1441 64.62
2010-11 Farm1 29.58 15.67 62.2 12.15529 -45.0841 62.2
2011-12 Farm1 26.51 16.6 60.98 9.085294 -44.1541 60.98
2012-13 Farm1 31.41 16.04 58.43 13.98529 -44.7141 58.43
2013-14 Farm1 31.84 16.66 58.02 14.41529 -44.0941 58.02
2014-15 Farm1 32.45 23.96 61.8 15.02529 -36.7941 61.8
2015-16 Farm1 32.82 21.76 60.62 15.39529 -38.9941 60.62
2016-17 Farm1 32.69 24.7 59.38 15.26529 -36.0541 59.38
2017-18 Farm1 33.44 24.01 56.55 16.01529 -36.7441 56.55
2018-19 Farm1 34.52 24 56.74 17.09529 -36.7541 56.74
2019-20 Farm1 34.79 24.68 56.02 17.36529 -36.0741 56.02
2020-21 Farm1 35.52 26.56 52.15 18.09529 -34.1941 52.15
2004-05 Farm2 22.82 86.3 60.44 -74.1494 38.01647 60.44
2005-06 Farm2 25.78 84.84 53.47 -71.1894 36.55647 53.47

112
D_Fert = D_Seed =
Yield Fertilizers Seed D_Y =
Year State Fert- Fert Seed –
quintal in kg in kg Yi - Ybar
bar Seed bar

2006-07 Farm2 25.08 81.15 52.68 -71.8894 32.86647 52.68


2007-08 Farm2 29 87.92 51.95 -67.9694 39.63647 51.95
2008-09 Farm2 26.65 80.59 52.34 -70.3194 32.30647 52.34
2009-10 Farm2 18.97 65.32 52.85 -77.9994 17.03647 52.85
2010-11 Farm2 19.29 76.68 52.14 -77.6794 28.39647 52.14
2011-12 Farm2 27.58 97.39 47.42 -69.3894 49.10647 47.42
2012-13 Farm2 24.26 97.36 47.47 -72.7094 49.07647 47.47
2013-14 Farm2 25.2 98.92 46.31 -71.7694 50.63647 46.31
2014-15 Farm2 30.69 103.99 43.85 -66.2794 55.70647 43.85
2015-16 Farm2 27.49 99.79 45.26 -69.4794 51.50647 45.26
2016-17 Farm2 30.81 104.08 44.56 -66.1594 55.79647 44.56
2017-18 Farm2 31.06 106.61 46.85 -65.9094 58.32647 46.85
2018-19 Farm2 29.68 131.71 43.88 -67.2894 83.42647 43.88
2019-20 Farm2 30.05 128.64 42.1 -66.9194 80.35647 42.1
2020-21 Farm2 27.68 117.19 37.25 -69.2894 68.90647 37.25
2004-05 Farm3 34.78 147.71 13.3475 -141.206 133.1178 13.3475
2005-06 Farm3 33.2 143.48 14.94 -142.786 128.8878 14.94
2006-07 Farm3 32.77 117.01 24.06 -143.216 102.4178 24.06
2007-08 Farm3 35 140.96 12.81 -140.986 126.3678 12.81
2008-09 Farm3 38.15 189.39 1.58 -137.836 174.7978 1.58
2009-10 Farm3 37 201.84 10.2 -138.986 187.2478 10.2
2010-11 Farm3 42.69 209.06 13 -133.296 194.4678 13
2011-12 Farm3 26.98 207.31 7.93 -149.006 192.7178 7.93
2012-13 Farm3 27.89 160.15 10.3 -148.096 145.5578 10.3
2013-14 Farm3 32.73 153.69 10.55 -143.256 139.0978 10.55
2014-15 Farm3 42.33 175.61 15 -133.656 161.0178 15
2015-16 Farm3 43.17 175.46 14 -132.816 160.8678 14
2016-17 Farm3 44.02 168.82 18 -131.966 154.2278 18
2017-18 Farm3 40.17 186.25 20 -135.816 171.6578 20
2018-19 Farm3 42.58 195.87 21.18 -133.406 181.2778 21.18
2019-20 Farm3 45.97 196 21.35 -130.016 181.4078 21.35
2020-21 Farm3 39.24 223.16 19.82 -136.746 208.5678 19.82

113
Table 13. Results of estimated Fixed-Effect Within-Group (WG) model
Regression Statistics
Multiple R 0.957
R Square 0.916
Adjusted R Square 0.913
Standard Error 18.388
Observations 51.000
ANOVA
Df SS MS F Significance F
Regression 2.000 177422.950 88711.475 262.372 0.000
Residual 48.000 16229.425 338.113
Total 50.000 193652.375
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -41.100 21.396 -1.921 0.061 -84.119 1.919
D_Seed -0.630 0.093 -6.797 0.000 -0.816 -0.444
D_Fertilizers 0.254 0.396 0.642 0.524 -0.542 1.049

First Difference Method

The first difference method in panel data analysis involves taking the first difference of each
variable within the panel model. The first difference method subtracts the value of each
variable in the current period from its value in the previous period for each individual in the
panel. This helps in eliminating individual-specific effects because they are differenced out. It
is particularly useful when dealing with unobserved individual-specific heterogeneity that is
constant over time. Mathematically, we can express it as follows:

(16) 𝑌𝑖𝑡 − 𝑌𝑖𝑡−1 = 𝛽1 (𝐹𝑒𝑟𝑡𝑖𝑡 − 𝐹𝑒𝑟𝑡𝑖𝑡−1 ) + 𝛽2 (𝑆𝑒𝑒𝑑𝑖𝑡 − 𝑆𝑒𝑒𝑑𝑖𝑡−1 ) + (𝑢𝑖𝑡 − 𝑢𝑖𝑡−1 )

It can be rewritten as:


(17) ∆𝑌𝑖𝑡 = 𝛽1 ∆𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽1 ∆𝑆𝑒𝑒𝑑𝑖𝑡 + ∆𝑢𝑖𝑡
Where ∆ indicates the difference between the variables' current and previous period values.
This method helps control individual-specific effects or time-invariant characteristics that are
constant over time. By taking the first difference, these time-invariant factors are eliminated
because leaving behind the changes in the variables over time. The data set for estimating
the first difference model and its estimated OLS regression results are presented in Tables
14 and 15, respectively.

114
Table 14. Data set for first difference panel data model

Yield (Y) Fertilizers Seed


Year State ∆Y ∆ Fertilizers ∆ Seed
in quintal in kg in kg

2004-05 Farm1 22.19 9.32 65.74 -- -- --

2005-06 Farm1 25.17 9.78 65.05 2.98 0.46 -0.69

2006-07 Farm1 16.71 10.17 67.16 -8.46 0.39 2.11

2007-08 Farm1 25.38 10.75 64.14 8.67 0.58 -3.02

2008-09 Farm1 26.75 8.95 63.22 1.37 -1.8 -0.92

2009-10 Farm1 25.83 12.61 64.62 -0.92 3.66 1.4

2010-11 Farm1 29.58 15.67 62.2 3.75 3.06 -2.42

2011-12 Farm1 26.51 16.6 60.98 -3.07 0.93 -1.22

2012-13 Farm1 31.41 16.04 58.43 4.9 -0.56 -2.55

2013-14 Farm1 31.84 16.66 58.02 0.43 0.62 -0.41

2014-15 Farm1 32.45 23.96 61.8 0.61 7.3 3.78

2015-16 Farm1 32.82 21.76 60.62 0.37 -2.2 -1.18

2016-17 Farm1 32.69 24.7 59.38 -0.13 2.94 -1.24

2017-18 Farm1 33.44 24.01 56.55 0.75 -0.69 -2.83

2018-19 Farm1 34.52 24 56.74 1.08 -0.01 0.19

2019-20 Farm1 34.79 24.68 56.02 0.27 0.68 -0.72

2020-21 Farm1 35.52 26.56 52.15 0.73 1.88 -3.87

2004-05 Farm2 22.82 86.3 60.44 -- -- --

2005-06 Farm2 25.78 84.84 53.47 2.96 -1.46 -6.97

2006-07 Farm2 25.08 81.15 52.68 -0.7 -3.69 -0.79

2007-08 Farm2 29 87.92 51.95 3.92 6.77 -0.73

2008-09 Farm2 26.65 80.59 52.34 -2.35 -7.33 0.39

2009-10 Farm2 18.97 65.32 52.85 -7.68 -15.27 0.51

2010-11 Farm2 19.29 76.68 52.14 0.32 11.36 -0.71

2011-12 Farm2 27.58 97.39 47.42 8.29 20.71 -4.72

115
Yield (Y) Fertilizers Seed
Year State ∆Y ∆ Fertilizers ∆ Seed
in quintal in kg in kg

2012-13 Farm2 24.26 97.36 47.47 -3.32 -0.03 0.05

2013-14 Farm2 25.2 98.92 46.31 0.94 1.56 -1.16

2014-15 Farm2 30.69 103.99 43.85 5.49 5.07 -2.46

2015-16 Farm2 27.49 99.79 45.26 -3.2 -4.2 1.41

2016-17 Farm2 30.81 104.08 44.56 3.32 4.29 -0.7

2017-18 Farm2 31.06 106.61 46.85 0.25 2.53 2.29

2018-19 Farm2 29.68 131.71 43.88 -1.38 25.1 -2.97

2019-20 Farm2 30.05 128.64 42.1 0.37 -3.07 -1.78

2020-21 Farm2 27.68 117.19 37.25 -2.37 -11.45 -4.85

2004-05 Farm3 34.78 147.71 13.3475 -- -- --

2005-06 Farm3 33.2 143.48 14.94 -1.58 -4.23 1.5925

2006-07 Farm3 32.77 117.01 24.06 -0.43 -26.47 9.12

2007-08 Farm3 35 140.96 12.81 2.23 23.95 -11.25

2008-09 Farm3 38.15 189.39 1.58 3.15 48.43 -11.23

2009-10 Farm3 37 201.84 10.2 -1.15 12.45 8.62

2010-11 Farm3 42.69 209.06 13 5.69 7.22 2.8

2011-12 Farm3 26.98 207.31 7.93 -15.71 -1.75 -5.07

2012-13 Farm3 27.89 160.15 10.3 0.91 -47.16 2.37

2013-14 Farm3 32.73 153.69 10.55 4.84 -6.46 0.25

2014-15 Farm3 42.33 175.61 15 9.6 21.92 4.45

2015-16 Farm3 43.17 175.46 14 0.84 -0.15 -1

2016-17 Farm3 44.02 168.82 18 0.85 -6.64 4

2017-18 Farm3 40.17 186.25 20 -3.85 17.43 2

2018-19 Farm3 42.58 195.87 21.18 2.41 9.62 1.18

2019-20 Farm3 45.97 196 21.35 3.39 0.13 0.17

2020-21 Farm3 39.24 223.16 19.82 -6.73 27.16 -1.53

116
Table 15. Results of the estimated first difference panel model
Regression Statistics
Multiple R 0.176
R Square 0.031
Adjusted R Square -0.012
Standard Error 4.452
Observations 48.000
ANOVA
Df SS MS F Significance F
Regression 2.000 28.580 14.290 0.721 0.492
Residual 45.000 891.935 19.821
Total 47.000 920.515
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 0.330 0.657 0.502 0.618 -0.993 1.653
∆ Fertilizers 0.056 0.050 1.102 0.276 -0.046 0.157
∆ Seed 0.001 0.187 0.006 0.995 -0.376 0.379

The Random Effects Model (REM)

The Random Effects Model is a statistical technique used in panel data analysis to account
for both within-group and between-group variations. The REM extends the basic pooled OLS
model by allowing for entity-specific effects that are not directly observed but are assumed
to follow a specific distribution. Consider the fixed effect model:

(9) 𝑌𝑖𝑡 =∝1it + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡

(10) 𝑌𝑖𝑡 =∝0 +∝1 D1i +∝2 D2i + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡

Incorporate farm heterogeneity (ei) within the error term (𝑢𝑖𝑡) rather than specifying as a
dummy variable and allowing for a common intercept, the model can be considered as a REM.
Instead of treating ∝1i as fixed, we assume it to be a random variable with mean ∝1 and
random farm-specific error term (ei) with the mean value of zero and variance of 𝜎𝑒2 ,
expressed as follows:

(18) ∝1i =∝1 + ei

117
By replacing ∝1i with ∝1+ei in the above equation (10), we have the error components or
random effect model

(19) 𝑌𝑖𝑡 =∝1 + ei + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡

By rearranging it, we have the final model of the random effect

(20) 𝑌𝑖𝑡 =∝1 + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑤𝑖𝑡 where 𝑤𝑖𝑡 = 𝑢𝑖𝑡 + ei

The error term 𝑤𝑖𝑡 has two components:


1. farm-specific error term ei
2. idiosyncratic iid error term 𝑢𝑖𝑡 (combined time-series & cross-section error term).

Unlike the fixed effect model, where each farm has its (fixed effect) intercept value, the ∝1
in the random effect model is a common intercept, meaning that it is the average of all
intercepts of all farms. The farm-specific error component ei measures the random deviation
of each farm’s intercept from the common intercept ∝1.

Assumptions of Random Effects Model


1. The farm-specific error component ei are assumed to be uncorrelated with the independent
variables and have a mean of zero i.e., 𝑪𝒐𝒗(𝒆𝒊 , 𝑿𝒊 )=0. OLS regression can be used in this
case since it produces consistent and efficient parameters.

2. The independent variables are assumed to be exogenous, meaning they are not correlated
with the error term. However, it is expected that 𝑤it and 𝑤is (t≠s) are correlated; that is, the
error terms of a given cross-sectional unit at two different points in time are correlated.

𝜎𝑒2
(21) 𝑐𝑜𝑟𝑟(𝑤it , 𝑤is ) =
𝜎𝑒2 +𝜎𝑢
2

If we do not take this correlation structure into account and use OLS, the resulting estimators
will be inefficient. The appropriate method is the Generalized Least Squares (GLS) method.
REM determines the degree to which serial correlation is a problem and then uses some
weighted estimation approach (e.g., GLS) to fix it.

118
Assumption of the composite error term in the error correction model:
ei ~𝑁(0, 𝜎𝑒2 )
uit ~𝑁(0, 𝜎𝑢2 )
E(ei , uit ) = 0; E(ei , ej ) = 0 (i≠j)
E(uit , ui𝑠 ) = E(uij , uij ) = E(uit , uj𝑠 ) = 0 (i≠j; t≠s)

The individual error components are not correlated and are not auto-correlated across cross-
section and time-series units. If E(uij ) = 0 then the Var(𝑤it ) = 𝜎𝑒2 +𝜎𝑢2 . If 𝜎𝑒2 = 0, there is
no difference between pooled regression and the error components model; we can go for
pooled regression.

How does REM GLS estimation work?

In the error components (or RE) model 𝑌𝑖𝑡 =∝1 + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + ei + 𝑢𝑖𝑡 , we
substitute the common error term 𝑤𝑖𝑡. From this, we derive the mean-corrected WG-FE
estimator 𝑌𝑖𝑡 − 𝑌̅𝑖 = 𝛽1 (𝐹𝑒𝑟𝑡𝑖𝑡 − ̅̅̅̅̅̅̅ 𝑆𝑒𝑒𝑑𝑖𝑡 ) + 𝑤𝑖𝑡 . We then construct
𝐹𝑒𝑟𝑡𝑖 + 𝛽2 (𝑆𝑒𝑒𝑑𝑖𝑡 − ̅̅̅̅̅̅̅̅
the RE GLS transform equation by pre-multiplying all the means by the GLS parameter, λ.

(22) 𝑌𝑖𝑡 − 𝜆𝑌̅𝑖 =∝1 (1 − 𝜆) + 𝛽1 (𝐹𝑒𝑟𝑡𝑖𝑡 − 𝜆𝐹𝑒𝑟𝑡 ̅̅̅̅̅̅̅̅


̅̅̅̅̅̅̅𝑖 + 𝛽2 (𝑆𝑒𝑒𝑑𝑖𝑡 − 𝜆𝑆𝑒𝑒𝑑 𝑖𝑡 ) + 𝑤𝑖𝑡 − 𝜆𝑤
̅̅̅𝑖

REM is a quasi-demeaned model because the means (𝑌̅𝑖 , ̅̅̅̅̅̅̅ 𝑆𝑒𝑒𝑑𝑖𝑡 ) are weighted by
𝐹𝑒𝑟𝑡𝑖 & ̅̅̅̅̅̅̅̅
the GLS parameter, 𝜆[0 ≤ 𝜆 ≤ 1].

If 𝜆 = 0, REM estimator will become pooled OLS 𝑌𝑖𝑡 =∝1i + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑤𝑖𝑡 and
If λ=1, REM estimator will become fixed effect model 𝑌𝑖𝑡 − 𝑌̅𝑖 = 𝛽1 (𝐹𝑒𝑟𝑡𝑖𝑡 − ̅̅̅̅̅̅̅
𝐹𝑒𝑟𝑡𝑖 +
̅̅̅̅̅̅̅̅
𝛽2 (𝑆𝑒𝑒𝑑𝑖𝑡 − 𝑆𝑒𝑒𝑑 𝑖𝑡 ) + 𝑤𝑖𝑡 − ̅̅̅.
𝑤𝑖

REM is equal to FEM if model is fully demeaned i.e., 𝜆 = 1. But if 0<𝜆<1, REM estimator is
not equal to pooled OLS and FEM.

REM GLS parameter 𝜆 is defined as:


1/2
𝜎𝑒2
(23) 𝜆 =1−( 2 2 )
𝜎𝑒 +𝑇𝜎𝑢

where, 𝜎𝑒2 = variance of the idiosyncratic error term, eit and 𝜎𝑢2 = Variance of the farm-specific
error term, 𝑢.

119
Fixed Effect model vs Random Effect model

The Hausman test

The Hausman Test is a statistical test used to determine whether the random effects (RE)
assumptions are valid and whether the random effects model is preferable to the fixed effects
(FE) model in panel data analysis. It tests the consistency of the estimators under the null
hypothesis that both the FE and RE estimators are consistent, but the random effects model
is more efficient. If the null hypothesis is rejected, it suggests that the random effects
assumptions may be violated, and the fixed effects model is preferred.

Statement of hypothesis
Null hypothesis H0 : REM is the appropriate estimator
or
H0 : Cov(e1, Xit) = 0
or
H0 : FEM and REM estimators do not differ substantially
Alternate hypothesis Ha: FEM is the appropriate estimator
or
Ha : FEM and REM estimators differ substantially

If H0 is rejected, we conclude that the REM is inappropriate because the random effect is
probably correlated with the Xit i.e., Cov(e1, Xit) ≠ 0. In other words, if the calculated test
statistic is greater than the critical value, reject the null hypothesis, indicating that the
random effects model is inconsistent and the fixed effects model is preferred. If the calculated
test statistic is less than the critical value, it fails to reject the null hypothesis, suggesting
that the random effects model is consistent and more efficient than the fixed effects model.

The Hausman test in Stata software


In Stata, you can perform the Hausman test using the Hausman command after estimating
both the fixed effects and random effects models. Here's a step-by-step guide on how to
conduct the Hausman test in Stata:

Start by loading your panel data into Stata.

120
In the data set, time and panel variables are in string format, so we have to convert these
variables into non-string format. For this, use the following comments:
encode year, gene(year1)
encode state, gene(state1)
Before estimating fixed and random effect models, we have let the Stata know that our data
set is panel data. For that, the following command is used:

xtset state1 year1

where xtset is a command for the setting data set as panel; farm1 and year1 are cross-
sectional and time variables.

Estimate Fixed Effects Model: Use the xtreg command with the fe option to estimate the
fixed effects model.

xtreg yield_qtl fert_kg seed_kg , fe


Then store the output of the estimated model by using the following command

estimates store fixed

121
Estimate Random Effects Model: Use the same xtreg command with the re option to estimate
the random effects model.
xtreg yield_qtl fert_kg seed_kg , re
estimates store random

Perform Hausman Test: After estimating both models, use the hausman command to
perform the Hausman test.
hausman fixed

Interpret Results
Stata will give the output of the Hausman test statistic and its associated p-value. If the p-
value is less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis
of no difference between the fixed and random effects estimators. In this case, the fixed
effects model may be preferred. If the p-value exceeds your chosen significance level, you fail
to reject the null hypothesis, indicating that the random effects model may be more
appropriate.

Stata Results
Fixed effect model
. xtreg yield_qtl fert_kg seed_kg , fe

Fixed-effects (within) regression Number of obs = 51


Group variable: farm1 Number of groups = 4

R-sq: Obs per group:


within = 0.2268 min = 1
between = 0.6558 avg = 12.8
overall = 0.3512 max = 17

F(2,45) = 6.60
corr(u_i, Xb) = -0.7911 Prob > F = 0.0031

yield_qtl Coef. Std. Err. t P>|t| [95% Conf. Interval]

fert_kg .1117405 .0346893 3.22 0.002 .0418727 .1816084


seed_kg -.0432867 .1352425 -0.32 0.750 -.315679 .2291056
_cons 22.11239 7.552635 2.93 0.005 6.900608 37.32418

sigma_u 6.1146911
sigma_e 4.4300808
rho .65578192 (fraction of variance due to u_i)

F test that all u_i=0: F(3, 45) = 7.99 Prob > F = 0.0002

.
. estimates store fixed

122
Random effect model

. xtreg yield_qtl fert_kg seed_kg , re

Random-effects GLS regression Number of obs = 51


Group variable: farm1 Number of groups = 4

R-sq: Obs per group:


within = 0.0391 min = 1
between = 0.8273 avg = 12.8
overall = 0.4047 max = 17

Wald chi2(2) = 32.63


corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

yield_qtl Coef. Std. Err. z P>|z| [95% Conf. Interval]

fert_kg -.0031706 .0287621 -0.11 0.912 -.0595433 .0532022


seed_kg -.2207929 .0965519 -2.29 0.022 -.4100311 -.0315546
_cons 40.55007 6.67863 6.07 0.000 27.46019 53.63994

sigma_u 0
sigma_e 4.4300808
rho 0 (fraction of variance due to u_i)

.
. estimates store random

Hausman test

. hausman fixed

Coefficients
(b) (B) (b-B) sqrt(diag(V_b-V_B))
fixed random Difference S.E.

fert_kg .1117405 -.0031706 .1149111 .019393


seed_kg -.0432867 -.2207929 .1775062 .0947009

b = consistent under Ho and Ha; obtained from xtreg


B = inconsistent under Ha, efficient under Ho; obtained from xtreg

Test: Ho: difference in coefficients not systematic

chi2(2) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 58.19
Prob>chi2 = 0.0000

123
Bibliography
Baltagi, B.H. and Baltagi, B.H., 2008. Econometric analysis of panel data (Vol. 4). Chichester:
Wiley.
Bickel, R., 2007. Multilevel analysis for applied research: It's just regression!. Guilford Press.
Gefen, D., Straub, D. and Boudreau, M.C., 2000. Structural equation modeling and
regression: Guidelines for research practice. Communications of the association for
information systems, 4(1), p.7.
Gil-Garcia, J.R., 2008. Using partial least squares in digital government research.
In Handbook of research on public information technology (pp. 239-253). igi
Global.
Greene, W. H. (2012). Econometric analysis (4th ed.). Englewood Cliffs, N. J.: Prentice-Hall.
Gujarati, D.N. and Porter, D.C., 2009. Basic econometrics. McGraw-hill.
Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E. and Tatham, R.L., 1998. Multivariate data
analysis. Uppersaddle River. Multivariate Data Analysis (5th ed) Upper Saddle
River, 5(3), pp.207-219.
Wooldridge, J.M., 2010. Econometric analysis of cross section and panel data. MIT press.

124
Chapter 8
114 1114
Estimation of Total Factor Productivity by
11414
using Malmquist Total Factor Productivity
Approach: Case of Rice in India
A. Suresh
ICAR-Central Institute of Fisheries Technology, Cochin, India.

Introduction

The Green Revolution (GR) has significantly contributed to achieving the self-sufficiency of
foodgrain production in India, primarily through increased production of rice and wheat. This
remarkable achievement has been achieved through the faster spread of modern varieties
(MVs) and input intensification. The yield increase in the case of rice during the initial phase
of MV introduction was not as miraculous as has happened in the case of wheat. This was
because the diffusion of MVs in the case of rice was not as fast as it was elsewhere. This can
be better gauged by the fact that by around the mid-1980s, some Asian countries like
Indonesia and The Philippines had reached the ceiling for MV adoption of 70-90 percent,
while in the case of India, it was around 30 percent during the same time (Otsuka, 2000).
However, the diffusion of MVs has continuously improved over the years.

The MVs introduced during the Green Revolution period have quickly exhausted the yield
potential, not only in India but across the globe (Hayami and Kikuchi, 1999). Also, some
symptoms of the unsustainability of modern cultivation practices emerged over the course
of time. Some visible symptoms of this unsustainability were nutrient imbalances, depletion
of soil micro-nutrients, over-exploitation of the groundwater, land degradation, more
frequent emergence of pests and diseases, and diminishing returns to inputs (Chand et al.,
2011). This has created apprehension about the ability of the approach to ensure future
food security. In this context an important debate emerged in policy circles- whether the
slowdown of agricultural performance is due to technology fatigue or policy fatigue (Planning
Commission, 2007; Narayanamoorthy, 2007). One major bottom line of the debate was that
given the high impacts of agricultural income in eliminating rural poverty, ensuring TFP
growth is critical to reducing rural poverty. In this context, the present chapter examines the
TFP growth in rice cultivation in India, taking into consideration the change in technical
change and efficiency. In light of the results, the study also discusses whether the slowdown
in yield growth is due to technology fatigue or sluggishness in input intensification.

125
TFP Studies in India and in Other Developing Countries

The TFP has attracted the attention of many scholars in India and other developing countries.
One common generalization that can be gauged from these studies is that the TFP has been
deteriorating even during the heyday of the green revolution in developing countries. For
example, Kawagoe et al. (1985) estimated the cross-country production functions for 22 less
developed countries and 21 developed countries using data for two decades between 1960
and 1980. They reported technological deterioration in developing countries and progress in
developed countries. Using cross-country analysis, some other studies also reported negative
productivity growth for developing country agriculture since the 1960s and 1970s
(Chaudhary, 2012). Nkamleu et al. (2003), analyzing data set for 10 Sub-Saharan African
countries for the period of 1972-1999, reported a deterioration of TFP growth. This
deterioration was identified to be more on account of regress in technical change. As far as
Chinese Agriculture is concerned, Li et al. (2011) noted significant productivity growth since
1980s, although the growth rates varied considerably among the subsectors. The productivity
growth is emancipated from either technological progress or efficiency gains, not from both
of them simultaneously. In an early study on the TFP in India, Kumar and Mruthyunjaya
(1992) reported growth in TFP of wheat in India during 1970-89 to be to the tune of 1.9
per cent in Punjab, 2.7 per cent in Haryana and Rajasthan, 2.6 per cent in Uttar Pradesh and
0.4 per cent in Madhya Pradesh. Kalirajan and Shand (1997) noticed a declining trend of TFP
growth in agriculture by the end of the 1980s. Joshi et al. (2003) and Kumar and Mittal
(2006) reported positive TFP growth for both rice and wheat during the period of 1980-
2000, but the TFP growth posted a reduction during the second decade compared to the
first decade. In a study of various crops and states for the period of 1975-2005, Chand et al
(2011) have observed that the TFP growth has shown considerable variation across crops
and regions. During the entire period under analysis, rice has posted a TFP growth of 0.67
per cent, while that of wheat has been at the rate of 1.92 per cent.

Malmquist Productivity Index: ‡The improvement in crop productivity can largely be


attributed to growth in input use or growth in the TFP. While partial factor productivity
measures the productivity pertaining to any specified input, the TFP provides a measure of
productivity considering all the inputs went gone into the production process.


Some part of this chapter is published by the author in the research paper cited as follows:
Suresh A (2013). Technical change and efficiency in rice production in India: A Malmquist Total Factor Productivity
approach, Agricultural Economics Research Review, 26 (Conference Issue): 109-18.
126
The TFP index can be constructed by dividing the index of total output by an index of total
inputs. In that sense, a growth in the TFP can be attributed to that part of the growth that
is not accounted for by the growth in input use. The most popular form of estimating TFP in
the past is the Tornquist- Theil Index method. This index estimates the TFP growth based
on information concerning price and uses cost/ revenue shares as weights to aggregate
inputs/ outputs (Bhushan, 2005). However, this method has one inherent weakness, it
assumes the observed outputs as frontier outputs. One important consequence of this
assumption is that the decomposition of the TFP growth into its constituent components,
viz., movement towards a production frontier and shift in the production frontier, cannot be
carried out. The Tornquist- Theil Index attributes the TFP growth entirely to the technical
change. The Malmquist productivity index (MPI) overcomes some of these problems.

The MPI was introduced by Caves et al (1982) based on distance functions. The output
oriented Malmquist TFP index measures the maximum level of outputs that can be produced
using a given level of input vector and a given production technology relative to the observed
level of outputs (Coelli et al, 2005). It measures the radial distance of the observed output
vectors in period t and t+1 relative to a reference technology. The Malmquist productivity
index for the period t is represented by,
𝐷0𝑡 (𝑥 𝑡+1 ,𝑦 𝑡+1 )
𝑀𝑡 = …
𝐷0𝑡 (𝑥 𝑡 ,𝑦 𝑡 )
(1)
which is defined as the ratio of two output distance functions concerning reference
technology at the period t. It is also possible to construct another productivity index by using
period t+1’s technology as the reference technology, which can be depicted as,
𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 )
𝑀𝑡+1 = … (2)
𝐷0𝑡+1 (𝑥 𝑡 ,𝑦 𝑡 )

Thus, there exists an arbitrariness in the choice of the benchmark technology depending
on the time period t or t+1. Fare et al (1994) has attempted to remove this arbitrariness
by specifying the MPI as the geometric mean of the two-period indices, defined as:
𝐷0𝑡 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 1
𝑀0 (𝑥 𝑡+1 , 𝑦 𝑡+1 , 𝑥 𝑡 , 𝑦 𝑡 ) = [( )( )] 2… (3)
𝐷0𝑡 (𝑥 𝑡 ,𝑦 𝑡 ) 𝐷0𝑡+1 (𝑥 𝑡 ,𝑦 𝑡 )
where, the notations x and y represent the vector of inputs and outputs, D 0 represents
the distances, and M0 represents the Malmquist index. Fare et al by using simple
arithmetic manipulations, has shown the MPI as the product of two distinct components,
viz. technical change and efficiency change, as indicated below:
𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡 (𝑥 𝑡 ,𝑦 𝑡 ) 1
𝑀0 (𝑥 𝑡+1 , 𝑦 𝑡+1 , 𝑥 𝑡 , 𝑦 𝑡 ) = [( )( )] 2… (4)
𝐷0𝑡+1 (𝑥 𝑡 ,𝑦 𝑡 ) 𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡+1 (𝑥 𝑡 ,𝑦 𝑡 )

127
where,
𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 )
Efficiency change = … (5)
𝐷0𝑡 (𝑥 𝑡 ,𝑦 𝑡 )
𝐷0𝑡 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡 (𝑥 𝑡 ,𝑦 𝑡 )
Technical change = [( )( )]… (6)
𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡+1 (𝑥 𝑡 ,𝑦 𝑡 )

The efficiency change can be further decomposed into pure efficiency change and scale
efficiency change. A detailed account of the MPI can be had from Fare et al. (1994), Coelli et
al. (2005), Bhushan (2005), and Chaudhary (2012). Introduction of linear programming
based Data Envelopment Analysis popularised the Malmquist index of productivity
measurement. DEA involves the construction of a piece-wise linear frontier based on the
distribution of the data of the input and outs of various entities/decision-making units
(DMUs) using a linear programming framework. This frontier constructs a piecewise surface
over the data such that the observed data lies on or below the constructed production
frontier (Coelli et al., 2005). The efficiency measure for each DMU is calculated relative to
this production frontier. Fare et al. (1994) identify four important advantages of using the
Malmquist Productivity Index compared to other approaches. They include: (1) the approach
requires data on only quantity and not prices. Information on prices is generally not available
for every input and output for many countries; (2) the linear programming-based approach
doesn’t assume an underlying production function, and therefore the stochastic properties
associated with the error term; (3) no prior assumption regarding the optimising behaviour
of the DMUs; and, (4) Since the approach allows for both movement towards the frontier
and shift in the frontier, it is possible to decompose the TFP into its components viz technical
change and efficiency change.

Data
The basic input data for the estimation was collected from the reports of “Comprehensive
Scheme for Cost of Cultivation of Principal Crops” carried out by the Directorate of Economics
and Statistics, Ministry of Agriculture, New Delhi. The data for the missing years were
approximated by interpolations based on the trend growth. The output variable was yield per
hectare (kg/ha) reported by the Ministry of Agriculture. Six input variables were used in the
analysis. They included usage in chemical nutrients (NPK), manure (q/ha), animal labour (pair
hours/ha), human labour (man-hours/ha), and real costs of machine labour and irrigation §

§
The real cost was derived by deflating with price index for diesel and respectively.
128
The analysis was carried out for the overall period of 1080-81 to 2009-10. The overall period
under analysis has been divided into two sub-periods of equal length of 15 years, 1980-81
to 1994-95 (period I) and 1995-96 to 2009-10 (period II). These periods broadly correspond
to the period before the macroeconomic reforms and the post-reform period, respectively.
To avoid extreme variations, triennial ending averages were used. The analysis was done using
the software DEAP 2.1 (Coelli, 1996).

Trend in the Yield of Paddy in India


The mean yield of paddy has significantly improved over the years, from about 1.2 tonnes/ha
in 1980-81 to 2.2 t/ha in 2009-10, at the rate of 1.9 per cent per year at the national level
(Table 1). There was a high degree of variation across states and over two periods. While at
the national level, the yield increased at the rate of 3.1 per cent per year, the second period
posted a growth rate of only 1.1 per cent during the first period. The states also shared the
same trend except for a few states like Punjab, broadly reflecting yield plateauing during
period II.

Trend in Total Factor Productivity


Following the methodology outlined earlier, we have estimated the trend in the Malmquist
productivity index since 1980-81. Figure 1 illustrates the movement of the TFP, technical
change, and efficiency change since 1980-81. The figure reveals that the movement of the
TFP change is aligned more with the movement of the technical progress than with a change
in the technical efficiency.

Table 1: Trend in yield of rice, across states, 1980-81 to 2009-10


Yield (TE average, kg/ha) Growth rates (% per year)
States 1980-81 to 1995-96 1980-81 to
1980-81 1994-95 2009-10
1994-95 2009-10 2009-10
Andhra Pradesh 1872 2562 3217 2.11 1.87 1.78
Bihar 921 1234 1319 2.86 -0.97 1.56
Karnataka 2008 2371 2539 1.14 1.09 1.37
Madhya Pradesh 586 845 912 1.87 -0.11 1.24
Odisha 918 1364 2167 3.51 3.84 2.17
Punjab 2760 3428 4017 1.33 1.62 1.05
Tamil Nadu 1958 3145 2857 4.34 -0.78 1.23
Uttar Pradesh 869 1836 2106 5.75 0.13 2.65
West Bengal 1347 2069 2551 4.34 1.73 2.69
Overall 1245 1847 2168 3.08 1.33 1.87

129
110
108
106
104
Per Cent

102
100
98
96
94
92
90
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009

Efficiency Technical change

Figure 1: Malmquist, TFP, and efficiency indices of paddy cultivation, 1980-81 to 2009- 2010

The result suggests that the mean TFP change for rice has been 0.2 per cent per year during
the overall period under consideration (Table 2). The decomposition of the TFP change
indicated that the change in the TFP was associated with a technological progress of 0.3 per
cent and a deterioration of the technical efficiency to -0.1 per cent. This underscores that
technical efficiency could not catch up with the technical progress and is pulling down the
TFP growth. In the case of wheat in India from 1982-83 to 1999-2000, Bhushan (2005)
indicated that the major source of productivity growth was technical change than efficiency
change. The efficiency change was not a major source of growth for rice in some other major
rice technology development countries like The Philippines (Umetsu et al., 2003).

Table 2 also depicts the growth in TFP and its constituent components across states for the
overall period under analysis. The TFP change varied considerably across states, with four
states (Andhra Pradesh, Punjab, Tamil Nadu and Uttar Pradesh) out of the total nine states
under consideration posting positive trends and the remaining five states posting negative
trends. The highest change in the TFP among states has been noted in case of Andhra
Pradesh (5.1 per cent), followed by Punjab (4.6 per cent). On the other hand, the negative
TFP growth ranged between -4.6 per cent in cases of Madhya Pradesh to -1.3 per cent in
case of Karnataka. The table reveals that the TFP change is associated more with technical
change than with efficiency change at state level also. A positive growth in both efficiency
and technical change could be noted only in case of Andhra Pradesh and Uttar Pradesh.

130
For Punjab, positive technical change was associated with no-change in efficiency, while for
Tamil Nadu, technical change of 2.8 per cent is coupled with a efficiency change of -0.9 per
cent. It is noteworthy that Karnataka and Madhya Pradesh posted a decline of technical
change, efficiency change and TFP during the overall period. The change in efficiency has
been decomposed into its components, viz. pure efficiency change and scale efficiency change
as well. Pure efficiency has remained unchanged at the national level and in most of the
states, except in Andhra Pradesh and Tamil Nadu. An increase in pure efficiency has been
observed in the case of Andhra Pradesh and Uttar Pradesh. The results suggest that the
agricultural development strategy has to pay increased attention to the factors that could
influence efficiency as well as the factors that result in technical progress.

Table 2: Trend in the total factor productivity and its components, 1980-81 to 2009-10
Pure Scale
Technical TFP
State Efficiency change Efficiency Efficiency
Change change
Change Change
Andhra Pradesh 100.7 104.4 100.5 100.2 105.1
Bihar 100 97.7 100 100 97.7
Karnataka 99.9 98.8 100 99.9 98.7
Madhya Pradesh 98.7 96.7 100 98.7 95.4
Odisha 100 96.3 100 100 96.3
Punjab 100 104.6 100 100 104.6
Tamil Nadu 99.1 102.8 99.3 99.8 101.8
Uttar Pradesh 100.5 103.2 100 100.5 103.7
West Bengal 100 98.6 100 100 98.6
Mean 99.9 100.3 100 99.9 100.2

Trend in TFP during the sub-periods

The sub-period analysis throws up some interesting results (Table 3). It turned out that at
national level, the mean TFP growth increased from -1.3 per cent in the period I (first period)
to 1.8 per cent during period II (second period). This TFP change was associated with an
improvement in the technical change (from -1.6 per cent to 2.1 per cent) and a decline in
efficiency (from 0.3 per cent to -0.2 per cent). It is observed that some of the early green
revolution states like Punjab, Tamil Nadu and Uttar Pradesh which posted high rate of TFP
growth during the first period has exhibited a deterioration during the second period while
states like Karnataka, Madhya Pradesh, Odisha and West Bengal where TFP trend was
deteriorating during the first period has shown a revival.

131
The results also suggest that during the two periods the TFP changes of the latter group of
states were with high level of margins, the highest absolute increase being in case of Odisha
(by 12.2 percentage points). The decline in the TFP of Punjab, Tamil Nadu and Uttar Pradesh
was mainly due to a deterioration of the technical progress rate than a decline in the efficiency
growth. The revival of the TFP growth in case of Karnataka, Madhya Pradesh, Odisha and
West Bengal is due to high level of technological progress. A picture of contrasting
performance has been noted in case of Andhra Pradesh and Bihar. In Andhra Pradesh an
already increasing TFP growth has increased further during the second period (from 4.0 per
cent to 7.5 per cent), while in Bihar the already deteriorating TFP growth during the first
period has further deteriorated (from -0.7 per cent to -4.4 per cent). This contrasting
performance of the two states owes it to the contrasting performance of technical progress
of the two states. In case of Andhra Pradesh, the increase in the technical progress from 2.5
per cent to 6.6 per cent could surpass the deterioration of the efficiency growth, effecting a
positive TFP growth. On the other hand, the deterioration of the technical growth from -0.7
per cent to -4.4 per cent while the efficiency remaining unchanged has pulled down the TFP
growth in case of Bihar. The increase in TFP growth with practically unaltered efficiency levels
points to upward shift of the production frontier. In that sense, it can be presumed that the
low performing states during the first period has been catching up with the already
progressive states. On the other hand, results suggests that the rates of shift in the
production frontier is declining in the already well performing states, except Andhra Pradesh.

Table 3: The trend in technical change, efficiency change and total factor productivity
State Efficiency Technical Change TFP change
Period I Period II Period I Period II Period I Period II
Andhra Pradesh 101.5 100.8 102.5 106.6 104.0 104.4
Bihar 100.0 100.0 99.3 95.6 99.3 97.7
Karnataka 100.0 100.3 95.3 102.1 95.3 98.8
Madhya Pradesh 99.7 98.8 91.4 101.8 91.2 96.7
Odisha 100.0 100.0 90.0 102.2 90.0 96.3
Punjab 100.0 100.0 105.6 104.0 105.6 104.6
Tamil Nadu 100.0 98.0 103.6 102.3 103.6 102.8
Uttar Pradesh 101.1 100.0 103.4 103.2 104.6 103.2
West Bengal 100.0 100.0 96.0 101.1 96.0 98.6
Mean 100.3 99.8 98.4 102.1 98.7 100.3

132
Technology Fatigue or Sluggishness in Input Intensification?

The above results help to shed light on the debates of on whether the declining productivity
is due to technology fatigue of policy fatigue. The forgone analysis has clearly shown that
TFP growth in rice has acquired greater geographical spread during recent periods. In this
context, it would be worthwhile to analyze the trend in use of inputs in rice cultivation. Table
4 provides the trend growth of application of four major inputs, viz. irrigation, fertilizer,
manures and human labour. It clearly indicates that the rate of use of inputs has declined in
most of the states, with a few exceptions.

The decline has been sharp in the case of labour, fertilizer and manure. All the states with the
exception of Punjab posted a decline in the rate of application of fertilizers. In case of labour,
all the states except Odisha and West Bengal have posted negative growths. This trend has
been broadly reflected in the cost of cultivation as well (Appendix). At national level, the cost
of cultivation increased at the rate of 9.2 per cent per year during the overall period under
analysis. On a disaggregated analysis the second period exhibited a growth rate of 7.3 per
cent per year, compared to 10.9 during the first period. This decline in expenditure growth
(despite a higher level of input price during the second period) might be out of reduced rates
of input application.

Table 4: Growth in use of irrigation (real price), fertilzer nutrients (kg/ha) and human
labour (labour hours) in paddy cultivation across states, between two periods (% per year)
States Irrigation Fertilizer Labour

Period I Period II Period I Period II Period I Period II


Andhra Pradesh 4.81 -13.70 2.73 1.88 -0.29 -2.29
Bihar -7.22 23.04 7.63 1.30 -0.10 -0.74
Karnataka 9.75 -4.78 6.40 2.12 0.05 -0.28
Madhya Pradesh 19.02 1.12 8.74 -0.92 0.81 -1.99
Odisha 7.03 3.28 13.61 2.30 0.48 0.36
Punjab 1.39 -5.24 1.04 1.11 -3.25 -1.23
Tamil Nadu 5.79 -4.03 -1.36 1.85 -5.10 -2.79
Uttar Pradesh 11.30 2.74 7.99 2.76 -0.73 0.20
West Bengal 14.44 -5.09 10.23 4.11 1.20 0.42

133
The above trend is vividly reflected in the change in the cost structure and factor shares
(Table 5). For analytical purpose the entire expenditure of rice cultivation has been clubbed
grouped into four input groups, viz. current inputs, capital inputs, labour and land. Current
inputs are seed, fertilizer, manure, insecticides, interest on variable cost; Capital inputs are
draft animal, irrigation, machinery, depreciation, interest on fixed capital; labour input is
human labour. The land revenue involves the value of land resources (both owned and hired)
as well as other charges on land. The table provides three specific information- share of inputs
in total cost of cultivation (cost share), trend growth of (nominal) expenditure of these input
groups, and their share in total value of output (factor share). The expenditure of the current
inputs has grown at the rate of 8.0 per cent per year, capital inputs at the rate of 8.8, labour
at the rate of 10.5 and land at the rate of 8.5 per cent for the overall period under analysis.
The period II has depicted a reduction in the expenditure growth for all the input groups,
most noticeably in case of current inputs (from 9.7 per cent to 5.4 per cent). The use of
capital inputs, which more or less reflects long term farm-investment, has reduced from 10.3
per cent to 7.2 per cent. This is a cause for concern, as the reduction in capital investment
has long term implications farm income growth.

Table 5: Trend in the cost share, factor share and growth rate of various input groups in
paddy cultivation, national level
Trend Growth rate (per
Cost Share (%) Factor share (%)
Input cent per year)
groups 1980 1994 2009 Period Period 1980 1994 2009
-81 -95 -10 I II Overall -81 -95 -10
Current 18.9 17.0 13.0 9.7 5.4 8.0 17.2 14.4 12.4
Capital 24.4 20.8 17.9 10.3 7.2 8.8 22.3 17.6 17.1
Labour 28.9 32.3 42.3 12.1 8.9 10.5 26.4 27.5 40.3
Land 27.8 29.9 26.8 11.1 6.2 8.7 25.4 25.4 25.6
Basic Data Source: Cost of cultivation reports of CACP

Corresponding to the relative growth of expenditure, the structure of the costs also has
depicted sharp change over time. While the share of the current cost, capital costs and labour
in cost of cultivation has registered an decline over years, that of labour has increased by 13
per cent points between 1980-81 to 2009-10. The spurt in the expenditure has to be
explained in the light of high rate of increase in agricultural wage in recent times than a
physical increase in the labour absorption in rice cultivation. The results broadly suggest that

134
it is the sluggishness in input intensification that is resulting in the yield decline than a
reduction of TFP or technical change. This indicates that the farm policies should favour
sustainable intensification of inputs so as to increase the yield. The trend in the cost share
has been broadly reflected in the factor share as well. While the share of current and capital
inputs declined over years, the share of labour and land has increased. A close observation
also reveals that the technical change in the rice cultivation has not made a significant
percolation of benefits to the enterpreuner / farmer in the form of increased share in the
value of output, during the second period under analysis.

Conclusion and Policy Implications

The study has estimated the TFP growth for rice in India and in major states and has
decomposed the TFP growth into its constituent components viz technical change and
efficiency change. In the light of the above results the study has discussed whether the recent
slowdown in yield growth is due to technology fatigue or sluggishness in the input
intensification.

The study identifies that during the overall period under analysis, the TFP growth has been
at a moderate rate of 0.2 per cent per year, with large inter-state variations. The positive
change in the TFP has been associated with a mean technical change of 0.3 per cent and a
deterioration of mean efficiency by -0.1 per cent. The technical change turned out to be the
main driver of the TFP change. Among states Andhra Pradesh, Punjab, Tamil Nadu and Uttar
Pradesh exhibited positive TFP change during the entire period under analysis. The sub-
period analysis indicates that second period has witnessed a revival of the mean TFP to the
level of 1.8 per cent per year, compared to a negative TFP change of -1.3 per cent during the
previous period. This revival has been effected mainly by positive technical change during the
second period. However, a matter of concern is the decline in the technical efficiency. It is
also observed that the TFP growth has become more widespread with passage of time. The
less progressive states with respect to TFP growth, viz. Karnataka, Madhya Pradesh, Odisha
and West Bengal during the first period have caught up with the initially progressive states
during the second period, mainly propelled by high rate of technical progress. Also, it is noted
that the TFP growth of the progressive states, except AP, have deteriorated during the
second period, mainly due to the regress in the technical change. One state, that needs special
mention is Bihar, where both the technical change and efficiency change deteriorated over
years.

135
The study throws up some important policy observations. It establishes that in case of rice,
there is no conclusive evidence for a technology regress; rather there is evidence of
technological progress over years. However, the rate of growth of input application has been
declining over years. Therefore, rather than technological fatigue, it might be the sluggish
input intensification that is contributing to the decline in yield growth of rice in recent periods.
Therefore, farm policies need to be aligned towards sustainable resource intensification,
notably capital inputs, as they have long term implications of farm income growth. Along with
technical progress, the policies should be aligned to improve the technical efficiency of
cultivation. In the light of the evidences existing on the positive role of research investment
in technical progress and extension expenditure on efficiency change, the agrarian policies
need to favour increased flow of resources towards the research and extension system so as
to effect TFP growth through both technical and efficiency changes.

136
Bibliography
Bhushan, S. (2005) Total factor productivity growth of wheat in India: A Malmquist
Approach. Indian Journal of Agricultural Economics, 60(1):32-48.
Caves, D.W., Christensen, L.R. and Diewert, W.E. (1982) The economic theory of index
numbers and the measurement of input, output and productivity, Econometrica:
1393-1414.
Chand, R., Kumar, P and Kumar, S. (2011) Total factor productivity and contribution of
research investment to agricultural growth in India, Policy Paper 25, New Delhi,
National Centre for Agricultural Economics and Policy Research.
Chaudhary, S. (2012) Trend in total factor productivity in Indian agriculture: State level
evidence using non-parametric sequential Malmquist Index, Working Paper No 215,
New Delhi, Centre for Development Economics, Delhi School of Economics.
Coelli, T.J. (1996) A guide to DEAP Version 2.1: A Data Envelopment Analysis (Computer)
Program, Centre for Efficiency and Productivity Analysis, University of New England,
Australia.
Coelli, T.J., Rao, D.S.P., O’Donnell, C.J. and Battese, G.E. (2005) An introduction to efficiency
and productivity analysis, Springer.
Fare, R., Grosskopf, S., Norris, M., AND Zhang, Z. (1994) Productivity growth, technical
progress, and efficiency change in industrialised countries. The American Economic
Review. 66-83.
Hayami, Y and Kikuchi, M. (1999) The three decades of green revolution in a Philippine village.
Japanese Journal of rural economics, 1: (10-24).
Kalirajan, K.P. and Shand, R.T. (1997) Sources of output growth in Indian Agriculture, Indian
Journal of Agricultural Economics, 52(4), 693-706.
Kawagoe T., Hayami Y., Ruttan V. (1985), The inter-country agricultural production function
and productivity differences among countries. Journal of Development Economics,
Vol. 19, p113-32.
Kumar P. and Mittal Surabhi (2006) Agricultural Productivity Trends in India: Sustainability
Issues. Agricultural Economics Research Review. 19 (Conference No.) pp 71-88.
Kumar, P. and Mruthyunjaya (1992) Measurement and analysis of total factor productivity
growth in wheat. Indian Journal of Agricultural Economics, 47 (7): 451-458.
Kumar, P., Joshi, P.K., Johansen, C and Asokan, M. (1998) Sustainability of rice-wheat based
cropping system in India. Economic and Political Weekly, 33: A152-A158.

137
Li, G., You, L. and Feng, Z. (2011) The sources of total factor productivity growth in Chinese
agriculture: Technological progress or efficiency gain. Journal of Chinese Economic
and Business Studies, 9(2): 181-203.
Narayanamoorthy, A. (2007). Deceleration in agricultural growth: Technology or policy
fatigue. Economic and Political Weekly, 42(25):2375-79.
Nkamleu, G.B., Gokowski, J and Kazianga, H. (2003) Explaining the failure of agricultural
production in sub-saharan Africa. Proceedings of the 25th International Conference
of Agricultural Economists, Durban, South Africa, 16-22 August 2003.
Otsuka, Keijiro (2000) Role of agricultural research in poverty reduction: lesson from the
Asian Experience. Food Policy, 25: 445-462.
Planning Commission (2007) Report of the steering committee on Agriculture for Eleventh
Five Year Plan (2007-2012), New Delhi, Government of India.
Umetsu, C., Lekprichakul, T and Charavorty, U (2003). Efficiency and technical change in the
Philippine rice sector: A Malmquist total factor productivity analysis. American
Journal of Agricultural Economics. 85(4): 943-963.

Appendix 1: Growth in cost of cultivation and cost of production (nominal prices)


% per year

States Period I Period II Overall


Andhra Pradesh 11.5 5.8 9.1
Bihar 9.7 5.3 7.9
Karnataka 10.6 5.6 9.9
Madhya Pradesh 11.4 4.6 9.3
Punjab 8.7 7.2 8.1
Tamil Nadu 10.5 4.5 6.6
Uttar Pradesh 10.9 7.2 9.1
West Bengal 11.0 10.1 10.6
Odisha 11.4 6.9 10.0
National 10.9 7.3 9.2
Basic Data Source: Cost of cultivation reports of CACP

138
Chapter 9
Forecasting Methods – An 114 1114
Overview 
Ramadas Sendhil , V Chandrasekar , L Lian Muan Sang ,11414
1 2
Jyothimol Joseph 1 1

and Akhilraj M1
1 Department of Economics, Pondicherry University (A Central University),

Puducherry, India.
2ICAR-Central Institute of Fisheries Technology, Cochin, India.

Introduction

Forecasting is an important tool in econometrics, enabling the prediction of future economic


trends, behavior, and outcomes. This technique involves estimating the likelihood of future
events or trends using historical data, enabling advanced decision-making by policymakers,
researchers, businesses, and individuals alike. It entails analyzing the trends and patterns of
changes in the variable over time to estimate its magnitude at a future point. By identifying
regular trends and patterns in data, forecasters can make informed predictions about the
likely future values that serve as essential inputs for planning, risk management, and policy
formulation.

Forecasting in econometrics encompasses a spectrum of approaches, ranging from traditional


time series analysis to sophisticated causal modeling and machine learning techniques. Data
smoothing can be used to make short-term predictions, especially when dealing with irregular
data. Forecasting is essentially about making informed estimates about what is likely to
happen based on past events. In this chapter, we delve into the different forecasting methods
in econometrics. Through a comprehensive examination of these methods, we gain a deeper
understanding of how different econometric forecasting models contribute to
comprehending complex economic dynamics.

Note: Adapted from Armstrong (2004)

Figure 1: Stages in Forecasting


This chapter is primarily adapted from ‘Forecasting of Paddy Prices: A Comparison of Forecasting Techniques’ (2007) authored
by Nasurudeen P, Thimmappa K, Anil Kuruvila, Sendhil R and V Chandrasekar from the Market Forecasting Centre, Department of
Agricultural Economics, PJN College of Agriculture and Research Institute, Karaikal. Available at:
https://www.researchgate.net/publication/329446012
139
Types of Forecasting Methods
Forecasting is a process that can be done based on subjective factors using personal
judgment, intuition, and commercial knowledge, and also through an objective approach using
statistical analysis of past data. Sometimes, a blend of both is also used. Broadly, the various
forecasting methods can be grouped into qualitative and quantitative approaches.
A. Qualitative Forecasting Methods: Subjective judgments or opinions are used in
qualitative methods of forecasting. These methods do not include mathematical
computations. This technique is employed when the past data for the variable being
forecast is unavailable, when there is limited time to gather data or utilize
quantitative techniques, or when the situation is evolving so rapidly that a statistical
forecast would offer fewer insights.

Qualitative Forecasting Methods

Executive Opinion Market Research Delphi Method


A gathering of high- This systematic approach utilizes A forecast is the result of
ranking executives questionnaires, surveys, an anonymous consensus
convenes to create an sampling, and information among experts. A
estimate for future events analysis to understand consumer coordinator sends data
or trends jointly. preferences and evaluate prices and questions to the
accordingly. experts, who then share
and discuss their
comments until a
consensus is reached. This
process is time-consuming.

B. Quantitative Forecasting Methods: Quantitative forecasting methods rely on


mathematical models based on objective analysis. These approaches involve
studying past relationships between the variable to be forecasted and the
influencing factors in developing a forecast model. They also enable the assessment
of forecast accuracy, learning from past mistakes, and establishing confidence
intervals for forecasts. Quantitative forecasting methods are categorized as time
series and causal models. While numerous methods for forecasting a variable exist,
this chapter intentionally concentrates on forecasting techniques for time series
variables.
140
Quantitative Forecasting Methods

Time Series Models Causal Models


These models consider past data patterns to These models operate under the assumption
anticipate future outcomes by identifying the that other variables influence the forecasted
inherent patterns within the data. variable, and the predictions are derived from
these interrelationships.

Time Series
Time series data is a collection of ordered observations on a quantitative attribute of a
variable gathered at various time intervals. Typically, these observations occur sequentially
and are evenly distributed over time. In mathematical terms, a time series is characterized by
X1, X2 , … Xn represents a variable X (such as the gross domestic product, sales, commodity
price, height, weight, etc.) at specific time points t1, t2, … tn. Therefore, X is a function of time,
denoted as X = F(t).

Objectives of Time Series Data Analysis


1. Description: The initial step involves describing the data using graphical methods
such as time plot, correlogram, and summary statistics such as autocorrelation
function (ACF) and partial autocorrelation function (PACF). The time plot is the
graphical representation of the time series.
2. Modeling: Subsequently, to depict the stationary data generating process an
efficient statistical model is used. Time series modeling generally tries to extract
information from the variable itself. The fundamental objective of this assumption
is that the current value of the variable in time period t is affected by its preceding
value in time period t -1
3. Forecasting: The forecasting stage entails estimating future series values. Generally,
the words "forecasting" and "prediction" are often used interchangeably, but with
time series data, forecasting, and multiple series or cross-sectional data prediction
should be used. Notably, "steady-state forecasting" anticipates the future to
resemble the past, while "what-if forecasting" employs a multivariate model to
explore the impact of other variables. The analysis of a single time series is called
univariate time series analysis. In time series, when multiple variables are used in an
equation, multivariate time series modeling techniques are adopted.

141
4. Control: Effective forecasts enable proactive control of a process or variable, aligning
with the concept of what-if-forecasting.

Time Series Forecasting: Time series forecasting, a subset of quantitative forecasting models,
involves analyzing data for trend, seasonality, and cycle patterns in a single variable.
Understanding these patterns is crucial before conducting the analysis. The initial step in
forecasting a time series variable is to generate sequence plots of the data to visually evaluate
the characteristics of the time series. This method of visualizing data is referred to as a
correlogram. This visualization aids in identifying behavioral components within the time
series and guides the selection of the most suitable forecasting model. Various conventional
methods for forecasting time series are the naïve method, mean model, moving averages
method, linear regression with time, exponential smoothing models, auto-regressing moving
average (ARMA), and auto-regressive integrated moving average (ARIMA). In this age of
artificial intelligence, more powerful, robust, and precise models of forecasting, such as
artificial neural networks (ANN), are developed and used by econometricians. Once a model
is selected based on the data pattern, the next step is its specification. This process entails
the identification of variables to be incorporated, the selection of the relationship equation's
form, and the estimation of the equation's parameters. The model's effectiveness is validated
by comparing its forecasts with historical data for the targeted forecasting process. Typical
error metrics like Mean Absolute Percentage Error (MAPE), Relative Absolute Error (RAE),
and Mean Square Error (MSE) are frequently employed for model validation. The objective
here is to distinguish the trend from the disturbance and observe the trend in its lagged
values to determine the long-term changes and factors of seasonal fluctuations Several
computer packages, including R, SPSS, FORECASTX, STATA, SHAZAME, SAS, and EVIEWS,
can perform time series forecasting. These tools allow analysts to effectively conduct and
validate time series forecasting analyses.

Patterns in a Time Series Data


The initial and crucial step in creating a forecast involves analyzing the historical relationship
by generating a time series plot. The time series plots presented below demonstrate the
typical patterns observed in time series data.
(i) Trend (T)
(ii) Seasonal effect (S)
(iii) Cyclical effect (C)
(iv) Irregular effect (I)
142
Source
https://www.analyticsvidhya.com/blog/2023/02/various-techniques-to-detect-and-isolate-time-
series-components-using-python/
https://www.xenonstack.com/blog/time-series-deep-learning

• Trend refers to the gradual and long-term increase or decrease of the variable over
time.
• Seasonal effects capture the recurring influences that impact the variable on an
annual basis.
• Cyclical effects measure the broad, irregular waves that affect the variable,
potentially stemming from general business cycles, demographic shifts, and other
factors.
• The irregular effect encompasses the variations that cannot be ascribed to trend,
seasonality, or cyclical patterns, essentially representing the residual fluctuations.

Table 1. An overview of popular forecasting methods


Forecasting Method Description

Naïve Utilizes the most recent actual value as a forecast.

Simple Arithmetic Mean Utilizes the mean of all historical data as a forecast.

Utilizes the mean of a specified number of the latest


Simple Moving Average observations, with equal weight assigned to each
observation.

Utilizes the average of a defined number of recent


Weighted Moving Average observations, applying different weights assigned to each
observation.

143
Forecasting Method Description

Weighted average techniques with exponentially declining


Exponential Smoothing
weights as data ages.

Trend Adjusted An exponential smoothing model incorporating


Exponential Smoothing adjustments for strong inherent trend patterns in the data.

Adaptive Response Rate Similar to the basic exponential smoothing model, but with
Exponential Smoothing adaptive smoothing parameter (alpha) adjustments based
(ARRES) on varying errors over time.

Statistical techniques aimed at explaining variation. It may


Curve Fitting (Regression)
be linear or non-linear

The auto-regressive (AR) model is effectively combined


Auto-Regressive with the moving average (MA) extended to non-stationary
Integrated Moving data. Also known as the Box-Jenkins method. The
Average (ARIMA) advantage of this model is it can be adapted to stationary
data i.e., the ARMA model

This approach forecasts future events under the


Markov Chain Probabilistic assumption that the likelihood of transitioning from one
Forecast state to another depends only on the current state,
independent of the sequence of events that preceded it.

Artificial Neural Network A powerful tool for forecasting and modeling when the
(ANN) underlying relationship in the data is not known

Artificial Intelligence Use of AI to enhance the prediction accuracy and address


Model the complexities of time series data.

Note: Adapted from Nasurudeen et al. (2007).

Popular Forecasting Techniques – A Glance


1. Forecasting by Naïve Method
Assume a time series X1, X2 , … Xt (daily data). The forecast for the subsequent day (i.e., X t+1)
is determined using the current day’s actual value (Xt). In this scenario, the process involves
subsequent day forecasting using the current day value to forecast for the next period.

144
2. Forecasting by Simple Arithmetic Average Method
In this method, the forecast for the subsequent day (i.e., Xt+1) is calculated as the mean of all
the past values or historical data. In this context, daily forecasting is initiated from day 2 (as
there was no pre-existing data available to form a forecast for the first day; other means to
make predictions have to be relayed) to determine the value for each day, followed by
forecasting for the subsequent days.

3. Forecasting by Simple Moving Average Method


The forecast for the next day (i.e., Xt+1) is determined by taking the average of a defined
number of recent observations, where each observation is assigned an equal weight. In this
scenario, a two-day simple moving average is utilized. If there is noinitial data for the forecast
series (i.e., Xt), a guess value can be used for the 1st day forecast. Subsequently, the "naïve
method" shall be employed to forecast the 2nd day. Following this, adequate data will be
available for subsequent forecasts using the two-day simple moving average. In the case of a
three-day simple moving average, 1st day forecast will be a guess, 2nd and 3rd day forecasts
will use the "naïve method" and the subsequent forecast will be made using the three-day
simple moving average.

4. Forecasting by Weighted Moving Average Method


In this method, the forecast for the next day (i.e., Xt+1) is determined using a three-day
weighted moving average. Initially, for the forecast series, a guess was made for the 1 st day
due to the absence of data at the beginning. Subsequently, the "naïve method" was used to
forecast for the 2nd day and 3rd day. Once sufficient data become available, the three-day
weighted moving average forecasts will be utilized for the subsequent days, with weights
assigned as follows: 0.5 for the most recent day, 0.3 for the day before that, and 0.2 for the
day prior. The weights can be arbitrary depending on the nature of the time series, and its
summation should be equal to ‘one.’ This is a better method in comparison to the earlier
methods with its own advantages but also with some limitations. One benefit of this method
is the flexibility to adjust the weights assigned to previous observations. However,
determining the optimal weights can be costly. This method is most suitable when historical
data exhibit consistent period-to-period changes of similar magnitude. On the other hand,
this method has limitations. It does not account for seasonality and trend, and determining
the optimal number of periods and weights can be challenging and resource-intensive.

145
5. Forecasting by Exponential Smoothing Method
It is a weighted average technique wherein the weights decline exponentially as data ages. In
this method, the forecast for the next day (i.e., Ft+1) is determined using the following formula:
Ft+1 = α At + (1-α) Ft (Eq. 1)
where At is the actual time series, Ft is the forecast series, and ‘α’ represents a smoothing
coefficient ranging between ‘0 and 1’. Although the exponential smoothing method relies on
just two observations for making future predictions (the latest actual observation and the
most recent forecast), it effectively integrates a portion of all historical data. In this approach,
past values are assigned varying weights, with older data receiving less weight. This concept
can be illustrated by extending the formula mentioned above. The method used to generate
the forecast for the last day (Ft) is as follows.
Ft = α At-1 + (1-α) Ft-1 (Eq. 2)
Substituting eq. 2 into eq. 1:
Ft+1 = α At + (1-α) [α At-1 + (1-α) Ft-1]
Modifying the above eq.,
Ft+1 = α At + α (1-α) At-1 + (1-α)2 Ft-1 (Eq. 3)
Identifying the continuous process from eq. 2:
Ft-1 = α At-2 + (1-α) Ft-2 (Eq. 4)
Putting the values of eq. 3 in exchange/place of eq. 4:
Ft+1 = α At + α (1-α) At-1 + (1-α)2 [α At-2 + (1-α) Ft-2] (Eq. 5)
Modifying the above eq.,
Ft+1 = α At + α (1-α) At-1 + α (1-α)2 At-2 + (1-α)3 Ft-2 (Eq. 6)
The following process can be concluded:
Ft+1 = α At + α (1-α) At-1 + α (1-α)2 At-2 + α (1-α)3 At-3 + α (1-α)4At-4 +… (Eq. 7)

As the decimal weights are raised to increasing powers, their values diminish. In the absence
of initial data, a guess value can be used for the day one forecast. Subsequently, the
exponential smoothing model shall be employed to forecast each subsequent day, starting
from day two. However, there are some principles to determine the value of ‘α’ which are
given below:
● To handle data that is random and shows erratic behavior without a clear pattern, a
larger value of ‘α’ should be employed.
● Conversely, for random walk time series data characterized by random and smooth
fluctuations without repetitive patterns, a smaller value of ‘α’ is recommended.

146
● When higher degree smoothing is required, a long-run moving average should be
utilized, corresponding to a smaller ‘α’ value.
● Conversely, when a lesser degree of smoothing is required, a short-run moving
average should be employed, corresponding to a higher ‘α’ value.
● Experimenting with different values of ‘α’ to fit the model and selecting the optimal
‘α’ based on minimal error is advisable.

6. Adaptive Response Rate Single Exponential Smoothing (ARRSES)


The adaptive response rate in single exponential smoothing offers an advantage over the
traditional method by removing the necessity of defining a specific value for 'α.' In this
approach, the forecast for the upcoming period (Ft+1) is calculated based on:
Ft+1 = αt At + (1-αt) Ft
αt+1 = | Et / Mt |
Mt = β |et| + (1- β) Mt-1
Et = β et + (1- β) Et-1
et = At - Ft
where,
At is the actual price and Ft is the forecasted price at tth period
et is the error term at tth period
Mt is the absolute smoothed error
Et is the smoothed error
The advantage of this method is that it is capable of representing nearly all data patterns.
The value of αt automatically adjusts whenever there is a change in the data pattern.
However, it may not perform well for highly random data with low autocorrelation.
Additionally, recomputing the necessary statistics when new observations become available
can be relatively cumbersome. Furthermore, the forecasts from this technique lag turning
points by one period, meaning it does not anticipate turning points in the forecasted time
series.

7. Brown’s One Parameter Linear Exponential Smoothing


Let S’t and S”t represent the single exponential and double exponential smoothed values,
respectively.
Then,
S’t = α At + (1-α) S’t-1

147
S”t = α S’t + (1-α) S”t-1
Then,
at = S’t + (S’t - S”t) = 2 S’t - S”t
bt = α / 1- α (S’t - S”t)
Ft+m = at + bt m
where, m is the number of periods ahead to be forecast.

8. Brown’s Quadratic Exponential Smoothing


Let S’t, S”t, and S”’t denote the single, double, and triple smoothing, and Ft represents the
forecasted prices. In the above case:
S’t = α At + (1-α) S’t-1
S”t = α S’t + (1-α) S”t-1
S’”t = α S”t + (1-α) S”’t-1
at = 3 S’t - 3 S”t + S’”t
ct = {α2 / [(1- α)2 ]} x (S’t - 2 S”t + S’”t)
bt = {α / [2 (1- α)2]} x [(6-5α) S’t - (10-8 α) S”t + (4-3α) S”’t]
Ft+m = at + bt m + ½ ct m2

9. Forecasting Methods of Curve Fitting (Regression)


Curve fitting methods aim to explain the variation in the time series using statistical
techniques. These methods do not account for seasonal or cyclical effects and assign equal
weight to the data. The available methods include:
(a) Linear Regression: This commonly used method fits a linear trend through historical data
points by minimizing the squared differences between the points and the trend line. This is
done using statistical formulas to determine the slope (b) and the intercept (a) of the trend
line. The resulting equation, Yt = a + b(t), can then be applied, where 't' represents time
(horizontal axis) and 'Y' represents the observed values of the time series (vertical axis).
(b) Exponential Function: This approach utilizes a curve that either increases or decreases,
proving beneficial in situations where there has been growth or decline in previous periods. It
is expressed as follows.:
Yt = aebt or ln (Yt) = ln (a) + bt
(c) Power Function: Similar to its predecessor, a power function gives a forecast curve that
increases or decreases at a different rate and is expressed as:
Yt = atb or ln (Yt) = ln (a) + b [ln (t)]

148
(d) Logarithmic Function: This method uses an alternate logarithmic model, and is expressed
as:
Yt = a + b [ln (t)]
(e) Gompertz Function: This method attempts to fit a 'Gompertz' or 'S' curve and is
expressed as:
Yt = e a + (b/t) or ln (Yt) = a + (b/t)
(f) Logistic Function: This method attempts to fit a 'Logistic' curve, expressed as:
Yt = 1/ [(1/u) + (abt)] or ln [(1/yt) – (1/u)] = ln (a) + ln (bt)
where 'u' is the upper boundary value.
(g) Parabola or Quadratic Function: This technique aims to fit a 'Parabolic' curve to forecast
a damped data series, expressed as:
Yt = a + bt +ct2
(h) Compound Function: This approach generates a forecasting curve that experiences
compound growth or decline, expressed as:
Yt = abt or ln (yt) = ln (a) + t ln (b)
(i) Growth Function: This approach generates a forecasting curve based on an estimated
growth rate, expressed as:
Yt = e a + (b )
or ln (Yt) = a + bt
(j) Cubic Function: This approach seeks to fit a 'Cubic' curve, expressed as:
Yt = a + bt + ct2 + dt3
(k) Inverse Function: This method attempts to fit an 'Inverse' curve, expressed as:
Yt = a + (b/t)

10. ARIMA Model


Auto-Regressive (AR) models can be combined effectively with Moving Average (MA)
models to create a time series model known as Auto-Regressive Moving Average (ARMA)
models. AR models use the lag values of the variable itself whereas MA models use the lag
values errors to represent the dependent variable. AR process of order 1 can be represented
with the MA process of infinite orders and vice versa. However, these models are relevant
only in the context of stationary data. To adapt this category of models for non-stationary
series, the option of differencing the data series is introduced, leading to Auto-Regressive
Integrated Moving Average (ARIMA) models.

149
The widespread adoption of ARIMA models is credited to Box and Jenkins (1970), who
introduced a diverse range of ARIMA models, with the general non-seasonal model denoted
as ARIMA (p,d,q).
“AR (p) denotes the order of the auto-regressive part

I (d) shows the degree of first differencing involved

MA (q) denotes the order of the moving average part”

The Box-Jenkins methodology comprises four key steps.


A. Identification: The initial stage is ‘Identification’, which entails finding the suitable
values of p, d, and q using correlograms and partial correlograms.
B. Estimation: The next stage is ‘Estimation’, where the auto-regressive and moving
average parameters in the model are estimated, typically using the ordinary least
squares (OLS) method, once the appropriate p and q values have been identified.
C. Diagnostic Checking: ‘Diagnostic Checking’ evaluates the adequacy of the selected
ARIMA model in fitting the data. This process includes testing whether the residuals
estimated from the model display characteristics of white noise. If the residuals fail
to meet this criterion, it may be necessary to restart the modeling process.
D. Forecasting: Finally, the fourth step is ‘Forecasting’, where ARIMA modeling is
known for its success, particularly for short-term forecasts. The predictions derived
from ARIMA modeling are frequently more dependable than those obtained from
conventional econometric modeling.

A. Identification
The potential existence of a wide range of ARIMA models can sometimes pose challenges in
determining the most suitable model and the following steps will address this challenge:
• Plot the data and detect any anomalies. Begin by plotting the data to identify
anomalies and assess if a transformation is required to stabilize the variability in the
time series. If necessary, apply a transformation to ensure stationarity in the series.
• After transforming the data (if needed), evaluate whether the data exhibit
stationarity by examining the time series plot, Autocorrelation Function (ACF), and
Partial Autocorrelation Function (PACF). A time series is likely stationary if the plot
shows data scattered around a constant mean, indicating the mean-reverting
property. Additionally, stationarity is suggested if the ACF and PACF values drop to
near zero. Conversely, non-stationarity is implied if the time series plot is not
horizontal or if the ACF and PACF do not decline toward zero.

150
• If the data remains non-stationary, consider applying techniques such as
differencing or detrending to achieve stationarity. For seasonal data, apply seasonal
differencing to the already differenced data. Typically, no more than two differencing
operations are needed to achieve a stationary time series.
• Once stationarity is achieved, examine autocorrelations to identify any remaining
patterns. Consider the following possibilities:
a. Seasonality may be indicated by large autocorrelations and/or partial
autocorrelations at the seasonal lags significantly different from zero.
b. Patterns in autocorrelations and partial autocorrelations may indicate the
potential for AR or MA models. If the ACF shows no significant
autocorrelations after lag q, this could indicate the suitability of an MA (q)
model. Similarly, if no significant partial autocorrelations remain after lag
p, an AR (p) model might be appropriate.
c. Without a clear indication of an MA or AR model, a mixed ARMA or ARIMA
model may be required.
Applying the Box-Jenkins methodology requires experience and sound judgment, with guiding
principles in mind.
• Establishing Stationarity: A preliminary analysis of the raw data helps
determine whether the time series is stationary in both its mean and
variance. Non-stationarity can often be addressed using differencing
(seasonal or non-seasonal) and transformations such as logarithmic or
power transformations.
• Considering Non-Seasonal Aspects: After achieving stationarity, examine
the ACF and PACF plots to evaluate the possibility of an MA or AR model
for non-seasonal data.
• Considering Seasonal Aspects: For seasonal aspects, the ACF and PACF
plots at seasonal lags help identify potential seasonal AR or MA models.
However, identifying seasonal components can be more complex and less
obvious compared to non-seasonal patterns.

B. Estimation
Once a tentative model identification has been made, the AR and MA parameters, both
seasonal and non-seasonal, must be determined most effectively. For instance, consider a
class of model identified as ARIMA (0,1,1), which is a family of models dependent on one MA
coefficient θ1:
151
(1-B)Yt = (1- θ1 B) et
The objective is to obtain the best estimate of θ1 to fit the time series being modeled
effectively. At the same time, the least squares method can be utilized for ARIMA models,
similar to regression, models involving an MA component (i.e., where q > 0) do not have a
simple formula for estimating the coefficients. Instead, an iterative method must be
employed. The general ARIMA model's statistical assumptions enable the computation of
useful summary statistics once the optimum coefficient values have been estimated. Each
coefficient can be associated with a standard error, enabling the conduct of a significance
test based on the parameter estimate and its standard error An ARIMA (3,1,0) model will be
of the form:
Yt = Ø1 Yt-1 + Ø2 Yt-2 + Ø2 Yt-2 + et where Yt = Yt - Yt-1

C. Diagnostic Checking
The diagnostic examination of the selected model is essential to ensure its adequacy. This
involves studying the residuals to identify any unaccounted patterns. Although calculating
the errors in an ARIMA model is more complex than in an ordinary least squares (OLS) model,
these errors are automatically generated as part of the ARIMA model estimation process. For
the model to be considered reliable for forecasting, the residuals left after fitting the model
should resemble white noise. A white noise model is characterized by residuals with no
significant autocorrelations and partial autocorrelations. One way to assess the correctness
of the model fit is by examining the residuals. Usually, the count of residuals will be n - d -
sD, where n denotes the number of observations, d, and D are the degrees of non-seasonal
and seasonal differencing, respectively, and s represents the number of observations per
season. Standardizing the residuals in plots is a common practice to ensure the variance
equals one, which helps in identifying potential outliers more easily. Any residuals smaller
than -3 or larger than 3 are regarded as outliers and may require further scrutiny. The residual
series is white noise if no outliers exist and the ACF or PACF is within the limits. Once this
step is confirmed, the next stage is actual forecasting.

D. Forecasting with ARIMA Models


An ARIMA (0,1,1) (0,1,1)12 model is described as
(1-B) ( 1-B12) Yt = (1- θ1 B) (1- Θ1 B12)et
Non-seasonal Seasonal Non-seasonal Seasonal
difference difference MA (1) MA (1)

152
To employ an identified model for forecasting, it is crucial to extend the equation and present
it in a conventional regression equation format. In the specified model, the equation is
expressed as follows:
Yt = Y t-1 + Y t-12 – Y t-13 + et – θ1 e t-1 – Θ 1 et-12 + θ1 Θ1 e t-13
To forecast one period, i.e., Yt-1, the subscripts are incremented by one throughout the
equation:
Yt+1 = Y t + Y t-11 – Y t-12 + et+1 – θ1e t – Θ 1 e t-11 + θ1 Θ 1 e t-12
While the term et+1 will not be known, the fitted model allows for replacing et , et-11, and et+1-
12 with their empirically determined values, which are the residuals for times t, t-11, and t-
12, respectively. As the forecasting extends further into the future, there will be no empirical
values for the error terms, and their expected values will be zero. Initially, the Y values in the
equation will be known as past values (Yt, Yt-11 and Yt-12 ). However, as the forecasting
progresses, these Y values will transition to forecasted values rather than known past values
(Makridakis et al., 1998).

11. Markov Chain Probabilistic Forecast


A Markov Chain is a mathematical model representing a system whose state transitions from
one state to another according to certain probabilistic rules. There is a condition of
interdependence where each state in a specific stage relies directly on one of the states from
preceding stages. The model assumes that knowledge of the present condition renders the
past state uninformative for predicting the future (Jain & Agarwal, 1992; Ramasubramanian,
2003). The states could have different conditions, situations, or levels. Markov Chains are
commonly used in various fields, including probability theory, statistics, economics, and
computer science. In the context of forecasting, Markov Chains can be used to make
probabilistic predictions of the states of a system. To better understand the model, let's
consider a scenario where a consumer has only three choices: bread, pizza, and cake, referred
to as states in the model. The restaurant serves one state daily, and the choice depends on
the state served the previous day. Assuming bread was served on day 0, it won't be served
on the subsequent day (day 1). Therefore, the consumer can anticipate the current state
based on past choices and predict future states based on the present state. For instance, on
day 1, pizza or cake will likely be served, but the choice for day 2 depends solely on the state
of day 1. Once the state of day 1 is determined, what will be served on day 2 becomes
independent of day 0 but is influenced by day 1, which Markov assigns with probability.

153
Probabilistic forecasting in the Markov chain model starts with defining the state. The
probabilities of transition from one state to another have to be determined. These
probabilities are often represented in a transition matrix. Each entry in the matrix represents
the probability of transitioning from one state to another. Once the system's initial state is
specified, transition probabilities can be used to simulate sequences of states over time. This
can be done iteratively, where the current state determines the next state based on the
transition probabilities. A distribution of possible future states can be built by simulating
multiple state sequences. This distribution provides a probabilistic forecast of the system's
future behavior (Paul, 2012).

12. Artificial Neural Network (ANN)


Neural Networks (NNs) have precise mapping abilities, allowing them to link input patterns
to corresponding output patterns. These networks learn through examples, undergoing
training with known instances of a problem to later showcase inference capabilities on
unfamiliar cases, which enables them to identify untrained objects. NNs can generalize,
predict new outcomes from historical trends, and exhibit robustness and fault tolerance,
enabling the recall of complete patterns from incomplete, partial, or noisy data. NNs
efficiently process information, operating in parallel, at high speed, and in a distributed
manner.

Source: Jha (2007)

154
In time series analysis, the utilization of this ANN technique, including the (p,d,q) model, is
employed for forecasting, aiming to predict precision and address the intricacies of time-
dependent data. One of the key benefits of ANN models compared to other non-linear
models is their capability as universal approximators, allowing them to effectively approximate
a wide range of functions with high precision. Evaluating Artificial Neural Networks (ANN)
against alternative forecasting methods, such as linear regression and exponential smoothing,
has provided valuable insights into the comparative efficacy of diverse techniques within
specific domains, such as stock market and sales prediction. Therefore, compared with
traditional statistical methods, such as the ARIMA, ANN has been shown to improve
forecasting accuracy to evaluate its performance and potential advantages. As a result, ANN
models have become increasingly utilized in time series forecasting to enhance predictive
capabilities and improve the accuracy of forecasts. This approach underscores the
adaptability of ANN in capturing and predicting the complexities of time-dependent data,
demonstrating its flexibility and effectiveness in capturing complex temporal patterns, making
them valuable in time series analysis. The combination of ANN with genetic algorithms and
deep belief networks has been explored to optimize forecasting models, addressing challenges
such as model overfitting and lack of interpretability.

13. Forecasting through Artificial Intelligence (AI)


The application of AI in time series forecasting has been widely applied across various
domains, particularly through Machine Learning (ML) methods and Neural Networks (NNs).
It has been utilized to enhance the prediction accuracy and address the data complexities. AI
methods, including ANN, have demonstrated potential in enhancing forecast accuracy and
dealing with issues associated with seasonality in time series data. Furthermore, optimizing
input attributes using AI methods has shown promising results in improving forecasting
accuracy. Additionally, the comparison of different AI models, such as Radial Basis Function
Neural Networks and Feedforward Neural Networks, has shown the superiority of certain AI
models in daily river flow forecasting, highlighting the relevance of AI in addressing
hydrological forecasting needs. In the agricultural sector, AI techniques, including Artificial
Neural Networks, have been employed for forecasting agricultural price fluctuations,
emphasizing the relevance of AI in addressing agricultural forecasting needs. Additionally, AI
has been applied to diverse domains, including wind power forecasting, traffic flow prediction,
and electricity load forecasting, showcasing/demonstrating its versatility in addressing
forecasting needs across different sectors.

155
Measuring Accuracy of Forecast
Forecast Errors: Forecast error refers to the disparity between forecasted and actual values
(test data). The precision of the aforementioned forecasting models can be enhanced by
minimizing specific criteria, such as:

Mean Error (ME) = (A t - Ft )


= et
n
1
Mean Absolute Error (MAE) =
n
 et
Sum of Squared Error (SSE) = (A - F ) = e
t t
2 2
t

The root of Mean Squared Error (RMSE) = (A - F ) t t


2

n −1

Mean of Squared Error (MSE) = 1 ( A − F )2 = 1 e2


t t t
n n

( At − Ft )
Percent Error (PEt) = X 100
At

1 n
Mean Percent Error (MPE) =  PEt
n t =1

1
Mean Absolute Percent Error (MAPE) =
n
 PEt

2
n −1
 Ft +1 − At +1 

t =1 
 
Theil’s U-statistic (out-of-sample forecast) =
At 
2
n −1
 At +1 − At 
t =1 
 
At 

For Theil’s statistics, if U equals 1, it indicates that the naïve method is as effective as the
forecasting technique being evaluated. If U is less than 1, the forecasting technique is
considered to perform better than the naïve method, with smaller U values suggesting greater
superiority. On the other hand, if U is greater than 1, the formal forecasting method does not
provide any benefit, as the naïve method would yield better results. While there are numerous
criteria for assessing forecast accuracy, a few are elaborated in the subsequent section.

156
1. Forecast Error: The forecast error serves as a metric for evaluating the accuracy of a
forecast at a specific point in time. It is computed as the difference between actual and
forecast values. It is represented as:
et = At - Ft
However, analyzing forecast errors for individual periods may not provide comprehensive
insights. Therefore, it is essential to examine the accumulation of errors over time. Merely
observing the cumulative et values may not provide meaningful insights, as positive and
negative errors offset each other. Relying solely on these values could lead to a false sense
of confidence. For instance, when comparing the original data and the associated pair of
forecasts generated by two different methods, it becomes evident that only a particular
method has produced superior forecasts based on the accumulated forecast errors over time.
2. Mean Absolute Deviation (MAD): To address the issue of positive errors offsetting
negative errors, a straightforward approach involves considering the absolute value of the
error, disregarding its sign. This yields the absolute deviation, which represents the size of
the deviation irrespective of its direction. Subsequently, the mean absolute deviation (MAD)
is computed by determining the average value of these accumulated absolute deviations.
3. Mean Absolute Percent Error (MAPE): The mean absolute percentage error (MAPE) is
computed by averaging the percentage difference between the fitted (forecast) data and the
original data. If the best-fit method yields a high MAPE (e.g., 40 per cent or more), it indicates
that the forecast may not be particularly reliable for various reasons.
MAPE = [ | et / At | x 100] / n
where 'A' represents the original series, 'e' represents the original series minus the forecast,
and 'n' denotes the number of observations.
4. Root Mean Squared Error (RMSE): The root mean square error (RMSE) is calculated by
taking the square root of the average of the squared errors. It provides a measure of how
much the forecast deviates from the actual data.
RMSE =  (  et2 / n)

Monitoring Forecast Accuracy over Time


Tracking Signal: The tracking signal is valuable for continuously monitoring the quality of the
earlier discussed forecasting methods. It involves calculating a daily tracking signal value and
assessing whether it falls within an acceptable range. If the tracking signal moves beyond the
acceptable range, it indicates that the forecasting method is no longer producing accurate

157
predictions. Tracking signals are crucial for detecting any bias in the forecasting process. Bias
occurs when the forecast consistently overestimates or underestimates the actual data
values. The tracking signal is computed as follows:

Tracking Signal (T.S.) = Algebraic Sum of Forecast Errors (ASFE)

MAD

Conclusions
While numerous forecasting methods and approaches are available, it is evident that there is
no universal single technique suitable for all situations. The selection of a forecasting method
depends on numerous factors, including the pattern of data, desired accuracy, time
constraints, complexity of the situation, the projection period, available resources, and the
forecaster's experience. These factors are interconnected, as a shorter forecasting time may
compromise accuracy, while a longer time frame may enhance accuracy and increase costs.
The key to a precise forecast is finding the right balance among these factors. Generally, the
best forecasts are derived from straightforward and uncomplicated methods. Research
indicates that combining individual forecasts can improve accuracy, while adding quantitative
forecasts to qualitative forecasts may reduce accuracy. However, the optimal combinations
of forecasts and the conditions for their effectiveness have not been fully elucidated.
Combining forecasting techniques typically yields higher-quality forecasts than relying on a
single method, as it allows for compensating for the weaknesses of any particular technique.
By choosing complementary approaches, the shortcomings of one method can be offset by
the strengths of another. Even when quantitative methods are used, they can be combined
with or supplemented by qualitative judgments, and forecasts can be reviewed or adjusted
based on qualitative assessments. It is essential to recognize that the forecasts made by data
analysts are intended for 'decision support,' rather than direct ‘decision-making.’

158
Bibliography
Armstrong, J.S. 2004. “Principles of Forecasting”, Kluwer Academic Publishers.
Box, G.E. and G. Jenkins. 1970. “Time Series Analysis, Forecasting and Control”, San Francisco:
Holden DayHolden-Day.
Chatfield, C. 2000. “Time-Series Forecasting”, Chapman & Hall/Crc.
Eğrioğlu, E., Yolcu, U., Aladağ, Ç. H., & Baş, E. (2014). Recurrent multiplicative neuron model
artificial neural network for non-linear time series forecasting. Neural Processing
Letters, 41(2), 249-258.
Gautam, N., Ghanta, S. N., Mueller, J. L., Mansour, M., Chen, Z., Puente, C., … & Al’Aref, S. J.
(2022). Artificial intelligence, wearables and remote monitoring for heart failure:
current and future applications. Diagnostics, 12(12), 2964.
Gaynor, E.P and R.C.Kirkpatrick. 1994. “Introduction to Time Series Modeling and Forecasting
in Business and Economics”, McGraw-Hill, Inc.
Gujarati, N.D. and Sangeetha. 2007. “Basic Econometrics”, Tata McGraw-Hill Publishing
Company Limited, New Delhi.
Hamzaçebi, Ç. (2008). Improving artificial neural networks’ performance in seasonal time
series forecasting. Information Sciences, 178(23), 4550-4559.
Hanke, E.J, Dean W.Wichern and Arther G.Reitsch. 2005. “Business Forecasting”, Pearson
Education.
Jain, R., & Agarwal, R. (1992). Probability Model for Crop Yield Forecast. Biometrical Journal,
34(4), 501-11.
Jha, G. K. (2007). Artificial neural network and its applications in agriculture. New Delhi: IARI.
Khashei, M. and Bijari, M. (2010). An artificial neural network (p,d,q) model for time series
forecasting. Expert Systems with Applications, 37(1), 479-489.
Makridakis, S., S.C.Wheelwright and R.J.Hyndman. 1998. “Forecasting - Methods and
Applications”, New York: John Wiley and Sons, Inc.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). Statistical and machine learning
forecasting methods: concerns and ways forward. PlosOne, 13(3), e0194889.
Nasurudeen P, Thimmappa K, Anil Kuruvila, Sendhil R and V Chandrasekar, ‘Forecasting of
Paddy Prices: A Comparison of Forecasting Techniques’, Market Forecasting Centre,
Department of Agricultural Economics, PJN College of Agriculture and Research
Institute, Karaikal, 2007.
Paul, R. K. (2012). Forecasting Using Markov Chain. New Delhi: Indian Agricultural Statistics
Research Institute.

159
Ramasubramanian, V. (2003). Forecasting Techniques in Agriculture. Agricultural and Food
Sciences, 1-15.
Yaseen, Z. M., El-Shafie, A., Afan, H. A., Hameed, M. M., Mohtar, W. H. M. W., & Hussain, A.
(2015). Rbfnn versus ffnn for daily river flow forecasting at Johor River, Malaysia.
Neural Computing and Applications, 27(6), 1533-1542.

160
Chapter-10
Emerging Trends and Technology for Data
Driven Market Research
R. Narayana Kumar
Principal Scientist and SIC, Madras Regional Station of ICAR-CMFRI, Chennai

Introduction
Market research has undergone a profound transformation over the decades, evolving from
rudimentary tabular analyses to the application of sophisticated econometric and statistical
models. This evolution reflects the growing complexity of markets and the need for more
nuanced insights into economic behaviors and trends. This chapter explores emerging trends
and technologies in data-driven market research, focusing on fisheries. This study area
provides a unique lens to examine advancements in market research methodologies and their
applications. Historically, market research relied heavily on basic tabular data and
straightforward statistical techniques to analyze market trends and consumer behavior. While
these methods offered valuable insights, they were often limited in capturing the intricacies
of market dynamics and price fluctuations. As markets became more complex and data more
abundant, the need for advanced analytical techniques became increasingly apparent. Today,
sophisticated econometric models and analytical tools enable researchers to conduct more
in-depth analyses and generate more accurate forecasts.
In the context of fisheries, understanding price behavior and market efficiency is crucial. Price
behavior encompasses how fish prices fluctuate in response to various factors such as supply,
demand, and external market conditions. Analyzing price behavior helps stakeholders in the
fisheries sector make informed decisions about pricing strategies, market-entry, and
inventory management. On the other hand, market efficiency refers to how well markets
adjust to changes and allocate resources effectively. It involves assessing the per cent share
of producers in the consumer rupee and analyzing market price indices across wholesale and
retail markets. The development of Fish Market Information Systems (FMIS) and Fish Price
Information Systems (FPIS) represents a significant advancement in the field. FMIS provides
real-time data on market arrivals, sales, and prices, facilitating better decision-making for
fishermen, wholesalers, and retailers. Similarly, FPIS offers detailed information on fish prices,
enabling stakeholders to track price trends and make informed trading decisions. These
systems enhance transparency and efficiency in the fish market, ultimately benefiting all
participants.

161
Emerging technologies have also played a crucial role in advancing market research
methodologies. For instance, ARIMA models and time series analysis offer powerful tools for
forecasting and understanding long-term trends in market data. ARIMA models, with their
capacity to account for autocorrelation and seasonal variations, provide valuable insights into
future price movements and market dynamics. Time series analysis allows researchers to
dissect data into its parts—trend, seasonal, cyclical, and irregular elements—enabling more
precise predictions and a better understanding of market behavior. Conjoint analysis is
another advanced technique that has gained prominence in market research. By examining
consumer preferences and choices, conjoint analysis helps researchers understand how
different product or service attributes influence consumer decisions. This method is
particularly useful for identifying factors driving demand and tailoring products or services to
meet consumer needs.
Integrating these advanced analytical methods with technologies like Fish Trade Platforms
(FTP) and E-Auction Platforms is transforming the fisheries sector. FTPs facilitate online
trading and auctioning of fish, enhancing market access and efficiency. By providing a real-
time transaction and price discovery platform, these systems support more effective
distribution and consumption of fish products.
Market research has evolved dramatically, driven by advancements in econometric models
and analytical technologies. Applying these methods in fisheries offers valuable insights into
price behavior, market efficiency, and consumer preferences. By leveraging technologies such
as FMIS, FPIS, and advanced analytical tools, stakeholders in the fisheries sector can gain a
deeper understanding of market dynamics and make more informed decisions. As market
research advances, integrating these emerging trends and technologies will play a crucial role
in shaping the future of economic analysis and decision-making.

Market Research Components


Market Research: Price Behaviour and Marketing Efficiency
Price Behaviour
Understanding price behavior is crucial in market research as it provides insights into how
prices fluctuate across different supply chain stages, from the landing center to the retail
market.
Landing Centre (or Harvest Centre)
Fish are first brought ashore at the landing center, also known as the harvest center. Prices
at this stage are typically influenced by the volume of the catch, species availability, and the

162
immediate demand from wholesalers. Market research at the landing center involves
monitoring these factors to predict price trends and manage supply efficiently. Effective price
behavior analysis at this level helps in setting baseline prices and ensures that fishermen get
a fair initial return for their catch.
Wholesale Market (Points of First Sales)
The wholesale market is the next critical point in the supply chain where the first sales occur.
Here, wholesalers buy fish in bulk, distributing them to retailers or other intermediaries. At
this stage, prices are influenced by transportation costs, storage conditions, and bulk
purchase agreements. Wholesale price behavior is essential for understanding how bulk
purchases and logistical considerations impact the overall pricing structure. Market research
in this area focuses on optimizing supply chain efficiencies and identifying opportunities for
reducing costs to maximize profitability.
Retail Market (Points of Last Sales)
The retail market represents the final stage where fish reach the consumers. A range of
factors including consumer demand, competition, marketing strategies, and value-added
services determines prices here. Retail price behavior analysis helps in understanding
consumer preferences and buying patterns, which are crucial for setting competitive prices
and enhancing customer satisfaction. By studying retail market dynamics, businesses can
tailor their offerings to meet consumer needs more effectively and improve market share.
Marketing Efficiency
Marketing efficiency refers to how well the market functions in distributing products from
producers to consumers. It involves analyzing the percentage share of producers in the
consumer rupee and market price indices at both wholesale and retail levels.
Share of Producer in the Consumer Rupee
This metric indicates the proportion of the final retail price that goes back to the producers.
A higher share suggests a more efficient market where producers receive a fair return for
their products. Market research aims to maximize this share by identifying and eliminating
inefficiencies in the supply chain.
Market Price Indices
Market price indices provide a benchmark for tracking price movements over time at both
wholesale and retail levels. These indices help in comparing current prices with historical data
to identify trends, forecast future prices, and make informed decisions. Wholesale Market
Price Index: This index measures the price changes at the wholesale level, reflecting the cost
dynamics involved in bulk purchasing and distribution.

163
Retail Market Price Index
This index tracks the price variations at the retail level, providing insights into consumer price
sensitivity and purchasing behavior. By analyzing these indices, market researchers can
develop strategies to stabilize prices, ensure fair returns for producers, and maintain
competitive pricing for consumers. This comprehensive understanding of price behavior and
marketing efficiency is essential for creating sustainable and profitable market systems in the
fisheries sector.

Development of Fish Market Information Systems (FMIS) and Related Technologies


Development of a Fish Market Information System (FMIS)
A Fish Market Information System (FMIS) is designed to gather, process, and disseminate
comprehensive market data crucial for all fisheries sector stakeholders. The FMIS serves as a
centralized database that provides real-time information on various aspects of the fish
market, such as market arrivals, sales volumes, and current prices. By integrating data from
multiple sources, FMIS enhances transparency and enables informed decision-making. The
system can help fishermen plan their harvests more efficiently, wholesalers manage their
inventories better, and retailers adjust their sales strategies in response to market trends.
FMIS aims to create a more streamlined and responsive market environment that benefits all
participants.
Developing a Fish Price Information System (FPIS)
The Fish Price Information System (FPIS) is a specialized component of the FMIS that
focuses specifically on collecting and disseminating price data. FPIS provides stakeholders
with accurate and up-to-date information on fish prices at different supply chain stages, from
landing centers to retail markets. This system helps in price discovery, allowing stakeholders
to compare prices across various markets and make informed decisions about buying and
selling. FPIS can reduce market inefficiencies, prevent price manipulation, and ensure fair
pricing practices by providing timely price information. Additionally, it can help policymakers
monitor price trends and intervene when necessary to stabilize the market.
Developing and Facilitating a Fish Trade Platform (FTP) and E-Auction Platform
The Fish Trade Platform (FTP) and E-Auction Platform are innovative solutions designed to
modernize the fish trade by leveraging digital technologies. The FTP is an online marketplace
that connects fishermen, wholesalers, retailers, and consumers, facilitating direct transactions
and reducing the reliance on intermediaries. The E-Auction Platform, a key feature of the
FTP, allows for real-time bidding on fish catches, ensuring that prices reflect current market

164
conditions and demand. These platforms enhance the efficiency of fish distribution and
consumption by providing a transparent and competitive trading environment. They also
offer added utilities such as secure payment processing, logistics coordination, and
traceability features, which help maintain the quality and safety of fish products.
Development of a Fish Marketing Grid
Developing a fish marketing grid involves creating a comprehensive network that maps out
the fish flow from landing centers to final consumers. This grid includes detailed information
on market arrivals, sales, and prices on specific dates, providing a holistic view of the market
dynamics. The fish marketing grid helps stakeholders understand the supply chain's
bottlenecks and optimize their operations accordingly. It also aids in forecasting demand and
supply trends, allowing for better resource allocation and planning.
Market Arrivals
Market arrivals refer to the quantity of fish that is brought to market at any given time.
Tracking market arrivals is essential for understanding supply patterns and anticipating
potential surpluses or shortages. An efficient FMIS can provide real-time data on market
arrivals, enabling stakeholders to make strategic decisions about harvesting, purchasing, and
pricing.
Market Sales
Market sales data provides insights into the volume of fish sold at various points in the supply
chain, from wholesale to retail markets. This information is crucial for assessing market
demand and performance. By analyzing market sales data, stakeholders can identify trends,
adjust their marketing strategies, and improve their sales outcomes. An FMIS that includes
detailed sales data helps in creating a more responsive and adaptive market.
Price on the Date
Having accurate price information on specific dates is vital for making informed trading
decisions. The FPIS component of the FMIS ensures that stakeholders have access to up-to-
date price data, which reflects the current market conditions. This data helps set competitive
prices, negotiate deals, and plan future transactions.
Online Marketing
Online marketing is an integral part of modernizing the fisheries sector. By leveraging digital
platforms, stakeholders can reach a wider audience, engage with customers more effectively,
and enhance their sales channels. Online marketing strategies include

165
Analytical Methods in Market Research
Market research employs various analytical methods to understand market dynamics,
forecast trends, and make informed decisions. These methods range from statistical models
to valuation techniques, providing unique insights into consumer behavior, market trends,
and economic value.
ARIMA Models
ARIMA (Auto-Regressive Integrated Moving Average) models are used for forecasting time
series data. They are particularly useful in market research for predicting future values based
on past trends. ARIMA models combine autoregression (AR), differencing (I), and moving
average (MA) to provide a comprehensive analysis of time series data. By identifying patterns
and making accurate forecasts, ARIMA models help businesses plan inventory, set prices, and
develop marketing strategies. For example, ARIMA models can predict future fish prices based
on historical data in the fisheries market, helping stakeholders make informed decisions about
production and sales.
Time Series Analysis
Time series analysis involves examining data points collected or recorded at specific time
intervals. This method helps identify trends, seasonal patterns, and cyclical movements in the
data. In market research, time series analysis is crucial for understanding how market variables
change over time. Researchers can gain insights into the underlying factors driving market
behavior by decomposing time series data into its constituent components (trend,
seasonality, and irregular variations). For instance, analyzing fish market sales data over
several years can reveal seasonal peaks and troughs, guiding marketing and production
planning.
Decomposition Analysis
Decomposition analysis breaks down time series data into trend, seasonal, cyclical, and
irregular components. This method helps isolate and understand the effects of different
factors on the overall data pattern. In market research, decomposition analysis is valuable for
identifying long-term trends and seasonal variations. For example, in the fish market, this
analysis can separate the impact of annual fish migrations (seasonal) from overall market
growth (trend), enabling more accurate forecasting and better strategic planning.
Conjoint Analysis
Conjoint analysis is a survey-based statistical technique used to determine how people value
different product or service attributes. In market research, it helps identify the most
important features influencing consumer choices. By presenting respondents with different

166
product configurations and asking them to rank or choose between them, researchers can
determine the relative importance of each attribute. In the fisheries market, conjoint analysis
can reveal preferences for fish species, freshness, price, and packaging, helping businesses
tailor their offerings to meet consumer demands.
Consumer Choices/Preferences
Understanding consumer choices and preferences is fundamental to market research. This
involves studying how consumers make purchasing decisions, what factors influence their
choices, and how their preferences change over time. Analyzing consumer behavior helps
businesses develop products that meet market demand, create effective marketing
campaigns, and improve customer satisfaction. For example, by studying consumer
preferences in the fish market, businesses can identify popular fish species, preferred
packaging methods, and optimal price points, leading to more targeted and successful
marketing efforts.
Choice of Markets
The choice of markets refers to selecting target markets based on consumer demographics,
purchasing power, and market potential. Market research helps businesses identify the most
lucrative product or service markets. By analyzing market conditions, competition, and
consumer behavior, researchers can recommend which markets to enter or expand into. In
fisheries, choosing the right market involves understanding regional preferences,
consumption patterns, and market accessibility, ensuring that products reach the most
profitable and receptive audiences.
Choice of Marketing Channels
The choice of marketing channels involves selecting the most effective ways to reach and
engage with target customers. Market research identifies the channels that best match the
preferences and behaviors of the target audience. This can include traditional channels like
retail stores and wholesale markets, as well as digital channels like e-commerce platforms and
social media. In the fisheries market, choosing the right marketing channels ensures that fish
products are marketed effectively, reaching consumers through the most convenient and
accessible means.
Markov Chain Analysis
Markov Chain Analysis is a statistical method used to model random processes where future
states depend only on the current state. In market research, it captures shifting patterns in
consumption, sales, exports, and related parameters. This method is useful for predicting
customer behavior, such as brand switching and purchase frequency. In the fisheries market,

167
Markov Chain Analysis can track how consumer preferences shift between different fish
species over time, helping businesses anticipate changes in demand and adjust their
strategies accordingly.
Contingent Valuation Methods (WTP and WTA)
Contingent Valuation Methods (CVM) are survey-based techniques used to estimate the
economic value of non-market goods and services by asking people their willingness to pay
(WTP) for a benefit or willingness to accept (WTA) compensation for a loss. In market
research, CVM helps assess the value consumers place on environmental benefits, public
goods, or market changes. For instance, in the fisheries market, CVM can estimate consumers'
WTP for sustainably sourced fish or their WTA for the inconvenience of reduced fishing
during conservation periods.
Hedonic Pricing
Hedonic pricing is an econometric method used to estimate the value of a good or service by
breaking down its price into constituent attributes. This method is commonly used in real
estate to value properties based on location, size, and amenities. In market research, hedonic
pricing helps determine how different product attributes contribute to overall price. In the
fisheries market, this could involve analyzing how factors like fish species, freshness, size, and
region of origin impact market prices, providing insights into what consumers value most and
how to price products competitively.

These analytical methods in market research provide the tools and techniques necessary to
understand complex market dynamics, forecast trends, and make data-driven decisions. From
statistical models like ARIMA and Markov Chain Analysis to valuation methods like
contingent valuation and hedonic pricing, each method offers unique insights that help
businesses optimize their strategies and achieve better market outcomes. By leveraging these
analytical techniques, stakeholders in the fisheries market can enhance their understanding
of consumer behavior, improve market efficiency, and ensure sustainable and profitable
operations.

Market Price Method


Estimating the Economic Value of Ecosystem Products
The market price method is valuable for estimating the economic value of ecosystem
products traded in commercial markets. In the context of the fisheries sector, this method
involves estimating the consumer and producer surplus using market price and quantity data.
By capturing the economic benefits derived from fish and other marine products, this method

168
provides insights into the value of these resources and the impacts of market changes or
interventions. The market price method is particularly useful for valuing changes in the
quantity or quality of a good or service. For instance, it can assess the economic impact of
environmental policies, such as seasonal closures of fishing areas, on both consumers and
producers. By understanding these impacts, policymakers can make informed decisions that
balance ecological sustainability with economic viability.
Steps in Market Price Method
The market price method involves a series of steps to estimate the economic value of
ecosystem products and assess the impacts of market interventions:
1. Estimation of Market Demand Function and Consumer Surplus Before Closure: The first
step involves calculating the market demand function and consumer surplus before any
intervention or market change. This requires analyzing historical market data to determine
how much consumers are willing to pay for a given quantity of fish.
2. Estimation of Demand Function and Consumer Surplus After Closure: Next, the demand
function and consumer surplus are recalculated after the intervention or market change. For
example, if a fishing area is closed for environmental restoration, the new demand function
and consumer surplus reflect the post-intervention market conditions.
3. Estimate the Loss in Economic Surplus to Consumer: The difference in consumer surplus
before and after the intervention is then determined. This loss represents the economic
impact on consumers due to reduced availability or increased fish costs.
4. Producers’ Surplus Before and After Closure: The producers’ surplus is calculated before
and after the intervention. This involves assessing changes in production costs, market prices,
and the quantity of fish sold.
5. Economic Loss Due to Closure: The consumer and producer surplus losses are summed to
estimate the total economic loss. This comprehensive measure provides a holistic view of the
economic impact of the market intervention, capturing both consumer and producer
perspectives.

A Hypothetical Situation
To illustrate the market price method, consider a hypothetical situation where a commercial
fishing area is closed seasonally to clean up pollution. The closure aims to improve
environmental conditions and, consequently, the quality and quantity of fish available in the
future. Here’s how the market price method would be applied in this context:

169
1. Estimation of Market Demand Function and Consumer Surplus Before Closure (A): Analyze
historical market data to determine the demand function and consumer surplus before the
closure. This reflects the market conditions when fishing activities are ongoing.
2. Estimation of Demand Function and Consumer Surplus After Closure (B): Recalculate the
demand function and consumer surplus after the closure, considering the expected
improvements in fish quality and availability.
3. Estimate the Loss in Economic Surplus to Consumer (D): Calculate the difference in
consumer surplus before and after the closure. This loss (D) represents the economic impact
on consumers due to the temporary reduction in fish supply.
4. Producers’ Surplus Before Closure (E): Assess the producers’ surplus before the closure by
analyzing production costs, market prices, and quantities sold under normal conditions.
5. Producers’ Surplus After Closure (F): Recalculate the producers’ surplus after the closure,
considering changes in production costs and market prices due to the temporary halt in
fishing activities.
6. Loss in Producers’ Surplus (G): Determine the difference in producers’ surplus before and
after the closure. This loss (G) captures the economic impact on producers due to the
intervention.
7. Economic Loss Due to Closure (H): Sum the losses in consumer and producer surplus (D
+ G) to estimate the total economic loss due to the closure. This comprehensive measure
helps policymakers evaluate the trade-offs involved in environmental interventions.
Interpretation
The final value obtained from the market price method helps compare the benefits of actions
that would allow the area to be re-opened against the costs of such actions. For instance, if
the economic loss due to the closure is significant, it may justify investments in pollution
control and environmental restoration to reopen the area and resume fishing activities. A
practical analogy can be drawn from maintaining a swimming pool in an apartment complex.
Suppose the cost of maintaining the pool is approximately ₹2 lakh per annum, but only a few
residents use it. The question arises of whether to continue maintaining the pool or collect
separate maintenance fees from those who use it. The market price method can help estimate
the economic value of the pool to the residents, considering their willingness to pay for its
use. If the economic benefits (willingness to pay) exceed the maintenance costs, continuing
the pool's operation would be justified. Otherwise, alternative arrangements may be
considered.

170
Finally, the market price method is a robust tool for estimating the economic value of
ecosystem products and assessing the impact of market interventions. By analyzing market
demand, consumer surplus, and producer surplus, this method provides a comprehensive view
of the economic implications of changes in the market, guiding informed decision-making for
sustainable and efficient resource management in the fisheries sector.

Travel Cost Method


The travel cost method (TCM) is widely used to estimate the economic value associated with
ecosystems or recreational sites. This method is particularly useful for assessing the economic
benefits derived from various changes or interventions, such as adjustments in visiting fees
(access charges), the closure of existing sites, the addition of new sites, or changes in the
environmental quality of a site. The underlying principle is that the travel expenses incurred
by visitors serve as a proxy for their willingness to pay (WTP) to access the site.
Estimating Economic Value
The travel cost method provides insights into the economic value of recreational sites by
analyzing the travel costs visitors incur. These costs include transportation, time spent
traveling, and other related expenditures. By examining these factors, the TCM can estimate
visitors' total economic benefits from the site.
Steps in Travel Cost Method
1. Define a Set of Zones Surrounding the Site: Identify and define several zones around the
recreational site, each representing different distances from the site. These zones help
categorize visitors based on how far they travel to reach the site.
2. Number of Visitors and Visits from Each Zone: Collect data on the number of visitors and
the frequency of their visits from each zone. This information is crucial for understanding the
demand for the site.
3. Estimate Visitation Rates per 1000 Population in Each Zone: Calculate the visitation rate
per 1000 people for each zone. This rate helps standardize the data and provides a basis for
comparing visitation patterns across different zones.
4. Calculate the Round-Trip Travel Distance and Travel Time for Each Zone: Determine the
total distance and time required for a round trip to the site from each zone. These calculations
are essential for estimating travel costs.
5. Variables Influencing the Per Capita Travel Costs: Identify and analyze variables that affect
per capita travel costs, such as transportation expenses, fuel prices, and the value of time
spent traveling.

171
6. Estimate Demand Function for Visits to the Site: Develop a demand function that relates
the number of visits to the site with the travel costs and other influencing factors. This
function helps predict how changes in travel costs or site characteristics will impact visitation
rates.
7. Estimate the Economic Benefit to the Site (Consumer Surplus): Calculate the consumer
surplus, the area under the demand curve. The consumer surplus represents the total
economic benefit visitors derive from the site beyond what they pay.
Interpretation
The economic benefit estimated using the travel cost method is a benchmark for assessing
the site's value. If the costs of maintaining the site are lower than the estimated economic
benefits, it is worthwhile to continue investing in it. Conversely, if maintenance costs exceed
the benefits, it may be necessary to reconsider the site's management or explore additional
factors that could influence its value. For instance, if a recreational site incurs significant
maintenance costs but attracts many visitors willing to pay substantial travel expenses to
access it, the travel cost method would justify the site's continued operation. However, if the
site is underutilized and the economic benefits are minimal, alternative management
strategies or site improvements may be needed to enhance its value and appeal.
Estimating the Economic Value of Recreational Sites
The travel cost method estimates the economic value of ecosystems or sites used for
recreation by considering the travel expenses of visitors as a proxy for their willingness to pay
(WTP). This method evaluates visiting fees, site closures, or environmental quality changes.
Steps in Travel Cost Method
1. Define Zones Surrounding the Site: Establish zones based on the distance from the site.
2. Number of Visitors and Visits from Each Zone: Collect data on the number of visitors and
visits from each zone.
3. Estimate Visitation Rates: Calculate visitation rates per 1,000 population in each zone.
4. Calculate Travel Distance and Time: Determine each zone's round-trip travel distance and
time.
5. Variables Influencing Travel Costs: Identify variables affecting per capita travel costs.
6. Estimate Demand Function for Visits: Develop a demand function for site visits.
7. Estimate Economic Benefit to the Site: Calculate the consumer surplus as the area under
the demand curve.

172
Interpretation
The economic benefit derived from the travel cost method serves as a benchmark for
assessing the site's value. If maintenance costs are lower than the benefits, it justifies
continuing the site's operation.

Contingent Valuation Method (CVM)


Assigning Monetary Value to Non-Use Values
The contingent valuation method (CVM) assigns monetary value to the non-use values of
the environment by directly asking people about their willingness to pay (WTP) for
environmental services or benefits. It also assesses people's willingness to accept (WTA)
compensation for giving up environmental benefits or enduring some difficulties.
Steps in CVM
1. Define the Valuation Problem: Identify the services valued and the relevant population.
2. Decide the Mode of Survey: Choose between personal interviews or mailed surveys,
considering sample size, resources, and the issue's importance.
3. Finalize Survey Design: Refer to similar studies to determine the range of values and
conduct focused group discussions.
4. Implement the Survey: Select an appropriate sample and maximize responses through
repeated visits or convenient contact times.
5. Compile, Analyze, and Report Results: Use suitable statistical techniques, eliminate outliers,
and address non-response bias.
Interpretation
Calculate the average WTP per individual or household and extrapolate to the population to
estimate the total benefit. This method is particularly useful for evaluating public goods and
services.

Contingent Choice Method


Similarities and Differences with CVM
The contingent choice method is similar to CVM but provides respondents with choices
between different commodities or environmental services at varying prices or costs. This
method focuses on ranking techniques and choices to understand preferences better.

173
Case Studies
Willingness to Pay (WTP) for Clam Fisheries Management Programme (CFMP)
A survey was conducted to evaluate the economic value and effectiveness of the Clam
Fisheries Management Programme (CFMP) and estimate stakeholders' willingness to pay
(WTP). This survey aimed to understand how much clam fishers and associated stakeholders
are willing to invest in the CFMP and to assess the program's impact on their livelihoods and
the sustainability of the fishery.

Problems Faced Before CFMP


Before the introduction of the CFMP, stakeholders encountered several significant
challenges:
• Unregulated Harvesting: Approximately 77.5% of respondents reported issues with
unregulated harvesting practices. This lack of regulation led to overfishing and
depletion of clam resources, adversely affecting the sustainability of the fishery.
• Limited Market for Produce: About 50% of the respondents faced difficulties due
to a limited market for their clam produce. This restricted market access constrained
their ability to sell their products at fair prices, impacting their income.
• Reduced Catch per Trip: Around 35% of the respondents experienced a decline in
catch per trip. This reduction in yield was a direct consequence of overfishing and
poor management practices, which diminished the overall profitability of their
fishing operations.
• Poor Quality of Clams: 23% of the respondents reported poor quality of clams,
which further exacerbated market issues and reduced their competitiveness in the
market.
Benefits Obtained by Adopting CFMP
The introduction of the CFMP brought several benefits to the stakeholders, which
significantly improved their economic and operational conditions:
• Sustained Catch: About 75% of respondents reported a sustained catch following
the implementation of CFMP. This improvement in catch stability ensured a more
reliable income source and contributed to the long-term sustainability of the fishery.
• Higher Share of Consumer Rupee: CFMP enabled stakeholders to capture a higher
share of the consumer rupee, with 50% reporting increased earnings. This shift
allowed fishers to benefit more directly from their catch.

174
• Consistent Market: Approximately 32.5% of respondents experienced access to a
more consistent market for their produce. This stability facilitated better planning
and financial security.
• Increased Net Operating Income per Trip: Around 25% of respondents saw an
increase in net operating income per trip, reflecting improved efficiency and
profitability due to better management practices.
• Enhanced Domestic Savings: The program led to a 22.5% increase in domestic
savings among stakeholders, allowing them to meet planned needs and invest in
other areas of their lives.
• Premium Prices for Produce: About 20% of respondents received premium prices
for their produce, enhancing their revenue and market competitiveness.
• Sustainable Income: Approximately 18% of stakeholders reported achieving a
sustainable income post-CFMP, underscoring the program's effectiveness in
stabilizing their financial situation.

Willingness to accept (WTA) SFB


Location Ban Days 30 45 60 90 120
Trawl owners 34167 49167 73333 105000 143333
Rameswaram
Trawl Labors 11817 20950 33433 52867 79067
Mechanised trawl Owner (NA) 591667 NA NA NA
Mech Gillnet Owner NA 88333 NA NA NA
Mech Trawl crew 1170 7920 5760 4200 5000
Chennai
Mech Gillnet crew NA 11500 NA NA NA
Mech Owner 11187 15417 NA NA NA
Mech labor 5333 11694 NA NA NA
Motorised 8174 13880 NA NA NA
Motorised Labor 4444 11111 NA NA NA
Kanyakumari
Non mech Owner 7409 13330 NA NA NA
Non Mech Labor 1875 9250 NA NA NA
Non-Mechanised 2970 8659 6333 4848 NA
Kakinada
Mechanised NA) 10050 (NA) NA NA
Mechanised Trawlnet NA 38200 NA NA NA
Motorised gillnet NA 11708 NA NA NA
Nizampatnam
Mot Mini Trawl net NA 10400 NA NA NA
Labours 7000 9265 NA NA NA

175
Location Ban Days 30 45 60 90 120
Motorised gillnet Boat
NA 5500 15000 4663 NA
owner
Mangalore
Non-Motorised Boat
NA NA 1333 1430 3000
Owner
Traditional (in Rs.) 333 606 976 1,278 1.725
Rameswaram
Motorised (in Rs.) 652 1.265 1.838 2,394 3.352
Motorised Boat Owner (in
NW NW NW NW NW
Rs.)
Chennai
Motorised Boat Crew (in
NW NW NW NW NW
Rs.)
Kakinada Motorised (in Rs.) 175 281 484 5.678 6,177
Nizampatnam Non-motorised (in Rs.) NW NW NW NW NW
Mech Purse-seiners (in
NW 1,500* NW 9000 NW
Rs.)
Mangalore
Trawl Boat owner (in Rs.) NW 23,500* NW 10000 NW
Trawl boat labour (in Rs.) NW 2,692 4,500 1000 NW

Conclusion
Recent advancements in econometric models and analytical tools have significantly
transformed the landscape of economic research. Historically, researchers relied on basic
tabular analysis to understand various phenomena. However, the evolution of sophisticated
analytical methods, such as ARIMA models and time series analysis, has greatly enhanced our
ability to interpret complex data. These advancements have broadened the scope of economic
inquiry and increased the data requirements needed for applying these methods effectively.
Modern applications like R, SPSS, and SAS offer powerful statistical analysis and
econometrics capabilities. While these tools provide advanced functionalities and insights,
users must develop specific skills for effective utilization. Some software applications are
designed to be user-friendly, whereas others necessitate more intensive practice and
expertise. This shift highlights the need for a deeper understanding of analytical techniques
and the ability to navigate complex software environments.
Furthermore, developing comprehensive questionnaires and data collection methods has kept
pace with these analytical advancements. Accurate data collection is essential for leveraging
these advanced tools effectively. Synchronizing data analysis technologies with data
collection processes ensures that research findings are robust and reliable.

176
Ultimately, interpreting results from these advanced analyses demands meticulous attention
and expertise. Integrating sophisticated analytical tools with careful data collection and
interpretation practices will be crucial for producing meaningful and actionable economic
insights as the field progresses. This holistic approach will enable researchers to tackle
complex economic questions with greater precision and relevance.

Acknowledgment

The author acknowledges the invaluable contribution of Dr. V. Chandrasekar, Senior Scientist
at ICAR-CIFT, Kochi-29, for his involvement in editing this book chapter titled "Emerging
Trends and Technology for Data-Driven Market Research."

177
Chapter-11
Data Visualization for Data Science
Chandrasekar V 1, Ramadas Sendhil 2 and V.Geethalakshmi 3
1&3
ICAR-Central Institute of Fisheries Technology, Cochin, India
2
Department of Economics, Pondicherry University (A Central University),
Puducherry, India.

Introduction
Data visualization is a valuable tool that converts raw data into graphical formats, making
complex information easier to comprehend and access. Well-designed visualizations uncover
trends, patterns, and insights that might not be obvious from raw data. In today’s data-driven
environment, proficiency in data visualization is crucial for those involved in data analysis, as
it facilitates decision-making by presenting information clearly and engagingly. This chapter
covers the basics of data visualization, explores different techniques and tools, and offers
practical examples, including how to create dynamic visualizations using dashboards on real-
time web pages through APIs.

Fundamentals of Data Visualization


Importance of Data Visualization
Data visualization helps to simplify complex datasets, making them easier to interpret and
communicate. For instance, consider a dataset containing annual sales figures for a company
over a decade. A simple line chart can quickly illustrate sales trends, helping stakeholders
identify periods of growth or decline.
Key Principles of Effective Data Visualization
To create effective visualizations, adhere to the following principles:
• Clarity: Ensure the message is clear. For example, a bar chart relating monthly sales
across different regions should use distinct colors for each region to avoid confusion.
• Accuracy: Represent data truthfully. A pie chart showing market share should have
segments proportional to the market share to avoid misleading impressions.
• Simplicity: Focus on essential information. Avoid cluttering a dashboard with
excessive charts; include only those that provide valuable insights.
• Consistency: Maintain uniformity in design. Use the same color scheme for similar
data across different visualizations to enhance understanding.

178
Types of Data Visualizations
Charts
Bar Charts: Ideal for comparing discrete categories. For example, a bar chart showing the
number of products sold by different departments allows for easy comparison of sales
performance across departments.
Line Charts: Ideal for illustrating trends over time. For example, a line chart showing monthly
website traffic over the course of a year can highlight patterns such as seasonal increases or
decreases.
Pie Charts: Effective for displaying proportions. For instance, a pie chart depicting the market
share of different smartphone brands visualizes each brand's contribution to the overall
market.
Scatter Plots: Excellent for illustrating relationships between two variables. For example, a
scatter plot comparing advertising spend to sales revenue can show whether a correlation
exists between the two variables.
Graphs
Histograms: Display the distribution of data. For example, a histogram showing the
distribution of customer ages in a retail store can reveal the most frequent age groups among
customers.
Box Plots: Visualize the distribution based on quartiles. For instance, a box plot showing
students' test scores can highlight median performance, variability, and outliers.
Advanced Visualizations
Heat Maps: Represent data values using color gradients. For example, a heat map displaying
website user activity by time and day of the week can highlight peak usage periods.
Bubble Charts: Extend scatter plots by adding a third dimension. For instance, a bubble chart
showing countries’ GDP versus life expectancy, with bubble size representing the population,
can reveal complex relationships between these variables.
Treemaps: Display hierarchical data as nested rectangles. For example, a treemap visualizing
a company's budget allocation across departments can show how resources are distributed.

Tools and Software for Data Visualization


Microsoft Excel
Excel offers a range of chart types and customization options. For example, using Excel to
create a line chart showing sales data over time and applying different colors and labels to
enhance readability.

179
Tableau
Tableau excels in creating interactive and dynamic visualizations. It enables users to link to a
variety of data sources, ranging from spreadsheets to databases, and convert this data into
meaningful visual representations. For example, you can create a dashboard displaying real-
time sales data with interactive filters, allowing users to drill down into specific regions or
periods..
Example: Imagine a retail company tracking its sales performance across different regions.
Using Tableau, you can create a dashboard that includes:
• A map highlighting sales volume by region.
• A line chart showing monthly sales trends.
• Bar charts comparing product performance across regions.
Interactive filters can be added to allow users to focus on specific time frames or product
categories, enabling a deeper understanding of sales dynamics.
Power BI
Power BI integrates with various data sources and provides robust visualization capabilities.
It's particularly strong in creating comprehensive reports that include various types of
visualizations, including bar charts, pie charts, and maps, all connected to a central dataset.
This integration allows for a seamless data flow and consistent updates.
Example: Consider an organization that needs to report on marketing campaign effectiveness.
With Power BI, you can create a report that includes:
• A pie chart showing the distribution of marketing spend across different channels.
• Bar charts comparing the number of leads generated by each campaign.
• A map indicating the geographic distribution of customer acquisition.
These visualizations can be dynamically updated as new data comes in, providing real-time
insights into campaign performance.
R and Python
R and Python are powerful for advanced visualization tools. These programming languages
offer extensive libraries and packages that enable complex data manipulation and
visualization.
R: Using R's `ggplot2` package, you can create a complex scatter plot with multiple layers,
including trend lines and error bands, to provide a detailed data analysis.
Example: A data scientist studying the relationship between advertising spend and sales
revenue might use `ggplot2` to create a scatter plot. The plot could include:

180
Points representing individual data points.
- A smooth trend line to indicate the general relationship.
- Error bands to show the variability around the trend line.
Python: Python's `matplotlib` and `seaborn` libraries are equally powerful for creating
intricate visualizations.
Example: In Python, a data analyst might use `seaborn` to create a heatmap showing
correlations between different financial metrics. This heatmap could:
• Use color gradients to indicate the strength of correlations.
• Include annotations for exact correlation values.

Creating Effective Data Visualizations


Understanding Data
Before creating visualizations, understand the data thoroughly. For example, if you have sales
data from multiple regions, categorize it by region and time period to determine the most
relevant aspects to visualize.
Choosing the Right Visualization
Select the visualization type based on your data and objectives. For example, use a bar chart
to compare sales across different regions but choose a line chart to show sales trends over
time.
Designing for Clarity
Ensure that your visualization is easy to interpret. For example, clear labels, appropriate scales,
and contrasting colors are used to distinguish between different data series in a line chart.
Testing and Iteration
Gather feedback from users to refine your visualizations. For instance, if users find a pie chart
confusing, consider switching to a bar chart or a stacked one to better convey the information.

Best Practices in Data Visualization


Avoiding Common Pitfalls
Misleading Scales: Ensure scales are accurate to prevent misinterpretation. For example, a
bar chart with a truncated y-axis might exaggerate differences between data points.
Overcomplicating Visualizations: Keep visualizations simple. For instance, avoid adding too
many data series to a single chart, which can overwhelm viewers.

181
Leveraging Interactivity
Interactive visualizations engage users and allow them to explore data. For example, a
dashboard with interactive filters lets users view data for specific periods or regions.
Ensuring Accessibility
Design visualizations for accessibility. For instance, use color-blind-friendly palettes and
provide alternative text descriptions to make visualizations accessible to all users.
Creating Real-Time Dashboards Using APIs
Incorporating real-time data into visualizations can significantly enhance their value by
providing up-to-date insights. This can be achieved by creating dashboards that integrate
with APIs to pull live data.
Example: Real-Time Sales Dashboard
1. Data Source: Connect to a real-time data source, such as a sales database or an online
sales platform's API, to fetch current sales data.
2. Dashboard Design: Use a tool like Tableau, Power BI, or a custom web application with
libraries like D3.js to design a dashboard that includes:
• Real-time sales figures.
• Interactive charts and graphs (e.g., bar charts, line charts).
• Filters to drill down into specific products, regions, or periods.
3. API Integration: Integrate the dashboard with the API to fetch live data regularly. For
example, using a web application framework like Flask (Python) or Express (Node.js), you can
set up endpoints that query the sales database/API and return the latest data to the
dashboard.
4. Visualization: Update the visualizations dynamically as new data is fetched. Use JavaScript
libraries like D3.js or Chart.js to render the visualizations on a web page and ensure they
update in real time without reloading the page.
5. Example:
```python
Example using Flask and D3.js
from flask import Flask, jsonify, render_template
import requests
app = Flask(__name__)
@app.route('/api/sales')
def get_sales_data():
response =
requests.get('https://api.salesplatform.com/latest-sales')

182
return jsonify(response.json())
@app.route('/')
def index():
return render_template('dashboard.html')
if __name__ == '__main__':
app.run(debug=True)
```
```html
<!-- dashboard.html -->
<!DOCTYPE html>
<html>
<head>
<title>Real-Time Sales Dashboard</title>
<script src="https://d3js.org/d3.v5.min.js"></script>
</head>
<body>
<div id="sales-chart"></div>
<script>
async function fetchSalesData() {
const response = await fetch('/api/sales');
const data = await response.json();
updateChart(data);
}
function updateChart(data) {
// Use D3.js to create or update the chart with
the new data
}
setInterval(fetchSalesData, 60000); // Update
every minute
fetchSalesData(); // Initial load
</script>
</body>
</html>
```

183
This example demonstrates creating a real-time sales dashboard using a web framework and
JavaScript library. Integrating with an API, the dashboard continuously updates, providing the
latest insights into sales performance.

Conclusion
Data visualization is a crucial tool for interpreting and communicating data. By applying
principles of clarity, accuracy, simplicity, and consistency and utilizing appropriate tools and
techniques, individuals can create compelling visualizations that provide valuable insights.
Mastery of data visualization enhances data analysis and supports effective decision-making
in various fields. Through tools like Tableau, Power BI, R, and Python, and integrating real-
time data using APIs, data can be transformed into meaningful visual stories that drive better
understanding and action.

Bibliography

Friendly, M. & Denis, D. J. (2001). Milestones in the history of thematic cartography,


statistical graphics, and data visualization. Retrieved from
http://www.datavis.ca/milestones/

Knaflic, C. N. (2015). Storytelling with data: A data visualization guide for business
professionals. Hoboken, N.J.: Wiley

Meloncon, L., & Warner, E. (2017). Data visualizations: A literature review and opportunities
for technical and professional communication. Paper presented at the Professional
Communication Conference (ProComm), 2017 IEEE International, Madison, WI,
USA.

Ward, M. O., Grinstein, G., & Keim, D. (2015). Interactive data visualization: Foundations,
techniques, and applications. Retrieved from
http://ebookcentral.proquest.com/lib/uql/detail.action?docID=1786691

184
Extension Information Statistic Division
ICAR - Central Institute of Fisheries Technology
(Indian Council of Agricultural Research, New Delhi)
Ministry of Agriculture and Farmers Welfare, Government of India

You might also like