2024 - Data Analytics Book
2024 - Data Analytics Book
in Data Analytics
V.Chandrasekar
R.Sendhil
V.Geethalakshmi
A.Suresh
Nikita Gopal
V Chandrasekar, Ramadas Sendhil, V Geethalakshmi, A Suresh, Nikita Gopal
Editors
V Chandrasekar
Senior Scientist, Agricultural Economics
ICAR-Central Institute of Fisheries
Technology
Cochin, Kerala, India
Ramadas Sendhil
Associate Professor
Department of Economics
Pondicherry University
Puducherry, India
V Geethalakshmi
Principal Scientist
ICAR-Central Institute of Fisheries
Technology
Cochin, Kerala, India
A Suresh
Principal Scientist, Agricultural Economics
ICAR-Central Institute of Fisheries
Technology
Cochin, Kerala, India
Nikita Gopal
Principal Scientist and Head of the Extension
Information & Statistics Division
ICAR-Central Institute of Fisheries
Technology
Cochin, Kerala, India
The edited volume has been published with financial support from the ICAR-Central
Institute of Fisheries Technology (CIFT), Cochin, Kerala, India. The use of general
descriptive names, registered names, trademarks, or service marks in this publication does
not imply, even in the absence of a specific statement, that these names are exempt from
applicable protective laws and regulations, nor are they free for general use. The publisher,
authors, and editors have made every effort to ensure that the advice and information
presented in this book are accurate and reliable as of the publication date. However, no
warranties are given by the publisher and editors regarding the content nor for any potential
errors or omissions. Additionally, the ICAR-CIFT remains neutral concerning
jurisdictional claims in published maps, illustrations, and institutional affiliations.
PREFACE
The landscape of data analytics is rapidly changing, and it is at the intersection of
econometrics and advanced statistical techniques that valuable information can be gained
from complex datasets. The need for robust econometric tools has been immense as
academicians, researchers, and data analysts learn how to deal with the ever-increasing
amount of data available to them. This edited volume, “Basic Econometric Tools and
Techniques in Data Analytics,” an outcome of the 5-day Scheduled Caste Sub-Plan
Program (SCSP) training of ICAR-Central Institute of Fisheries Technology in
collaboration with the Department of Economics, School of Management, Pondicherry
University (A Central University) aims at bridging the gap between theoretical econometric
concepts and their practical use in data analysis.
Economic theory, mathematics, and statistical methods are combined in econometrics thus
making it a powerful framework for modeling relationships within datasets. For both
beginners and experts alike, this book offers an extensive compendium of chapters
discussing fundamental econometric tools necessary to draw meaningful conclusions from
various datasets. Written by field experts, all the chapters in this volume have varied
perspectives on basic econometric tools to form one solid piece of academic writing. Topics
covered include mastering essential software for econometric analysis like R, linear
regression, and hypothesis testing, as well as advanced techniques like time-series
forecasting and panel regression models. The book also underscores the significance of
having strong background knowledge in these subjects, even as it centers around practical
applications in econometrics. This includes each chapter crafted to explain lucidly and
provide exemplifications and enhanced learning exercises to enable readers to comprehend
what they have learned and put it into practice.
The target audience for this book involves students, researchers, and professionals from
diverse fields such as economics, commerce, finance, business, and social sciences, among
others, who wish to employ econometrics for evidence-based decision-making. Whether
used as a college text or a procedural manual for experts, this book seeks to arm users with
the requisite information and skill set for basic data analysis using econometric tools.
We would like to thank all the contributors for their insights that made this compilation a
priceless source of material. We trust that "Basic Econometric Tools and Techniques in
Data Analytics" will become an indispensable companion for those individuals who want to
understand the dynamism of data analysis through the eyes of econometrics.
Editors
CONTENTS
Puducherry, India.
2 ICAR-Central Institute of Fisheries Technology, Cochin, India
Introduction
Economics, a fascinating discipline, draws on foundational aspects from various fields like
mathematics and statistics. The applicability of economic theories is judged and tested over
time by several visionary researchers to deduce the relationship between economic agents
and economic activities. Economics is subdivided into various disciplines, each focusing on
specific aspects of life. Microeconomics explores individual decision-making, macroeconomics
delves into aggregate decision-making, development economics examines the impact of
economic decisions on sustenance and well-being, and environmental economics scrutinizes
the relationship between mankind and nature, to cite a few branches. In a way, every sub-
discipline deals with the agent’s economic decisions and the spillover effect of these decisions
on their and surrounding lives.
Why Econometrics?
1
Mathematical models are deterministic since there is no scope for variability. The very close
alternative to the mathematical method is statistics, but it is only concerned with collecting,
processing, and representing economic theory. Since mathematical economics and statistics
are just supplementary arms for economic theory, they need specific disciplines for
quantitively measuring this economic phenomenon or decisions. Economic decisions are
subject to variability due to individual differences and diverse situational contexts. These
variabilities are called errors in econometrics. So, econometrics deals with empirical evidence
of economic theories based on precise tools and sophisticated techniques that follow
mathematical and statistical principles of unbiasedness, efficiency, and consistency. “The
method of econometric research aims, essentially, at a conjunction of economic theory and
actual measurements, using the theory and technique of statistical inference as a bridge pier”
(Haavelmo, 1944).
Econometric tools have wide-ranging applicability in real life across various fields and
industries. Nowadays, policies aren’t based on trial-and-error methods but on hardcore
econometric models to observe the expected impact beforehand. Based on these findings,
policies are customized and implemented to achieve efficiency and distributive equity.
Government and central banks use econometric models to evaluate the repercussions of fiscal
and monetary policies on key macroeconomic indicators such as inflation, per capita
disposable income, unemployment rate, GDP growth, and money supply. These econometric
tools are also used in scrutinizing and fending off the impacts of exogenous variables on
endogenous variables. For example, during the subprime crisis of 2007-08, central banks'
resilient policies for the money market helped stabilize the Indian economy. These
econometric tools also help in choosing the right policy. For example, econometric models are
instrumental in determining whether demand-side policies or supply-side policies are the
appropriate choice for market correction.
These econometric tools are extensively used in financial markets to analyze volatility. For
example, asset prices are dynamic and pose a severe challenge in crises. The choice of
investment strategies, risk assessment of the asset, the performance of the asset shortly, and
hedging strategies are heavily reliant on these econometric tools. Multinational corporations
use these econometric tools for demand estimation, observing consumer behavior, price
determination, and forecasting sales. Apart from these, healthcare economics uses
econometric tools to evaluate the effectiveness of healthcare intervention, and environmental
2
economists use it for cost-benefit analysis and optimization of natural resources to scrutinize
the relationship between pollution levels and their impact on mankind. So, every sub-
discipline of economics is heavily dependent on these environmental tools to optimize the
benefit efficiently while keeping allocative and distributive equity intact.
Hence, the primary uses of econometric tools include the following:
• Econometric models help in formulating relationships between economic variables.
• Econometric techniques help in testing hypotheses about economic relationships.
• Econometric models are heavily used in forecasting and estimating the future trends
of economic variables
Example: the expected growth rate of public expenditure in healthcare by the central
government in the subsequent ten years based on preceding investment.
• Econometrics is frequently used to assess the effects of specific economic policies.
• It helps in estimating causality between two variables.
Example: the link between risk and return.
Among the several schools of thought in econometric methodology, steps based on classical
methodology are as follows:
4. Obtaining data
Econometric data are not derived from controlled experiments but are gathered
through observation of real-world events and behaviors.
3
Economic data sets come in different formats. A cross-sectional data set contains
various variables at a single point in time. In econometrics, cross-sectional variables
are typically represented by the subscript "i," where "i" takes values from 1 to N,
representing the number of cross-sections. This type of data is often used in applied
microeconomics, labor economics, public finance, business economics, demographic
economics, and health economics.
A time series dataset involves recording observations of one or more variables at
sequential time intervals, making it particularly useful in macroeconomic research.
Time series variables are typically represented with the subscript "t."
Panel data combines aspects of both cross-sectional and time series data, collecting
information from multiple variables over time. Panel data are represented using both
"i" and "t" subscripts, referring to cross-sectional and time series data, respectively.
For instance, the GNP of five countries over a 10-year period might be denoted as
Yit, where t = 1, 2, 3, ..., 10 and i = 1, 2, 3, 4,5.
4
Tools of Econometric Analysis
The ordinary least square model is one of the most used and reliable regression analysis
methods. These models are based on Gauss-Markov assumptions and are widely applied
in various fields for modeling, predictions, and testing hypotheses. Mainly CLRM is of
two types:
a) Simple Linear Regression: This model characterizes the linear relationship between
a dependent and a single independent variable.
Yt = 𝜸 + β1Xt + ut
where 𝜸 is the intercept, β1 is the slope, and ut is the disturbance term..
b) Multiple Linear Regression: This model is simply the extension of simple linear
regression to include more than one independent variable and is represented as
Yt = β0 + β1X1t + β2X2t + β3X3t +…….. βnXnt + ut
where X1, X2, ..., Xn are the independent variables.
5
• Homoscedasticity: Constant variance of error terms across all levels of the
independent variables.
• Independence: Observations are independent of each other.
• Normality of disturbance term: The disturbance terms are assumed to be
normally distributed.
• No Perfect Multicollinearity among independent variables.
The parameters (β0, β1, β2, ..., βn) are estimated using the method of ordinary least
squares, which “minimizes the sum of squared differences between observed and
predicted values” and are assumed to possess the properties of linearity, unbiasedness,
consistency, and efficiency (BLUE).
6
normal, t, F, or chi-square distributions. This helps assess how likely it is to observe the
data if the null hypothesis holds. A p-value is then calculated to indicate the strength of
the evidence. A small p-value suggests that the null hypothesis should be rejected, while
a larger p-value indicates there isn’t enough evidence to reject it. Both methods help
determine the validity of the null hypothesis in statistical analysis.
This type of regression analysis model includes categorical variables (also known as
dummy variables or indicator variables). These categorical variables represent categories
or groups that cannot be quantitatively measured. Examples include gender, education,
race, religion, geographical region, etc. They take the value of 0 or 1, indicating the
absence or presence of a particular categorical attribute. A regression model with all its
regressors dummy is called an Analysis of Variance (ANOVA) model.
Yi = α + β1D1i + β2D2i¬ + β3D3i + ui
If there are h categories, only h−1 dummy variables will be taken to avoid the problems
of dummy variable trap and perfect multicollinearity, which is problematic for regression
analysis. The coefficients of these dummy variables, known as differential intercept
coefficients, indicate the average change in the dependent variable when transitioning
from the benchmark category to the category associated with the dummy variable.
Regression models incorporating quantitative and qualitative variables are referred to as
analysis of covariance (ANCOVA) models.
Yi = β1 + β2D2i + β3D3i + β4Xi + ui
Applications
Also known as binary or discrete choice models or limited dependent variable regression
models, are a class of statistical models used when the dependent variable is categorical.
7
Different techniques used include:
8
5. Panel Data Regression Models
According to Baltagi, since panel data relates to cross-section over time, heterogeneity
can be captured in observation units. With observations that span both time and
individuals in a cross-section, panel data provides “more informative data, more
variability, less collinearity among variables, more degrees of freedom and more
efficiency”. It is better suited for analyzing changes over time, identifying and measuring
the impact of policies and laws, and examining complex behavioural models. Panel data
regression model can be expressed as:
A panel is considered balanced when every subject has the same number of time periods
observed, whereas it is unbalanced if the number of observations varies across subjects.
Common estimation techniques for panel data regression models include pooled ordinary
least squares (OLS), fixed effects models, and random effects models.
These models are used to analyze the variables whose values change continuously. These
econometric models are capable of dealing with the volatility of the series. In financial
economics, most of the series, such as stock prices and future options prices, are
examples of volatile time series, and they require special techniques, such as
autoregressive conditional heteroskedasticity (ARCH) modeling, to extract the
information from the series itself.
Traditional econometrics views the variance of the distribution terms as constant over
time (CLRM assumption of homoscedasticity). However, financial time series exhibit high
volatility in particular periods. This volatile nature has huge implications for the overall
economy. So, the ARCH family of models is used to analyze these volatile time series.
Several different models of ARCH, such as GARCH (Generalized Autoregressive
Conditional Heteroskedasticity), GARCH-M (GARCH in mean), T-GARCH (Threshold
GARCH), E-GARCH (Exponential GARCH), and others, are used frequently in analysis.
Each technique and tools have its advantages and limitations.
9
7. Time Series Forecasting
10
Conclusion
In summary, the significance of econometric tools and techniques in data analysis within
economics is irrefutable. As a bridge between economic theories and empirical evidence,
econometrics provides a quantitative foundation for testing and validating hypotheses across
various economic disciplines. From shaping government policies and analyzing financial
markets to evaluating healthcare interventions and optimizing environmental resource
allocation, econometric models play a crucial role in understanding and predicting economic
phenomena. The versatility of these econometric tools allows for a thorough and nuanced
exploration of economic relationships, ensuring that researchers can adapt their models to
the complexities of the data at hand. As the field of econometrics continues to evolve,
integrating cutting-edge statistical and mathematical techniques with established economic
theories remains an essential component of empirical research. The advancements in
computational power and the increasing availability of data further enhance the capacity of
econometricians to provide accurate, data-driven insights, guiding decision-makers and
researchers in navigating the complexities of the dynamic economic landscape and
transforming theoretical insights into actionable strategies.
11
Bibliography
Amemiya, T. (1981). Qualitative response model: A survey. Journal of Economic Literature,
19, 481–536.
Asteriou, D., & Hall, S. G. (2011). Applied Econometrics. UK: Palgrave Macmillan.
Baltagi, B. H. (1995). Econometric analysis of panel data. John Wiley and Sons.
Berndt, E. R. (1991). The practice of econometrics: Classic and contemporary. Addison-
Wesley.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and applications .
Cambridge University Press.
Cramer, J. S. (2001). An introduction to the logit model for economist (2nd ed., p. 33).
Timberlake Consultants Ltd.
Cromwell, J. B., Labys, W. C., & Terraza, M. (1994). Univariate tests for time series models.
Sage Publications.
Cuthbertson, K., Hall, S. G., & Taylor, M. P. (1992). Applied econometric techniques (p. 100).
University of Michigan Press.
Goldberger, A. S. (1991). A course in econometrics. Harvard University Press.
Greene, W. H. (1993). Econometric analysis (2nd ed., pp. 535–538). Macmillan.
Gujarati, D. N. (2012). Basic Econometrics. McGraw Hill Education Private Limited.
Haavelmo. (1944). The Probability Approach in Econometrics. Supplement to Econometrica,
12, iii.
Hood, W. C., & Koopmans, T. C. (1953). Studies in econometric method (p. 133). John Wiley
& Sons.
Johnston, J. (1984). Econometric methods (3rd ed.). McGraw-Hill.
Kmenta, J. (1986). Elements of econometrics (2nd ed., pp. 723–731). Macmillan.
Murray, M. P. (2006). Econometrics: A modern introduction. Pearson/Addison Wesley.
Patterson, K. (2000). An introduction to applied econometrics: A time series approach. St.
Martin’s Press.
Ripollés, J., Martínez-Zarzoso, I., & Alguacil, M. (2022). Dealing with Econometrics: Real World
Cases with Cross-Sectional Data. UK: Cambridge Scholars Publishing.
Wooldridge, J. M. (1999). Econometric analysis of cross section and panel data. MIT Press.
12
Chapter 2
An Introduction to R and R Studio
J.Jayasankar, Fahima M.A and Megha K.J
ICAR-Central Marine Fisheries Research Institute, Kochi, Kerala
Introduction
Statistical methods have inspired many computational tools in the past decades, so much so
that tools-inspired methodological options have been recorded. Many generic software
programs perform basic statistical analyses and tests, making the inference process well-
founded and relatively easy. R, an evolved offshoot of software S, is the latest one off the
block with explosive growth and adoption. The following section gives a practical overview of
what it takes to get R functional and about the basic maneuvers.
The R Environment
13
R Studio
• Console where you can type commands and see output. The console is all you would
see if you run R in the command line without RStudio. The prompt, by default ‘>’,
indicates that R is waiting for your commands.
• Script editor where you can type out commands and save them to a file. You can also
submit the commands to run in the console.
• Environment/History: Environment shows all active objects, and history keeps track
of all commands run in the console.
• Files/Plots/Packages/Help
Installing Procedure
14
Step for installing RStudio
To install RStudio on Windows, click the “download RStudio for Windows” and choose the
appropriate version. Run the .exe file after downloading and follow the installation
instructions. Users can now work on R studio for analysis.
After finishing the installation procedure, the user can open RStudio by clicking the RStudio
icon, as shown in the figure above.
15
R commands, case sensitivity, etc.
➢ Normally all alphanumeric symbols are allowed (and in some countries, this includes
accented letters) plus ‘.’ and ‘_’, with the restriction that a name must start with ‘.’ or
a letter, and if it starts with ‘.’ the second character must not be a digit. Names are
effectively unlimited in length.
➢ Comments can be put almost to anywhere, starting with a hash mark (‘#’), everything
to the end of the line is a comment. If a command is not complete at the end of a
line, R will give a different prompt, by default + on second and subsequent lines and
continue to read input until the command is syntactically complete. This prompt may
be changed by the user. We will generally omit the continuation prompt and indicate
continuation by simple indenting.
➢ Objects in R obtain values by assignment. This is achieved by the gets arrow, <-.
16
Getting help with functions and features
Basic Arithmetic
a) Vectors
Vectors are variables with one or more values of the same type. A variable with a single value
is known as a scalar. In R, a scalar is a vector of length 1. There are at least three ways to
create vectors in R: (a) sequence, (b) concatenation function, and (c) repetition function.
Eg:
▪ vector1 <- c(1,5,9)
vector2 <- c(20,21,22,23,24,25)
▪ vector<-seq(1,10,by=1)
vector
1 2 3 4 5 6 7 8 9 10
▪ A<-rep(5,3)
A
555
17
➢ logical vectors
As well as numerical vectors, R allows the manipulation of logical quantities. The elements of
a logical vector can have the values TRUE, FALSE, and NA. The first two are often abbreviated
as T and F, respectively. Note, however, that T and F are just variables that are set to TRUE
and FALSE by default but are not reserved words and, hence, can be overwritten by the user.
Hence, you should always use TRUE and FALSE.
➢ Character vector
We can equally create a character vector in which each entry is a string of text. Strings in R
are contained within double quotes.
➢ Missing values
In some cases, the components of a vector may not be completely known. When an element
or value is “not available” or a “missing value” in the statistical sense, a place within a vector
may be reserved for it by assigning it the special value NA. The function is.na(x) gives a logical
vector of the same size as x with value TRUE if and only if the corresponding element in x is
NA.
Note that there is a second kind of “missing” value produced by numerical computation, the
so-called Not a Number, NaN, values.
➢ Class of an object
18
For example "numeric","logical","character","list","matrix","array", "factor" and "data.frame"
are possible values.
Y<-c(2,4,6,8,10,12)
X<-c(1,2,3,4,5,6)
b<-data.frame(X,Y)
b
X Y
11 2
22 4
33 6
44 8
5 5 10
6 6 12
class(b)
[1] "data.frame"
b) Arrays
An array can be considered as a multiply subscripted collection of data entries. R allows simple
facilities for creating and handling arrays, particularly the special case of matrices. A
dimension vector is a vector of non-negative integers. The dimensions are indexed from one
up to the values given in the dimension vector. A vector can be used by R as an array only if
it has a dimension vector as its dim attribute. Suppose, for example, z is a vector of 15
elements. The assignment dim(z) <-c(3,5) gives it the dim attribute that allows it to be
treated as a 3 by 5 array.
➢ Array indexing: Individual elements of an array may be referenced by giving the array's
name followed by the subscripts in square brackets, separated by commas. More generally,
subsections of an array may be specified by giving a sequence of index vectors in place of
subscripts; however, if any index position is given an empty index vector, then the full
range of that subscript is taken. To access elements in a 2D array, you need two indices
– one for the row and one for the column. The first index refers to the row number, and
the second refers to the column number.
Eg: z<-array(c(1,2,3,4,5,6),c(3,2))
19
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
z[2,2]
[1] 5
➢ The array() function
As well as giving a vector structure a dim attribute, arrays can be constructed from vectors
by the array function, which has the form
For example − If we create an array of dimension (2, 3), it creates a matrix with 2 rows and
3 columns. An array is created using the array() function. It takes vectors as input and uses
the values in the dim parameter to create an array.
c) Matrices
Matrices are mostly used in statistics and so play an important role in R. To create a matrix,
use the function matrix(), specifying elements by column first.
Eg: matrix(1:12,nrow=3,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
matrix(c(1,2,3,4,5,6),nrow=2)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
matrix(c(1,2,3,4,5,6),byrow=TRUE,ncol=3)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
matrix(c(1,2,3,4,5,6),ncol=3)
[,1] [,2] [,3]
20
[1,] 1 3 5
[2,] 2 4 6
➢ Special functions for constructing certain matrices:
diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
produce a identity matrix.
diag(1:3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3
➢ Matrix multiplication is performed using the operator %*%, which is distinct from scalar
multiplication *.
a<-matrix(c(1:9),3,3)
x<-c(1,2,3)
a
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
a%*%x
[,1]
[1,] 30
[2,] 36
[3,] 42
➢ A standard function exists for common mathematical operations on Matrices.
1. Transpose of a matrix
t(a)
[,1] [,2] [,3]
[1,] 1 2 3
21
[2,] 4 5 6
[3,] 7 8 9
2. Determinant of a matrix
a<-matrix(c(1:8,10),3,3)
a
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 10
det(a)
[1] -3
3. Dimension of a matrix
dim(a)
3,3
4. Inverse of a matrix
a<-matrix(c(1:8,10),3,3)
solve(a)
[,1] [,2] [,3]
[1,] -0.6666667 -0.6666667 1
[2,] -1.3333333 3.6666667 -2
[3,] 1.0000000 -2.0000000 1
➢ Sub setting
You can stitch matrices together using the rbind() and cbind() function.
cbind(a,t(a))
[,1] [,2] [,3] [,4] [,5] [,6]
22
[1,] 1 4 7 1 2 3
[2,] 2 5 8 4 5 6
[3,] 3 6 10 7 8 10
Note: Uni-dimensional arrays are called vectors in R. Two-dimensional arrays are called
matrices.
The function eigen(Sm) calculates the eigenvalues and eigenvectors of a symmetric matrix
Sm. The result of this function is a list of two components named values and vectors. The
assignment ev <- eigen(Sm) will assign this list to ev. Then ev$val is the vector of eigenvalues
of Sm and ev$vec is the matrix of corresponding eigenvectors.
Eg:
eigen(a)
eigen() decomposition
$values
[1] 16.7074933 -0.9057402 0.1982469
$vectors
[,1] [,2] [,3]
[1,] -0.4524587 -0.9369032 0.1832951
[2,] -0.5545326 -0.1249770 -0.8624301
[3,] -0.6984087 0.3264860 0.4718233
Lists
The main object for holding data in R. These are a bit like vectors except that each entry can
be any other R object.
x<-list(1:3,TRUE,"hello")
x[[3]]
[1] "hello"
Here x has three elements: A numeric vector, a logical and string. We can select an entry of x
with double square bracket. The function names() can be used to obtain a character vector
of all the names of the object in list.
23
e) Data frames
A DataFrame in R is a tabular data structure that stores values of any data type. Use
class(name of your data frame) or is(name of your data frame, “data.frame”) command to
check whether it is a data frame or not.
• The command data.frame() creates a data frame, each argument representing a column.
books<-data.frame(author=c("Raju","Radha"),year=c(1980,1979))
books
author year
1 Raju 1980
2 Radha 1979
We can select rows and columns in the same way as in the matrices.
books[2,]
author year
1 Radha 1979
• as.list(data.frame) –will convert a data frame object into a list object.
dim(books)
[1] 2 2
• names(books)- will return the column names of a data frame, row.names(books) will
24
summary(books)
author year
Length:2 Min. :1979
Class :character 1st Qu.:1979
Mode :character Median :1980
Mean :1980
3rd Qu.:1980
Max. :1980
• The unique() function in R is used to eliminate or delete the duplicate values or the rows
present in the vector, data frame, or matrix as well.
A <- c(1, 2, 3, 3, 2, 5, 6, 7, 6, 5)
unique(A)
[1] 1 2 3 5 6 7
• factor() In R, factors are used to work with categorical variables, variables that have a
fixed and known set of possible values
x <-c("female", "male", "male", "female")
factor(x)
[1] female male male female
Levels: female male
• table() function in R Language is used to create a categorical representation of data
with variable name and the frequency in the form of a table.
vec = c(10, 14, 13, 10, 12, 13, 12, 10, 14, 12)
table(vec)
vec
10 12 13 14
3 3 2 2
• The str() function displays the internal structure of an object such as an array, list, matrix,
factor, or data frame.
vec = list(10, 14, 13, 10, 12, 13, 12, 10, 14, "as")
str(vec)
List of 10
$ : num 10
$ : num 14
$ : num 13
25
$ : num 10
$ : num 12
$ : num 13
$ : num 12
$ : num 10
$ : num 14
$ : chr "as"
• The View() function in R can be used to invoke a spreadsheet-style data viewer within
RStudio.
• paste() method in R programming is used to concatenate the two string values by
separating with delimiters.
string1 <- "R"
string2 <- "RStudio"
answer <- paste(string1, string2, sep=" and ")
print(answer)
[1] "R and RStudio"
• The print() function prints the specified message to the screen, or other standard
output device.
print("hello")
[1] "hello"
26
• In R, we can write data frames easily to a file, using the write.table() and write.csv()
command.
write.table(books, file="cars1.txt",row.names=F)
write.csv(books,file="car1.csv",row.names=F)
R, by default, creates a column of row indices. If we wanted to create a file without the
row indices, we would use the command.
Functions
A function in a programming language is much like its mathematical equivalent. It has some
input called arguments and an output called return value.
Writing function
square<-function(x){
x^2
}
square(4)
[1] 16
Note: objects which are created inside a function do not exist outside it.
• for() loops
The most common way to execute a block of code multiple times is with a for () loop.
for (x in 1:5) {
+ print(x)
+}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Commonly using other loops are while loop and nested loop.
• if() statement
27
statement or block of statements will be executed or not, i.e., if a certain condition is true,
then a block of statements is executed; otherwise, it is not.
a <- 5
if(a > 0)
+{
+ print("Positive Number")
+}
[1] "Positive Number"
R Packages
R comes with many data set build-ins, particularly in the MASS package. A package is a
collection of functions, data sets, and other objects. To install and load a package,
install.packages(“package name”)
library(“package”)
Eg: Lubridate is an R package that makes it easier to work with dates and times.
install.packages(“lubridate”)
library(lubridate)
To get the list of available data sets in base R we can use data() but to get the list of data
sets available in a package we first need to load that package then data() command shows
the available data sets in that package.
a) Tidyverse package
The Tidyverse suite of integrated packages is designed to work together to make common
data science operations more user-friendly. The packages have functions for data wrangling,
tidying, reading/writing, parsing, and visualizing, among others.
28
• Data Import and Management
tibble
readr
1. ggolot2
ggplot2 is an R data visualization library based on The Grammar of Graphics. ggplot2 can
create data visualizations such as bar charts, pie charts, histograms, scatterplots, error charts,
etc., using high-level API. It also allows you to add different data visualization components or
layers in a single visualization. Once ggplot2 has been told which variables to map to which
aesthetics in the plot, it does the rest of the work so that the user can focus on interpreting
the visualizations and take less time to create them. However, this also means that it is not
possible to create highly customized graphics in ggplot2. If you want to install ggplot2, the
best method is to install the tidyverse using:
install.packages("tidyverse")
library("ggplot2")
If stat = "identity", then the bar chart will display the values in the data frame as is.
1. dplyr
dplyr is a very popular data manipulation library in R. It has five important functions that are
combined naturally with the group by() function that can help in performing these functions
in groups. These functions include the mutate() function which can add new variables that
are functions of existing variables, select() function that selects the variables based on their
names, filter() function that picks selects the variables based on their values. Summarise ()
29
function that reduces multiple values into a summary, and the arrange() function that
arranges the arranges the row orderings. If you want to install dplyr, the best method is to
install the tidyverse using:
install.packages("tidyverse")
Or you can just install dplyr using:
install.packages("dplyr")
Eg:
library(dplyr)
data(starwars)
print(starwars %>% filter(species == "Droid"))
2. tidyr
tidyr is a data cleaning library in R which helps to create tidy data. Tidy data means that all
the data cells have a single value with each of the data columns being a variable and the data
rows being an observation.This tidy data is a staple in the tidyverse and it ensures that more
time is spent on data analysis and to obtain value from data rather than cleaning the data
continuously and modifying the tools to handle untidy data.The functions in tidyr broadly fall
into five categories namely, Pivoting which changes the data between long and wide forms,
Nesting which changes grouped data so that a group is a single row with a nested data frame,
Splitting character columns and then combining them, Rectangling which converts nested
lists into tidy tibbles and converting implicit missing values into explicit values. If you want to
install tidyr, the best method is to install the tidyverse using:
install.packages("tidyverse")
Or you can just install tidyr using:
install.packages("tidyr")
30
3. Stringr
stringr is a library that has many functions used for data cleaning and data preparation tasks.
It is also designed for working with strings and has many functions that make this an easy
process.
All of the functions in stringr start with str and they take a string vector as their first
argument. Some of these functions include str_detect(), str_extract(), str_match(),
str_count(), str_replace(), str_subset(), etc. If you want to install stringr, the best method
is to install the tidyverse using:
install.packages("tidyverse")
Or you can just install stringr from CRAN using:
install.packages("stringr")
Eg:
library(stringr)
str_length("hello")
5
1. readr
This readr library provides a simple and speedy method to read rectangular data such as that
with file formats tsv, csv, delim, fwf, etc. readr can analyze many different types of data using
a function that examines the total file. This is done automatically by readr in most cases.
readr can read different kinds of file formats using different functions, namely read_csv() for
comma-separated files, read_tsv() for tab-separated files, read_table() for tabular files,
read_fwf() for fixed-width files, read_delim() for delimited files, and, read_log() for web log
files. If you want to install readr, the best method is to install the tidyverse using:
install.packages("readr")
library(readr)
myData = read_tsv("sample.txt", col_names = FALSE)
print(myData)
31
2. tibble
You can create new tibbles from column vectors using the
tibble() function and you can also create a tibble row-by-row
using a tribble() function. If you want to install tibble, the best
method is to install the tidyverse using:
install.packages("tibble")
library(tibble)
tib <- tibble(a = c(1,2,3), b = c(4,5,6), c = c(7,8,9))
tib
# A tibble: 3 x 3
a b c
<dbl> <dbl> <dbl>
1 1 4 7
2 2 5 8
3 3 6 9
R plotting
a. plot()
32
Draw a line
The plot() function also takes a type parameter with the value l to draw a line to connect all
the points in the diagram:
plot(x,y,type="l")
Plot label
b. Box plot
Eg:
boxplot(iris[,1],xlab="Sepal.Length",ylab="Length(in centemeters)", main="Summary
Characteristics of Sepal.Length(Iris Data) ")
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris
virginica and Iris versicolor). Four features were measured from each sample: the length
and the width of the sepals and petals, in centimeters. If we want to add color to boxplot
use argument ‘col’.
Eg:
33
boxplot(iris[,1],xlab="Sepal.Length",ylab="Length(in
centemeters)", main="Summary Characteristics of
Sepal.Length(Iris Data) ",col= “orange”)
c. Histogram
34
d. Bar plots
Bar plots can be created in R using the barplot() function. We can supply a vector or matrix
to this function. If we supply a vector, the plot will have bars with their heights equal to the
elements in the vector.
Eg:
max.temp <- c(22, 27, 26, 24, 23, 26, 28)
barplot(max.temp,
main = "Maximum Temperatures in a Week",
xlab = "Degree Celsius",
ylab = "Day",
names.arg = c("Sun", "Mon", "Tue", "Wed",
"Thu", "Fri", "Sat") , col = "lightgreen”)
Note
e. Scatterplots
A "scatter plot" is a type of plot used to display the relationship between two numerical
variables, and plots one dot for each observation. It needs two vectors of same length, one
for the x-axis (horizontal) and one for the y-axis (vertical):
Eg:
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y, main="Observation of Cars",
xlab="Car age", ylab="Car speed",col="black",
pch=21,bg="lightgreen")
35
Pros and Cons of R
Advantages of R
• Open source
• Data wrangling
• Array of packages
• Quality of plotting and graphing
• Platform independent
• Machine learning operations
• Continuously growing
Disadvantages of R
• Weak origin
• Data handling
• Basic security
• Complicated Language
Bibliography
Venables, W. N., & Smith, D. M. the R Development Core Team (2007). An introduction to
R.<http.cran.r-project.org/doc/manuals/R-intro.pdf>Accessed,18(07).
https://www.stats.ox.ac.uk/~evans/Rprog/LectureNotes.pdf
https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
https://web.itu.edu.tr/~tokerem/The_Book_of_R.pdf
36
Chapter 3
Regression Analysis: Simple and Multiple
Regression Using R
V Chandrasekar1, Ramadas Sendhil2 & Geethalakshmi. V1
1
ICAR-Central Institute of Fisheries Technology, Cochin, India
2 Department of Economics, Pondicherry University (A Central University),
Puducherry, India.
In the world of statistics and data analysis, simple linear regression serves as one of the
fundamental tools for exploring the relationship between two variables. It allows us to
understand how changes in one variable are associated with changes in another. In this
chapter, we'll delve into the concept of simple linear regression, its mechanics, and how to
interpret the results. Simple linear regression involves two main variables: the independent
variable (X) and the dependent variable (Y). The relationship between these variables is
assumed to be linear, meaning that changes in X result in proportional changes in Y. The
equation that defines simple linear regression is: Y = β0 + β1X + ϵ
Where;
Y = dependent variable.
X = independent variable.
β0 = intercept (the value of Y when X is zero).
β1 = slope (the change in Y for a one-unit change in X).
ϵ = error term
37
Example: Income and Expenditure
Let's consider a scenario where a researcher wants to study the effect of income on expenditure habits. The researcher collects data from several
individuals with varying income levels and records their corresponding expenditure amounts. Here's a summary of the data:
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Income
Income
Income
Income
Income
Income
Income
Income
Income
Income
38650 25099 22557 16734 65036 47022 35145 32048 33654 36856 63785 65081 59862 42900 28518 25363 74261 52832 34395 39771
49817 36848 53727 34960 36535 23407 62421 54121 24255 22240 35428 37321 51984 35797 37473 32572 25101 22834 51419 44754
49263 48479 62279 53478 22888 20102 36838 32232 71139 64444 31421 14051 41953 29916 34438 21919 57149 60317 60068 52350
32167 29541 54852 41339 47512 51428 72436 50472 30423 42381 64631 37523 23500 9662 42824 38572 68247 46810 32855 23999
71987 59585 40365 38231 54615 51088 63477 43204 23755 16458 26437 20115 47118 26957 18268 3596 33699 27531 59214 46765
37320 26474 65125 43324 34354 33463 59421 48701 19869 26811 20045 19203 23095 36273 28916 14422 63225 46026 39125 36818
46768 34290 60315 51059 71787 53911 24616 21680 26308 17530 63937 55618 27331 27858 21723 19642 23010 30794 26444 21647
45004 21344 69514 50087 55087 45386 25414 18779 71391 58658 27660 15245 38848 31744 69792 57184 47022 42277 57232 38848
31239 31008 71974 55938 30999 18296 67109 64018 31072 12536 68336 41819 35767 16055 26860 26028 72491 62548 34022 34335
46422 39723 27597 26208 46499 17317 68336 54763 15610 10488 38296 25154 41619 20006 16535 18085 56693 39742 16139 35640
46352 34094 69584 58483 18306 13592 50850 33698 74808 52540 37730 35024 15469 17242 44186 45304 40898 34245 47860 36349
27755 21500 46725 47865 35369 28465 60329 51257 28162 19027 31622 32177 33862 9031 38834 33527 33169 23885 45938 44616
71218 63101 63706 38907 46085 22319 65769 45106 57469 53218 51017 49943 36171 30848 37869 18820 67707 62510 56742 46073
74690 63362 61690 50303 53638 55023 35766 18123 65433 60432 56127 48738 65062 46446 62696 54012 54081 51256 60550 47054
21201 15540 60765 48091 68816 55577 55322 41010 65651 51263 18587 7494 18507 21921 18926 25785 36346 39869 43181 35757
25615 30289 54870 53226 43193 38798 24117 19803 54724 49361 53660 42516 44226 46149 17644 14437 49273 33407 64982 56628
23571 13512 15919 7515 33855 15865 42671 39666 20874 28470 23384 24554 64795 56401 60915 55821 37015 34206 65484 59180
23905 24347 16828 16924 49345 51820 35327 33375 45919 44824 49782 47917 65637 66118 63766 39913 68992 56126 46905 32309
47580 29062 55023 51039 49379 43799 61455 52366 50768 41971 26319 23265 71832 59693 63948 40183 70278 51651 31288 25720
19966 26868 40462 24128 26039 24147 51600 48603 74658 48789 26488 19228 28114 30896 32066 28424 64779 56941 67839 53268
73132 61153 50074 43091 57136 41890 47132 26462 58562 49282 38622 29147 56885 41594 29105 25113 16417 18830 71218 53403
35306 27253 48659 38134 61198 50002 68498 48314 37669 42214 41238 44211 48027 30852 23523 15584 66383 50941 73000 67515
24311 13245 15086 13861 37737 37687 54670 30033 70651 72188 63883 47989 24152 15836 70067 57150 46067 36483 58758 35635
35451 32577 28670 43052 71195 59207 41788 32319 63797 54007 38450 26633 29280 39014 34859 25018 70344 53941 15896 13944
52295 45814 58802 49301 21972 25053 37504 39389 19232 17236 49929 37036 31765 32876 25198 16577 31047 21231 60041 43420
38
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Expenditure
Income
Income
Income
Income
Income
Income
Income
Income
Income
Income
66943 57184 64863 53953 59543 38646 22768 24276 73665 69888 34029 8593 26878 29578 57008 43230 30464 20265 57255 47145
39027 37626 49404 32900 39246 24522 15787 10687 65381 49861 38224 24401 21267 28435 43842 37717 61852 56350 52402 36070
22934 10703 56277 40879 70839 44780 19264 15597 73032 63723 19118 17255 26963 23346 47744 37955 54174 47742 64523 38171
23828 22905 72306 53977 69531 45190 59066 48744 30395 25648 28608 25984 42332 43694 53973 44718 38630 26304 41747 39622
25519 21906 53398 39726 36632 40093 51913 50294 67056 45987 74538 44848 53528 43481 54588 34106 56977 48403 65251 57150
69356 66481 28281 23325 17914 5501 18822 5345 19303 11643 33566 27426 50939 47139 17907 12370 63672 46823 74504 63381
18580 16855 59337 58369 35427 27563 25467 24807 42259 25092 67101 51481 62526 52541 31149 36978 21014 15861 72275 53488
35913 24327 35226 37622 45357 31765 32237 36165 29250 20680 63282 51460 53269 39991 50833 40786 35013 29458 32777 29642
68288 62579 32423 32612 48697 39857 72627 57467 44294 38051 19335 25646 20508 26798 48536 40805 59969 42627 20341 30712
20729 22977 35007 23782 40583 37766 64839 51465 20729 7348 51309 40161 23837 22977 66098 60164 68432 62601 28425 13731
52265 60313 71884 55132 56370 50922 56908 49455 52264 36076 42805 39076 69223 65853 31292 26761 57573 42703 61039 42691
22454 10874 47215 61893 54640 42930 66359 57141 71642 65208 15818 20631 23618 10190 65651 46551 69290 50569 60766 49552
70785 53667 35971 31502 31885 20014 59751 36169 22130 29741 29100 15579 45480 38590 23454 30782 16261 15002 42462 49687
41930 24515 32363 25636 44200 36918 39001 29969 72094 45721 56470 40389 73128 62912 16023 17633 59087 46261 53209 32302
19588 20277 15165 9375 57626 50491 64636 45321 21864 13378 35735 44261 16998 6334 52505 45182 74838 65727 60547 49384
50641 36135 40049 19784 37190 25798 66304 47364 44173 50164 23474 15574 44859 32580 30378 24145 26102 20787 31564 14542
39845 26015 62004 49735 21849 11044 31213 29274 50171 35669 58475 44545 20823 29541 49497 38652 16630 7838 24847 6748
30674 35635 22830 8436 42943 33863 46983 30214 26044 22851 53008 37821 48367 47610 15773 15833 62379 54496 34374 22892
36852 27626 21922 8831 34123 22619 15760 7691 29194 30719 34370 28885 47637 48321 57440 40659 29683 17110 52515 48335
37917 26649 34365 35228 35834 20250 36727 36623 62467 37666 18683 14530 42724 33736 59128 34989 62816 45513 34741 27109
53610 40229 59346 42651 35120 17944 71967 61982 68620 46036 50979 58956 53821 45593 61973 43704 22227 8022 60899 47041
51984 43498 53102 31582 66625 62847 17828 20953 23735 40819 15331 25003 42579 38339 54949 43822 20733 7917 34432 22450
52435 38076 56667 40588 62741 52561 21447 10808 59664 37282 16206 15944 42188 34139 45196 39503 28707 19375 45329 39390
71039 57057 74416 67339 37373 30303 36588 30428 71560 50810 66900 58596 56960 42867 43141 40402 39305 34262
34263 32299 21370 3778 43955 31664 20927 19590 22922 16888 73496 47663 29988 18144 46361 39407 68381 55034
39
To run the example data in R, you can follow these steps:
1. Install R and RStudio
If you haven't done so yet, you should install R and RStudio. R is a programming language
used for statistical computing, while RStudio is an integrated development environment (IDE)
designed to simplify working with R. You can download R from the Comprehensive R Archive
Network (CRAN) at https://cran.r-project.org/ and RStudio from the official RStudio website
(https://www.rstudio.com/products/rstudio/download/).
2. Open R Studio
After installing RStudio, open the application.
Before conducting linear regression analysis, it is essential to verify that the data satisfies the
four key assumptions for linear regression. These assumptions include:
• Linearity: The relationship between the independent and dependent variables must
be linear.
• Independence: The observations should be independent of one another.
• Homoscedasticity: The residuals should have constant variance across all levels of
the independent variable(s).
• Normality of Residuals: The residuals (errors) should follow a normal distribution.
40
Let's use R to check these assumptions using the example Bajra dataset.
Read the CSV file into R
income_data <- read.csv ("path/to/your/ income_data.csv")
Load necessary libraries
library(ggplot2)
library(car)
Simple Regression
summary(income_data)
In a simple regression analysis of income data, the summary function provides a numerical
overview, including minimum, median, mean, and maximum values for income and expenditure
variables.
Income Expenditure
Min. :15086 Min. : 3596
1st Qu.:30086 1st Qu.:24361
Median :44260 Median :36887
Mean :44692 Mean :36185
3rd Qu.:59942 3rd Qu.:47904
Max. :74838 Max. :72188
Multiple Regression
summary(cancer_data)
Given that the variables are numeric, running the code generates a numerical summary for
both the independent variables (smoking and junk food) and the dependent variable (cancer
disease occurrence).
Cancer occurrence Junk food Drinking
Min. : 0.79 Min. : 1.62 Min. : 0.76
1st Qu.: 9.33 1st Qu.: 29.51 1st Qu.:12.02
Median :15.14 Median : 51.88 Median :22.91
Mean :14.70 Mean : 54.63 Mean :22.32
3rd Qu.:19.95 3rd Qu.: 83.18 3rd Qu.:32.36
Max. :29.51 Max. :107.15 Max. :43.65
41
Verify that your data meets the assumptions.
In R, we can verify that our data meet the four key assumptions for linear regression. Since
we are working with simple regression involving just one independent and one dependent
variable, there is no need to check for hidden relationships between variables. However, if
there is autocorrelation within the variables (e.g., multiple observations from the same
subject), simple linear regression might not be appropriate. In these situations, a more
structured approach like a linear mixed-effects model should be considered. Additionally, use
the hist() function to assess whether the dependent variable follows a normal distribution
Given that the observations display a bell-shaped distribution with a concentration in the
middle and fewer data points at the extremes, we can proceed with linear regression.
Linearity: To assess linearity, we visually inspect the relationship between the independent
and dependent variables using a scatter plot to determine if a straight line could adequately
represent the data points.
42
Since the relationship appears approximately linear, we can move forward with the linear
model. Homoscedasticity, or consistent variance, ensures the prediction error remains stable
across the model's prediction range. We'll assess this assumption after it fits the linear model.
Use the cor() function to check if your independent variables are highly correlated.
cor(cancer_data$Junkfood, cancer_data$Drinking)
For example, cor(cancer_data$Junkfood, cancer_data$drinking) gives an output of 0.015,
indicating a small correlation (only 1.5%). Hence, both parameters can be included in our
model. Use the hist() function to determine whether your dependent variable follows a
normal distribution. hist(cancer_data$Cancer)
Since the observations show a bell-shaped pattern, we can proceed with the linear regression.
Fit a linear regression model
model <- lm(Yield ~ Fertilizer, data = bajra_data)
Check Assumptions
1. Linearity: Check by plotting a scatterplot of the predictor against the response variable
ggplot(cancer_data, aes(x = Exercise, y = Cancer)) +
+ geom_point() +
+ geom_smooth(method = "lm") +
+ labs(x = "Exercise (No of days)", y = "Cancer (severity)", title = "Scatterplot of Cancer
vs. Exercise")
43
To interpret the results of the provided data, we conducted a linear regression analysis
between the variables "Exercise" and "Cancer" to explore their potential relationship. The
analysis included examining the scatterplot between exercise and cancer to visually identify
any patterns, calculating the correlation coefficient to measure the strength and direction of
their relationship, and fitting a linear regression model to the data. The coefficients from the
regression equation were interpreted to understand how exercise levels might relate to cancer
severity. Additionally, the model's goodness-of-fit was assessed using metrics like R-squared
and p-values. These steps provided insights into the potential impact of exercise on cancer
44
severity and helped identify limitations or areas for further investigation in the dataset, such
as the influence of other factors like drinking habits on cancer severity.
2. Independence: Not directly testable from data, but typically assumed based on study
design. If data comes from a randomized experiment or a properly designed observational
study, independence can be assumed.
The `residualPlot` will plot the residuals against the fitted values. We are looking for a random
scatter of points around the horizontal line at zero, indicating homoscedasticity.
The linear regression analysis conducted on the dataset revealed a significant relationship
between Cancer and Exercise, as demonstrated by the scatterplot, where the red regression
line demonstrates a positive correlation between the two variables. The diagnostic plots
created for the regression model, such as the residuals vs. fitted values plot, quantile-quantile
(Q-Q) plot, scale-location plot, and residuals vs. leverage plot, offer insights into the model's
assumptions and possible issues. These plots help evaluate the regression model's adequacy,
45
including its linearity, homoscedasticity, normality of residuals, and influential observations.
Overall, the analysis suggests that Exercise may be a significant predictor of Cancer, but
further investigation and model refinement may be necessary to fully understand the
relationship and ensure the model's validity and reliability.
> shapiro.test(cancer_data$Cancer)
Shapiro-Wilk normality test
data: cancer_data$Cancer
W = 0.98021, p-value = 2.709e-06
> shapiro.test(cancer_data$Exercise)
Shapiro-Wilk normality test
data: cancer_data$Exercise
W = 0.95028, p-value = 6.836e-12
> shapiro.test(cancer_data$Drinking)
Shapiro-Wilk normality test
data: cancer_data$Drinking
W = 0.9615, p-value = 3.951e-10
This code will conduct Shapiro-Wilk tests for normality on the variables "Cancer," "Exercise,"
and "Drinking" in the dataset named `cancer_data`.
The Shapiro-Wilk normality tests were conducted for three variables: "Cancer," "Exercise,"
and "Drinking." For the "Cancer" variable, the test yielded a Shapiro-Wilk statistic (W) of
0.98021 and a very low p-value of 2.709e-06, indicating a rejection of the null hypothesis of
normality. Similarly, for the "Exercise" variable, the test resulted in a W statistic of 0.95028
46
and an extremely low p-value of 6.836e-12, also leading to the rejection of the null
hypothesis. Finally, for the "Drinking" variable, the W statistic was 0.9615 with a p-value of
3.951e-10, again indicating non-normality. In summary, all three variables significantly
deviate from a normal distribution based on the Shapiro-Wilk tests.
shapiro.test(residuals(model))
The histogram and QQ plot of residuals visually indicate whether the residuals are
approximately normally distributed. The Shapiro-Wilk test provides a formal test of normality.
If the p-value from the test is greater than 0.05, we fail to reject the null hypothesis of
normality. Inspecting these plots and conducting tests will help you determine whether your
data meet the assumptions for linear regression. If the assumptions are violated, you may
need to apply transformations to the variables or consider alternative modelling techniques.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.556276 0.212113 120.48 <2e-16 ***
Exercise -0.198772 0.003376 -58.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
47
Residual standard error: 2.338 on 496 degrees of freedom
Multiple R-squared: 0.8748, Adjusted R-squared: 0.8746
F-statistic: 3467 on 1 and 496 DF, p-value: < 2.2e-16
The intercept (25.556276) represents the estimated Cancer level when the Exercise variable
is zero. The coefficient for Exercise (-0.198772) suggests that for each unit increase in
Exercise, Cancer decreases by 0.198772 units on average. The p-value for Exercise is
extremely low (<2e-16), indicating strong evidence that Exercise is associated with Cancer.
Prediction
For a given level of Exercise, you can predict the corresponding Cancer level using the
equation: Cancer = 25.556276 - 0.198772 * Exercise.
Interpretation of Slope
The slope coefficient (-0.198772) represents the change in the response variable (Cancer)
per unit change in the predictor variable (Exercise). In this case, it suggests that, on average,
an increase of one unit in Exercise is associated with a decrease of 0.198772 units in Cancer.
Overall, the results indicate a significant negative association between Exercise and Cancer,
with Exercise explaining a large proportion of the variability in Cancer levels.
48
Bibliography
Goldberger, A. S. (1991). A course in econometrics. Harvard University Press.
Greene, W. H. (1993). Econometric analysis (2nd ed., pp. 535–538). Macmillan.
Gujarati, D. N. (2012). Basic Econometrics. McGraw Hill Education Private Limited.
49
Chapter 4
Diagnostic Tests in Regression Analysis
Amaresh Samantaraya1
1Department of Economics, Pondicherry University (A Central University),
Puducherry, India.
Introduction
Econometric analysis is widely used today in empirical research both in academics and in
aiding policy/decision making by public authorities and private business. Literally,
econometrics means economic measurement. It is rare to find a research paper published in
professional journals in economics or reports pertaining to economic policy published by
government and professional organizations without application of econometric tools. But
application of econometric analysis is not confined to economics alone. Researchers and
analysts are often using econometric analysis for empirical investigation of issues pertaining
to a variety of disciplines including in commerce and management, sociology, psychology,
medical studies, etc. The importance of econometrics for undertaking empirical analysis in a
variety of fields cannot be overstated.
*
Exogenous money supply and Keynesian money demand
50
Major Components of Regression Analysis
Regression analysis applied in econometrics composed of mainly four steps, such as (a) Model
Specification, (b) Model Estimation, (c) Diagnostic Checks, and (d) Hypothesis Testing and
Inferences. Each of such steps is briefly explained, below.
51
a statement or draw inferences about the impact of explanatory variables on the dependent
variable, in general. If we could have obtained data for the entire population (all possible data
points over time and across the countries), then the counterparts of βis in Equation (1) can
be termed as population parameters.
In hypothesis testing, which is part of last step as discussed below, the researcher makes
inference about the population parameters based on the sample estimates obtained from
Step 2 above. Then, what is the role of diagnostic checks? Let us explain as below.
Gauss-Markov theorem suggests that if the assumptions of Classical Linear Regression
Model (CLRM) are satisfied, the OLS estimates of sample estimates become Best Linear
Unbiased Estimators (BLUE) of population parameters. Such assumptions are listed as below:
(i) Mean of stochastic error term in Equation (1) is zero i.e., E(ut) = 0.
(ii) No auto-correlation in the stochastic error term i.e., E(utus) = 0 for all t≠s.
(iii) Stochastic error term is homoscedastic i.e., E(ut2) = σ2.
(iv) There is no correlation between the explanatory variables and stochastic error
term.
(v) There is no perfect multicollinearity or high multicollinearity amongst the
explanatory variables.
(vi) The explanatory variables are non-stochastic.
(vii) The econometric model is correctly specified. It requires that no relevant
explanatory variable is excluded from the model; no irrelevant explanatory
variable is included in the model; and mathematical functional form of the model
is correct.
(viii) The stochastic error term is randomly distributed.
If all the above assumptions, except the last one are satisfied application of OLS to estimate
the econometric model will produce best linear unbiased estimator for the population
parameters. Hence, we need not look for any other alternative method of estimation.
Otherwise, we need to revise the method of estimation. If the last assumption is satisfied,
then it helps in hypothesis testing. In the above backdrop, diagnostic checks are undertaken
in econometric analysis, to establish the relevance of OLS as the estimation procedure, and
accordingly drawing inferences from the estimated results. In addition to the above, in
conventional analysis, a battery of indicators such as coefficient of determination, Akaike
Information Criterion (AIC), Scwartz Bayesian Criterion (SBC), t-test and F-test etc. are
employed as part of diagnostic checks in econometric analysis. Necessary details are provided
in Section III.
52
(d) Hypothesis Testing and Inferences
Hypothesis testing is used to draw inferences about population parameters based on the
sample estimate. For example, let us explain about examining the validity of liquidity
preference theory. One can collect data on interest rate, money stock, GDP for a particular
country say India for a given period, say 1970-71 to 2019-20, and estimate values of the
intercept and partial slope coefficients applying OLS to say, Equation 1. One can also employ
panel data estimation techniques using data for several counties say, India, Brazil, the US, etc.
for a given period of time. In any case, each of the above data set represents a sample of a
country or a set of a country for a given period. The estimated sample estimates are certainly
relevant for the sample. But, our ultimate objective is to make inference about a general
statement about say impact of change in money stock or GDP on interest rate, and thus on
the relevance of liquidity preference theory.
Economics and many streams in social sciences are non-experimental in nature.
Economists/statisticians cannot produce data on interest rates or money stock in the
laboratory. They can only rely on available data for different countries for specific time
periods. In other words, the economists can never data pertaining to all counties, and for all
times. Given this constraint, hypothesis testing is used to make a statement about impact of
money stock or GDP on interest rates (for the population), based on the sample estimates
obtained from the estimated results for the sample.
For example, if sample estimate of say, β1 is obtained as 0.06 in Equn. 1 by applying OLS, can
the researcher reject any statistically significant impact of money stock on interest rate
(assuming, Yt stands for interest rate, and Xit represent money stock). Or, based on the same,
can the researcher infer statistically significant impact of money stock on interest rate? The
related exercise is quite rigorous.
It may be noted that the focus of the present paper is to discuss about diagnostic checks in
econometrics. Hence, other steps in econometric analysis are covered in the above, very
briefly. The readers may refer to standard textbooks on econometrics (given in references)
for further details.
53
highlighted in the following. The readers may refer to standard textbooks on econometrics
for technical details. Moreover, we confine to methods of detection which are widely used in
practical econometric analysis, and the related list is not exhaustive.
†
Estimated t = (Sample estimate of βi – Population parameter βi)/(Standard Error of
Sample Estimate of βi). Standard error of estimated βi depends on standard error of ut.
54
negative autocorrelation. However, unlike t or F values popularly used in regression analysis
for hypothesis testing d-statistic do not follow a standard distribution. However, we can use
lower and upper critical values provided by Durbin-Watson to check presence or absence of
autocorrelation, using the decision criteria as given below.
It may be noted that d-statistic can be used to check for first-order autocorrelation only. It
cannot be used to verify higher order autocorrelation. Secondly, if the estimated d-statistic
values are in certain regions as indicated in the above graph, then we cannot make a decision
about presence/absence of autocorrelation problem. There are several other shortcomings in
the d-statistic test for autocorrelation. The LM test given below is used to overcome such
deficiencies. Nevertheless, DW test is popularly used as it is readily calculated from the
estimated stochastic error terms, and reported by default in most of the econometric
software packages.
In the presence of autocorrelation, there are broadly two types of remedial measures available
for the researchers. Firstly, one can use GLS methods such as Cochrane-Orcutt and Hildreth-
Lu procedures to correct for the problem of autocorrelation. Secondly, robust standard errors
correcting for autocorrelation problem can be used for hypothesis testing instead of standard
errors obtained from OLS. Popular econometric software packages have incorporated both
the methods. Providing technical details of the remedial measures are beyond the scope of
this chapter.
55
(b) Tests for Heteroscedasticity
Similar to the above, if the stochastic error term in Equation (1) suffer from the problem of
heteroscedasticity, the estimated t-value derived from OLS can lead to erroneous inferences.
In that case, OLS estimates continue to be unbiased, but not efficient. An alternative
procedure, weighted Least Squares (WLS) is BLUE is presence of heteroscedasticity. Park
test and White test are popularly used for checking for the presence of the problem of
heteroscedasticity.
Park Test
It is a two-step test procedure. In the first step, OLS is applied to the original regression
model as in Equation (1), and estimated values of stochastic error terms are obtained. In the
second step, square of the estimated stochastic error term is regressed on the explanatory
variable (s) of the original regression model, which is expected to cause heteroscedasticity of
ut. Say, if in Equation 1, X1t is expected to cause heteroscedasticity, in the second step,
estimated series of ut is regression on X1t. If the coefficient of X1t in the second regression is
statistically significant (using say t-test), then it indicates that the original equation suffers
from the problem of heteroscedasticity. On the contrary, if it is statistically not significant,
one can safely conclude that the original regression model does not suffer from the problem
of heteroscedasticity.
White Test
White test is a two-step LM procedure widely used to detect presence of heteroscedasticity.
As in the Park Test, in White test also OLS is applied to the original regression model, and
estimated values of stochastic error term obtained. In the second step, square of the
estimated stochastic error term series is regressed on the explanatory variables of the original
models along with their square terms and cross-products. From this auxiliary regression,
estimated value of the coefficient of determination is used to develop a F or LM test, which
helps in checking presence or absence of the problem of heteroscedasticity.
56
procedure. Secondly, similar to the case of autocorrelation problem, White’s robust standard
errors or Heteroscedasticity Autocorrelation (HAC) adjusted standard errors are used to
undertake hypothesis testing for the estimated OLS coefficients of the original regression
model.
where, r223 represents the correlation between explanatory variables. The VIF represents
magnification of variance of the stochastic error term in the original regression model, due to
strong correlation between the explanatory variables. It is observed that if the correlation
coefficient of two explanatory variables is 0.8, then VIF is 2.78. With rise of correlation
coefficient to 0.95, VIF rises to 10.26. Thus, correlation coefficient of less than 0.8 does not
cause grave problem for our regression analysis.
As part of remedial measures, explanatory variables are transformed or sometimes one of the
explanatory variables which is found to strongly correlated with another is dropped to get rid
of high multicollinearity problem. But many econometricians believe remedy causes bigger
problem than multicollinearity itself, and sometimes economic interpretation from the
estimated coefficients of the transformed variables may not be relevant to the research. So,
many prefer not to do anything. It is because, despite high-multicollinearity, OLS continues
to be BLUE.
57
(d) Model specification
The researcher should ensure that mathematical functional form and inclusion of the
explanatory variables in the econometric model are as per the theoretical prescriptions. For
example, if one need to estimate Phillips curve (suggesting negative association between
inflation and unemployment), the mathematical form of the regression model should be in
the form of an inverse function. This is in line with the functional form of Phillips curve –
rectangular hyperbola. Moreover, the scholar also needs to be mindful about what to include
or exclude as part of right-hand side variables. As detailed in standard econometric textbooks,
exclusion of relevant explanatory variables makes OLS estimates biased, while inclusion of
irrelevant variables make OLS inefficient. Before applying standard procedures to detect
model specification errors, the scholar should ensure that the functional form and variables
on the right-hand side of the regression model are in strict conformity to the economic theory
or postulations of the relevant area of research. To check for model misspecification errors,
broadly two types of criteria are used. Firstly, Durbin-Watson d-statistic provides a good rule
of thumb test. If d-statistics is estimated to be closer to ‘2’, it implicitly suggests absence of
model specification error. On the other hand, if the estimated value of d-statistics is low
(lower than lower critical value dL), concerns on model misspecification cannot be avoided.
Ramsay’s Regression Specification Error Test (RESET) is widely used for assure the
researcher about correctness about model specification. It uses a F-test to check if the model
can be improved by including any missing variable. Technical details of the same are provided
in standard econometric textbooks.
58
If JB value is estimated to be ‘zero’, it implies that S=0, and K=3, which suggests the
distribution is normal. Under the null hypothesis of normal distribution of ut, JB
asymptotically follows a chi-square distribution with the degrees of freedom of 2. Therefore,
comparing the estimated JB value with the corresponding critical chi-square value, one can
infer about the normality assumption of ut.
In this chapter, the discussion on the diagnostic checks in regression analysis/ econometrics
was presented with minimal use of technical/mathematical details. The focus was to highlight
the relevance of the diagnostic checks, and provide a lucid description of the same. For
technical details, the readers may refer to standard textbooks in econometrics, as given in the
references.
Bibliography
Gujarati, Damodar N. (1995): Basic Econometrics, McGraw Hill, 3rd Edition.
Pindyck, Robert S. and Daniel L. Rubinfeld (2000): Econometric Models and Economic
Forecasts, McGraw- Hill, 4th Edition.
Ramanathan, Ramu (2002): Introductory Econometrics with Applications, Harcorut College
Publishers, 5th edition.
60
Chapter 5
Data Mining and Computation114 1114
Software
for Social11414
Sciences*
V. Geethalakshmi and V Chandrasekar
ICAR-Central Institute of Fisheries Technology, Cochin, India.
Introduction
Statistics is the branch of science that deals with data generation, management, analysis and
information retrieval. Statistical methods dominate scientific research as they include
planning, designing, collecting data, analyzing, drawing meaningful interpretation and
reporting of research findings. Statistics has a key role to play in fisheries research carried
out in the various disciplines viz., Aquaculture, Fisheries Resource Management, Fish Genetics,
Fish Biotechnology, Aquatic Health, Nutrition, Environment, Fish Physiology and Post-
Harvest Technology for enhancing production and ensuring sustainability. For formulating
advisories and policies for stakeholders at all levels, the data generated from the various sub-
sectors in fisheries and aquaculture has to be studied.
With the advent of computational software, dealing with complicated datasets is relatively
easier. Advanced computational techniques aid in data analysis which is crucial to evolve
statistical inference from research data. Data management is also possible with advanced
statistical software.
A well-structured statistical system will form the basis for decision-making at various levels
of a sector especially during the planning and implementation. Statistics can play more
dominant role
• as a tool for policy-making and implementation
• assessing the impact of technology
• in sustaining nutritional safety
• in socio-economic upliftment of people below the poverty line
• to identify emerging opportunities through effective coordination
• speedy dissemination of information by networking and appropriate
human resource development
When there is large amount of data available in many forms, to derive meaningful conclusions
without loss of data, data mining can be used. Data mining helps in extracting knowledge
from huge datasets. The technique aids in the computational process of discovering patterns
in large data sets involving methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems. The extracted information from a data set using
data mining will be transformed into an understandable structure for further use. The key
properties of data mining are
62
streamlining the sector. Data mining helps educators access student data, predict
achievement levels and pinpoint students or groups of students in need of extra attention.
Finance & Banking: Banking system maintains billions of transactions of their customer base
and the automated algorithms coupled with. data mining will help companies get a better
view of market risks, detect fraud faster, manage regulatory compliance obligations, and get
optimal returns on their marketing investments.
Insurance: The insurance companies may have to handle risks, fraud, the defaulting of
customers and also retain their customer base. In the competitive insurance market, the
products have to be priced to attract customers and find new businesses to expand their
customer base.
Manufacturing: The production line has to be aligned to the supply structure, and the other
departments like quality assurance, packing, branding and maintenance have to be taken care
of for seamless operations. The demand forecast forms the basis of supply chain and timely
delivery has to be ensured. Data mining can be used to predict wear and tear of production
assets and anticipate maintenance, which can maximize uptime and keep the production line
on schedule.
Retailing: Large customer databases hold hidden customer insight that can help you improve
relationships, optimize marketing campaigns and forecast sales. Through more accurate data
models, retail companies can offer more targeted campaigns – and find the offer that makes
the biggest impact on the customer. Data mining tools sweep through databases and identify
previously hidden patterns in one step. An example of pattern discovery is the analysis of
retail sales data to identify seemingly unrelated products that are often purchased together.
Other pattern discovery problems include detecting fraudulent credit card transactions and
identifying anomalous data that could represent data entry keying errors.
Data can be of the following types - Record data – Transactional, Temporal data – Time
series, sequence (biological sequence data), Spatial & Spatial-Temporal data, Graph data,
Unstructured data - twitter, status, review, news article and Semi-structured data -
publication data, xml. Data mining can be employed for:
Anomaly Detection (Outlier/change/deviation detection): The identification of unusual data
records, that might be interesting or data errors that require further investigation.
Association Rule Learning (Dependency modelling): Searches for relationships between
variables. For example, a supermarket might gather data on customer purchasing habits.
Using association rule learning, the supermarket can determine which products are frequently
63
bought together and use this information for marketing purposes. This is sometimes referred
to as market basket analysis.
Clustering is the task of discovering groups and structures in the data that are in some way
or another "similar", without using known structures in the data.
Classification is the task of generalizing known structure to apply to new data. For example,
an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
Regression attempts to find a function which models the data with the least error.
Summarization providing a more compact representation of the data set, including
visualization and report generation.
In order to explore the unknown underlying dependency in the data an initial hypothesis is
assumed. There may be several hypotheses formulated for a single problem at this stage.
Data generation is the second step which can be either through a designed experiment. The
second possibility is when the expert cannot influence the data- generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Data collection affects its
theoretical distribution. It is important to make sure that the data used for estimating a model
and the data used later for testing and applying a model come from the same, unknown,
sampling distribution. In the observational setting, data are usually "collected" from the
existing databases, data warehouses, and data marts.
64
Data pre-processing is an important step before doing the analysis. Firstly, outliers have to
be identified and removed or treated. Commonly, outliers result from measurement errors,
coding and recording errors, and, sometimes, are natural, abnormal values. Such
nonrepresentative samples can seriously affect the model produced later. Pre-processing
involves either removal of outliers from data or develop robust models which are insensitive
to outliers. Data pre-processing also includes several steps such as variable scaling and
different types of encoding. For estimating the model, selection and implementation of the
appropriate data-mining technique is an important step.
Data-mining models should help in decision making. Hence, such models need to be
interpretable in order to be useful because humans are not likely to base their decisions on
complex "black-box" models. Note that the goals of accuracy of the model and accuracy of
its interpretation are somewhat contradictory. Usually, simple models are more interpretable,
but they are also less accurate. Modern data-mining methods are expected to yield highly
accurate results using high dimensional models.
65
Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. Data Mining has a different type of
classifier:
• Decision Tree is a flow-chart-like tree structure, where each node represents a test on
an attribute value, each branch denotes an outcome of a test, and tree leaves represent
classes or class distributions.
• SVM (Support Vector Machine) is a supervised learning strategy used for classification
and additionally used for regression. When the output of the support vector machine
is a continuous value, the learning methodology is claimed to perform regression; and
once the learning methodology will predict a category label of the input object, it’s
known as classification.
• Generalized Linear Model (GLM) is a statistical technique, for linear modeling. GLM
provides extensive coefficient statistics and model statistics, as well as row diagnostics.
It also supports confidence bounds.
• Bayesian classification is a statistical classifier. They can predict class membership
probabilities, for instance, the probability that a given sample belongs to a particular
class. Bayesian classification is created on the Bayes theorem.
• Classification by Backpropagation
• K-NN Classifier: The k-nearest neighbor (K-NN) classifier is taken into account as an
example-based classifier, which means that the training documents are used for
comparison instead of an exact class illustration, like the class profiles utilized by other
classifiers.
• Rule-Based Classification represent the knowledge in the form of If-Then rules. An
assessment of a rule evaluated according to the accuracy and coverage of the classifier.
If more than one rule is triggered then we need to conflict resolution in rule -based
classification.
• Frequent-Pattern Based Classification (or FP discovery, FP mining, or Frequent itemset
mining) is part of data mining. It describes the task of finding the most frequent and
relevant patterns in large datasets.
• Rough set theory can be used for classification to discover structural relationships
within imprecise or noisy data. It applies to discrete-valued features. Continuous-
valued attributes must therefore be discrete prior to their use. Rough set theory is
based on the establishment of equivalence classes within the given training data.
66
• Fuzzy Logic: Rule-based systems for classification have the disadvantage that they
involve sharp cut-offs for continuous attributes. Fuzzy Logic is valuable for data mining
frameworks performing grouping /classification. It provides the benefit of working at
a high level of abstraction.
Clustering Unlike classification and prediction, which analyze class-labelled data objects or
attributes, clustering analyzes data objects without consulting an identified class label. In
general, the class labels do not exist in the training data simply because they are not known
to begin with. Clustering can be used to generate these labels. The objects are clustered
based on the principle of maximizing the intra-class similarity and minimizing the interclass
similarity. That is, clusters of objects are created so that objects inside a cluster have high
similarity in contrast with each other, but are different objects in other clusters. Each
Cluster that is generated can be seen as a class of objects, from which rules can be inferred.
Clustering can also facilitate classification formation, that is, the organization of
observations into a hierarchy of classes that group similar events together.
Regression can be defined as a statistical modelling method in which previously obtained
data is used to predicting a continuous quantity for new observations. This classifier is also
known as the Continuous Value Classifier. There are two types of regression models: Linear
regression and multiple linear regression models.
Date generation in fisheries will vary depending on the nature of research undertaken. For
example, when species behaviour, growth, abundance, etc. is studied detailed data on spatial
distribution and catch is required. If the focus is to predict the profit of the coming years, an
economist should study the effect of population size on producer's costs. The macro level
data on infrastructure, employment, earnings, investment etc. will be considered to formulate
management measures. Enormous data from marine fishing gets generated from commercial
fishing vessels and research vessels which will can be mined to analyse the trend, resource
abundance, etc.
In ‘Fishery technology’ large volumes of data generated in a wide range of applied scientific
areas of fishing technology, fish processing, quality control, fishery economics, marketing and
management. Apart from statistical data collected in technological research, data also
collected on production, export, socio-economics etc. for administrative and management
decision making.
67
Major areas of data generation are as follows:
❖ fishing vessel and gear designs
❖ fishing methods
❖ craft and gear materials
❖ craft and gear preservation methods
❖ fishing efficiency studies
❖ fishing accessories
❖ emerging areas include use of GIS and remote sensing
Data on various aspects of fishing gets collected for administrative purposes and policy
making. For administrative purposes, voluminous data gets generated through fisheries
departments of states. Each district has officials entrusted with the work of collection of
data which are coordinated at the state level. State level figures are compiled at the National
level by Department of Animal Husbandry and Dairying, Ministry of Agriculture, New Delhi.
National
level
estimates
Information is also compiled on macroeconomic variables like GSDP from fishing by the
respective Directorates of Economics & Statistics.
Infrastructure
Indian fisheries is supported by a vast fishing fleet of 2,03,202 fishing crafts categorized into
mechanized, motorized and non-motorised. The registration of these fishing crafts are done
at various ports across India and license for fishing operations has to be obtained from the
respective states. The fish processing sector largely managed by the private sector has per
day processing capacity installed at 11000 tonnes per day. Data is also collected on the
infrastructure facilities and inventories by agencies from time to time, such as number of
68
mechanized, motorized and non-motorized fishing crafts, fish landing centers, fisheries
harbours, types of gears and accessories, fish markets, ice plants and cold storages, Socio-
economic data like population of fishermen, welfare schemes, cooperative societies, financial
assistance, subsidies, training programs, etc.
Data on fish farms, production and area under aquaculture is maintained by the respective
State Fisheries departments and compiled at the National level. Apart from capture fisheries
(marine) and culture fisheries (aquaculture) the fish production from inland water bodies like
lake, ponds, reservoirs, etc. is collected and compiled at State level. For developing the sector,
various programmes and projects have to be formulated and implemented. To achieve the
objectives of such developmental programmes, the current status of production of fish from
various regions has to be made known. The need for fish production data maintained by these
agencies from marine sources, aquaculture and inland water bodies arises while formulating
various research studies and development projects at district, state and National level.
69
Data Generation along the Fish Value Chain
Fresh fish after harvest is iced and distributed through various channels into the domestic
markets and overseas markets. Around 80% of the fish is marketed fresh, 12% of fish gets
processed for the export sector, 5% is sent for drying/curing and the rest is utilized for other
purposes.
Marine Products Export Development Authority (MPEDA) maintains the database on the
export of fish and fishery products from India to various countries. The weekly prices realized
by Indian seafood products in the various overseas markets are also collected and compiled
by the agency. Marine Products Export Development Authority (MPEDA), established in
1972 under the Ministry of Commerce, is responsible for collecting data regarding production
and exports, apart from formulating and implementing export promotion strategies. Before
MPEDA was established, the Export Promotion Council of India was undertaking this task.
Fish processing factories established all over the country generate data on daily production,
procurement of raw material and movement of price structure, etc., which is generally kept
confidential. Data on quality aspects maintained by the Export Inspection Council of India
through the Export Inspection Agency (EIA) in each region under the Ministry of Commerce
and Industry. The EIA is the agency approving the suitability of the products for export.
⚫ bacteriological organisms present in the products
⚫ rejections in terms of quantity
⚫ reason for rejection etc.
70
At the Central Institute of Fisheries Technology (CIFT), we periodically collect data on the
following aspects, which are used for policy decisions.
◼ Techno-economic data on various technologies developed
◼ Data on the Economics of operation of mechanized, motorized, and traditional crafts
◼ Data for the estimation of fuel utilization by the fishing industry
◼ Year-wise data on Installed capacity utilization in the Indian seafood processing
industry
◼ Demand – supply and forecast studies on the fishing webs
◼ Harvest and post-harvest losses in fisheries
◼ Transportation of fresh fish and utilization of trash fish
◼ Impact of major trade policies like the impact of anti-dumping, trend analysis of
price movement of marine products in the export markets
◼ Study on the impact of technology and study on socio-economic aspects
Compared to other data mining software, SAS Enterprise Miner is a very comprehensive tool
that can handle a wide variety of data mining tasks. Further, it is very user-friendly and easy
to learn, even for users who are not familiar with SAS programming. Finally, it has a wide
range of built-in features and functionality, which makes it a very powerful tool.
71
Features
Data Preparation
Data Input
You can load a dataset into SAS Enterprise Miner by using the Data Import node. This node
lets you specify the dataset's location and other necessary information, such as variable types
and roles. Nodes are the building blocks of a SAS Enterprise Miner process flow. There are
various node types, each of which performs a different task. For example, there are nodes for
data import, data cleansing, modeling, and results visualization.
The main components of SAS Enterprise Miner are the data source, the data target, the
model, and the results. The data source is the location from which the data is being imported.
The data-target is the location to which the data is being exported. The model is the
statistical or machine learning model that is being used to analyze the data. The results are
the model output, which can be used to make predictions or decisions.
Decision trees are a type of predictive modeling used to classify data. In SAS Enterprise Miner,
decision trees are generated using the Tree Model node. This node takes a dataset as input
and generates a decision tree based on the variables in the dataset. The tree can then be
used to predict the class of new data.
Data Partition
You can split datasets in SAS Enterprise Miner by using the Partition node. This node will
take a dataset as input and will output two or more partitions based on the settings that you
specify. You can specify the percentage of records that should go into each partition, or you
can specify a particular variable on which to split the dataset. Partitioning provides mutually
exclusive data sets. Two or more mutually exclusive data sets share no observations with each
other. Partitioning the input data reduces the computation time of preliminary modeling runs.
72
The Data Partition node enables you to partition data sets into training, test, and validation
data sets. The training data set is used for preliminary model fitting. The validation data set
is used to monitor and tune the model weights during estimation and for model assessment.
The test data set is an additional hold-out data set that you can use for model assessment.
This node uses simple random sampling, stratified random sampling, or user-defined
partitions to create partitioned data sets.
Filtering Data
The Filter node tool is located on the Sample tab of the Enterprise Miner tools bar. Use the
Filter node to create and apply filters to your training data set. You can also use the Filter
node to create and apply filters to the validation and test data sets. You can use filters to
exclude certain observations, such as extreme outliers and errant data you do not want to
include in your mining analysis. Filtering extreme values from the training data produces
better models because the parameter estimates are more stable.
Association node enables you to identify association relationships within the data. For
example, if a customer buys a loaf of bread, how likely is the customer to also buy a gallon of
milk? The node also enables you to perform sequence discovery if a sequence variable is
present in the data set. The Cluster node enables you to segment your data by grouping
statistically similar observations. Similar observations tend to be in the same cluster, and
observations that are different tend to be in different clusters. The cluster identifier for each
observation can be passed to other tools for use as an input, ID, or target variable. It can also
be used as a group variable that enables the automatic construction of separate models for
each group.
DMDB node creates a data mining database that provides summary statistics and factor-
level information for class and interval variables in the imported data set. The DMDB is a
metadata catalog that stores valuable counts and statistics for model building.
Graph Explore node is an advanced visualization tool that allows you to graphically explore
large volumes of data to uncover patterns and trends and reveal extreme values in the
database. For example, you can analyze univariate distributions, investigate multivariate
distributions, and create scatter and box plots and constellation and 3-D charts. Graph
Explore plots are fully interactive and are dynamically linked to highlight data selections in
multiple views.
73
Link Analysis node transforms unstructured transactional or relational data into a model that
can be graphed. Such models can be used to discover fraud detection, criminal network
conspiracies, telephone traffic patterns, website structure and usage, database visualization,
and social network analysis. Also, the node can be used to recommend new products to
existing customers.
Market Basket node performs association rule mining of transaction data in conjunction with
item taxonomy. This node is useful in retail marketing scenarios that involve tens of
thousands of distinct items, where the items are grouped into subcategories, categories,
departments, and so on. This is called item taxonomy. The Market Basket node uses the
taxonomy data and generates rules at multiple levels in the taxonomy.
MultiPlot node is a visualization tool that allows you to graphically explore larger volumes of
data. The MultPlot node automatically creates bar charts and scatter plots for the input and
target variables without making several menu or window item selections. The code created
by this node can be used to create graphs in a batch environment.
Path Analysis node enables to analyze Web log data to determine the paths that visitors take
as they navigate through a website. You can also use the node to perform sequence analysis.
SOM/Kohonen node enables you to perform unsupervised learning by using Kohonen vector
quantization (VQ), Kohonen self-organizing maps (SOMs), or batch SOMs with Nadaraya-
Watson or local-linear smoothing. Kohonen VQ is a clustering method, whereas SOMs are
primarily dimension-reduction methods.
StatExplore node is a multipurpose node that you use to examine variable distributions and
statistics in your data sets. Use the StatExplore node to compute standard univariate
statistics, standard bivariate statistics by class target and class segment, and correlation
statistics for interval variables by interval input and target. You can also use the StatExplore
node to reject variables based on target correlation.
Variable Clustering node is a useful tool for selecting variables or cluster components for
analysis. Variable clustering removes collinearity, decreases variable redundancy, and helps
reveal the underlying structure of the input variables in a data set. Large numbers of variables
can complicate the task of determining the relationships that might exist between the
independent variables and the target variable in a model. Models that are built with too many
redundant variables can destabilize parameter estimates, confound variable interpretation,
and increase the computing time that is required to run the model. Variable clustering can
reduce the number of variables that are required to build reliable predictive or segmentation
models.
74
Variable Selection node enables you to evaluate the importance of input variables in
predicting or classifying the target variable. The node uses either an R 2 or a Chi-square
selection (tree-based) criterion. The R2 criterion removes variables that have large
percentages of missing values and removes class variables that are based on the number of
unique values. The variables unrelated to the target are set to a status of rejected. Although
rejected variables are passed to subsequent tools in the process flow diagram, these variables
are not used as model inputs by modeling nodes such as the Neural Network and Decision
Tree tools.
AutoNeural node can be used to automatically configure a neural network. The AutoNeural
node implements a search algorithm to incrementally select activation functions for various
multilayer networks.
Decision Tree node enables you to fit decision tree models into your data. The
implementation includes features in various popular decision tree algorithms (for example,
CHAID, CART, and C4.5). The node supports both automatic and interactive training. When
you run the Decision Tree node in automatic mode, it automatically ranks the input variables
based on the strength of their contribution to the tree. This ranking can be used to select
variables for subsequent modeling. You can override any automatic step with the option to
define a splitting rule and prune explicit tools or subtrees. Interactive training lets you explore
and evaluate data splits as you develop them.
Dmine regression node enables you to compute a forward stepwise least squares regression
model. In each step, the independent variable that contributes maximally to the model R-
square value is selected. The tool can also automatically bin continuous terms.
DMNeural node is another modeling node that you can use to fit an additive nonlinear model.
The additive nonlinear model uses bucketed principal components as inputs to predict a
binary or an interval target variable with the automatic selection of an activation function.
Ensemble node enables the creation of models by combining the posterior probabilities (for
class targets) or the predicted values (for interval targets) from multiple predecessor models.
Gradient Boosting node uses tree boosting to create a series of decision trees that together
form a single predictive model. Each tree in the series fits the residual prediction of the earlier
trees. The residual is defined in terms of the derivative of a loss function. For squared error
loss with an interval target, the residual is simply the target value minus the predicted value.
Boosting is defined for binary, nominal, and interval targets.
75
LARS node enables you to use Least Angle Regression algorithms to perform variable
selection and model fitting tasks. The LARs node can produce models that range from simple
intercept models to complex multivariate models that have many inputs. When using the
LARs node to perform model fitting, the node uses criteria from either the least angle
regression or the LASSO regression to choose the optimal model.
MBR (Memory-Based Reasoning) node enables you to identify similar cases and to apply
information that is obtained from these cases to a new record. The MBR node uses k-nearest
neighbor algorithms to categorize or predict observations.
Model Import node enables you to import models into the SAS Enterprise Miner environment
that SAS Enterprise Miner did not create. Models created using SAS PROC LOGISTIC (for
example) can now be run, assessed, and modified in SAS Enterprise Miner.
Neural Network node enables you to construct, train, and validate multilayer feedforward
neural networks. Users can select from several predefined architectures or manually select
input, hidden, and target layer functions and options.
Partial Least Squares node is a tool for modeling continuous and binary targets based on
SAS/STAT PROC PLS. The Partial Least Squares node produces DATA step score code and
standard predictive model assessment results.
Regression node enables to fit linear and logistic regression models to your data. You can use
continuous, ordinal, and binary target variables. You can use both continuous and discrete
variables as inputs. The node supports the stepwise, forward, and backward selection
methods. A point-and-click interaction builder enables to create higher-order modeling terms.
Rule Induction node enables you to improve the classification of rare events in your modeling
data. The Rule Induction node creates a Rule Induction model that uses split techniques to
remove the largest pure split node from the data. Rule Induction also creates binary models
for each level of a target variable and ranks the levels from the rarest event to the most
common. After all levels of the target variable are modeled, the score code is combined into
a SAS DATA step.
Two Stage node enables you to compute a two-stage model to predict a class and an interval
target variable simultaneously. The interval target variable is usually a value that is associated
with a level of the class target.
Survival data mining
Survival data mining is the application of survival analysis to customer data mining problems.
The application to the business problem changes the nature of the statistical techniques. The
issue in survival data mining is not whether an event will occur in a certain time interval but
76
when the next event will occur. The SAS Enterprise Miner Survival node is located on the
Applications tab of the SAS Enterprise Miner toolbar. The Survival node performs survival
analysis on mining customer databases when there are time-dependent outcomes. The time-
dependent outcomes are modeled using multinomial logistic regression. The discrete event
time and competing risks control the occurrence of the time-dependent outcomes. The
Survival node includes functional modules that prepare data for mining, expand data to one
record per time unit, and perform sampling to reduce the size of the expanded data without
information loss. The Survival node also performs survival model training, validation, scoring,
and reporting.
77
Chapter 6
Introduction to Indices and Performance
Evaluation
J. Charles Jeeva and R. Narayana Kumar
Madras Regional Station
ICAR- Central Marine Fisheries Research Institute
Chennai-600 020
Indices are a sum of a series of individual yes/no questions that are then combined in a single
numeric score. They are usually a measure of the quantity of some social phenomenon and
are constructed at a ratio level of measurement. The word is derived from Latin, in which
index means "one who points out," an "indication," or a "forefinger." In Latin, the plural form
of the word is indices. In statistics and research design, an index is a composite statistic – a
measure of changes in a representative group of individual data points or other words. This
compound measure aggregates multiple indicators.
Features of Index numbers are as follows: Average: They predict or represent the changes
that take place in terms of averages. Quantitative: They offer the accurate measurement of
quantitative change. Measures of relative changes: they measure relative changes over time.
Scales are always used to give scores at the individual level. However, indices could be used
to give scores at both individual and aggregate levels. They differ in how the items are
aggregated.
A scale is an index that, in some sense, only measures one thing. For example, a final exam in
a given course could be thought of as a scale: it measures competence in a single subject. In
contrast, a person's GPA can be considered an index: it is a combination of several separate,
independent competencies.
78
To summarize, an index is a measure that contains several indicators and is used to
summarize some more general concepts.
Indicators are quantitative or qualitative variables that can be measured or described and
when observed periodically, demonstrate trends; they help to communicate complex
phenomena. They represent the abstraction of a phenomenon or a variable. In other words,
an indicator is just an indicator. It is not the same as the phenomenon of interest, but only
an indicator of that phenomenon (Patton, 1997).
Classification of Indicators
Scientific indicators tend to be quantitatively measurable; they are global within a given
discipline and are meant to be comparable across space and time.
Grassroots (indigenous/local) indicators are signals used by local people (individuals, groups,
communities) based on their own observations, perceptions, and local knowledge, applied
within specific cultural, ecological, and spiritual contexts; they tend to be more descriptive.
Another classification of indicators says that they can be broadly classified into two
categories, namely, final and intermediate.
Final indicator: When an indicator measures the effect of an intervention on individuals’ well-
being, we call it a "final" indicator.
For example, literacy may be considered one of the dimensions of `well-being,’ so an indicator
measuring it—say, the proportion of people of a certain age who can read a simple text and
write their name—would be a final indicator. Sometimes final indicators are divided into
“outcome” and “impact”.
Impact indicators measure key dimensions of `well-being’ such as freedom from hunger,
literacy, good health, empowerment, and security.
Outcome indicators capture access to, use of, and satisfaction with public services, such as
the use of health clinics and satisfaction with the services received, access to credit,
representation in political institutions, and so on. These are not dimensions of well-being ‘in
themselves but are closely related. They may be contextual. Thus, both the impact and
outcome indicators should constitute the final indicators of impact assessment and
monitoring impact.
79
Intermediate indicator: when an indicator measures a factor that determines an outcome or
contributes to the process of achieving an outcome, we call it an “input” or “output” indicator,
depending on the stage of the process—in other words, an "intermediate" indicator.
For example, many things may be needed to raise literacy levels: more schools and teachers,
better textbooks, etc. A measure of public expenditures on classrooms and teachers would be
‘input’ indicators, while measures of classrooms built and teachers trained would be ‘output’
indicators. What is important is that inputs and outputs are not goals in themselves; rather,
they help to achieve the chosen goals.
A good indicator:
Once a set of goals/objectives of the project have been agreed upon through a participatory
analysis process, the next step is to identify indicators—also in a participatory way—to
measure progress toward those goals as a result of an intervention or a development project.
The impact monitoring and assessment depend critically on the choice of appropriate
indicators. Preferably, they should be derived from the identification and descriptions of
relevant variables being given by the clients, with appropriate indicators of them being based
on discussion of all the stakeholders.
80
Basis for Indicators of Impact Assessment
• Indicators of the program impact based on the program objectives are needed to
guide policies and decisions at all levels of society- village, town, city, district, state,
region, nation, continent, and world.
• These indicators must represent all important concerns of all the stakeholders in
the program: An ad-hoc collection of indicators that just seem relevant is not
adequate. A more systematic approach must look at the interaction of the program
components with the environment.
• The number of indicators should be as small as possible but not smaller than
necessary. The indicator set must be comprehensive and compact, covering all
relevant aspects.
• The process of finding an indicator set must be participatory to ensure that the set
encompasses the visions and values of the community or region for which it is
developed.
• From a look at these indicators, it must be possible to deduce the viability and
sustainability of change due to a project program and current developments and to
compare with alternative change/development paths.
Appropriate Tools
Participatory Rural Appraisal (PRA) tools are often only seen as appropriate for gathering
information at the beginning of an intervention, as part of a process of appraisal and planning.
Development workers may talk about having ‘done’ a PRA, sometimes seeing it as a step
towards getting funding. However, PRA tools have a much wider range of potential uses and
can often be readily adapted and used for participatory monitoring and participatory
evaluation.
81
A few examples described are as follows:
Transect walk is a means of involving the community in monitoring and evaluating changes
that have occurred over the program intervention period. This method entails direct
observation while incorporating the views of community members.
Spider web diagram is used as a means for participants to monitor and evaluate key areas of
a program. The spider web is a simple diagrammatic tool for discussion use; it does not entail
any direct field observations.
Participatory mapping is perhaps the easiest and most popular participatory tool used here
to evaluate project interventions.
Photographic comparisons are another easy visual tool, here used to stimulate community
discussions in evaluating program interventions.
Well-being ranking differentiates the benefits that different community members have
gained from the development interventions.
The H-form is a simple monitoring and evaluation tool. This method is particularly designed
for monitoring and evaluation of programs. It was developed in Somalia to assist local people
in monitoring and evaluating local environmental management. The method can be used for
developing indicators, evaluating activities, and facilitating and recording interviews with
individuals regarding tank silt applications.
However interesting a participatory evaluation at the end of a program might be, without it
having been based on a sound system of participatory monitoring throughout the project
intervention, the evaluation in itself is limited. Thus, the first conclusion to draw is that
monitoring and evaluation should be made a systematic feature of all interventions, seeking
community participation from the outset in defining what should be monitored (indicators),
how often and by whom the monitoring should be conducted, how this information will be
used, etc.
A detailed analysis usually produces many components of plausible impact, long viability
impact chains, and potential indicators. Furthermore, there will generally be several, perhaps
many, appropriate indicators for answering each assessment question or particular aspects
82
of it. It is, therefore, essential to condense the impact analysis system and the indicator set
as much as permissible without losing essential information. There are several possibilities to
do this. They are:
• Aggregation. Use the highest level of aggregation possible. For example, when
applied in the final impact assessment, they are likely to be disaggregated into
smaller components, according to the requirement of the impact assessment
scheme.
• Weakest-link approach. Identify the weakest links in the program and define
appropriate indicators. Do not bother with other components that may be vital but
not related to direct program effects.
Performance Evaluation
All organizations that have learned the art of “winning from within” by focusing inward on
their employees rely on a systematic performance evaluation process to measure and
evaluate employee performance regularly. Ideally, employees are graded annually on their
work anniversaries, based on which they are either promoted or given a suitable distribution
of salary raises. Performance evaluation also directly provides periodic feedback to employees,
such that they are more self-aware in terms of their employee performance
evaluation metrics.
83
Performance Evaluation is a formal and productive procedure to measure an employee’s work
and results based on their job responsibilities. It is used to gauge the value an employee adds
in terms of increased business revenue compared to industry standards and overall employee
return on investment (ROI).
➢ It is an integrated platform for the employee and employer to attain common ground
on what both think is befitting a quality performance. This helps improve
communication, which usually leads to better and more accurate team metrics and,
thus, improved performance results.
➢ A manager should evaluate his/her team members regularly and not just once a
year. This way, the team can avert new and unexpected problems with constant work
to improve competence and efficiency.
➢ The management can effectively manage the team and conduct productive resource
allocation after evaluating the goals and preset standards of performance.
Now that we know why the staff performance measurement process is necessary, let us look
at the top 5 key benefits the employee performance evaluation offers.
Improved communication
Managers help their employees with assignments and how they can effectively do them. A
performance evaluation meeting is a perfect time to examine an employee’s career path. It
lets the employee know what their future goals are and what they need to do to get there. It
helps them create small and achievable goals, assign deadlines, and work toward completion.
It also lets them know where they stand in the hierarchy and where they will be in the future.
Engaged employees perform better than their counterparts. They are better team players,
are more productive, and help their peers out actively. A staff performance evaluation is a
perfect time to check employee engagement. It will help you understand how engaged the
employee is and let you know what steps you would need to take to ensure high engagement.
Resources planning
Staff appraisals help in understanding how an employee is performing and what their future
assignments or goals can be. It not only helps in effective goals management but also in
resource planning. You can effectively reallocate your resources or hire new members to add
to your team.
85
Performance Evaluation Methods
There are 5 most critical performance evaluation methods. Using only one of these
performance evaluation methods might help an organization gain one-sided information
while using multiple methods to help obtain insights from various perspectives, which will be
instrumental in forming an unbiased and performance-centric decision.
3. Graphics rating scale: It is one of supervisors’ most widely used performance evaluation
methods. Numeric or text values corresponding to values from poor to excellent can be used
in this scale, and parallel evaluation of multiple team members can be conducted using this
graphical scale. Employee skills, expertise, conduct, and other qualities can be evaluated
compared to others in a team. It is important to make each employee understand the value
of each entity of the scale in terms of success and failure. This scale should ideally be the
same for each employee.
4. Developmental checklists: Every organization has a roadmap for each employee for their
development and exhibited behavior. Maintaining a checklist for development is one of the
most straightforward performance evaluation methods. This checklist has several
dichotomous questions, the answers of which need to be positive. If not, then the employee
requires some developmental training in the areas where they need improvement.
86
5. Demanding events checklist: There are events in each employee’s career with an
organization where they must exhibit immense skill and expertise. An intelligent manager
always lists demanding events where employees show good or bad qualities.
Conclusion
A discussion on what indicators are. How can we classify these indicators based on the
purpose for which they are used to measure and monitor different impact components? The
basis for identification, selection, and how to apply them in impact monitoring and
assessment (IMA) has been presented in this article.
87
Bibliography
Estrella, M., Blauert, J. et. al. Etd. Learning from Change: Issues and Experiences in
Participatory Monitoring and Evaluation. Intermediate Technology Publications,
IDRC, Canada.
Feder, G and Slade, R.H.1986. “Methodological Issues in the Evaluation and Extension
research.” In Etd. Jones, G.E. 1986. Investing in Rural Extension: strategies and Goals.
Elsevier Applied Science Publishers, London.
Herweg, K. and Steiner, K. 2002. Impact Monitoring and Evaluation: Instruments for Use in
Rural Development Projects With a Focus on Sustainable Land Management. Vol.
1: Procedure. Rural Development Department, World Bank, Washington D.C. 1998.
Jaiswal, N.K. and Das, P.K. (1981) Transfer of Technology in Rice Farming, Rural Development
Digest, 4 (4): 320-353.
Patton, M. 1997 Etd. Utilization Focused Evaluation: the new Century Text. Sage
Publications, Ch.7-8. International Educational and Professional Publishers, New
Delhi. 1997.
Sustad, J. and M. Cohen. 1998. Toward Guidelines for Lower-Cost Impact Assessment
Methodologies for Microenterprise Programs. A Discussion paper, AIMS,
Management systems International, 600 water Street SW. Washington, DC.
88
Chapter 7
114
Fundamentals of Panel Data 1114
Analysis
11414
Umanath Malaiarasan
Madras Institute of Development Studies, Chennai-020
Introduction
The dynamics of nature, society, and human activities are continuously evolving. It is driven
by numerous factors ranging from environmental shifts to socio-economic transformations.
Understanding these complex and interrelated phenomena is essential for addressing
contemporary challenges and shaping future trajectories. The world we inhabit is in a state
of continuous change, with natural systems undergoing profound changes in response to
climate variability, habitat destruction, and resource exploitation. The behavior of nature is
undergoing unprecedented shifts that have far-reaching implications for ecosystems,
biodiversity and human well-being. Concurrently, societal structures and norms are in a state
of instability, shaped by demographic shifts, technological advancements, and cultural
dynamics. Globalization has ushered in a new era of interconnectedness, transforming
patterns of trade, migration, and communication on a scale never before witnessed.
Meanwhile, human activities from production to consumption are reshaping the physical
structure and social landscapes of the earth, putting forth pressures on natural resources,
and altering the fabric of communities worldwide.
In this dynamic landscape, traditional research methods, statistical and econometrics analysis
often fall short of capturing the complexity and temporal dynamics of these phenomena.
Cross-sectional studies provide only a static snapshot of reality, overlooking the temporal
dimension crucial for understanding change over time. Similarly, time series analysis may
overlook individual-level variation and the heterogeneity of responses across different
contexts. As we confront the complex challenges of the 21st century—from climate change
and biodiversity loss to inequality and social unrest—there should be alternative approaches
and robust analytical tools capable of capturing the dynamic interactions shaping our world.
Researchers and policymakers need a data analysis that offers a unique vantage point by
combining the strengths of both approaches to examine individuals over time and uncover
patterns of change within and across various levels of analysis.
89
The importance of panel data analysis in capturing the dynamic nature of nature, society, and
human activities lies in its ability to disentangle the complex inter-relationships among these
phenomena. By longitudinally tracking individuals, households, communities, or regions, panel
data analysis allows researchers to explore how changes in one domain influence and are
influenced by changes in others. For instance, it can shed light on how environmental policies
impact socioeconomic outcomes or how shifts in cultural norms affect individual behaviors
and societal structures. Moreover, panel data analysis facilitates the identification of causal
pathways and feedback loops, providing insights essential for informed decision-making and
policy formulation. In the pages that follow, we will explore the methodological foundations
with suitable empirical applications in panel data analysis.
Panel data techniques have gained popularity in recent years due to their ability to address
the challenges and limitations of conventional Ordinary Least Squares (OLS) estimations.
OLS estimations often yield uncertain outcomes, and the history of regression analysis is
marked by numerous violations of its assumptions (Bickel, 2007; Gil-Garcia, 2008; Gefen,
Straub & Boudreau, 2000; Hair et al., 1998). These violations can lead to biased and
inefficient estimates, compromising the validity and reliability of the results. To overcome
these challenges, researchers have developed a considerable array of tests and procedures to
identify and rectify OLS violations. However, these adjustments can be complex and time-
consuming, requiring researchers to make assumptions about the nature and extent of the
violations. This introduces additional uncertainty into the analysis and may limit the
generalizability of the findings. In contrast, panel data techniques offer a promising
alternative. By utilizing data collected over time from the same individuals, organizations, or
units, panel data methods allow researchers to control for unobserved heterogeneity and
time-invariant factors that may complicate the analysis. This longitudinal approach provides
a more comprehensive understanding of the relationships between variables and allows for
the examination of dynamic processes and causal effects.
Panel data analysis occupies a pivotal position at the intersection of time series and cross-
sectional econometrics. Conventionally, time series parameter identification relied on
concepts such as stationarity, pre-determinedness, and uncorrelated shocks, while cross-
sectional parameter identification leaned on exogenous instrumental variables and random
sampling. Panel datasets, by encompassing both dimensions, have expanded the realm of
possible identification arrangements, prompting economists to reevaluate the nature and
90
sources of parameter identification. One line of inquiry stemmed from utilizing panel data to
control unobserved time-invariant heterogeneity in cross-sectional models. Another strand
aimed to dissect variance components and estimate transition probabilities among states.
Studies in these domains loosely corresponded to early investigations into fixed and random
effects approaches. The former typically sought to measure regressor effects while holding
unobserved heterogeneity constant, while the latter focused on parameters characterizing
error component distributions. A third vein explored autoregressive models with individual
effects and broader models with lagged dependent variables. A significant portion of research
in the first two traditions concentrated on models with strictly exogenous variables. This
differs from time series econometrics, where distinguishing between predetermined and
strictly exogenous variables is fundamental in model specification. However, there are
instances where theoretical or empirical concerns warrant attention to models exhibiting
genuine lack of strict exogeneity after accounting for individual heterogeneity. Various terms
are employed to denote panel data, encompassing pooled data, pooled time series and cross-
sectional data, micropanel data, longitudinal data, and event history analysis, among others
(Baltagi, 2008; Greene, 2012; Gujarati, 2003; Wooldridge, 2002).
Cross-sectional data
91
Table 1. Per acre yield, quantity use of seed and fertilizers in paddy production for various
farms in 2020-21
Year State Yield in quintal Seed in kg Fertilizers in kg
2020-21 Farm 1 60.32 62.42 267.14
2020-21 Farm 2 35.52 52.15 26.56
2020-21 Farm 3 27.68 37.25 117.19
2020-21 Farm 4 39.24 19.82 223.16
2020-21 Farm 5 54.05 15.1 223.22
2020-21 Farm 6 31.88 15.66 62.42
2020-21 Farm 7 54.88 62.6 320.58
2020-21 Farm 8 39.04 95.47 160.34
2020-21 Farm 9 41.18 50.78 161.05
2020-21 Farm 10 19.66 51.88 286.13
2020-21 Farm 1 1 44.48 56.52 128.75
2020-21 Farm 1 2 71.93 0.00 181.68
2020-21 Farm 1 3 45.91 70.54 213.37
2020-21 Farm 1 4 39.16 20.93 180.88
2020-21 Farm 1 5 42.73 57.85 165.87
The corresponding equation for the above cross-sectional data can be expressed as:
where 𝑌𝑖 is the value of the dependent variable yield of paddy for the ith farm; Fert and Seed
represent the amount of fertilizers and seed used in the ith farm; and ∝, 𝛽𝑠 and 𝑢 represent
the intercept, slope coefficients, and error terms of the equations, respectively. This equation
represents a linear regression model for cross-sectional data, where the goal is to estimate
the coefficients ∝ and 𝛽𝑠 that best describes the relationship between the independent
variables and the dependent variable for all individual farms in the sample.
Time series data
Time series data refers to observations collected at regular intervals over a continuous period
for a single entity. In other words, it represents a sequence of data points indexed by time.
Time series data are commonly used in various fields, such as economics, finance,
meteorology, and engineering, to study the behavior of a phenomenon or variable over time.
92
Observations in a time series are arranged in chronological order, with each observation
corresponding to a specific point in time. These data are typically collected at regular intervals
such as hourly, daily, monthly, or yearly. This regularity facilitates the analysis of periodic
fluctuations and trends over different time scales. Time series data can involve a single
variable (univariate time series) or multiple variables (multivariate time series). Univariate
time series focuses on the behavior of a single variable over time, while multivariate time
series considers the interactions between multiple variables. Time series data often exhibit
stochastic or random behavior, meaning that they are subject to inherent variability and
uncertainty. This stochastic component can arise from various sources, including random
fluctuations, external shocks, and measurement errors. Time series data often exhibit
autocorrelation, indicating that observations are correlated with themselves over time.
Stationarity is a fundamental concept in time series analysis, referring to the stability of
statistical properties over time. A stationary time series has a constant mean, variance, and
autocovariance structure over time, making it easier to model and analyze. Time series data
are analyzed using various statistical techniques, including time series models, spectral
analysis, cointegration, and forecasting methods. These methods allow researchers to identify
patterns, estimate parameters, make predictions, and infer causal relationships from time
series data. Table 2 shows the example for time-series data, i.e., it presents the data for per
acre yield, fertilizers, and seed over a period of time for a single farm.
Table 2. Per acre yield, quantity use of seed and fertilizers in paddy production for the single
farm over a period of time
Year Farms Yield in quintal Fertilizers in kg Seed in kg
2004-05 Farm1 22.19 9.32 65.74
2005-06 Farm1 25.17 9.78 65.05
2006-07 Farm1 16.71 10.17 67.16
2007-08 Farm1 25.38 10.75 64.14
2008-09 Farm1 26.75 8.95 63.22
2009-10 Farm1 25.83 12.61 64.62
2010-11 Farm1 29.58 15.67 62.2
2011-12 Farm1 26.51 16.6 60.98
2012-13 Farm1 31.41 16.04 58.43
2013-14 Farm1 31.84 16.66 58.02
2014-15 Farm1 32.45 23.96 61.8
93
Year Farms Yield in quintal Fertilizers in kg Seed in kg
2015-16 Farm1 32.82 21.76 60.62
2016-17 Farm1 32.69 24.7 59.38
2017-18 Farm1 33.44 24.01 56.55
2018-19 Farm1 34.52 24 56.74
2019-20 Farm1 34.79 24.68 56.02
2020-21 Farm1 35.52 26.56 52.15
The general form of the equation for the above time series data can be expressed as follows:
where 𝑌𝑡 is the value of the dependent variable yield of paddy at the t th period (year) for a
single farm; Fert and Seed represents the amount of fertilizers and seed used in the t th year;
and ∝, 𝛽𝑠 and 𝑢 represent the intercept, slope coefficients, and error term of the equations,
respectively. This equation represents a linear regression model for time series data, where
the goal is to estimate the coefficients ∝ and 𝛽𝑠 that best describes the relationship between
the independent variables and the dependent variable for all the years in the sample.
Panel data
The organization of data in a panel format involves recording individual observations for each
variable across different time points. The temporal units can vary, spanning years, months,
weeks, days, and even shorter intervals such as hours, minutes, and seconds. The choice of
time units depends on the anticipated behavior of the variable over time. Researchers may
explore various time expressions, including lagged, linear, squared, and quadratic
representations. Each case within the panel signifies an individual observation of a specific
variable from panels such as individuals, groups, firms, organizations, cities, states, countries,
etc., and an identifier for each case is essential. In principle, it is feasible to estimate time
series for each case or cross-sectional regressions for each time unit using the corresponding
equations (1) and (2). These expressions depict simple pooled Ordinary Least Squares (OLS)
models and when applied to panel data (equation (3)), it is referred to as pooled OLS
regression.
94
The pooled panel data approach aggregates observations for each case over time without
distinguishing between cases, thereby neglecting the effects across individuals and time.
Consequently, this estimation may distort the true relationships among variables studied
across cases and over time. Table 3 shows the example for panel data, i.e., it presents the data
for per acre yield, fertilizers, and seed across two different individual farms and over a three-
month period of time.
Table 3. Per acre yield, quantity use of seed and fertilizers in paddy production across
different farms over different time periods
The general form of the equation for the above time series data can be expressed as follows:
𝑌𝑖𝑡 represents the value of the dependent variable for the ith farm at tth time period; Fert and
Seed represents the amount of fertilizers and seed used in the ith farm at tth time period; and
∝, 𝛽𝑠 and 𝑢 represent the intercept, slope coefficients and error term of the equation,
respectively.
95
Types of Panel Data
In general, the observations within a sample remain consistent across all time periods.
However, there are instances, especially in random surveys, where the observations in one
period's sample differ from those in another. This distinction leads to what is known as a
balanced panel dataset for the former (Table 3) and an unbalanced panel dataset for the
latter (Table 4). Generally, an unbalanced panel dataset arises due to missing observations
for certain variables over specific time periods during the data collection process. Apart from
these, there are other forms of panel data called short, long, and dynamic panel data sets.
Short panels have a limited number of time periods relative to the number of cross-section
observed. For example, in Table 5, number of firms (5) is more than the number of time-
periods (2 years). In contrast, long panels span a large number of time periods, allowing for
extensive longitudinal analysis. For example, in Table 6, number of time periods (6 years) is
more than the number of panels (2 farms). Dynamic panels incorporate lagged values of
variables to capture temporal dependencies and serial correlation, enabling researchers to
analyze dynamic processes over time. For example, in Table 7, one year lagged value of yield
(yit-1) is taken as one of the independent variables in the data set.
96
Table 5. Short panel data (Micro panel)
Year Farms Yield in quintal Seed in kg Fertilizers in kg
2020-21 Farm 1 60.32 62.42 267.14
2019-20 Farm 1 63.71 59.65 277.24
2019-20 Farm 2 34.79 56.02 24.68
2020-21 Farm 2 35.52 52.15 26.56
2018-19 Farm 3 29.68 43.88 131.71
2020-21 Farm 3 27.68 37.25 117.19
2018-19 Farm 4 42.58 21.18 195.87
2019-20 Farm 4 45.97 21.35 196
97
Why Panel Data?
Panel data offer several advantages over other types of datasets. Here are some reasons why
panel data are valuable. First, since panel data track entities (individuals, firms, states,
countries, etc.) over time, there is inherent heterogeneity among these units. Each unit may
have unique characteristics, behaviors or responses to changes over time. Panel data allow
researchers to account for this heterogeneity exists across different individuals or panels.
Second, panel data provide more informative data compared to cross-sectional or time series
data alone. By observing units over time, researchers can capture both within-unit and
between-unit variations. This leads to less collinearity among variables, as the inclusion of
time-series and cross-sectional variations helps separating the effects of different variables.
Third, panel data are particularly well-suited for studying the dynamics of change because
they capture how individuals, firms etc., evolve over time. For example, panel data can
effectively analyze phenomena such as spells of unemployment, job turnover and labor
mobility, providing insights into how these dynamics unfold over time and how various factors
influence them. Fourth, panel data can better detect and measure effects that cannot be
observed using pure cross-sectional or pure time series data. For instance, the effects of
policies like minimum wage laws on employment and earnings can be accurately studied by
incorporating successive waves of minimum wage increases over time, which is possible with
panel data. Fifth, panel data enable the study of more complex behavioral models that involve
interactions between individual units and changes over time. Compared to simpler cross-
sectional or time series data, phenomena such as economies of scale and technological change
can be better understood and modeled using panel data. Sixth, by providing data for several
thousand units over time, panel data can minimize biases from aggregating individuals or
firms into broad aggregates. This large sample size allows for more robust statistical analyses
and reduces the risk of biased estimates.
In the Pooled OLS model, the relationship between the dependent variable yield (Y) and the
independent variables (Fertilizers, Seed) can be represented as follows:
98
where 𝑌𝑖𝑡 is the value of the dependent variable yield of paddy for the i th farm at tth time;
Fert and Seed represents the amount of fertilizers and seed used in the i th farm at tth time;
and ∝, 𝛽𝑠 and 𝑢 represent the intercept, slope coefficients and error term of the equations,
respectively. This equation represents a linear regression model for cross-sectional and time-
series data, where the goal is to estimate the coefficients ∝ and 𝛽𝑠 that best describes the
relationship between the independent variables and the dependent variable for all individual
farms in the sample. The estimated results of the pooled OLS regression data given in Table
8 are presented in Table 9.
99
Year Farms Yield in quintal Fertilizers in kg Seed in kg
100
Table 9. Pooled OLS regression results
Regression Statistics
Multiple R 0.64
R Square 0.40
Adjusted R Square 0.38
Standard Error 5.310
Observations 51.00
ANOVA
Df SS MS F Significance F
Regression 2.000 920.031 460.016 16.314 0.000
Residual 48.000 1353.460 28.197
Total 50.000 2273.492
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 40.550 6.679 6.072 0.000 27.122 53.978
Fertilizers -0.003 0.029 -0.110 0.913 -0.061 0.055
Seed -0.221 0.097 -2.287 0.027 -0.415 -0.027
In the above estimated model, it is assumed that the intercept and slope coefficients are
consistently uniform across cases or over time, but the panel dataset may not support this
assumption. Actually the above estimated model does not distinguish between various farms
and does not tell us whether the response of yield of paddy to the input variables over time
is the same for all the farms. Here, we are assuming the regression coefficients are the same
for all the farms – no distinction between farms. By lumping together, the effect of different
farms at different times into one coefficient, we camouflage the heterogeneity (individuality
or uniqueness) that may exist among the farms (Gujarati, 2008). The individuality of each
farm (unobserved) is subsumed in the disturbance term (uit). This can cause the error term
to correlate with some of the regressors included in the model. As a result, the outcomes
could yield biased estimates of the variances for each estimated coefficient, rendering
statistical tests and confidence intervals inaccurate (Baltagi, 2008; Gujarati, 2003; Pindyck
and Rubinfeld, 1998; Wooldridge, 2002).
Suppose to consider the unobserved heterogeneous variable like the managerial skills of
farmers for which no data is observed in the panel equation (4), we can write it as follows:
101
where the additional variable M = management skills of farmers. Of the variables included in
the equation, only the variable M is time-invariant (or time constant) because it varies among
farmers but is constant over time for a given farmer. Although it is time-invariant, the variable
M is not directly observable, and therefore, we cannot measure its contribution to the
production function. We can do this indirectly if we write the equation as:
(6) 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑒𝑖 + 𝑢𝑖𝑡
where 𝑒𝑖 is called the unobserved or heterogeneity effect, reflecting the impact of M on yield.
In reality, there may be more such unobserved effects, such as the location of the farm, nature
of ownership, gender of the farmers, etc. Although such variables may differ among the
farmers, they will probably remain the same for any given farmer over the sample period.
Since 𝑒𝑖 is not directly observable, we can consider it as an unobserved random variable and
include it in the error term 𝑢𝑖𝑡 and thereby consider the composite error term 𝑤𝑖𝑡 = 𝑒𝑖 + 𝑢𝑖𝑡
and we can write the equation as:
(7) 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑤𝑖𝑡
but if the 𝑒𝑖 term included in the error term 𝑤𝑖𝑡 is correlated with any of the regressors in
the previous equation (i.e., Cov (Xit , wit ) ≠ 0 ), we have a violation of one of the key
assumptions of the OLS regression model that the error term is not correlated with the
regressors (i.e., Cov (Xit , wit ) = 0 ). As we know, in this situation, the OLS estimates are not
only biased, but they are also inconsistent. As there is a real possibility that the unobservable
𝑒𝑖 is correlated with one or more of the regressors, autocorrelation may also be possible (i.e.,
Cov (wit , wis ) = σ2u ; t ≠ s), where t and s are the different time periods. This means that σ2u
is non-zero, and therefore, the (unobserved) heterogeneity induces autocorrelation and we
will have to pay attention to it.
102
2. Endogeneity: Endogeneity arises when the independent variables are correlated with
the error term. Pooled regression can suffer from endogeneity issues, especially if
there are time-varying factors that are omitted from the model.
3. Serial Correlation: Pooled regression assumes that observations are independent,
but in panel data, observations for the same individual over time may be correlated.
Therefore, ignoring serial correlation can lead to inefficient standard errors and
biased hypothesis testing.
4. Time Trends: Pooled regression does not account for time-specific trends or
changes. If there are time-varying factors that affect the dependent variable,
neglecting them can result in biased parameter estimates.
5. Dynamic Panel Bias: If lagged dependent variables are included as regressors in a
pooled regression with panel data, dynamic panel bias may occur. This bias arises
due to correlation (autocorrelation) between the lagged dependent variable and
unobserved individual-specific effects.
6. Inefficiency: Pooled regression may be less efficient compared to models that
account for individual-specific effects, such as fixed effects or random effects
models. Inefficiency can result in imprecise parameter estimates.
103
(8) 𝑌𝑖𝑡 =∝ +𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑒i + 𝑢𝑖𝑡
where 𝑒i is an unobserved heterogeneity (farm dependent error-term). It is fixed over time
and varies across farms. The term “fixed effects” is because, although the intercept may differ
across farms, each farm’s intercept does not vary over time (i.e., it is time-invariant). So it can
be expressed as follows:
(9) 𝑌𝑖𝑡 =∝1it + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡
If we write the intercept as ∝1it , the intercept of each farm is time-variant. Also, the Fixed
Effect model above assumes that the slope coefficients of the regressors do not vary across
individuals or over time. Now, we can allow for the (fixed effect) intercept to vary among the
farms as:
(10) 𝑌𝑖𝑡 =∝0 +∝1 D1i +∝2 D2i + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡
where D1i = 1 for farm 2, 0 otherwise; D2i = 1 for farm 3, 0 otherwise; and so on. Since we
have 3 farms, we have introduced only 2 dummy variables to avoid falling into the dummy-
variable trap (i.e., the situation of perfect collinearity). Here, we treat farm 1 as the base or
reference category, and its effect is captured in the model's intercept. Example data structure
and estimated results are presented in Tables 10 and 11, respectively.
104
Yield in Seed Fertilizers D1 for D2 for
Year State
quintal in kg kg farm2 farm 3
2020-21 Farm1 35.52 52.15 26.56 0 0
2004-05 Farm2 22.82 60.44 86.3 1 0
2005-06 Farm2 25.78 53.47 84.84 1 0
2006-07 Farm2 25.08 52.68 81.15 1 0
2007-08 Farm2 29 51.95 87.92 1 0
2008-09 Farm2 26.65 52.34 80.59 1 0
2009-10 Farm2 18.97 52.85 65.32 1 0
2010-11 Farm2 19.29 52.14 76.68 1 0
2011-12 Farm2 27.58 47.42 97.39 1 0
2012-13 Farm2 24.26 47.47 97.36 1 0
2013-14 Farm2 25.2 46.31 98.92 1 0
2014-15 Farm2 30.69 43.85 103.99 1 0
2015-16 Farm2 27.49 45.26 99.79 1 0
2016-17 Farm2 30.81 44.56 104.08 1 0
2017-18 Farm2 31.06 46.85 106.61 1 0
2018-19 Farm2 29.68 43.88 131.71 1 0
2019-20 Farm2 30.05 42.1 128.64 1 0
2020-21 Farm2 27.68 37.25 117.19 1 0
2004-05 Farm3 34.78 13.3475 147.71 0 1
2005-06 Farm3 33.2 14.94 143.48 0 1
2006-07 Farm3 32.77 24.06 117.01 0 1
2007-08 Farm3 35 12.81 140.96 0 1
2008-09 Farm3 38.15 1.58 189.39 0 1
2009-10 Farm3 37 10.2 201.84 0 1
2010-11 Farm3 42.69 13 209.06 0 1
2011-12 Farm3 26.98 7.93 207.31 0 1
2012-13 Farm3 27.89 10.3 160.15 0 1
2013-14 Farm3 32.73 10.55 153.69 0 1
2014-15 Farm3 42.33 15 175.61 0 1
2015-16 Farm3 43.17 14 175.46 0 1
2016-17 Farm3 44.02 18 168.82 0 1
2017-18 Farm3 40.17 20 186.25 0 1
2018-19 Farm3 42.58 21.18 195.87 0 1
2019-20 Farm3 45.97 21.35 196 0 1
2020-21 Farm3 39.24 19.82 223.16 0 1
105
Table 11. Estimated results of LSDV fixed effect model
Regression Statistics
Multiple R 0.78
R Square 0.61
Adjusted R Square 0.57
Standard Error 4.41
Observations 51.00
ANOVA
Df SS MS F Significance F
Regression 4.000 1380.541 345.135 17.780 0.000
Residual 46.000 892.950 19.412
Total 50.000 2273.492
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept
(base farm1) 30.592 8.440 3.625 0.001 13.603 47.580
Seed -0.054 0.134 -0.403 0.689 -0.323 0.215
Fertilizers 0.112 0.034 3.246 0.002 0.043 0.181
D1 for farm2 -12.256 3.010 -4.072 0.000 -18.315 -6.198
D2 for farm 3 -11.945 6.650 -1.796 0.079 -25.329 1.440
As a result, the intercept ∝0 is the intercept value of farm 1, where D1=D2=0 i.e., 𝐸(𝑌1𝑖 ) =
∝0 + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡 . The other ∝ coefficients represent how much the intercept
values of the other farms differ from the intercept value of the first farm. For example, ∝1
tells by how much the intercept value of the second farm differs from ∝0. The sum (∝0+∝1)
gives the actual value of the intercept for farm 2. We can write it as 𝐸(𝑌2𝑖 ) =∝0 +
∝1 (1) + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡 for farm 2 and 𝐸(𝑌3𝑖 ) =∝0 +∝2 (1) + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 +
𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡 for farm 3.
106
Thus given ∝0 the other dummy variable coefficients ∝0 and ∝2 tell us by how much the
intercept values of farm 2 and 3 differ from that of farm 1. The coefficients from the fixed
effect model produce estimators known as fixed effect estimators. This is called a one-way
fixed effect model as intercepts vary only across farms (to account for heterogeneity) but
not across time. We can also allow for time effect if we believe that the yield function changes
over time because of technological changes, changes in government regulation and/or tax
policies, and other such effects. Such a time effect can be easily accounted for if we introduce
time dummies, one for each year from 2004-05 to 2020-21. We can also consider the two-
way fixed effects model if we allow for both time periods and farms.
107
instance, suppose we want to estimate a wage function for a group of workers using
panel data. Besides wage, a wage function may include age, experience, and
education as explanatory variables. We can also add gender category, color, and
ethnicity as additional variables in the model, and these variables will not change
over time for an individual subject; the LSDV approach may not be able to identify
the impact of such time-invariant variables on wages.
4) Omitted variable bias: If time-varying omitted variables correlate with both the
independent and dependent variables, the LSDV model may suffer from omitted
variable bias. While including individual-specific fixed effects helps control for time-
invariant omitted variables, it does not address the bias introduced by time-varying
omitted variables.
5) Heterogeneity in slopes ignored: The LSDV model assumes that the coefficients of
the independent variables are constant across individuals. However, in many cases,
there may be heterogeneity in the slopes of the relationships between the
independent and dependent variables across individuals. The LSDV model does not
allow for such heterogeneity in slopes.
6) Biased estimates for Time-Invariant variables with perfect collinearity: In the
presence of perfect collinearity between time-invariant independent variables and
individual fixed effects, the LSDV model produces biased estimates for the
coefficients of those variables. This issue arises because the individual fixed effects
absorb all the variation in the time-invariant variables, making it impossible to
identify their effects separately.
When deciding between a Pooled OLS regression and a Fixed Effects model for panel data
analysis, the Restricted or Partial F-test and the Wald test of differential intercept can be
useful tools to assess which model is better suited for the data.
108
Null hypothesis (H0): all the differential intercepts = 0
or
H0: ∝𝟏=∝𝟐= 0
As in the context of choosing between a Pooled OLS and Fixed Effects model for panel data
analysis, the F-test formula for comparing the fit of two different regression models can be
expressed as follows:
Suppose we have two models:
Full Model (Fixed Effects): 𝑌𝑖𝑡 =∝0 +∝1 𝐷1𝑖 +∝2 𝐷2𝑖 + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡
To conduct the F-test, we estimate both the restricted (Pooled OLS) and full (Fixed Effects)
regression models using Ordinary Least Squares (OLS) regression. The F-statistic can be
calculated as:
𝑆𝑆𝐸𝑅 −𝑆𝑆𝐸𝐶
𝑘∗𝐶
(11) 𝐹= 𝑆𝑆𝐸𝐶
𝑛−𝑘𝐶
Where,
𝑆𝑆𝐸𝑅 = Error Sum of Square of the restricted model (pooled OLS)
𝑆𝑆𝐸𝐶 = Error Sum of Square of Complete model (FE-LSDV)
𝑘𝐶∗ = number of additional coefficients in the complete model
𝑘𝐶 = number of coefficients in the complete model
𝑛 = sample size
𝑅𝐶2 = R2 from the complete model
𝑅𝑅2 = R2 from the restricted model
If the estimated F-stat value is more than the F-Table value at a chosen significance level
(1%, 5%, and 10%), we can reject the null hypothesis that all the differential intercepts are
not equal to zero, i.e., there is no individual heterogeneity effect and accept that there is an
individual effect. The inclusion of the differential intercepts significantly improved the model
and therefore accepted the FEM model. This means that accounting for heterogeneity in the
model is important.
109
From the estimated regression results presented in Tables 8 and 10 for pooled OLS and fixed
effect model, respectively, we can calculate the F-value and compare the table value for the
decision.
1353.46 − 892.9504
𝐹= 2 = 11.861
892.9504
51 − 5
Where the estimated F-stat value (11.861) is more than the F-Table value (3.18) at a 5%
level of significance, we can reject the null hypothesis that all the differential intercepts are
not equal to zero, i.e., there is no individual heterogeneity, effect and accept individual effect
is there. The inclusion of the differential intercepts significantly improved the model and
therefore accepted the FEM model. This means that accounting for heterogeneity in the
model is important in this case.
b) Wald Test of Differential Intercept
The Wald test is used to test whether certain variables' coefficients significantly differ across
different groups or categories. In the context of panel data analysis, it can be used to test
whether the intercepts (or individual-specific effects) are significantly different across
individuals. Specifically, the test assesses whether the individual-specific intercepts in the
Fixed Effects model are jointly equal to zero. If the null hypothesis is rejected, it indicates that
there are significant differences in intercepts across individuals, supporting the use of the
Fixed Effects model. In the context of model selection, if the p-value associated with the Wald
test is below the chosen significance level, it suggests that the Fixed Effects model is
preferred over the Pooled OLS model, as it captures individual-specific effects that are not
accounted for in the Pooled OLS model.
The Fixed-Effect Within-Group (WG) Estimator is also a Fixed Effect method used to control
for unobserved individual-specific heterogeneity. This approach removes individual-specific
effects by differencing the data within each group (farms). This process needs to calculate
the mean values of the dependent and explanatory variables for each farm and then subtract
these means from each individual value of all the variables. These adjusted values are
commonly referred to as "de-meaned" or mean-corrected values. This procedure is repeated
for each farm, resulting in a set of de-meaned values for each variable. Subsequently, all the
de-meaned values across all farms are pooled together and an OLS regression is performed
on the combined dataset, consisting of the pooled mean-corrected values from farms.
110
We express each variable as a deviation from its time-mean to remove this heterogeneity i.e.,
by differencing values of the variables around their sample mean, we effectively eliminate the
heterogeneity in the data set. Let’s take the average of the equation (6)
(12) 𝑌̅𝑖 = 𝛽1 ̅̅̅̅̅̅̅ 𝑆𝑒𝑒𝑑𝑖 ) + 𝑒̅i + 𝑢̅𝑖 and subtract it from the same equation
𝐹𝑒𝑟𝑡𝑖 + 𝛽2 ̅̅̅̅̅̅̅
as follows
Once the data is demeaned, the model is estimated using OLS regression under the weak
exogeneity assumption that 𝑪𝒐𝒗(𝑿𝒊𝒕, 𝒆𝒊)=0. The model typically includes the demeaned
independent variables and a constant term. Only this assumption is necessary for consistent
estimators. In the Fixed-Effect Within-Group (WG) Estimator, the coefficient estimates
represent the within-group effects of the independent variables on the dependent variable.
These coefficients capture the relationship between the variables after controlling for time-
invariant individual-specific effects. Statistical inference, such as hypothesis testing and
confidence interval estimation, can be performed based on standard OLS procedures applied
to the demeaned data.
111
The Fixed-Effect Within-Group (WG) Estimator has several advantages. It effectively
controls for unobserved individual-specific heterogeneity. It allows for estimating the effects
of time-varying independent variables on the dependent variable while controlling for
individual-specific effects. It is computationally efficient and relatively straightforward to
implement. However, it is important to note that the Fixed-Effect Within-Group (WG)
Estimator also has limitations. It assumes that individual-specific effects are time-invariant.
It does not allow for the estimation of individual-specific effects, which may be of interest in
some cases. It may suffer from bias if the time-varying independent variables are correlated
with the individual-specific effects. The calculation of de-meaned values for the variables and
data set for the Fixed Effect Within Group model and its estimated OLS regression results
are presented in Tables 12 and 13, respectively.
Table 12. data set for the Fixed-Effect Within-Group (WG) model
D_Fert = D_Seed =
Yield Fertilizers Seed D_Y =
Year State Fert- Fert Seed –
quintal in kg in kg Yi - Ybar
bar Seed bar
112
D_Fert = D_Seed =
Yield Fertilizers Seed D_Y =
Year State Fert- Fert Seed –
quintal in kg in kg Yi - Ybar
bar Seed bar
113
Table 13. Results of estimated Fixed-Effect Within-Group (WG) model
Regression Statistics
Multiple R 0.957
R Square 0.916
Adjusted R Square 0.913
Standard Error 18.388
Observations 51.000
ANOVA
Df SS MS F Significance F
Regression 2.000 177422.950 88711.475 262.372 0.000
Residual 48.000 16229.425 338.113
Total 50.000 193652.375
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -41.100 21.396 -1.921 0.061 -84.119 1.919
D_Seed -0.630 0.093 -6.797 0.000 -0.816 -0.444
D_Fertilizers 0.254 0.396 0.642 0.524 -0.542 1.049
The first difference method in panel data analysis involves taking the first difference of each
variable within the panel model. The first difference method subtracts the value of each
variable in the current period from its value in the previous period for each individual in the
panel. This helps in eliminating individual-specific effects because they are differenced out. It
is particularly useful when dealing with unobserved individual-specific heterogeneity that is
constant over time. Mathematically, we can express it as follows:
114
Table 14. Data set for first difference panel data model
115
Yield (Y) Fertilizers Seed
Year State ∆Y ∆ Fertilizers ∆ Seed
in quintal in kg in kg
116
Table 15. Results of the estimated first difference panel model
Regression Statistics
Multiple R 0.176
R Square 0.031
Adjusted R Square -0.012
Standard Error 4.452
Observations 48.000
ANOVA
Df SS MS F Significance F
Regression 2.000 28.580 14.290 0.721 0.492
Residual 45.000 891.935 19.821
Total 47.000 920.515
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 0.330 0.657 0.502 0.618 -0.993 1.653
∆ Fertilizers 0.056 0.050 1.102 0.276 -0.046 0.157
∆ Seed 0.001 0.187 0.006 0.995 -0.376 0.379
The Random Effects Model is a statistical technique used in panel data analysis to account
for both within-group and between-group variations. The REM extends the basic pooled OLS
model by allowing for entity-specific effects that are not directly observed but are assumed
to follow a specific distribution. Consider the fixed effect model:
(10) 𝑌𝑖𝑡 =∝0 +∝1 D1i +∝2 D2i + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡
Incorporate farm heterogeneity (ei) within the error term (𝑢𝑖𝑡) rather than specifying as a
dummy variable and allowing for a common intercept, the model can be considered as a REM.
Instead of treating ∝1i as fixed, we assume it to be a random variable with mean ∝1 and
random farm-specific error term (ei) with the mean value of zero and variance of 𝜎𝑒2 ,
expressed as follows:
117
By replacing ∝1i with ∝1+ei in the above equation (10), we have the error components or
random effect model
Unlike the fixed effect model, where each farm has its (fixed effect) intercept value, the ∝1
in the random effect model is a common intercept, meaning that it is the average of all
intercepts of all farms. The farm-specific error component ei measures the random deviation
of each farm’s intercept from the common intercept ∝1.
2. The independent variables are assumed to be exogenous, meaning they are not correlated
with the error term. However, it is expected that 𝑤it and 𝑤is (t≠s) are correlated; that is, the
error terms of a given cross-sectional unit at two different points in time are correlated.
𝜎𝑒2
(21) 𝑐𝑜𝑟𝑟(𝑤it , 𝑤is ) =
𝜎𝑒2 +𝜎𝑢
2
If we do not take this correlation structure into account and use OLS, the resulting estimators
will be inefficient. The appropriate method is the Generalized Least Squares (GLS) method.
REM determines the degree to which serial correlation is a problem and then uses some
weighted estimation approach (e.g., GLS) to fix it.
118
Assumption of the composite error term in the error correction model:
ei ~𝑁(0, 𝜎𝑒2 )
uit ~𝑁(0, 𝜎𝑢2 )
E(ei , uit ) = 0; E(ei , ej ) = 0 (i≠j)
E(uit , ui𝑠 ) = E(uij , uij ) = E(uit , uj𝑠 ) = 0 (i≠j; t≠s)
The individual error components are not correlated and are not auto-correlated across cross-
section and time-series units. If E(uij ) = 0 then the Var(𝑤it ) = 𝜎𝑒2 +𝜎𝑢2 . If 𝜎𝑒2 = 0, there is
no difference between pooled regression and the error components model; we can go for
pooled regression.
In the error components (or RE) model 𝑌𝑖𝑡 =∝1 + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + ei + 𝑢𝑖𝑡 , we
substitute the common error term 𝑤𝑖𝑡. From this, we derive the mean-corrected WG-FE
estimator 𝑌𝑖𝑡 − 𝑌̅𝑖 = 𝛽1 (𝐹𝑒𝑟𝑡𝑖𝑡 − ̅̅̅̅̅̅̅ 𝑆𝑒𝑒𝑑𝑖𝑡 ) + 𝑤𝑖𝑡 . We then construct
𝐹𝑒𝑟𝑡𝑖 + 𝛽2 (𝑆𝑒𝑒𝑑𝑖𝑡 − ̅̅̅̅̅̅̅̅
the RE GLS transform equation by pre-multiplying all the means by the GLS parameter, λ.
REM is a quasi-demeaned model because the means (𝑌̅𝑖 , ̅̅̅̅̅̅̅ 𝑆𝑒𝑒𝑑𝑖𝑡 ) are weighted by
𝐹𝑒𝑟𝑡𝑖 & ̅̅̅̅̅̅̅̅
the GLS parameter, 𝜆[0 ≤ 𝜆 ≤ 1].
If 𝜆 = 0, REM estimator will become pooled OLS 𝑌𝑖𝑡 =∝1i + 𝛽1 𝐹𝑒𝑟𝑡𝑖𝑡 + 𝛽2 𝑆𝑒𝑒𝑑𝑖𝑡 + 𝑤𝑖𝑡 and
If λ=1, REM estimator will become fixed effect model 𝑌𝑖𝑡 − 𝑌̅𝑖 = 𝛽1 (𝐹𝑒𝑟𝑡𝑖𝑡 − ̅̅̅̅̅̅̅
𝐹𝑒𝑟𝑡𝑖 +
̅̅̅̅̅̅̅̅
𝛽2 (𝑆𝑒𝑒𝑑𝑖𝑡 − 𝑆𝑒𝑒𝑑 𝑖𝑡 ) + 𝑤𝑖𝑡 − ̅̅̅.
𝑤𝑖
REM is equal to FEM if model is fully demeaned i.e., 𝜆 = 1. But if 0<𝜆<1, REM estimator is
not equal to pooled OLS and FEM.
where, 𝜎𝑒2 = variance of the idiosyncratic error term, eit and 𝜎𝑢2 = Variance of the farm-specific
error term, 𝑢.
119
Fixed Effect model vs Random Effect model
The Hausman Test is a statistical test used to determine whether the random effects (RE)
assumptions are valid and whether the random effects model is preferable to the fixed effects
(FE) model in panel data analysis. It tests the consistency of the estimators under the null
hypothesis that both the FE and RE estimators are consistent, but the random effects model
is more efficient. If the null hypothesis is rejected, it suggests that the random effects
assumptions may be violated, and the fixed effects model is preferred.
Statement of hypothesis
Null hypothesis H0 : REM is the appropriate estimator
or
H0 : Cov(e1, Xit) = 0
or
H0 : FEM and REM estimators do not differ substantially
Alternate hypothesis Ha: FEM is the appropriate estimator
or
Ha : FEM and REM estimators differ substantially
If H0 is rejected, we conclude that the REM is inappropriate because the random effect is
probably correlated with the Xit i.e., Cov(e1, Xit) ≠ 0. In other words, if the calculated test
statistic is greater than the critical value, reject the null hypothesis, indicating that the
random effects model is inconsistent and the fixed effects model is preferred. If the calculated
test statistic is less than the critical value, it fails to reject the null hypothesis, suggesting
that the random effects model is consistent and more efficient than the fixed effects model.
120
In the data set, time and panel variables are in string format, so we have to convert these
variables into non-string format. For this, use the following comments:
encode year, gene(year1)
encode state, gene(state1)
Before estimating fixed and random effect models, we have let the Stata know that our data
set is panel data. For that, the following command is used:
where xtset is a command for the setting data set as panel; farm1 and year1 are cross-
sectional and time variables.
Estimate Fixed Effects Model: Use the xtreg command with the fe option to estimate the
fixed effects model.
121
Estimate Random Effects Model: Use the same xtreg command with the re option to estimate
the random effects model.
xtreg yield_qtl fert_kg seed_kg , re
estimates store random
Perform Hausman Test: After estimating both models, use the hausman command to
perform the Hausman test.
hausman fixed
Interpret Results
Stata will give the output of the Hausman test statistic and its associated p-value. If the p-
value is less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis
of no difference between the fixed and random effects estimators. In this case, the fixed
effects model may be preferred. If the p-value exceeds your chosen significance level, you fail
to reject the null hypothesis, indicating that the random effects model may be more
appropriate.
Stata Results
Fixed effect model
. xtreg yield_qtl fert_kg seed_kg , fe
F(2,45) = 6.60
corr(u_i, Xb) = -0.7911 Prob > F = 0.0031
sigma_u 6.1146911
sigma_e 4.4300808
rho .65578192 (fraction of variance due to u_i)
F test that all u_i=0: F(3, 45) = 7.99 Prob > F = 0.0002
.
. estimates store fixed
122
Random effect model
sigma_u 0
sigma_e 4.4300808
rho 0 (fraction of variance due to u_i)
.
. estimates store random
Hausman test
. hausman fixed
Coefficients
(b) (B) (b-B) sqrt(diag(V_b-V_B))
fixed random Difference S.E.
chi2(2) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 58.19
Prob>chi2 = 0.0000
123
Bibliography
Baltagi, B.H. and Baltagi, B.H., 2008. Econometric analysis of panel data (Vol. 4). Chichester:
Wiley.
Bickel, R., 2007. Multilevel analysis for applied research: It's just regression!. Guilford Press.
Gefen, D., Straub, D. and Boudreau, M.C., 2000. Structural equation modeling and
regression: Guidelines for research practice. Communications of the association for
information systems, 4(1), p.7.
Gil-Garcia, J.R., 2008. Using partial least squares in digital government research.
In Handbook of research on public information technology (pp. 239-253). igi
Global.
Greene, W. H. (2012). Econometric analysis (4th ed.). Englewood Cliffs, N. J.: Prentice-Hall.
Gujarati, D.N. and Porter, D.C., 2009. Basic econometrics. McGraw-hill.
Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E. and Tatham, R.L., 1998. Multivariate data
analysis. Uppersaddle River. Multivariate Data Analysis (5th ed) Upper Saddle
River, 5(3), pp.207-219.
Wooldridge, J.M., 2010. Econometric analysis of cross section and panel data. MIT press.
124
Chapter 8
114 1114
Estimation of Total Factor Productivity by
11414
using Malmquist Total Factor Productivity
Approach: Case of Rice in India
A. Suresh
ICAR-Central Institute of Fisheries Technology, Cochin, India.
Introduction
The Green Revolution (GR) has significantly contributed to achieving the self-sufficiency of
foodgrain production in India, primarily through increased production of rice and wheat. This
remarkable achievement has been achieved through the faster spread of modern varieties
(MVs) and input intensification. The yield increase in the case of rice during the initial phase
of MV introduction was not as miraculous as has happened in the case of wheat. This was
because the diffusion of MVs in the case of rice was not as fast as it was elsewhere. This can
be better gauged by the fact that by around the mid-1980s, some Asian countries like
Indonesia and The Philippines had reached the ceiling for MV adoption of 70-90 percent,
while in the case of India, it was around 30 percent during the same time (Otsuka, 2000).
However, the diffusion of MVs has continuously improved over the years.
The MVs introduced during the Green Revolution period have quickly exhausted the yield
potential, not only in India but across the globe (Hayami and Kikuchi, 1999). Also, some
symptoms of the unsustainability of modern cultivation practices emerged over the course
of time. Some visible symptoms of this unsustainability were nutrient imbalances, depletion
of soil micro-nutrients, over-exploitation of the groundwater, land degradation, more
frequent emergence of pests and diseases, and diminishing returns to inputs (Chand et al.,
2011). This has created apprehension about the ability of the approach to ensure future
food security. In this context an important debate emerged in policy circles- whether the
slowdown of agricultural performance is due to technology fatigue or policy fatigue (Planning
Commission, 2007; Narayanamoorthy, 2007). One major bottom line of the debate was that
given the high impacts of agricultural income in eliminating rural poverty, ensuring TFP
growth is critical to reducing rural poverty. In this context, the present chapter examines the
TFP growth in rice cultivation in India, taking into consideration the change in technical
change and efficiency. In light of the results, the study also discusses whether the slowdown
in yield growth is due to technology fatigue or sluggishness in input intensification.
125
TFP Studies in India and in Other Developing Countries
The TFP has attracted the attention of many scholars in India and other developing countries.
One common generalization that can be gauged from these studies is that the TFP has been
deteriorating even during the heyday of the green revolution in developing countries. For
example, Kawagoe et al. (1985) estimated the cross-country production functions for 22 less
developed countries and 21 developed countries using data for two decades between 1960
and 1980. They reported technological deterioration in developing countries and progress in
developed countries. Using cross-country analysis, some other studies also reported negative
productivity growth for developing country agriculture since the 1960s and 1970s
(Chaudhary, 2012). Nkamleu et al. (2003), analyzing data set for 10 Sub-Saharan African
countries for the period of 1972-1999, reported a deterioration of TFP growth. This
deterioration was identified to be more on account of regress in technical change. As far as
Chinese Agriculture is concerned, Li et al. (2011) noted significant productivity growth since
1980s, although the growth rates varied considerably among the subsectors. The productivity
growth is emancipated from either technological progress or efficiency gains, not from both
of them simultaneously. In an early study on the TFP in India, Kumar and Mruthyunjaya
(1992) reported growth in TFP of wheat in India during 1970-89 to be to the tune of 1.9
per cent in Punjab, 2.7 per cent in Haryana and Rajasthan, 2.6 per cent in Uttar Pradesh and
0.4 per cent in Madhya Pradesh. Kalirajan and Shand (1997) noticed a declining trend of TFP
growth in agriculture by the end of the 1980s. Joshi et al. (2003) and Kumar and Mittal
(2006) reported positive TFP growth for both rice and wheat during the period of 1980-
2000, but the TFP growth posted a reduction during the second decade compared to the
first decade. In a study of various crops and states for the period of 1975-2005, Chand et al
(2011) have observed that the TFP growth has shown considerable variation across crops
and regions. During the entire period under analysis, rice has posted a TFP growth of 0.67
per cent, while that of wheat has been at the rate of 1.92 per cent.
‡
Some part of this chapter is published by the author in the research paper cited as follows:
Suresh A (2013). Technical change and efficiency in rice production in India: A Malmquist Total Factor Productivity
approach, Agricultural Economics Research Review, 26 (Conference Issue): 109-18.
126
The TFP index can be constructed by dividing the index of total output by an index of total
inputs. In that sense, a growth in the TFP can be attributed to that part of the growth that
is not accounted for by the growth in input use. The most popular form of estimating TFP in
the past is the Tornquist- Theil Index method. This index estimates the TFP growth based
on information concerning price and uses cost/ revenue shares as weights to aggregate
inputs/ outputs (Bhushan, 2005). However, this method has one inherent weakness, it
assumes the observed outputs as frontier outputs. One important consequence of this
assumption is that the decomposition of the TFP growth into its constituent components,
viz., movement towards a production frontier and shift in the production frontier, cannot be
carried out. The Tornquist- Theil Index attributes the TFP growth entirely to the technical
change. The Malmquist productivity index (MPI) overcomes some of these problems.
The MPI was introduced by Caves et al (1982) based on distance functions. The output
oriented Malmquist TFP index measures the maximum level of outputs that can be produced
using a given level of input vector and a given production technology relative to the observed
level of outputs (Coelli et al, 2005). It measures the radial distance of the observed output
vectors in period t and t+1 relative to a reference technology. The Malmquist productivity
index for the period t is represented by,
𝐷0𝑡 (𝑥 𝑡+1 ,𝑦 𝑡+1 )
𝑀𝑡 = …
𝐷0𝑡 (𝑥 𝑡 ,𝑦 𝑡 )
(1)
which is defined as the ratio of two output distance functions concerning reference
technology at the period t. It is also possible to construct another productivity index by using
period t+1’s technology as the reference technology, which can be depicted as,
𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 )
𝑀𝑡+1 = … (2)
𝐷0𝑡+1 (𝑥 𝑡 ,𝑦 𝑡 )
Thus, there exists an arbitrariness in the choice of the benchmark technology depending
on the time period t or t+1. Fare et al (1994) has attempted to remove this arbitrariness
by specifying the MPI as the geometric mean of the two-period indices, defined as:
𝐷0𝑡 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 1
𝑀0 (𝑥 𝑡+1 , 𝑦 𝑡+1 , 𝑥 𝑡 , 𝑦 𝑡 ) = [( )( )] 2… (3)
𝐷0𝑡 (𝑥 𝑡 ,𝑦 𝑡 ) 𝐷0𝑡+1 (𝑥 𝑡 ,𝑦 𝑡 )
where, the notations x and y represent the vector of inputs and outputs, D 0 represents
the distances, and M0 represents the Malmquist index. Fare et al by using simple
arithmetic manipulations, has shown the MPI as the product of two distinct components,
viz. technical change and efficiency change, as indicated below:
𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡 (𝑥 𝑡 ,𝑦 𝑡 ) 1
𝑀0 (𝑥 𝑡+1 , 𝑦 𝑡+1 , 𝑥 𝑡 , 𝑦 𝑡 ) = [( )( )] 2… (4)
𝐷0𝑡+1 (𝑥 𝑡 ,𝑦 𝑡 ) 𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡+1 (𝑥 𝑡 ,𝑦 𝑡 )
127
where,
𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 )
Efficiency change = … (5)
𝐷0𝑡 (𝑥 𝑡 ,𝑦 𝑡 )
𝐷0𝑡 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡 (𝑥 𝑡 ,𝑦 𝑡 )
Technical change = [( )( )]… (6)
𝐷0𝑡+1 (𝑥 𝑡+1 ,𝑦 𝑡+1 ) 𝐷0𝑡+1 (𝑥 𝑡 ,𝑦 𝑡 )
The efficiency change can be further decomposed into pure efficiency change and scale
efficiency change. A detailed account of the MPI can be had from Fare et al. (1994), Coelli et
al. (2005), Bhushan (2005), and Chaudhary (2012). Introduction of linear programming
based Data Envelopment Analysis popularised the Malmquist index of productivity
measurement. DEA involves the construction of a piece-wise linear frontier based on the
distribution of the data of the input and outs of various entities/decision-making units
(DMUs) using a linear programming framework. This frontier constructs a piecewise surface
over the data such that the observed data lies on or below the constructed production
frontier (Coelli et al., 2005). The efficiency measure for each DMU is calculated relative to
this production frontier. Fare et al. (1994) identify four important advantages of using the
Malmquist Productivity Index compared to other approaches. They include: (1) the approach
requires data on only quantity and not prices. Information on prices is generally not available
for every input and output for many countries; (2) the linear programming-based approach
doesn’t assume an underlying production function, and therefore the stochastic properties
associated with the error term; (3) no prior assumption regarding the optimising behaviour
of the DMUs; and, (4) Since the approach allows for both movement towards the frontier
and shift in the frontier, it is possible to decompose the TFP into its components viz technical
change and efficiency change.
Data
The basic input data for the estimation was collected from the reports of “Comprehensive
Scheme for Cost of Cultivation of Principal Crops” carried out by the Directorate of Economics
and Statistics, Ministry of Agriculture, New Delhi. The data for the missing years were
approximated by interpolations based on the trend growth. The output variable was yield per
hectare (kg/ha) reported by the Ministry of Agriculture. Six input variables were used in the
analysis. They included usage in chemical nutrients (NPK), manure (q/ha), animal labour (pair
hours/ha), human labour (man-hours/ha), and real costs of machine labour and irrigation §
§
The real cost was derived by deflating with price index for diesel and respectively.
128
The analysis was carried out for the overall period of 1080-81 to 2009-10. The overall period
under analysis has been divided into two sub-periods of equal length of 15 years, 1980-81
to 1994-95 (period I) and 1995-96 to 2009-10 (period II). These periods broadly correspond
to the period before the macroeconomic reforms and the post-reform period, respectively.
To avoid extreme variations, triennial ending averages were used. The analysis was done using
the software DEAP 2.1 (Coelli, 1996).
129
110
108
106
104
Per Cent
102
100
98
96
94
92
90
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009
Figure 1: Malmquist, TFP, and efficiency indices of paddy cultivation, 1980-81 to 2009- 2010
The result suggests that the mean TFP change for rice has been 0.2 per cent per year during
the overall period under consideration (Table 2). The decomposition of the TFP change
indicated that the change in the TFP was associated with a technological progress of 0.3 per
cent and a deterioration of the technical efficiency to -0.1 per cent. This underscores that
technical efficiency could not catch up with the technical progress and is pulling down the
TFP growth. In the case of wheat in India from 1982-83 to 1999-2000, Bhushan (2005)
indicated that the major source of productivity growth was technical change than efficiency
change. The efficiency change was not a major source of growth for rice in some other major
rice technology development countries like The Philippines (Umetsu et al., 2003).
Table 2 also depicts the growth in TFP and its constituent components across states for the
overall period under analysis. The TFP change varied considerably across states, with four
states (Andhra Pradesh, Punjab, Tamil Nadu and Uttar Pradesh) out of the total nine states
under consideration posting positive trends and the remaining five states posting negative
trends. The highest change in the TFP among states has been noted in case of Andhra
Pradesh (5.1 per cent), followed by Punjab (4.6 per cent). On the other hand, the negative
TFP growth ranged between -4.6 per cent in cases of Madhya Pradesh to -1.3 per cent in
case of Karnataka. The table reveals that the TFP change is associated more with technical
change than with efficiency change at state level also. A positive growth in both efficiency
and technical change could be noted only in case of Andhra Pradesh and Uttar Pradesh.
130
For Punjab, positive technical change was associated with no-change in efficiency, while for
Tamil Nadu, technical change of 2.8 per cent is coupled with a efficiency change of -0.9 per
cent. It is noteworthy that Karnataka and Madhya Pradesh posted a decline of technical
change, efficiency change and TFP during the overall period. The change in efficiency has
been decomposed into its components, viz. pure efficiency change and scale efficiency change
as well. Pure efficiency has remained unchanged at the national level and in most of the
states, except in Andhra Pradesh and Tamil Nadu. An increase in pure efficiency has been
observed in the case of Andhra Pradesh and Uttar Pradesh. The results suggest that the
agricultural development strategy has to pay increased attention to the factors that could
influence efficiency as well as the factors that result in technical progress.
Table 2: Trend in the total factor productivity and its components, 1980-81 to 2009-10
Pure Scale
Technical TFP
State Efficiency change Efficiency Efficiency
Change change
Change Change
Andhra Pradesh 100.7 104.4 100.5 100.2 105.1
Bihar 100 97.7 100 100 97.7
Karnataka 99.9 98.8 100 99.9 98.7
Madhya Pradesh 98.7 96.7 100 98.7 95.4
Odisha 100 96.3 100 100 96.3
Punjab 100 104.6 100 100 104.6
Tamil Nadu 99.1 102.8 99.3 99.8 101.8
Uttar Pradesh 100.5 103.2 100 100.5 103.7
West Bengal 100 98.6 100 100 98.6
Mean 99.9 100.3 100 99.9 100.2
The sub-period analysis throws up some interesting results (Table 3). It turned out that at
national level, the mean TFP growth increased from -1.3 per cent in the period I (first period)
to 1.8 per cent during period II (second period). This TFP change was associated with an
improvement in the technical change (from -1.6 per cent to 2.1 per cent) and a decline in
efficiency (from 0.3 per cent to -0.2 per cent). It is observed that some of the early green
revolution states like Punjab, Tamil Nadu and Uttar Pradesh which posted high rate of TFP
growth during the first period has exhibited a deterioration during the second period while
states like Karnataka, Madhya Pradesh, Odisha and West Bengal where TFP trend was
deteriorating during the first period has shown a revival.
131
The results also suggest that during the two periods the TFP changes of the latter group of
states were with high level of margins, the highest absolute increase being in case of Odisha
(by 12.2 percentage points). The decline in the TFP of Punjab, Tamil Nadu and Uttar Pradesh
was mainly due to a deterioration of the technical progress rate than a decline in the efficiency
growth. The revival of the TFP growth in case of Karnataka, Madhya Pradesh, Odisha and
West Bengal is due to high level of technological progress. A picture of contrasting
performance has been noted in case of Andhra Pradesh and Bihar. In Andhra Pradesh an
already increasing TFP growth has increased further during the second period (from 4.0 per
cent to 7.5 per cent), while in Bihar the already deteriorating TFP growth during the first
period has further deteriorated (from -0.7 per cent to -4.4 per cent). This contrasting
performance of the two states owes it to the contrasting performance of technical progress
of the two states. In case of Andhra Pradesh, the increase in the technical progress from 2.5
per cent to 6.6 per cent could surpass the deterioration of the efficiency growth, effecting a
positive TFP growth. On the other hand, the deterioration of the technical growth from -0.7
per cent to -4.4 per cent while the efficiency remaining unchanged has pulled down the TFP
growth in case of Bihar. The increase in TFP growth with practically unaltered efficiency levels
points to upward shift of the production frontier. In that sense, it can be presumed that the
low performing states during the first period has been catching up with the already
progressive states. On the other hand, results suggests that the rates of shift in the
production frontier is declining in the already well performing states, except Andhra Pradesh.
Table 3: The trend in technical change, efficiency change and total factor productivity
State Efficiency Technical Change TFP change
Period I Period II Period I Period II Period I Period II
Andhra Pradesh 101.5 100.8 102.5 106.6 104.0 104.4
Bihar 100.0 100.0 99.3 95.6 99.3 97.7
Karnataka 100.0 100.3 95.3 102.1 95.3 98.8
Madhya Pradesh 99.7 98.8 91.4 101.8 91.2 96.7
Odisha 100.0 100.0 90.0 102.2 90.0 96.3
Punjab 100.0 100.0 105.6 104.0 105.6 104.6
Tamil Nadu 100.0 98.0 103.6 102.3 103.6 102.8
Uttar Pradesh 101.1 100.0 103.4 103.2 104.6 103.2
West Bengal 100.0 100.0 96.0 101.1 96.0 98.6
Mean 100.3 99.8 98.4 102.1 98.7 100.3
132
Technology Fatigue or Sluggishness in Input Intensification?
The above results help to shed light on the debates of on whether the declining productivity
is due to technology fatigue of policy fatigue. The forgone analysis has clearly shown that
TFP growth in rice has acquired greater geographical spread during recent periods. In this
context, it would be worthwhile to analyze the trend in use of inputs in rice cultivation. Table
4 provides the trend growth of application of four major inputs, viz. irrigation, fertilizer,
manures and human labour. It clearly indicates that the rate of use of inputs has declined in
most of the states, with a few exceptions.
The decline has been sharp in the case of labour, fertilizer and manure. All the states with the
exception of Punjab posted a decline in the rate of application of fertilizers. In case of labour,
all the states except Odisha and West Bengal have posted negative growths. This trend has
been broadly reflected in the cost of cultivation as well (Appendix). At national level, the cost
of cultivation increased at the rate of 9.2 per cent per year during the overall period under
analysis. On a disaggregated analysis the second period exhibited a growth rate of 7.3 per
cent per year, compared to 10.9 during the first period. This decline in expenditure growth
(despite a higher level of input price during the second period) might be out of reduced rates
of input application.
Table 4: Growth in use of irrigation (real price), fertilzer nutrients (kg/ha) and human
labour (labour hours) in paddy cultivation across states, between two periods (% per year)
States Irrigation Fertilizer Labour
133
The above trend is vividly reflected in the change in the cost structure and factor shares
(Table 5). For analytical purpose the entire expenditure of rice cultivation has been clubbed
grouped into four input groups, viz. current inputs, capital inputs, labour and land. Current
inputs are seed, fertilizer, manure, insecticides, interest on variable cost; Capital inputs are
draft animal, irrigation, machinery, depreciation, interest on fixed capital; labour input is
human labour. The land revenue involves the value of land resources (both owned and hired)
as well as other charges on land. The table provides three specific information- share of inputs
in total cost of cultivation (cost share), trend growth of (nominal) expenditure of these input
groups, and their share in total value of output (factor share). The expenditure of the current
inputs has grown at the rate of 8.0 per cent per year, capital inputs at the rate of 8.8, labour
at the rate of 10.5 and land at the rate of 8.5 per cent for the overall period under analysis.
The period II has depicted a reduction in the expenditure growth for all the input groups,
most noticeably in case of current inputs (from 9.7 per cent to 5.4 per cent). The use of
capital inputs, which more or less reflects long term farm-investment, has reduced from 10.3
per cent to 7.2 per cent. This is a cause for concern, as the reduction in capital investment
has long term implications farm income growth.
Table 5: Trend in the cost share, factor share and growth rate of various input groups in
paddy cultivation, national level
Trend Growth rate (per
Cost Share (%) Factor share (%)
Input cent per year)
groups 1980 1994 2009 Period Period 1980 1994 2009
-81 -95 -10 I II Overall -81 -95 -10
Current 18.9 17.0 13.0 9.7 5.4 8.0 17.2 14.4 12.4
Capital 24.4 20.8 17.9 10.3 7.2 8.8 22.3 17.6 17.1
Labour 28.9 32.3 42.3 12.1 8.9 10.5 26.4 27.5 40.3
Land 27.8 29.9 26.8 11.1 6.2 8.7 25.4 25.4 25.6
Basic Data Source: Cost of cultivation reports of CACP
Corresponding to the relative growth of expenditure, the structure of the costs also has
depicted sharp change over time. While the share of the current cost, capital costs and labour
in cost of cultivation has registered an decline over years, that of labour has increased by 13
per cent points between 1980-81 to 2009-10. The spurt in the expenditure has to be
explained in the light of high rate of increase in agricultural wage in recent times than a
physical increase in the labour absorption in rice cultivation. The results broadly suggest that
134
it is the sluggishness in input intensification that is resulting in the yield decline than a
reduction of TFP or technical change. This indicates that the farm policies should favour
sustainable intensification of inputs so as to increase the yield. The trend in the cost share
has been broadly reflected in the factor share as well. While the share of current and capital
inputs declined over years, the share of labour and land has increased. A close observation
also reveals that the technical change in the rice cultivation has not made a significant
percolation of benefits to the enterpreuner / farmer in the form of increased share in the
value of output, during the second period under analysis.
The study has estimated the TFP growth for rice in India and in major states and has
decomposed the TFP growth into its constituent components viz technical change and
efficiency change. In the light of the above results the study has discussed whether the recent
slowdown in yield growth is due to technology fatigue or sluggishness in the input
intensification.
The study identifies that during the overall period under analysis, the TFP growth has been
at a moderate rate of 0.2 per cent per year, with large inter-state variations. The positive
change in the TFP has been associated with a mean technical change of 0.3 per cent and a
deterioration of mean efficiency by -0.1 per cent. The technical change turned out to be the
main driver of the TFP change. Among states Andhra Pradesh, Punjab, Tamil Nadu and Uttar
Pradesh exhibited positive TFP change during the entire period under analysis. The sub-
period analysis indicates that second period has witnessed a revival of the mean TFP to the
level of 1.8 per cent per year, compared to a negative TFP change of -1.3 per cent during the
previous period. This revival has been effected mainly by positive technical change during the
second period. However, a matter of concern is the decline in the technical efficiency. It is
also observed that the TFP growth has become more widespread with passage of time. The
less progressive states with respect to TFP growth, viz. Karnataka, Madhya Pradesh, Odisha
and West Bengal during the first period have caught up with the initially progressive states
during the second period, mainly propelled by high rate of technical progress. Also, it is noted
that the TFP growth of the progressive states, except AP, have deteriorated during the
second period, mainly due to the regress in the technical change. One state, that needs special
mention is Bihar, where both the technical change and efficiency change deteriorated over
years.
135
The study throws up some important policy observations. It establishes that in case of rice,
there is no conclusive evidence for a technology regress; rather there is evidence of
technological progress over years. However, the rate of growth of input application has been
declining over years. Therefore, rather than technological fatigue, it might be the sluggish
input intensification that is contributing to the decline in yield growth of rice in recent periods.
Therefore, farm policies need to be aligned towards sustainable resource intensification,
notably capital inputs, as they have long term implications of farm income growth. Along with
technical progress, the policies should be aligned to improve the technical efficiency of
cultivation. In the light of the evidences existing on the positive role of research investment
in technical progress and extension expenditure on efficiency change, the agrarian policies
need to favour increased flow of resources towards the research and extension system so as
to effect TFP growth through both technical and efficiency changes.
136
Bibliography
Bhushan, S. (2005) Total factor productivity growth of wheat in India: A Malmquist
Approach. Indian Journal of Agricultural Economics, 60(1):32-48.
Caves, D.W., Christensen, L.R. and Diewert, W.E. (1982) The economic theory of index
numbers and the measurement of input, output and productivity, Econometrica:
1393-1414.
Chand, R., Kumar, P and Kumar, S. (2011) Total factor productivity and contribution of
research investment to agricultural growth in India, Policy Paper 25, New Delhi,
National Centre for Agricultural Economics and Policy Research.
Chaudhary, S. (2012) Trend in total factor productivity in Indian agriculture: State level
evidence using non-parametric sequential Malmquist Index, Working Paper No 215,
New Delhi, Centre for Development Economics, Delhi School of Economics.
Coelli, T.J. (1996) A guide to DEAP Version 2.1: A Data Envelopment Analysis (Computer)
Program, Centre for Efficiency and Productivity Analysis, University of New England,
Australia.
Coelli, T.J., Rao, D.S.P., O’Donnell, C.J. and Battese, G.E. (2005) An introduction to efficiency
and productivity analysis, Springer.
Fare, R., Grosskopf, S., Norris, M., AND Zhang, Z. (1994) Productivity growth, technical
progress, and efficiency change in industrialised countries. The American Economic
Review. 66-83.
Hayami, Y and Kikuchi, M. (1999) The three decades of green revolution in a Philippine village.
Japanese Journal of rural economics, 1: (10-24).
Kalirajan, K.P. and Shand, R.T. (1997) Sources of output growth in Indian Agriculture, Indian
Journal of Agricultural Economics, 52(4), 693-706.
Kawagoe T., Hayami Y., Ruttan V. (1985), The inter-country agricultural production function
and productivity differences among countries. Journal of Development Economics,
Vol. 19, p113-32.
Kumar P. and Mittal Surabhi (2006) Agricultural Productivity Trends in India: Sustainability
Issues. Agricultural Economics Research Review. 19 (Conference No.) pp 71-88.
Kumar, P. and Mruthyunjaya (1992) Measurement and analysis of total factor productivity
growth in wheat. Indian Journal of Agricultural Economics, 47 (7): 451-458.
Kumar, P., Joshi, P.K., Johansen, C and Asokan, M. (1998) Sustainability of rice-wheat based
cropping system in India. Economic and Political Weekly, 33: A152-A158.
137
Li, G., You, L. and Feng, Z. (2011) The sources of total factor productivity growth in Chinese
agriculture: Technological progress or efficiency gain. Journal of Chinese Economic
and Business Studies, 9(2): 181-203.
Narayanamoorthy, A. (2007). Deceleration in agricultural growth: Technology or policy
fatigue. Economic and Political Weekly, 42(25):2375-79.
Nkamleu, G.B., Gokowski, J and Kazianga, H. (2003) Explaining the failure of agricultural
production in sub-saharan Africa. Proceedings of the 25th International Conference
of Agricultural Economists, Durban, South Africa, 16-22 August 2003.
Otsuka, Keijiro (2000) Role of agricultural research in poverty reduction: lesson from the
Asian Experience. Food Policy, 25: 445-462.
Planning Commission (2007) Report of the steering committee on Agriculture for Eleventh
Five Year Plan (2007-2012), New Delhi, Government of India.
Umetsu, C., Lekprichakul, T and Charavorty, U (2003). Efficiency and technical change in the
Philippine rice sector: A Malmquist total factor productivity analysis. American
Journal of Agricultural Economics. 85(4): 943-963.
138
Chapter 9
Forecasting Methods – An 114 1114
Overview
Ramadas Sendhil , V Chandrasekar , L Lian Muan Sang ,11414
1 2
Jyothimol Joseph 1 1
and Akhilraj M1
1 Department of Economics, Pondicherry University (A Central University),
Puducherry, India.
2ICAR-Central Institute of Fisheries Technology, Cochin, India.
Introduction
This chapter is primarily adapted from ‘Forecasting of Paddy Prices: A Comparison of Forecasting Techniques’ (2007) authored
by Nasurudeen P, Thimmappa K, Anil Kuruvila, Sendhil R and V Chandrasekar from the Market Forecasting Centre, Department of
Agricultural Economics, PJN College of Agriculture and Research Institute, Karaikal. Available at:
https://www.researchgate.net/publication/329446012
139
Types of Forecasting Methods
Forecasting is a process that can be done based on subjective factors using personal
judgment, intuition, and commercial knowledge, and also through an objective approach using
statistical analysis of past data. Sometimes, a blend of both is also used. Broadly, the various
forecasting methods can be grouped into qualitative and quantitative approaches.
A. Qualitative Forecasting Methods: Subjective judgments or opinions are used in
qualitative methods of forecasting. These methods do not include mathematical
computations. This technique is employed when the past data for the variable being
forecast is unavailable, when there is limited time to gather data or utilize
quantitative techniques, or when the situation is evolving so rapidly that a statistical
forecast would offer fewer insights.
Time Series
Time series data is a collection of ordered observations on a quantitative attribute of a
variable gathered at various time intervals. Typically, these observations occur sequentially
and are evenly distributed over time. In mathematical terms, a time series is characterized by
X1, X2 , … Xn represents a variable X (such as the gross domestic product, sales, commodity
price, height, weight, etc.) at specific time points t1, t2, … tn. Therefore, X is a function of time,
denoted as X = F(t).
141
4. Control: Effective forecasts enable proactive control of a process or variable, aligning
with the concept of what-if-forecasting.
Time Series Forecasting: Time series forecasting, a subset of quantitative forecasting models,
involves analyzing data for trend, seasonality, and cycle patterns in a single variable.
Understanding these patterns is crucial before conducting the analysis. The initial step in
forecasting a time series variable is to generate sequence plots of the data to visually evaluate
the characteristics of the time series. This method of visualizing data is referred to as a
correlogram. This visualization aids in identifying behavioral components within the time
series and guides the selection of the most suitable forecasting model. Various conventional
methods for forecasting time series are the naïve method, mean model, moving averages
method, linear regression with time, exponential smoothing models, auto-regressing moving
average (ARMA), and auto-regressive integrated moving average (ARIMA). In this age of
artificial intelligence, more powerful, robust, and precise models of forecasting, such as
artificial neural networks (ANN), are developed and used by econometricians. Once a model
is selected based on the data pattern, the next step is its specification. This process entails
the identification of variables to be incorporated, the selection of the relationship equation's
form, and the estimation of the equation's parameters. The model's effectiveness is validated
by comparing its forecasts with historical data for the targeted forecasting process. Typical
error metrics like Mean Absolute Percentage Error (MAPE), Relative Absolute Error (RAE),
and Mean Square Error (MSE) are frequently employed for model validation. The objective
here is to distinguish the trend from the disturbance and observe the trend in its lagged
values to determine the long-term changes and factors of seasonal fluctuations Several
computer packages, including R, SPSS, FORECASTX, STATA, SHAZAME, SAS, and EVIEWS,
can perform time series forecasting. These tools allow analysts to effectively conduct and
validate time series forecasting analyses.
• Trend refers to the gradual and long-term increase or decrease of the variable over
time.
• Seasonal effects capture the recurring influences that impact the variable on an
annual basis.
• Cyclical effects measure the broad, irregular waves that affect the variable,
potentially stemming from general business cycles, demographic shifts, and other
factors.
• The irregular effect encompasses the variations that cannot be ascribed to trend,
seasonality, or cyclical patterns, essentially representing the residual fluctuations.
Simple Arithmetic Mean Utilizes the mean of all historical data as a forecast.
143
Forecasting Method Description
Adaptive Response Rate Similar to the basic exponential smoothing model, but with
Exponential Smoothing adaptive smoothing parameter (alpha) adjustments based
(ARRES) on varying errors over time.
Artificial Neural Network A powerful tool for forecasting and modeling when the
(ANN) underlying relationship in the data is not known
144
2. Forecasting by Simple Arithmetic Average Method
In this method, the forecast for the subsequent day (i.e., Xt+1) is calculated as the mean of all
the past values or historical data. In this context, daily forecasting is initiated from day 2 (as
there was no pre-existing data available to form a forecast for the first day; other means to
make predictions have to be relayed) to determine the value for each day, followed by
forecasting for the subsequent days.
145
5. Forecasting by Exponential Smoothing Method
It is a weighted average technique wherein the weights decline exponentially as data ages. In
this method, the forecast for the next day (i.e., Ft+1) is determined using the following formula:
Ft+1 = α At + (1-α) Ft (Eq. 1)
where At is the actual time series, Ft is the forecast series, and ‘α’ represents a smoothing
coefficient ranging between ‘0 and 1’. Although the exponential smoothing method relies on
just two observations for making future predictions (the latest actual observation and the
most recent forecast), it effectively integrates a portion of all historical data. In this approach,
past values are assigned varying weights, with older data receiving less weight. This concept
can be illustrated by extending the formula mentioned above. The method used to generate
the forecast for the last day (Ft) is as follows.
Ft = α At-1 + (1-α) Ft-1 (Eq. 2)
Substituting eq. 2 into eq. 1:
Ft+1 = α At + (1-α) [α At-1 + (1-α) Ft-1]
Modifying the above eq.,
Ft+1 = α At + α (1-α) At-1 + (1-α)2 Ft-1 (Eq. 3)
Identifying the continuous process from eq. 2:
Ft-1 = α At-2 + (1-α) Ft-2 (Eq. 4)
Putting the values of eq. 3 in exchange/place of eq. 4:
Ft+1 = α At + α (1-α) At-1 + (1-α)2 [α At-2 + (1-α) Ft-2] (Eq. 5)
Modifying the above eq.,
Ft+1 = α At + α (1-α) At-1 + α (1-α)2 At-2 + (1-α)3 Ft-2 (Eq. 6)
The following process can be concluded:
Ft+1 = α At + α (1-α) At-1 + α (1-α)2 At-2 + α (1-α)3 At-3 + α (1-α)4At-4 +… (Eq. 7)
As the decimal weights are raised to increasing powers, their values diminish. In the absence
of initial data, a guess value can be used for the day one forecast. Subsequently, the
exponential smoothing model shall be employed to forecast each subsequent day, starting
from day two. However, there are some principles to determine the value of ‘α’ which are
given below:
● To handle data that is random and shows erratic behavior without a clear pattern, a
larger value of ‘α’ should be employed.
● Conversely, for random walk time series data characterized by random and smooth
fluctuations without repetitive patterns, a smaller value of ‘α’ is recommended.
146
● When higher degree smoothing is required, a long-run moving average should be
utilized, corresponding to a smaller ‘α’ value.
● Conversely, when a lesser degree of smoothing is required, a short-run moving
average should be employed, corresponding to a higher ‘α’ value.
● Experimenting with different values of ‘α’ to fit the model and selecting the optimal
‘α’ based on minimal error is advisable.
147
S”t = α S’t + (1-α) S”t-1
Then,
at = S’t + (S’t - S”t) = 2 S’t - S”t
bt = α / 1- α (S’t - S”t)
Ft+m = at + bt m
where, m is the number of periods ahead to be forecast.
148
(d) Logarithmic Function: This method uses an alternate logarithmic model, and is expressed
as:
Yt = a + b [ln (t)]
(e) Gompertz Function: This method attempts to fit a 'Gompertz' or 'S' curve and is
expressed as:
Yt = e a + (b/t) or ln (Yt) = a + (b/t)
(f) Logistic Function: This method attempts to fit a 'Logistic' curve, expressed as:
Yt = 1/ [(1/u) + (abt)] or ln [(1/yt) – (1/u)] = ln (a) + ln (bt)
where 'u' is the upper boundary value.
(g) Parabola or Quadratic Function: This technique aims to fit a 'Parabolic' curve to forecast
a damped data series, expressed as:
Yt = a + bt +ct2
(h) Compound Function: This approach generates a forecasting curve that experiences
compound growth or decline, expressed as:
Yt = abt or ln (yt) = ln (a) + t ln (b)
(i) Growth Function: This approach generates a forecasting curve based on an estimated
growth rate, expressed as:
Yt = e a + (b )
or ln (Yt) = a + bt
(j) Cubic Function: This approach seeks to fit a 'Cubic' curve, expressed as:
Yt = a + bt + ct2 + dt3
(k) Inverse Function: This method attempts to fit an 'Inverse' curve, expressed as:
Yt = a + (b/t)
149
The widespread adoption of ARIMA models is credited to Box and Jenkins (1970), who
introduced a diverse range of ARIMA models, with the general non-seasonal model denoted
as ARIMA (p,d,q).
“AR (p) denotes the order of the auto-regressive part
A. Identification
The potential existence of a wide range of ARIMA models can sometimes pose challenges in
determining the most suitable model and the following steps will address this challenge:
• Plot the data and detect any anomalies. Begin by plotting the data to identify
anomalies and assess if a transformation is required to stabilize the variability in the
time series. If necessary, apply a transformation to ensure stationarity in the series.
• After transforming the data (if needed), evaluate whether the data exhibit
stationarity by examining the time series plot, Autocorrelation Function (ACF), and
Partial Autocorrelation Function (PACF). A time series is likely stationary if the plot
shows data scattered around a constant mean, indicating the mean-reverting
property. Additionally, stationarity is suggested if the ACF and PACF values drop to
near zero. Conversely, non-stationarity is implied if the time series plot is not
horizontal or if the ACF and PACF do not decline toward zero.
150
• If the data remains non-stationary, consider applying techniques such as
differencing or detrending to achieve stationarity. For seasonal data, apply seasonal
differencing to the already differenced data. Typically, no more than two differencing
operations are needed to achieve a stationary time series.
• Once stationarity is achieved, examine autocorrelations to identify any remaining
patterns. Consider the following possibilities:
a. Seasonality may be indicated by large autocorrelations and/or partial
autocorrelations at the seasonal lags significantly different from zero.
b. Patterns in autocorrelations and partial autocorrelations may indicate the
potential for AR or MA models. If the ACF shows no significant
autocorrelations after lag q, this could indicate the suitability of an MA (q)
model. Similarly, if no significant partial autocorrelations remain after lag
p, an AR (p) model might be appropriate.
c. Without a clear indication of an MA or AR model, a mixed ARMA or ARIMA
model may be required.
Applying the Box-Jenkins methodology requires experience and sound judgment, with guiding
principles in mind.
• Establishing Stationarity: A preliminary analysis of the raw data helps
determine whether the time series is stationary in both its mean and
variance. Non-stationarity can often be addressed using differencing
(seasonal or non-seasonal) and transformations such as logarithmic or
power transformations.
• Considering Non-Seasonal Aspects: After achieving stationarity, examine
the ACF and PACF plots to evaluate the possibility of an MA or AR model
for non-seasonal data.
• Considering Seasonal Aspects: For seasonal aspects, the ACF and PACF
plots at seasonal lags help identify potential seasonal AR or MA models.
However, identifying seasonal components can be more complex and less
obvious compared to non-seasonal patterns.
B. Estimation
Once a tentative model identification has been made, the AR and MA parameters, both
seasonal and non-seasonal, must be determined most effectively. For instance, consider a
class of model identified as ARIMA (0,1,1), which is a family of models dependent on one MA
coefficient θ1:
151
(1-B)Yt = (1- θ1 B) et
The objective is to obtain the best estimate of θ1 to fit the time series being modeled
effectively. At the same time, the least squares method can be utilized for ARIMA models,
similar to regression, models involving an MA component (i.e., where q > 0) do not have a
simple formula for estimating the coefficients. Instead, an iterative method must be
employed. The general ARIMA model's statistical assumptions enable the computation of
useful summary statistics once the optimum coefficient values have been estimated. Each
coefficient can be associated with a standard error, enabling the conduct of a significance
test based on the parameter estimate and its standard error An ARIMA (3,1,0) model will be
of the form:
Yt = Ø1 Yt-1 + Ø2 Yt-2 + Ø2 Yt-2 + et where Yt = Yt - Yt-1
C. Diagnostic Checking
The diagnostic examination of the selected model is essential to ensure its adequacy. This
involves studying the residuals to identify any unaccounted patterns. Although calculating
the errors in an ARIMA model is more complex than in an ordinary least squares (OLS) model,
these errors are automatically generated as part of the ARIMA model estimation process. For
the model to be considered reliable for forecasting, the residuals left after fitting the model
should resemble white noise. A white noise model is characterized by residuals with no
significant autocorrelations and partial autocorrelations. One way to assess the correctness
of the model fit is by examining the residuals. Usually, the count of residuals will be n - d -
sD, where n denotes the number of observations, d, and D are the degrees of non-seasonal
and seasonal differencing, respectively, and s represents the number of observations per
season. Standardizing the residuals in plots is a common practice to ensure the variance
equals one, which helps in identifying potential outliers more easily. Any residuals smaller
than -3 or larger than 3 are regarded as outliers and may require further scrutiny. The residual
series is white noise if no outliers exist and the ACF or PACF is within the limits. Once this
step is confirmed, the next stage is actual forecasting.
152
To employ an identified model for forecasting, it is crucial to extend the equation and present
it in a conventional regression equation format. In the specified model, the equation is
expressed as follows:
Yt = Y t-1 + Y t-12 – Y t-13 + et – θ1 e t-1 – Θ 1 et-12 + θ1 Θ1 e t-13
To forecast one period, i.e., Yt-1, the subscripts are incremented by one throughout the
equation:
Yt+1 = Y t + Y t-11 – Y t-12 + et+1 – θ1e t – Θ 1 e t-11 + θ1 Θ 1 e t-12
While the term et+1 will not be known, the fitted model allows for replacing et , et-11, and et+1-
12 with their empirically determined values, which are the residuals for times t, t-11, and t-
12, respectively. As the forecasting extends further into the future, there will be no empirical
values for the error terms, and their expected values will be zero. Initially, the Y values in the
equation will be known as past values (Yt, Yt-11 and Yt-12 ). However, as the forecasting
progresses, these Y values will transition to forecasted values rather than known past values
(Makridakis et al., 1998).
153
Probabilistic forecasting in the Markov chain model starts with defining the state. The
probabilities of transition from one state to another have to be determined. These
probabilities are often represented in a transition matrix. Each entry in the matrix represents
the probability of transitioning from one state to another. Once the system's initial state is
specified, transition probabilities can be used to simulate sequences of states over time. This
can be done iteratively, where the current state determines the next state based on the
transition probabilities. A distribution of possible future states can be built by simulating
multiple state sequences. This distribution provides a probabilistic forecast of the system's
future behavior (Paul, 2012).
154
In time series analysis, the utilization of this ANN technique, including the (p,d,q) model, is
employed for forecasting, aiming to predict precision and address the intricacies of time-
dependent data. One of the key benefits of ANN models compared to other non-linear
models is their capability as universal approximators, allowing them to effectively approximate
a wide range of functions with high precision. Evaluating Artificial Neural Networks (ANN)
against alternative forecasting methods, such as linear regression and exponential smoothing,
has provided valuable insights into the comparative efficacy of diverse techniques within
specific domains, such as stock market and sales prediction. Therefore, compared with
traditional statistical methods, such as the ARIMA, ANN has been shown to improve
forecasting accuracy to evaluate its performance and potential advantages. As a result, ANN
models have become increasingly utilized in time series forecasting to enhance predictive
capabilities and improve the accuracy of forecasts. This approach underscores the
adaptability of ANN in capturing and predicting the complexities of time-dependent data,
demonstrating its flexibility and effectiveness in capturing complex temporal patterns, making
them valuable in time series analysis. The combination of ANN with genetic algorithms and
deep belief networks has been explored to optimize forecasting models, addressing challenges
such as model overfitting and lack of interpretability.
155
Measuring Accuracy of Forecast
Forecast Errors: Forecast error refers to the disparity between forecasted and actual values
(test data). The precision of the aforementioned forecasting models can be enhanced by
minimizing specific criteria, such as:
n −1
( At − Ft )
Percent Error (PEt) = X 100
At
1 n
Mean Percent Error (MPE) = PEt
n t =1
1
Mean Absolute Percent Error (MAPE) =
n
PEt
2
n −1
Ft +1 − At +1
t =1
Theil’s U-statistic (out-of-sample forecast) =
At
2
n −1
At +1 − At
t =1
At
For Theil’s statistics, if U equals 1, it indicates that the naïve method is as effective as the
forecasting technique being evaluated. If U is less than 1, the forecasting technique is
considered to perform better than the naïve method, with smaller U values suggesting greater
superiority. On the other hand, if U is greater than 1, the formal forecasting method does not
provide any benefit, as the naïve method would yield better results. While there are numerous
criteria for assessing forecast accuracy, a few are elaborated in the subsequent section.
156
1. Forecast Error: The forecast error serves as a metric for evaluating the accuracy of a
forecast at a specific point in time. It is computed as the difference between actual and
forecast values. It is represented as:
et = At - Ft
However, analyzing forecast errors for individual periods may not provide comprehensive
insights. Therefore, it is essential to examine the accumulation of errors over time. Merely
observing the cumulative et values may not provide meaningful insights, as positive and
negative errors offset each other. Relying solely on these values could lead to a false sense
of confidence. For instance, when comparing the original data and the associated pair of
forecasts generated by two different methods, it becomes evident that only a particular
method has produced superior forecasts based on the accumulated forecast errors over time.
2. Mean Absolute Deviation (MAD): To address the issue of positive errors offsetting
negative errors, a straightforward approach involves considering the absolute value of the
error, disregarding its sign. This yields the absolute deviation, which represents the size of
the deviation irrespective of its direction. Subsequently, the mean absolute deviation (MAD)
is computed by determining the average value of these accumulated absolute deviations.
3. Mean Absolute Percent Error (MAPE): The mean absolute percentage error (MAPE) is
computed by averaging the percentage difference between the fitted (forecast) data and the
original data. If the best-fit method yields a high MAPE (e.g., 40 per cent or more), it indicates
that the forecast may not be particularly reliable for various reasons.
MAPE = [ | et / At | x 100] / n
where 'A' represents the original series, 'e' represents the original series minus the forecast,
and 'n' denotes the number of observations.
4. Root Mean Squared Error (RMSE): The root mean square error (RMSE) is calculated by
taking the square root of the average of the squared errors. It provides a measure of how
much the forecast deviates from the actual data.
RMSE = ( et2 / n)
157
predictions. Tracking signals are crucial for detecting any bias in the forecasting process. Bias
occurs when the forecast consistently overestimates or underestimates the actual data
values. The tracking signal is computed as follows:
MAD
Conclusions
While numerous forecasting methods and approaches are available, it is evident that there is
no universal single technique suitable for all situations. The selection of a forecasting method
depends on numerous factors, including the pattern of data, desired accuracy, time
constraints, complexity of the situation, the projection period, available resources, and the
forecaster's experience. These factors are interconnected, as a shorter forecasting time may
compromise accuracy, while a longer time frame may enhance accuracy and increase costs.
The key to a precise forecast is finding the right balance among these factors. Generally, the
best forecasts are derived from straightforward and uncomplicated methods. Research
indicates that combining individual forecasts can improve accuracy, while adding quantitative
forecasts to qualitative forecasts may reduce accuracy. However, the optimal combinations
of forecasts and the conditions for their effectiveness have not been fully elucidated.
Combining forecasting techniques typically yields higher-quality forecasts than relying on a
single method, as it allows for compensating for the weaknesses of any particular technique.
By choosing complementary approaches, the shortcomings of one method can be offset by
the strengths of another. Even when quantitative methods are used, they can be combined
with or supplemented by qualitative judgments, and forecasts can be reviewed or adjusted
based on qualitative assessments. It is essential to recognize that the forecasts made by data
analysts are intended for 'decision support,' rather than direct ‘decision-making.’
158
Bibliography
Armstrong, J.S. 2004. “Principles of Forecasting”, Kluwer Academic Publishers.
Box, G.E. and G. Jenkins. 1970. “Time Series Analysis, Forecasting and Control”, San Francisco:
Holden DayHolden-Day.
Chatfield, C. 2000. “Time-Series Forecasting”, Chapman & Hall/Crc.
Eğrioğlu, E., Yolcu, U., Aladağ, Ç. H., & Baş, E. (2014). Recurrent multiplicative neuron model
artificial neural network for non-linear time series forecasting. Neural Processing
Letters, 41(2), 249-258.
Gautam, N., Ghanta, S. N., Mueller, J. L., Mansour, M., Chen, Z., Puente, C., … & Al’Aref, S. J.
(2022). Artificial intelligence, wearables and remote monitoring for heart failure:
current and future applications. Diagnostics, 12(12), 2964.
Gaynor, E.P and R.C.Kirkpatrick. 1994. “Introduction to Time Series Modeling and Forecasting
in Business and Economics”, McGraw-Hill, Inc.
Gujarati, N.D. and Sangeetha. 2007. “Basic Econometrics”, Tata McGraw-Hill Publishing
Company Limited, New Delhi.
Hamzaçebi, Ç. (2008). Improving artificial neural networks’ performance in seasonal time
series forecasting. Information Sciences, 178(23), 4550-4559.
Hanke, E.J, Dean W.Wichern and Arther G.Reitsch. 2005. “Business Forecasting”, Pearson
Education.
Jain, R., & Agarwal, R. (1992). Probability Model for Crop Yield Forecast. Biometrical Journal,
34(4), 501-11.
Jha, G. K. (2007). Artificial neural network and its applications in agriculture. New Delhi: IARI.
Khashei, M. and Bijari, M. (2010). An artificial neural network (p,d,q) model for time series
forecasting. Expert Systems with Applications, 37(1), 479-489.
Makridakis, S., S.C.Wheelwright and R.J.Hyndman. 1998. “Forecasting - Methods and
Applications”, New York: John Wiley and Sons, Inc.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). Statistical and machine learning
forecasting methods: concerns and ways forward. PlosOne, 13(3), e0194889.
Nasurudeen P, Thimmappa K, Anil Kuruvila, Sendhil R and V Chandrasekar, ‘Forecasting of
Paddy Prices: A Comparison of Forecasting Techniques’, Market Forecasting Centre,
Department of Agricultural Economics, PJN College of Agriculture and Research
Institute, Karaikal, 2007.
Paul, R. K. (2012). Forecasting Using Markov Chain. New Delhi: Indian Agricultural Statistics
Research Institute.
159
Ramasubramanian, V. (2003). Forecasting Techniques in Agriculture. Agricultural and Food
Sciences, 1-15.
Yaseen, Z. M., El-Shafie, A., Afan, H. A., Hameed, M. M., Mohtar, W. H. M. W., & Hussain, A.
(2015). Rbfnn versus ffnn for daily river flow forecasting at Johor River, Malaysia.
Neural Computing and Applications, 27(6), 1533-1542.
160
Chapter-10
Emerging Trends and Technology for Data
Driven Market Research
R. Narayana Kumar
Principal Scientist and SIC, Madras Regional Station of ICAR-CMFRI, Chennai
Introduction
Market research has undergone a profound transformation over the decades, evolving from
rudimentary tabular analyses to the application of sophisticated econometric and statistical
models. This evolution reflects the growing complexity of markets and the need for more
nuanced insights into economic behaviors and trends. This chapter explores emerging trends
and technologies in data-driven market research, focusing on fisheries. This study area
provides a unique lens to examine advancements in market research methodologies and their
applications. Historically, market research relied heavily on basic tabular data and
straightforward statistical techniques to analyze market trends and consumer behavior. While
these methods offered valuable insights, they were often limited in capturing the intricacies
of market dynamics and price fluctuations. As markets became more complex and data more
abundant, the need for advanced analytical techniques became increasingly apparent. Today,
sophisticated econometric models and analytical tools enable researchers to conduct more
in-depth analyses and generate more accurate forecasts.
In the context of fisheries, understanding price behavior and market efficiency is crucial. Price
behavior encompasses how fish prices fluctuate in response to various factors such as supply,
demand, and external market conditions. Analyzing price behavior helps stakeholders in the
fisheries sector make informed decisions about pricing strategies, market-entry, and
inventory management. On the other hand, market efficiency refers to how well markets
adjust to changes and allocate resources effectively. It involves assessing the per cent share
of producers in the consumer rupee and analyzing market price indices across wholesale and
retail markets. The development of Fish Market Information Systems (FMIS) and Fish Price
Information Systems (FPIS) represents a significant advancement in the field. FMIS provides
real-time data on market arrivals, sales, and prices, facilitating better decision-making for
fishermen, wholesalers, and retailers. Similarly, FPIS offers detailed information on fish prices,
enabling stakeholders to track price trends and make informed trading decisions. These
systems enhance transparency and efficiency in the fish market, ultimately benefiting all
participants.
161
Emerging technologies have also played a crucial role in advancing market research
methodologies. For instance, ARIMA models and time series analysis offer powerful tools for
forecasting and understanding long-term trends in market data. ARIMA models, with their
capacity to account for autocorrelation and seasonal variations, provide valuable insights into
future price movements and market dynamics. Time series analysis allows researchers to
dissect data into its parts—trend, seasonal, cyclical, and irregular elements—enabling more
precise predictions and a better understanding of market behavior. Conjoint analysis is
another advanced technique that has gained prominence in market research. By examining
consumer preferences and choices, conjoint analysis helps researchers understand how
different product or service attributes influence consumer decisions. This method is
particularly useful for identifying factors driving demand and tailoring products or services to
meet consumer needs.
Integrating these advanced analytical methods with technologies like Fish Trade Platforms
(FTP) and E-Auction Platforms is transforming the fisheries sector. FTPs facilitate online
trading and auctioning of fish, enhancing market access and efficiency. By providing a real-
time transaction and price discovery platform, these systems support more effective
distribution and consumption of fish products.
Market research has evolved dramatically, driven by advancements in econometric models
and analytical technologies. Applying these methods in fisheries offers valuable insights into
price behavior, market efficiency, and consumer preferences. By leveraging technologies such
as FMIS, FPIS, and advanced analytical tools, stakeholders in the fisheries sector can gain a
deeper understanding of market dynamics and make more informed decisions. As market
research advances, integrating these emerging trends and technologies will play a crucial role
in shaping the future of economic analysis and decision-making.
162
immediate demand from wholesalers. Market research at the landing center involves
monitoring these factors to predict price trends and manage supply efficiently. Effective price
behavior analysis at this level helps in setting baseline prices and ensures that fishermen get
a fair initial return for their catch.
Wholesale Market (Points of First Sales)
The wholesale market is the next critical point in the supply chain where the first sales occur.
Here, wholesalers buy fish in bulk, distributing them to retailers or other intermediaries. At
this stage, prices are influenced by transportation costs, storage conditions, and bulk
purchase agreements. Wholesale price behavior is essential for understanding how bulk
purchases and logistical considerations impact the overall pricing structure. Market research
in this area focuses on optimizing supply chain efficiencies and identifying opportunities for
reducing costs to maximize profitability.
Retail Market (Points of Last Sales)
The retail market represents the final stage where fish reach the consumers. A range of
factors including consumer demand, competition, marketing strategies, and value-added
services determines prices here. Retail price behavior analysis helps in understanding
consumer preferences and buying patterns, which are crucial for setting competitive prices
and enhancing customer satisfaction. By studying retail market dynamics, businesses can
tailor their offerings to meet consumer needs more effectively and improve market share.
Marketing Efficiency
Marketing efficiency refers to how well the market functions in distributing products from
producers to consumers. It involves analyzing the percentage share of producers in the
consumer rupee and market price indices at both wholesale and retail levels.
Share of Producer in the Consumer Rupee
This metric indicates the proportion of the final retail price that goes back to the producers.
A higher share suggests a more efficient market where producers receive a fair return for
their products. Market research aims to maximize this share by identifying and eliminating
inefficiencies in the supply chain.
Market Price Indices
Market price indices provide a benchmark for tracking price movements over time at both
wholesale and retail levels. These indices help in comparing current prices with historical data
to identify trends, forecast future prices, and make informed decisions. Wholesale Market
Price Index: This index measures the price changes at the wholesale level, reflecting the cost
dynamics involved in bulk purchasing and distribution.
163
Retail Market Price Index
This index tracks the price variations at the retail level, providing insights into consumer price
sensitivity and purchasing behavior. By analyzing these indices, market researchers can
develop strategies to stabilize prices, ensure fair returns for producers, and maintain
competitive pricing for consumers. This comprehensive understanding of price behavior and
marketing efficiency is essential for creating sustainable and profitable market systems in the
fisheries sector.
164
conditions and demand. These platforms enhance the efficiency of fish distribution and
consumption by providing a transparent and competitive trading environment. They also
offer added utilities such as secure payment processing, logistics coordination, and
traceability features, which help maintain the quality and safety of fish products.
Development of a Fish Marketing Grid
Developing a fish marketing grid involves creating a comprehensive network that maps out
the fish flow from landing centers to final consumers. This grid includes detailed information
on market arrivals, sales, and prices on specific dates, providing a holistic view of the market
dynamics. The fish marketing grid helps stakeholders understand the supply chain's
bottlenecks and optimize their operations accordingly. It also aids in forecasting demand and
supply trends, allowing for better resource allocation and planning.
Market Arrivals
Market arrivals refer to the quantity of fish that is brought to market at any given time.
Tracking market arrivals is essential for understanding supply patterns and anticipating
potential surpluses or shortages. An efficient FMIS can provide real-time data on market
arrivals, enabling stakeholders to make strategic decisions about harvesting, purchasing, and
pricing.
Market Sales
Market sales data provides insights into the volume of fish sold at various points in the supply
chain, from wholesale to retail markets. This information is crucial for assessing market
demand and performance. By analyzing market sales data, stakeholders can identify trends,
adjust their marketing strategies, and improve their sales outcomes. An FMIS that includes
detailed sales data helps in creating a more responsive and adaptive market.
Price on the Date
Having accurate price information on specific dates is vital for making informed trading
decisions. The FPIS component of the FMIS ensures that stakeholders have access to up-to-
date price data, which reflects the current market conditions. This data helps set competitive
prices, negotiate deals, and plan future transactions.
Online Marketing
Online marketing is an integral part of modernizing the fisheries sector. By leveraging digital
platforms, stakeholders can reach a wider audience, engage with customers more effectively,
and enhance their sales channels. Online marketing strategies include
165
Analytical Methods in Market Research
Market research employs various analytical methods to understand market dynamics,
forecast trends, and make informed decisions. These methods range from statistical models
to valuation techniques, providing unique insights into consumer behavior, market trends,
and economic value.
ARIMA Models
ARIMA (Auto-Regressive Integrated Moving Average) models are used for forecasting time
series data. They are particularly useful in market research for predicting future values based
on past trends. ARIMA models combine autoregression (AR), differencing (I), and moving
average (MA) to provide a comprehensive analysis of time series data. By identifying patterns
and making accurate forecasts, ARIMA models help businesses plan inventory, set prices, and
develop marketing strategies. For example, ARIMA models can predict future fish prices based
on historical data in the fisheries market, helping stakeholders make informed decisions about
production and sales.
Time Series Analysis
Time series analysis involves examining data points collected or recorded at specific time
intervals. This method helps identify trends, seasonal patterns, and cyclical movements in the
data. In market research, time series analysis is crucial for understanding how market variables
change over time. Researchers can gain insights into the underlying factors driving market
behavior by decomposing time series data into its constituent components (trend,
seasonality, and irregular variations). For instance, analyzing fish market sales data over
several years can reveal seasonal peaks and troughs, guiding marketing and production
planning.
Decomposition Analysis
Decomposition analysis breaks down time series data into trend, seasonal, cyclical, and
irregular components. This method helps isolate and understand the effects of different
factors on the overall data pattern. In market research, decomposition analysis is valuable for
identifying long-term trends and seasonal variations. For example, in the fish market, this
analysis can separate the impact of annual fish migrations (seasonal) from overall market
growth (trend), enabling more accurate forecasting and better strategic planning.
Conjoint Analysis
Conjoint analysis is a survey-based statistical technique used to determine how people value
different product or service attributes. In market research, it helps identify the most
important features influencing consumer choices. By presenting respondents with different
166
product configurations and asking them to rank or choose between them, researchers can
determine the relative importance of each attribute. In the fisheries market, conjoint analysis
can reveal preferences for fish species, freshness, price, and packaging, helping businesses
tailor their offerings to meet consumer demands.
Consumer Choices/Preferences
Understanding consumer choices and preferences is fundamental to market research. This
involves studying how consumers make purchasing decisions, what factors influence their
choices, and how their preferences change over time. Analyzing consumer behavior helps
businesses develop products that meet market demand, create effective marketing
campaigns, and improve customer satisfaction. For example, by studying consumer
preferences in the fish market, businesses can identify popular fish species, preferred
packaging methods, and optimal price points, leading to more targeted and successful
marketing efforts.
Choice of Markets
The choice of markets refers to selecting target markets based on consumer demographics,
purchasing power, and market potential. Market research helps businesses identify the most
lucrative product or service markets. By analyzing market conditions, competition, and
consumer behavior, researchers can recommend which markets to enter or expand into. In
fisheries, choosing the right market involves understanding regional preferences,
consumption patterns, and market accessibility, ensuring that products reach the most
profitable and receptive audiences.
Choice of Marketing Channels
The choice of marketing channels involves selecting the most effective ways to reach and
engage with target customers. Market research identifies the channels that best match the
preferences and behaviors of the target audience. This can include traditional channels like
retail stores and wholesale markets, as well as digital channels like e-commerce platforms and
social media. In the fisheries market, choosing the right marketing channels ensures that fish
products are marketed effectively, reaching consumers through the most convenient and
accessible means.
Markov Chain Analysis
Markov Chain Analysis is a statistical method used to model random processes where future
states depend only on the current state. In market research, it captures shifting patterns in
consumption, sales, exports, and related parameters. This method is useful for predicting
customer behavior, such as brand switching and purchase frequency. In the fisheries market,
167
Markov Chain Analysis can track how consumer preferences shift between different fish
species over time, helping businesses anticipate changes in demand and adjust their
strategies accordingly.
Contingent Valuation Methods (WTP and WTA)
Contingent Valuation Methods (CVM) are survey-based techniques used to estimate the
economic value of non-market goods and services by asking people their willingness to pay
(WTP) for a benefit or willingness to accept (WTA) compensation for a loss. In market
research, CVM helps assess the value consumers place on environmental benefits, public
goods, or market changes. For instance, in the fisheries market, CVM can estimate consumers'
WTP for sustainably sourced fish or their WTA for the inconvenience of reduced fishing
during conservation periods.
Hedonic Pricing
Hedonic pricing is an econometric method used to estimate the value of a good or service by
breaking down its price into constituent attributes. This method is commonly used in real
estate to value properties based on location, size, and amenities. In market research, hedonic
pricing helps determine how different product attributes contribute to overall price. In the
fisheries market, this could involve analyzing how factors like fish species, freshness, size, and
region of origin impact market prices, providing insights into what consumers value most and
how to price products competitively.
These analytical methods in market research provide the tools and techniques necessary to
understand complex market dynamics, forecast trends, and make data-driven decisions. From
statistical models like ARIMA and Markov Chain Analysis to valuation methods like
contingent valuation and hedonic pricing, each method offers unique insights that help
businesses optimize their strategies and achieve better market outcomes. By leveraging these
analytical techniques, stakeholders in the fisheries market can enhance their understanding
of consumer behavior, improve market efficiency, and ensure sustainable and profitable
operations.
168
provides insights into the value of these resources and the impacts of market changes or
interventions. The market price method is particularly useful for valuing changes in the
quantity or quality of a good or service. For instance, it can assess the economic impact of
environmental policies, such as seasonal closures of fishing areas, on both consumers and
producers. By understanding these impacts, policymakers can make informed decisions that
balance ecological sustainability with economic viability.
Steps in Market Price Method
The market price method involves a series of steps to estimate the economic value of
ecosystem products and assess the impacts of market interventions:
1. Estimation of Market Demand Function and Consumer Surplus Before Closure: The first
step involves calculating the market demand function and consumer surplus before any
intervention or market change. This requires analyzing historical market data to determine
how much consumers are willing to pay for a given quantity of fish.
2. Estimation of Demand Function and Consumer Surplus After Closure: Next, the demand
function and consumer surplus are recalculated after the intervention or market change. For
example, if a fishing area is closed for environmental restoration, the new demand function
and consumer surplus reflect the post-intervention market conditions.
3. Estimate the Loss in Economic Surplus to Consumer: The difference in consumer surplus
before and after the intervention is then determined. This loss represents the economic
impact on consumers due to reduced availability or increased fish costs.
4. Producers’ Surplus Before and After Closure: The producers’ surplus is calculated before
and after the intervention. This involves assessing changes in production costs, market prices,
and the quantity of fish sold.
5. Economic Loss Due to Closure: The consumer and producer surplus losses are summed to
estimate the total economic loss. This comprehensive measure provides a holistic view of the
economic impact of the market intervention, capturing both consumer and producer
perspectives.
A Hypothetical Situation
To illustrate the market price method, consider a hypothetical situation where a commercial
fishing area is closed seasonally to clean up pollution. The closure aims to improve
environmental conditions and, consequently, the quality and quantity of fish available in the
future. Here’s how the market price method would be applied in this context:
169
1. Estimation of Market Demand Function and Consumer Surplus Before Closure (A): Analyze
historical market data to determine the demand function and consumer surplus before the
closure. This reflects the market conditions when fishing activities are ongoing.
2. Estimation of Demand Function and Consumer Surplus After Closure (B): Recalculate the
demand function and consumer surplus after the closure, considering the expected
improvements in fish quality and availability.
3. Estimate the Loss in Economic Surplus to Consumer (D): Calculate the difference in
consumer surplus before and after the closure. This loss (D) represents the economic impact
on consumers due to the temporary reduction in fish supply.
4. Producers’ Surplus Before Closure (E): Assess the producers’ surplus before the closure by
analyzing production costs, market prices, and quantities sold under normal conditions.
5. Producers’ Surplus After Closure (F): Recalculate the producers’ surplus after the closure,
considering changes in production costs and market prices due to the temporary halt in
fishing activities.
6. Loss in Producers’ Surplus (G): Determine the difference in producers’ surplus before and
after the closure. This loss (G) captures the economic impact on producers due to the
intervention.
7. Economic Loss Due to Closure (H): Sum the losses in consumer and producer surplus (D
+ G) to estimate the total economic loss due to the closure. This comprehensive measure
helps policymakers evaluate the trade-offs involved in environmental interventions.
Interpretation
The final value obtained from the market price method helps compare the benefits of actions
that would allow the area to be re-opened against the costs of such actions. For instance, if
the economic loss due to the closure is significant, it may justify investments in pollution
control and environmental restoration to reopen the area and resume fishing activities. A
practical analogy can be drawn from maintaining a swimming pool in an apartment complex.
Suppose the cost of maintaining the pool is approximately ₹2 lakh per annum, but only a few
residents use it. The question arises of whether to continue maintaining the pool or collect
separate maintenance fees from those who use it. The market price method can help estimate
the economic value of the pool to the residents, considering their willingness to pay for its
use. If the economic benefits (willingness to pay) exceed the maintenance costs, continuing
the pool's operation would be justified. Otherwise, alternative arrangements may be
considered.
170
Finally, the market price method is a robust tool for estimating the economic value of
ecosystem products and assessing the impact of market interventions. By analyzing market
demand, consumer surplus, and producer surplus, this method provides a comprehensive view
of the economic implications of changes in the market, guiding informed decision-making for
sustainable and efficient resource management in the fisheries sector.
171
6. Estimate Demand Function for Visits to the Site: Develop a demand function that relates
the number of visits to the site with the travel costs and other influencing factors. This
function helps predict how changes in travel costs or site characteristics will impact visitation
rates.
7. Estimate the Economic Benefit to the Site (Consumer Surplus): Calculate the consumer
surplus, the area under the demand curve. The consumer surplus represents the total
economic benefit visitors derive from the site beyond what they pay.
Interpretation
The economic benefit estimated using the travel cost method is a benchmark for assessing
the site's value. If the costs of maintaining the site are lower than the estimated economic
benefits, it is worthwhile to continue investing in it. Conversely, if maintenance costs exceed
the benefits, it may be necessary to reconsider the site's management or explore additional
factors that could influence its value. For instance, if a recreational site incurs significant
maintenance costs but attracts many visitors willing to pay substantial travel expenses to
access it, the travel cost method would justify the site's continued operation. However, if the
site is underutilized and the economic benefits are minimal, alternative management
strategies or site improvements may be needed to enhance its value and appeal.
Estimating the Economic Value of Recreational Sites
The travel cost method estimates the economic value of ecosystems or sites used for
recreation by considering the travel expenses of visitors as a proxy for their willingness to pay
(WTP). This method evaluates visiting fees, site closures, or environmental quality changes.
Steps in Travel Cost Method
1. Define Zones Surrounding the Site: Establish zones based on the distance from the site.
2. Number of Visitors and Visits from Each Zone: Collect data on the number of visitors and
visits from each zone.
3. Estimate Visitation Rates: Calculate visitation rates per 1,000 population in each zone.
4. Calculate Travel Distance and Time: Determine each zone's round-trip travel distance and
time.
5. Variables Influencing Travel Costs: Identify variables affecting per capita travel costs.
6. Estimate Demand Function for Visits: Develop a demand function for site visits.
7. Estimate Economic Benefit to the Site: Calculate the consumer surplus as the area under
the demand curve.
172
Interpretation
The economic benefit derived from the travel cost method serves as a benchmark for
assessing the site's value. If maintenance costs are lower than the benefits, it justifies
continuing the site's operation.
173
Case Studies
Willingness to Pay (WTP) for Clam Fisheries Management Programme (CFMP)
A survey was conducted to evaluate the economic value and effectiveness of the Clam
Fisheries Management Programme (CFMP) and estimate stakeholders' willingness to pay
(WTP). This survey aimed to understand how much clam fishers and associated stakeholders
are willing to invest in the CFMP and to assess the program's impact on their livelihoods and
the sustainability of the fishery.
174
• Consistent Market: Approximately 32.5% of respondents experienced access to a
more consistent market for their produce. This stability facilitated better planning
and financial security.
• Increased Net Operating Income per Trip: Around 25% of respondents saw an
increase in net operating income per trip, reflecting improved efficiency and
profitability due to better management practices.
• Enhanced Domestic Savings: The program led to a 22.5% increase in domestic
savings among stakeholders, allowing them to meet planned needs and invest in
other areas of their lives.
• Premium Prices for Produce: About 20% of respondents received premium prices
for their produce, enhancing their revenue and market competitiveness.
• Sustainable Income: Approximately 18% of stakeholders reported achieving a
sustainable income post-CFMP, underscoring the program's effectiveness in
stabilizing their financial situation.
175
Location Ban Days 30 45 60 90 120
Motorised gillnet Boat
NA 5500 15000 4663 NA
owner
Mangalore
Non-Motorised Boat
NA NA 1333 1430 3000
Owner
Traditional (in Rs.) 333 606 976 1,278 1.725
Rameswaram
Motorised (in Rs.) 652 1.265 1.838 2,394 3.352
Motorised Boat Owner (in
NW NW NW NW NW
Rs.)
Chennai
Motorised Boat Crew (in
NW NW NW NW NW
Rs.)
Kakinada Motorised (in Rs.) 175 281 484 5.678 6,177
Nizampatnam Non-motorised (in Rs.) NW NW NW NW NW
Mech Purse-seiners (in
NW 1,500* NW 9000 NW
Rs.)
Mangalore
Trawl Boat owner (in Rs.) NW 23,500* NW 10000 NW
Trawl boat labour (in Rs.) NW 2,692 4,500 1000 NW
Conclusion
Recent advancements in econometric models and analytical tools have significantly
transformed the landscape of economic research. Historically, researchers relied on basic
tabular analysis to understand various phenomena. However, the evolution of sophisticated
analytical methods, such as ARIMA models and time series analysis, has greatly enhanced our
ability to interpret complex data. These advancements have broadened the scope of economic
inquiry and increased the data requirements needed for applying these methods effectively.
Modern applications like R, SPSS, and SAS offer powerful statistical analysis and
econometrics capabilities. While these tools provide advanced functionalities and insights,
users must develop specific skills for effective utilization. Some software applications are
designed to be user-friendly, whereas others necessitate more intensive practice and
expertise. This shift highlights the need for a deeper understanding of analytical techniques
and the ability to navigate complex software environments.
Furthermore, developing comprehensive questionnaires and data collection methods has kept
pace with these analytical advancements. Accurate data collection is essential for leveraging
these advanced tools effectively. Synchronizing data analysis technologies with data
collection processes ensures that research findings are robust and reliable.
176
Ultimately, interpreting results from these advanced analyses demands meticulous attention
and expertise. Integrating sophisticated analytical tools with careful data collection and
interpretation practices will be crucial for producing meaningful and actionable economic
insights as the field progresses. This holistic approach will enable researchers to tackle
complex economic questions with greater precision and relevance.
Acknowledgment
The author acknowledges the invaluable contribution of Dr. V. Chandrasekar, Senior Scientist
at ICAR-CIFT, Kochi-29, for his involvement in editing this book chapter titled "Emerging
Trends and Technology for Data-Driven Market Research."
177
Chapter-11
Data Visualization for Data Science
Chandrasekar V 1, Ramadas Sendhil 2 and V.Geethalakshmi 3
1&3
ICAR-Central Institute of Fisheries Technology, Cochin, India
2
Department of Economics, Pondicherry University (A Central University),
Puducherry, India.
Introduction
Data visualization is a valuable tool that converts raw data into graphical formats, making
complex information easier to comprehend and access. Well-designed visualizations uncover
trends, patterns, and insights that might not be obvious from raw data. In today’s data-driven
environment, proficiency in data visualization is crucial for those involved in data analysis, as
it facilitates decision-making by presenting information clearly and engagingly. This chapter
covers the basics of data visualization, explores different techniques and tools, and offers
practical examples, including how to create dynamic visualizations using dashboards on real-
time web pages through APIs.
178
Types of Data Visualizations
Charts
Bar Charts: Ideal for comparing discrete categories. For example, a bar chart showing the
number of products sold by different departments allows for easy comparison of sales
performance across departments.
Line Charts: Ideal for illustrating trends over time. For example, a line chart showing monthly
website traffic over the course of a year can highlight patterns such as seasonal increases or
decreases.
Pie Charts: Effective for displaying proportions. For instance, a pie chart depicting the market
share of different smartphone brands visualizes each brand's contribution to the overall
market.
Scatter Plots: Excellent for illustrating relationships between two variables. For example, a
scatter plot comparing advertising spend to sales revenue can show whether a correlation
exists between the two variables.
Graphs
Histograms: Display the distribution of data. For example, a histogram showing the
distribution of customer ages in a retail store can reveal the most frequent age groups among
customers.
Box Plots: Visualize the distribution based on quartiles. For instance, a box plot showing
students' test scores can highlight median performance, variability, and outliers.
Advanced Visualizations
Heat Maps: Represent data values using color gradients. For example, a heat map displaying
website user activity by time and day of the week can highlight peak usage periods.
Bubble Charts: Extend scatter plots by adding a third dimension. For instance, a bubble chart
showing countries’ GDP versus life expectancy, with bubble size representing the population,
can reveal complex relationships between these variables.
Treemaps: Display hierarchical data as nested rectangles. For example, a treemap visualizing
a company's budget allocation across departments can show how resources are distributed.
179
Tableau
Tableau excels in creating interactive and dynamic visualizations. It enables users to link to a
variety of data sources, ranging from spreadsheets to databases, and convert this data into
meaningful visual representations. For example, you can create a dashboard displaying real-
time sales data with interactive filters, allowing users to drill down into specific regions or
periods..
Example: Imagine a retail company tracking its sales performance across different regions.
Using Tableau, you can create a dashboard that includes:
• A map highlighting sales volume by region.
• A line chart showing monthly sales trends.
• Bar charts comparing product performance across regions.
Interactive filters can be added to allow users to focus on specific time frames or product
categories, enabling a deeper understanding of sales dynamics.
Power BI
Power BI integrates with various data sources and provides robust visualization capabilities.
It's particularly strong in creating comprehensive reports that include various types of
visualizations, including bar charts, pie charts, and maps, all connected to a central dataset.
This integration allows for a seamless data flow and consistent updates.
Example: Consider an organization that needs to report on marketing campaign effectiveness.
With Power BI, you can create a report that includes:
• A pie chart showing the distribution of marketing spend across different channels.
• Bar charts comparing the number of leads generated by each campaign.
• A map indicating the geographic distribution of customer acquisition.
These visualizations can be dynamically updated as new data comes in, providing real-time
insights into campaign performance.
R and Python
R and Python are powerful for advanced visualization tools. These programming languages
offer extensive libraries and packages that enable complex data manipulation and
visualization.
R: Using R's `ggplot2` package, you can create a complex scatter plot with multiple layers,
including trend lines and error bands, to provide a detailed data analysis.
Example: A data scientist studying the relationship between advertising spend and sales
revenue might use `ggplot2` to create a scatter plot. The plot could include:
180
Points representing individual data points.
- A smooth trend line to indicate the general relationship.
- Error bands to show the variability around the trend line.
Python: Python's `matplotlib` and `seaborn` libraries are equally powerful for creating
intricate visualizations.
Example: In Python, a data analyst might use `seaborn` to create a heatmap showing
correlations between different financial metrics. This heatmap could:
• Use color gradients to indicate the strength of correlations.
• Include annotations for exact correlation values.
181
Leveraging Interactivity
Interactive visualizations engage users and allow them to explore data. For example, a
dashboard with interactive filters lets users view data for specific periods or regions.
Ensuring Accessibility
Design visualizations for accessibility. For instance, use color-blind-friendly palettes and
provide alternative text descriptions to make visualizations accessible to all users.
Creating Real-Time Dashboards Using APIs
Incorporating real-time data into visualizations can significantly enhance their value by
providing up-to-date insights. This can be achieved by creating dashboards that integrate
with APIs to pull live data.
Example: Real-Time Sales Dashboard
1. Data Source: Connect to a real-time data source, such as a sales database or an online
sales platform's API, to fetch current sales data.
2. Dashboard Design: Use a tool like Tableau, Power BI, or a custom web application with
libraries like D3.js to design a dashboard that includes:
• Real-time sales figures.
• Interactive charts and graphs (e.g., bar charts, line charts).
• Filters to drill down into specific products, regions, or periods.
3. API Integration: Integrate the dashboard with the API to fetch live data regularly. For
example, using a web application framework like Flask (Python) or Express (Node.js), you can
set up endpoints that query the sales database/API and return the latest data to the
dashboard.
4. Visualization: Update the visualizations dynamically as new data is fetched. Use JavaScript
libraries like D3.js or Chart.js to render the visualizations on a web page and ensure they
update in real time without reloading the page.
5. Example:
```python
Example using Flask and D3.js
from flask import Flask, jsonify, render_template
import requests
app = Flask(__name__)
@app.route('/api/sales')
def get_sales_data():
response =
requests.get('https://api.salesplatform.com/latest-sales')
182
return jsonify(response.json())
@app.route('/')
def index():
return render_template('dashboard.html')
if __name__ == '__main__':
app.run(debug=True)
```
```html
<!-- dashboard.html -->
<!DOCTYPE html>
<html>
<head>
<title>Real-Time Sales Dashboard</title>
<script src="https://d3js.org/d3.v5.min.js"></script>
</head>
<body>
<div id="sales-chart"></div>
<script>
async function fetchSalesData() {
const response = await fetch('/api/sales');
const data = await response.json();
updateChart(data);
}
function updateChart(data) {
// Use D3.js to create or update the chart with
the new data
}
setInterval(fetchSalesData, 60000); // Update
every minute
fetchSalesData(); // Initial load
</script>
</body>
</html>
```
183
This example demonstrates creating a real-time sales dashboard using a web framework and
JavaScript library. Integrating with an API, the dashboard continuously updates, providing the
latest insights into sales performance.
Conclusion
Data visualization is a crucial tool for interpreting and communicating data. By applying
principles of clarity, accuracy, simplicity, and consistency and utilizing appropriate tools and
techniques, individuals can create compelling visualizations that provide valuable insights.
Mastery of data visualization enhances data analysis and supports effective decision-making
in various fields. Through tools like Tableau, Power BI, R, and Python, and integrating real-
time data using APIs, data can be transformed into meaningful visual stories that drive better
understanding and action.
Bibliography
Knaflic, C. N. (2015). Storytelling with data: A data visualization guide for business
professionals. Hoboken, N.J.: Wiley
Meloncon, L., & Warner, E. (2017). Data visualizations: A literature review and opportunities
for technical and professional communication. Paper presented at the Professional
Communication Conference (ProComm), 2017 IEEE International, Madison, WI,
USA.
Ward, M. O., Grinstein, G., & Keim, D. (2015). Interactive data visualization: Foundations,
techniques, and applications. Retrieved from
http://ebookcentral.proquest.com/lib/uql/detail.action?docID=1786691
184
Extension Information Statistic Division
ICAR - Central Institute of Fisheries Technology
(Indian Council of Agricultural Research, New Delhi)
Ministry of Agriculture and Farmers Welfare, Government of India