0% found this document useful (0 votes)
10 views53 pages

Econometrics Notes

The document provides an introduction to econometrics, detailing its significance in linking economic theory with empirical analysis through statistical methods. It covers various econometric concepts including linear and multiple regression models, assumptions of classical linear regression, and the use of econometric software for data analysis. Additionally, it outlines the methodology of econometrics, including hypothesis formulation, data collection, and model estimation, along with practical applications in forecasting and policy analysis.

Uploaded by

vanshverma185
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views53 pages

Econometrics Notes

The document provides an introduction to econometrics, detailing its significance in linking economic theory with empirical analysis through statistical methods. It covers various econometric concepts including linear and multiple regression models, assumptions of classical linear regression, and the use of econometric software for data analysis. Additionally, it outlines the methodology of econometrics, including hypothesis formulation, data collection, and model estimation, along with practical applications in forecasting and policy analysis.

Uploaded by

vanshverma185
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

College adda

WhatsApp no:- 9667836336 YouTube channel:-College adda


Lesson:-1. Introduction to Econometrics
Lesson:-2. Linear Regression model
Lesson:-3. Multiple Regression model
Lesson:-4. Dummy Variables
Lesson:-5. Multicollinearity and Heteroskedasticity
Lesson:-6. Autocorrelation
Lesson:-7. Specification Bias

Lesson:-1, Introduction to Econometrics


Learning objectives
1. Comprehend the significance of econometrics in bridging
economic theory with empirical analysis through statistical
inference.
2. Learn to assess and interpret the statistical models utilized in
econometrics to analyze economic phenomena.
3. Acquire the skills to apply econometric tools for forecasting and
policy analysis in real-world economic scenarios.
Introduction to Econometrics
Econometrics, initially grounded in the concept of “economic
measurement,” goes beyond simple quantification. It represents a
synthesis of mathematical statistics, economic theory, and the
analysis of empirical data.
This interdisciplinary field employs mathematical and statistical
methods to validate economic models, interpret real-world
phenomena, and extract quantitative insights.
Rationale Behind Having a Separate Discipline
Econometrics, Combination of economics, mathematics, and
statistics, emerges as a different field with its unique importance
Economic theory elucidates the functioning of the economy in
broad strokes, offering insights such as increased consumer
activity when prices decline. However, it falls short of providing
precise numerical values. In this void, econo-metrics assumes a
pivotal role by employing mathematical and statistical
methodologies to ascertain these exact numerical parameters.
The marriage of mathematical economics translates economic
theories into precise equations. Subsequently, econometrics
assumes the responsibility of scrutinizing and validating these
equations through the examination of authentic data. Economic
statistics serve as the repository of this data, meticulously
gathering and arranging it.
Methodology of Econometrics
1. Formulating a Theory or Hypothesis:Let us begin by articulating
a theory or hypothesis about an economic phenomenon.
according to Keynes, the theory of consumption that individuals
tend to boost their spending as their income rises but not to the
same extent as the income increases itself.
He essentially posited that the marginal propensity to consume
(MPC) representing the rate at which spending changes for every
unit increase in income (e.g., a dollar), stands above zero yet falls
short of reaching 1.
2. Defining the Mathematical Representation of the Theory:
Keynes suggested a positive link between consumption and
income, he did not specify the exact nature of their functional
relationship. To simplify, a mathematical economist might
propose the following form of the Keynesian consumption
function:- Y=β1+β2X where 0<β2<1
In this equation, Y represents consumption expenditure, X stands
for income, and the parameters β1 and act as the intercept and
slope coefficients, respectively. The slope coefficient β2 serves as
a measure of the MPC. Equation 1.1 can be depicted as shown in
Figure 1.1, illustrating a linear relationship between consumption
and income. This equation, known as the consumption function in
economics, presents a mathematical model demonstrating the
connection between consumption and income.
Developing the Statistical or Econometric Model: The
consumption function represented purely by the mathematical
equation in Equation 1.1 holds limited interest for the
econometrician. This model assumes an exact or deterministic
link between consumption and income. However, real-world
economic relationships tend to be imprecise.
if data were collected on consumption expenditure and
disposable income from, let us say, 500 American families and
plotted on a graph, they would not perfectly align with the
straight line described by Equation 1.3.1. This divergence occurs
because factors beyond income, such as family size, ages of family
members, religious beliefs, and more, also influence consumption.
Y=β1+β2X+u
Here, u is termed as the disturbance or error term, represents a
random (stochastic) variable possessing well-defined probabilistic
properties. This error term u encapsulates various unaccounted
factors impacting consumption.
4. Collecting Relevant Data: To derive the numerical values of β1
and β2 and in the econometric model illustrated by Equation
1.2, data becomes imperative. While we will delve deeper into
significance of data for economic analysis in the following chapter
5. Estimating the Parameters within the Econometric Model:
With the data in hand, our subsequent objective is to calculate the
parameters within the consumption function. These numerical
values attributed to the parameters offer empirical substance to
the consumption function.
6. Testing the Formulated Hypotheses: Supposing that the
fitted model provides a reasonably accurate representation of
reality, it becomes crucial to establish appropriate criteria for
assessing whether the estimates, as presented in Equation 1.3,
align with the expectations of the theory being tested. According
to “positive” economists like Milton Friedman, a theory or
hypothesis that lacks verifiability through empirical evidence
might not be considered a viable component of scientific inquiry.
The process of confirming or disproving economic theories based
on sample evidence relies on a branch of statistical theory termed
statistical inference (hypothesis testing). Throughout this book,
we will delve into the actual mechanics of this inference process.
7.Making Forecasts or Predictions: If the selected model does
not disprove the hypothesis or theory in consideration, we can
employ it to foresee future values of the dependent variable,
denoted as Y, based on known or expected future values of the
predictor variable, represented as X.
Another use of the estimated model (Equation 1.3) comes into
play when considering the potential impact of policy changes. For
instance, if the president proposes an income tax reduction
If, in response to this policy change, investment expenditure
increases,macroeconomic theory indicates that the change in
income resulting from a dollar change in investment expenditure
is defined by the income multiplier (M), given by

The estimation of MPC plays a pivotal role in this calculation, as


the multiplier hinges upon it. Regression models like Equation 1.3
facilitate obtaining this MPC estimate, offering valuable insights
for policy considerations. With knowledge of MPC, one can
forecast the future trajectory of income,consumption expenditure
and employment following alterations in the government’s fiscal
policies
8. Applying the Model for Control or Policy Purposes: In
the final step, the hypothetical scenario where we possess the
estimated consumption function as presented in Equation.
These computations propose that an estimated model holds
utility for control or policy purposes. Through a suitable blend of
fiscal and monetary policy strategies, the government can
manipulate the control variable X to achieve the desired level of
the target variable Y.
The Nature of Regression Analysis
Regression Analysis:- It plays a important role in econometrics,
serving as a most important for analyzing, quantifying, and
interpreting relationships between economic variables. primary
function lies in modeling and understanding interactions between
dependent and independent variables within economic systems.
By employing regression models, econometricians can
assess how changes in one or more economic factors affect
outcomes such as prices, demand, GDP, employment, and more.
Single Regression Equation: following model represents the
simplest form of a regression equation with two variables, that is,
one dependent and one independent variable:
Y= β0 + β1X + ε
where Y is dependent variable, X is the independent variable,β0 is
the intercept, the value of Y when X is zero, and β1 is coefficient
indicating the change in Y for a one-unit change in X.
Multiple Linear Regression Equation: Multiple linear regression is
a statistical technique used to analyze the relationship between a
dependent variable and two or more independent variables.
Y=β0 + β1X1 + β2X2 +....+βnXn + ε
1. Y represents dependent variable being predicted or explained.
2. X1, X2, …, Xp denote the independent variables or predictors.
3. β1 is the intercept, the expected value of Y when all predictors
are zero.
4. β1, β2, …, βp are the coefficients that quantify the impact of each
independent variable on Y.
5. ε denotes the error term, representing the difference between
the observed and predicted values not explained by the predictors.
Regression equation terms
Dependent variable (Y) is variable being predicted or explained
by the independent variables. It is influenced by changes in the
independent variables. Independent variables (X) are the
variables that influence or predict changes in the dependent
variable. In a simple regression, there is one independent variable
(X), while in multiple regressions, there are multiple independent
variables (X1, X2, X3, etc.).
Parameters/Coefficients: These are the values estimated by the
regression model to explain the relationship between the
dependent and independent variables.
1. β0 (intercept): It represents the value of the dependent variable
when all independent variables are zero.
2. β1 (slope coefficient): It signifies the change in the dependent
variable for a unit change in the independent variable.
3. For multiple regression with more independent variables, β2,β3,
and so forth represent additional coefficients corresponding to
each independent variable.
Errors term:-
error term embodies various sources of variability or factors
beyond those included in the model. These unobserved or
unaccounted-for elements might include measurement errors,
unmeasured variables, randomness, or inherent variability in the
phenomenon being studied.
Introduction to Econometric software
Econometric software provides a gateway to the practical
application of statistical methods in economics, offering powerful
tools for data analysis, model estimation, and hypothesis testing.
Eviews
It stands as a comprehensive econometric software package
renowned for its best condition for time-series analysis and
econometric modeling. It offers a user-friendly interface,
facilitating data management, estimation of various econometric
models, and insightful interpretation of results
Stata
STATA is a powerful statistical software extensively used for data
analysis, statistical modeling, and visualization in various fields,
including economics. Renowned for its versatility, STATA supports
a wide range of statistical techniques, from basic descriptive
analysis to complex econometric modeling and panel data
analysis. Its user-friendly interface, command-driven structure,
and extensive capabilities for data management and
manipulation make it a preferred choice for
researchers,economists, and social scientists aiming to conduct
sophisticated statistical analyses and generate publication quality
results.
GRETL
GRETL (Gnu Regression, Econometrics, and Time-series Library) is
a user-friendly econometrics software designed for statistical
analysis, econometric modeling, and time-series analysis. It offers
an intuitive graphical interface combined with a powerful
scripting language, enabling users to perform various statistical
tasks, including data processing, regression analysis, time-series
modeling, and forecasting.
GRETL’s simplicity and extensive range of econometric tools make
it suitable for students, researchers, and practitioners seeking a
user-friendly platform for econometric analysis and modeling.
R Software
It seems like you have entered “R” as a standalone term. If you are
referring to the statistical software called R, it is an open-source
programming language and software environment primarily used
for statistical computing and graphics. It provides a wide range of
statistical techniques, data analysis tools, and visualization
capabilities, making it popular among researchers, statisticians,
and analysts for conducting diverse statistical analyses and
generating visual representations of data.
Lesson:-2, Linear regression model
Learning objectives
Define linear regression and understand the basic
concepts.
Understand the properties of various estimators.
Identifying the assumptions of linear regression and
understanding its importance.
Linear regression is a fundamental statistical method used for
modeling the relationship between a dependent variable and one
or more independent variables. It is a supervised learning
algorithm that aims to establish a linear relationship between the
input features and the target variable.
Applications
Linear regression finds applications in various fields, including
Economics: Modeling the relationship between variables like
income and expenditure.
Finance: Predicting stock prices based on various factors.
Biology: Analyzing the impact of variables on biological processes.
Social Sciences: Studying factors influencing human behavior
Ols estimation
Ordinary least squares (OLS) is a foundational statistical tool used
to uncover unknown parameters within a linear regression model.
By minimizing the sum of squared vertical distances between
observed responses and those predicted by a linear
approximation, OLS derives estimations. This method often yields
a straightforward formula, particularly in scenarios with a single
regressor on the right-hand side.When the regressors remain
exogenous and multicollinearity is absent, the OLS estimator
maintains consistency.
classical linear regression model (CLRM) is the foundation for
ordinary least squares (OLS) estimation in econometrics. It
provides the theoretical framework upon which OLS is built,
allowing economists and researchers to estimate and interpret
relationships between variables. CLRM underlies the assumptions
and structure necessary for applying OLS to empirical data.
Assumptions of CLRM
1. Linearity: CLRM assumes a linear relationship between the
dependent variable and the explanatory variables. It is expressed
as. Y= β0 + β1X1 + β2X2+...+ βpXp +ε
2. Assumptions about Errors
(a) Zero Conditional Mean (Expectation of Errors)
This means that, on average, the errors have an expected value
of zero given the values of the explanatory variables.
(b) Homoscedasticity of Errors (Constant Variance)
Var(ε| X1, X2, X3...,Xk)=σ2
The variance of the errors remains constant across all levels of
the explanatory variables.
(c) Independence of Errors: Cov(εi , εj)=0 where i≠j
Errors in one observation are not related to errors in other
observations.
3.No Perfect Multicollinearity: CLRM requires that the explanatory
variables are not perfectly linearly related to each other.
4. Random Sampling: Data used in the analysis are obtained
through a random sampling process, ensuring that the sample
represents the population of interest.
5. Normally Distributed Errors (Optional): While not strictly
required, assuming normally distributed errors allows for specific
statistical properties and hypothesis testing under normality
assumptions.
Properties of estimation
Estimation in the context of linear regression involves finding the
values for the coefficients that best fit the model to the observed
data. The most common method for this purpose is the method of
least squares. This method minimizes the sum of the squared
differences between the observed values and the values predicted
by the regression equation.

F-test
The F-test in linear regression is used to assess overall significance
of the model. It compares the fit of the estimated model with a
model that has no independent variables. The null hypothesis for
the F-test is that all coefficients of the independent variables are
zero, meaning the model has no explanatory power.
RSS is the residual sum of squares residuals when the model is
estimated with the independent variables.
1).ESS is the explained sum of squared residuals when the model
is estimated without the independent variables (i.e., the null
model with only an intercept).
2). k is the number of independent variables in the model.
3). n is the number of observations.
Steps for Conducting an F-test in Linear Regression
1. Formulate Hypotheses
Null hypothesis (H0): All coefficients are zero (model has no
explanatory power).
Alternative Hypothesis (H1): At least one coefficient is nonzero
(model is significant).
2. Calculate the F-statistic Obtain the residual sum of squares (RSS)
and explain the sum of squares (ESS) from the regression output.
3. Determine Degrees of Freedom
Degrees of freedom for the numerator is k (number of coefficients
being tested).
Degrees of freedom for the denominator is n- k-1
(Total sample size minus the number of coefficients and 1).
4. Compare with Critical Value or P-value
Use the F-statistic to look up a critical value from an F-distribution
table or compare it with a significance level.
Alternatively, obtain the P-value associated with the F-statistic. If
the P-value is less than the chosen significance level (commonly
0.05), you reject the null hypothesis.
If the F-test is significant, it indicates that at least one
independent variable contributes significantly to explaining the
variability in the dependent variable.
It is crucial for assessing the overall significance of a regression
model. If the F-statistic is significant, it suggests that the model
explains a significant amount of variance in the dependent
variable, and at least one predictor variable is contributing to the
model’s explanatory power.
Goodness of Fit
R-squared (R2)
R-squared (R2) is a statistical measure that represents the
proportion of the variance in the dependent variable that is
explained by the independent variables in a linear regression
2
model. It ranges from 0 to 1, where a higher R indicates a better
fit of the model to the data.
For linear Regression
2
R =1-RSS/TSS
RSS is the residual sum of squares (the sum of the squared
differences between the observed and predicted values of the
dependent variable).
TSS is the total sum of squares, which measures the total variance
of the dependent variable.
2
Interpreting R
1. R2 = 0: The model does not explain any variability in the
dependent variable.
2. R2 = 1: The model perfectly explains the variability in the
dependent variable.
3. 0 < R2 < 1: Indicates the proportion of variability explained by
the model. For example, an R2 of 0.75 means that 75% of the
variance in the dependent variable is explained by the
independent variables.
Limitations of R2
1.R2 does not indicate whether the coefficients are statistically
significant.
2. It does not provide information about the goodness of fit for
models with different numbers of independent variables.
3. High R2 does not necessarily mean a causal relationship.
4. R2 may increase with addition of irrelevant variables (overfitting)
Testing of Hypothesis
Hypothesis testing in linear regression, whether simple or
multiple, involves assessing the significance of the regression
coefficients and overall model fit.
The two primary hypotheses commonly tested are related to the
individual coefficients (slope parameters) and the overall
significance of the model.
Hypothesis Testing for Simple Linear Regression
1. Testing Individual Coefficients (β0 and β1)
Null Hypotheses
1. H0: β0 = 0 (the intercept is equal to zero)
2. H0: β1 = 0 (the slope is equal to zero)
Alternative Hypotheses
1. H1: β0 ≠ 0 H1: β0 > 0 or H1: β0 < 0
2. H1: β1 ≠ 0 H1: β0 = 0 or H1: β1< 0
Test Statistic: The t-statistic is used for testing individual
coefficients.
Decision Rule: Reject the null hypothesis if the P-value is less
than the chosen significance level (e.g., 0.05).Below is an example
of hypothesis testing for simple linear regression.
2. Testing Overall Model Significance (ANOVA)
Null Hypothesis
H0: β1 = β2 = ...=βκ = 0 (none of the coefficients are significant)
Alternative Hypothesis
H1: at least βi ≠0 ( at least one coefficient is significant )
Test Statistic: F-statistic is used for testing overall model
significance.
Decision Rule: Reject the null hypothesis if the P-value is less
than the chosen significance level.
Practical Application Using Econometric Software
Estimating Simple Regression in EViews
Step 1: Open EViews on your computer.
Step 2: Click on File, then select New, and choose Workfile to
create a new file.
Step 3: Specify the frequency of the data for time series data or
select undated/irregular for cross-sectional data. Define the start
and end of your data set. EViews will open a new window
automatically containing a constant (c) and a residual (resid)
series.
Step 4: On the command line, type the following: genr x = 0 (press
enter). genr y = 0 (press enter)
This creates two new series named x and y with zeros for every
observation. Open x and y as a group by selecting them and
doubleclicking with your mouse.
Step 5: Either type the data into EViews or copy/paste it from
Excel. To edit the data or paste anything into EViews cells, press
Edit + / - button after editing press the button to close or secure
the data.
Step 6: Once the data is entered into EViews, estimate the
regression line (to obtain alpha and beta) either by typing:
ls ycx (press enter) on the command line, or by clicking on
quick/estimate equation and writing your equation (i.e., ycx) in
the new window. Note that EViews automatically chooses OLS
(ordinary least squares) as the estimation method, and the
sample is automatically set to the maximum possible.
Linear regression model

Lesson:-3, Multiple Regression model


Learning objectives
1. fundamental concepts of multiple regression analysis, including
predictor variables, the dependent variable, and the assumptions
underlying the model.
2. How to systematically select variables, assess multicollinearity,
and apply variable transformations when constructing robust
multiple regression models.
3. Interpret regression coefficients, significance tests, and
goodness-of-fit measures to assess the model’s explanatory
power and the significance of individual predictors.
4. Explore techniques to verify the assumptions of multiple
regression, such as normality, linearity, homoscedasticity, and
independence of errors, and employ diagnostic tools to validate
model adequacy.
Introduction
Multiple linear regression is a powerful statistical technique used
to understand the relationship between multiple independent
variables and a single dependent variable. Unlike simple linear
regression, which deals with just one predictor, multiple linear
regression involves two or more predictors to model this
relationship.
In this scenario, the dependent variable is the gross domestic
product (GDP), and the independent variables could include
factors like:
1. Government Spending: The total expenditure by the
government on goods and services.
2. Investment: The amount businesses invest in equipment,
structures, and other capital.
3. Exports: The value of goods and services a country sells to other
nations.
4. Inflation Rate: The rate at which the general level of prices for
goods and services is rising.
GDP = β0 + β1 × Government spending + β2 x Investment + β3 x
Exports + β4 x inflation rate + ε
Multiple Linear regression
Multiple linear regression is an extension of simple linear
regression, allowing for the modeling of relationships between a
dependent variable and multiple independent variables.
Y = β0 + β1X1 + β2X2 +....+ βnXn + ε
Y: The dependent variable (response variable).
X1, X2, …, Xn: Multiple independent variables (predictors).
β0: Intercept or constant term.
β1,β2, …, βn coefficients for the respective independent variables.
They represent the change in the expected value of Y for a one-
unit change in the corresponding X, assuming all other variables
are held constant.
ε: The error term.
Key Components of a Multiple Linear Regression Model
1. Intercept (β0): Represents the expected value of Y when all
predictor variables (X1, X2, …, Xn) are zero. Interpretation can be
meaningful or not, depending on the context of the data.
2. Slope Coefficients (β1, β2, …, βn): Represent the change in the
expected value of Y for a one-unit change in the corresponding
predictor, holding other predictors constant.
3. Multiple Predictors: X1, X2, …, Xn are the independent variables
contributing to the model. Each variable provides additional
information to predict the dependent variable.
4. Residuals: Differences between the observed values of Y and the
values predicted by the multiple regression equation. Residual
analysis helps assess the model’s fit and assumptions.
5. Assumptions: Multiple linear regression assumes linearity,
independence of residuals, homoscedasticity (constant variance
of residuals), and normality of residuals, similar to simple linear
regression.
Estimation in Multiple Linear Regression
Estimation in the context of linear regression involves finding the
values for the coefficients that best fit the model to the observed
data. The most common method for this purpose is the method of
least squares.
matrix form of the normal equations is
T T
X Xb = X Y
X is the matrix of independent variables,
Y is the vector of dependent variable values, and
b is the vector of coefficient estimates.
solution for b is given by. b = (XTX)-1XTY
Hypothesis Testing
1. Testing Individual Coefficients (β0, β1,β2,...βn)
Null Hypotheses
H0:β0=0, β1=0,..βn=0 (each coefficient is equal to zero).
Alternative Hypotheses
H1:βi≠0 (at least one coefficient is not equal to zero).
Test Statistic: t-Statistic is used for testing individual coefficients.
Decision Rule: Reject the null hypothesis for a particular
coefficient if the P-value is less than the chosen significance level.
2. Testing Overall Model Significance (ANOVA)
Null Hypothesis
H0 : β1 = β2= … = βκ = 0 (none of the coefficients are significant).
Alternative Hypothesis
H1: At least one β1 ≠ 0 (at least one of the coefficient is significant)
Test Statistic: F-statistic is used for testing overall model
significance.
Decision Rule: Reject the null hypothesis if the P-value is less than
the chosen significance level.
Interpretation
P-value: In hypothesis testing, the P-value indicates the
probability of observing the test statistic (or more extreme values)
under the assumption that the null hypothesis is true. A small P-
value (typically less than 0.05) leads to the rejection of the null
hypothesis.
Significance Level (α): The chosen level of significance, often
denoted as α (e.g., 0.05), represents the threshold below which
the null hypothesis is rejected.
Test Statistics (t-Statistic, F-Statistic): These statistics measure
how far the estimated coefficients are from what would be
expected under the null hypothesis. Large values of these
statistics contribute to the rejection of the null hypothesis.
Measures of fit
Measures of fit for multiple regression models are essential to
assess how well the model represents the relationships between
the predictors and the outcome variable.
R-Squared (R2): Represents the proportion of variation in
the dependent variable explained by the independent variables in
the model. In multiple regression, it is called the coefficient of
determination.
R-squared ranges from 0 to 1, where higher values indicate a
better fit.
However, it does not account for overfitting and might not reflect
the model’s predictive accuracy.
Adjusted R-Squared: Unlike R-squared, adjusted R-squared
penalizes the addition of irrelevant predictors, adjusting for the
number of predictors in the model. It helps prevent overfitting
and is a more conservative estimate of model’s goodness-of-fit.
Root Mean Square Error (RMSE) or Mean Squared Error (MSE):
These measures represent the average difference between
predicted values and actual values. Lower RMSE or MSE values
indicate better model fit, highlighting how close the predicted
values are to the observed data.
Akaike Information Criterion (AIC) and Bayesian Information
Criterion (BIC): These are measures used in model selection that
balance goodness-of-fit with model complexity. Lower AIC or BIC
values indicate a better trade-off between fit and complexity. AIC
tends to penalize less for complexity compared to BIC.
F-Statistic: This statistic assesses the overall significance of the
regression model by comparing the variance explained by the
model to the variance not explained. A higher F-statistic with a
low associated P-value suggests that the model is statistically
significant.
Standard Error of the Regression (S): It estimates the standard
deviation of the residuals or errors. A lower standard error
indicates that the model’s predictions are closer to the observed
values.
Durbin–Watson Statistic: This measures the autocorrelation of
residuals. It ranges from 0 to 4, with values around 2 suggesting
no significant autocorrelation. Values substantially different from
2 might indicate a problem with the residuals.
Note:- It is crucial to consider a combination of these measures
when assessing the fit of multiple regression models. While R-
squared and adjusted R-squared provide insights into explained
variance, RMSE, AIC, BIC, F-statistic, and other measures offer
additional perspectives on model accuracy, complexity, and
significance.
Lesson:-4, Dummy Variables
Learning Objectives
1. To determine how well the observed data fits with the
model’s expected value.
2. Students will learn about different
measures/tests/statistics used for goodness of fit.
3. The usefulness of R-square and adjusted R-square.
Students will learn about the standard error of the model
with respect to goodness of fit.
4. Conceptual understanding, comparison, and calculation
of AIC, BIC, and SIC will be provided in this chapter.
Concept of Dummy Variables
Dummy variables, also known as indicator variables or binary
variables, are used in statistical modeling and econometrics to
represent categorical data numerically. They are called “dummy”
because they take on the values of 0 or 1 to indicate the absence
or presence of a particular categorical attribute.
Dummy variables are especially useful when dealing with
qualitative data that cannot be directly used in mathematical
models. They are commonly used in regression analysis to
represent categorical variables that have more than two levels,
such as education level or occupation.
Dummy variables work:
1. Binary Representation
For a categorical variable with two categories (e.g., yes/no, male/
female), one dummy variable is sufficient. It takes the value of 0
for one category and 1 for the other.
Example: Gender
Suppose you have a dataset with a “gender” variable, and you
want to include it in a regression model. You could create a
dummy variable, say “male,” which takes the value 1 if the person
is male and 0 if the person is female.

2. Handling Multiple Categories


For a categorical variable with more than two categories, you can
create multiple dummy variables.
If there are n categories, you typically create n-1 dummy variable
The excluded category becomes the reference category, and the
other dummy variables represent whether an observation belongs
to a specific category or not.
Example: Education level
Suppose you have an “education level” variable with categories
like “high school,” “Bachelor’s,” and “Master’s.” You can create
two dummy variables, say “Bachelor’s” and “Master’s,” and use
“high school” as the reference category.

3. Avoiding Multicollinearity
In regression analysis, including dummy variables for all
categories without omitting one can lead to multicollinearity
issues. Omitting one category helps prevent perfect correlation
among the dummy variables.
Types of Dummy Variables
1. Binary Dummy Variables: This is the simplest type of dummy
variable and is used for a categorical variable with two categories
(binary). One category is chosen as the reference, and a single
dummy variable is created to represent the other category. The
dummy variable takes the value of 0 or 1, indicating the absence
or presence of the category
2. Multicategory Dummy Variables: For categorical variables with
more than two categories, multiple dummy variables are created.
If a variable has k categories, k-1 dummy variables are typically
created, with one category chosen as the reference
3. Interaction Dummy Variables: Interaction dummy variables are
used when there are potential interactions between two or more
categorical variables. These variables are created by taking the
product of the dummy variables representing the individual
categories.
Intercept dummy Variables
1. Reference Category: One category of the categorical variable is
chosen as the reference or baseline category. This category is not
explicitly represented by a dummy variable.
2.Dummy Variables: For categorical variable with k categories,k- 1
dummy variables are created. Each dummy variable represents
one of the nonreference categories. If the variable has k
categories (including the reference category), then k - 1 dummy
variables are created.
3.Intercept: The intercept in the regression equation represents
the expected value of the dependent variable when all predictor
variables (including dummy variables) are set to zero. In the
context of dummy variables, this means when the observation
belongs to the reference category.
Dependent variable:-β0 + β1 x Green dummy + β2 x Blue dummy + ε
β0 represents the expected value when the color is red (the
reference category).
β1 represents change in expected value when the color is green.
β2 represents change in the expected value when the color is blue
Slope Dummy Variables: Relationship between the independent
variable (predictor) and the dependent variable (response) differs
between the two groups.
Interaction Term: The slope dummy variable is often used in
interaction with the original independent variable.The interaction
term is the product of the slope dummy variable and the original
independent variable. This interaction term is added to the
regression model to allow for different slopes for the two groups.
Yi = β0 + βiXi + β2Di + β3(Xi x Di) + ε
Yi = dependent variable, Xi = independent variable,
Di=slope dummy variable (1 for Group B, 0 for Group A),
ε = error term, β0=is the intercept,
β1= is the slope of Group A,
β2=Difference in intercepts between Group B and Group A, and
β3=Difference in the slopes between Group B and Group A.
The presence of the interaction term Xi × Di allows the slope of the
regression line to vary between the two groups.
Use of Dummy Variables to Model Qualitative/Binary/
Structural
1. Binary variables
Dummy Variable: Create a dummy variable (also known as an
indicator variable) that takes the value 1 for one category (e.g.,
female) and 0 for the other category (e.g., male).
Regression Model: Include this dummy variable in your regression
model.The coefficient associated with the dummy variable
indicates the average change in the dependent variable when
moving from one category to the other.
Dependent variable = β0 + β1 x X1+ β2 x Gender dummy + ε
represents the average difference in the dependent variable
between the two gender categories.
β2 represents the average difference in the dependent variable
between the two gender categories.
2. Structural changes
Example: Imagine you have data for a period before and after the
implementation of a new policy.
Dummy Variable: Create a dummy variable that takes the value 0
for the period before the policy change and 1 for the period after
the policy change.
Regression Model: Include this dummy variable in your regression
model to account for the structural change in the data.
Dependent variable = β0 + β1 x X1 + β2 x Policy dummy + ε
The coefficient β2 now captures the average change in dependent
variable associated with the policy change.
3. Interactions
interaction terms by multiplying two (or more) dummy
variables, for example, an interaction between the policy change
dummy and a regional dummy.
Yi = β0 + β1 x X1 + β2 x Policy dummy + β3 x Regional dummy + β4
( Policy dummy + Regional dummy)+ ε
The interaction term (β4) captures how the effect of the policy
change differs across regions.
Other Functional Forms of Dummy Variables
1. Interaction effects
Dummy variables can be used to model interaction effects
between different categorical variables. For example, if you have
two categorical variables A and B, you can create dummy
variables for each category and an interaction term by multiplying
the two dummy variables.
Yi= β0 + β1 x Dummy₋A + β2 x Dummy₋ B + β3 (Dummy ₋A ⨯ Dummy₋
B) + εi
interaction term (β3) captures the joint effect of categories A and B.
2.Piece wise linear regression
Dummy variables can be used to model piecewise linear
relationships. This is helpful when you expect different slopes or
intercepts for different ranges of your independent variable.
Yi= β0 + β1 x X1 + β2 x Dummy₋indicator + β3 x (X1 x
Dummy₋indicator) + ε
Here, Dummy_Indicator takes the value 1 for observations within
a specific range and 0 otherwise.
3. Polynomial regression
Dummy variables can be used to model polynomial relationships.
For instance, you might use dummy variables to represent
different polynomial degrees.
2
Yi = β0 + β1 x X + β2 x X + β3 x Dummy₋indicator) + β4 x ( X x
2
Dummy₋indicator) + β5 x ( X x Dummy₋ indicator )+ εi
Dummy_Indicator takes the value 1 for observations where the
polynomial term is relevant and 0 otherwise.
4. Seasonal dummy variables
time series analysis, dummy variables are often used to model
seasonal effects. Each dummy variable represents a different
season.
Yi = β0 + β1 x X + β2 x Dummy₋winter + β3 x Dummy₋Spring + β4 x
Dummy₋ summer + β5 x Dummy₋fall + εi
The coefficients associated with seasonal dummy variables
capture the average change in the dependent variable during each
season.
Response Regression Models
Qualitative response regression models, also known as binary
choice models or binary response models, are used when the
dependent variable is categorical and takes on only two possible
outcomes. The most common example is a binary outcome, such 1.
as “success” or “failure,” “yes” or “no,” or “1” or “0.”
1. logistic regression
Logistic regression is a statistical method used for predicting the
probability of a binary outcome. It is commonly used when the
dependent variable is categorical and represents two classes,
such as 0 or 1, yes or no, true or false. Logistic regression models
the probability that an instance belongs to a particular category.
logistic function (sigmoid function) is at the core of logistic regres-
sion, ensuring that the predicted probabilities lie between 0 and 1.

Characteristics and Considerations of Logistic Regression


Interpretation of Coefficients The coefficients (β) represent the
change in the log-odds of the dependent variable for a one-unit
change in the corresponding independent variable.
Odds Ratio:- The odds ratio is derived from the coefficients and
represents the change in odds for a one-unit change in the
independent variable.
Binary Outcome:- Logistic regression is suitable for binary
outcomes, but it can be extended to handle multinomial or
ordinal outcomes in the form of multinomial logistic regression or
ordered logistic regression
Assumption of Linearity in Log-Odds
Logistic regression assumes a linear relationship between the
independent variables and the log-odds of the dependent variable.
Diagnostic Measures
Model fit can be assessed using measures like the deviance,
likelihood ratio tests, and the Hosmer–Lemeshow goodness-of-fit
test.Logistic regression is widely used in various fields, including
medicine (predicting disease presence or absence), marketing
(predicting customer churn), and social sciences (predicting
voting behavior). It is a powerful tool for binary classification
problems when the relationship between the independent
variables and the log-odds of the dependent variable is nonlinear.
2. Probit Regression
Probit regression is another statistical method used for modeling
the probability of a binary outcome, like logistic regression. It is
particularly common in econometrics and social sciences. Like
logistic regression, probit regression is used when the dependent
variable is binary and follows a probit link function.
P(y=1)=∅(β0 + β1X1 +β2X2+...+βnXn)
P(y=1) is the probability of the dependent variable being 1,
∅ is the cumulative distribution function of the standard normal
distribution,
β0 is the intercept, and β1,β2...βn ( n are the coefficients associated
with the independent variables X1,X2,X3,..,Xn
Key Characteristics and Considerations of Probit Regression
Interpretation of Coefficients:- Similar to logistic regression, the
coefficients (β) in probit regression represent the change in the
probability of the dependent variable being 1 for a one-unit
change in the corresponding independent variable.
Odds Ratio:- While logistic regression provides odds ratios, probit
regression typically focuses on marginal effects, which represent
the change in the probability of the dependent variable for a small
change in the independent variable.
Link Function:- The probit link function assumes a normal
distribution for the latent variable, and the model is based on the
standard normal distribution.
Nonlinear Relationship
Probit regression assumes a linear relationship between the
independent variables and the latent variable, but the observed
probabilities follow a nonlinear relationship.

Lesson:-5, Multicollinearity and


Heteroskedasticity
Learning objectives
1.Understand the nature of problems due to violation of the
assumptions of no multicollinearity and homoscedasticity in the
classical linear regression model (CLRM).
Learn how to diagnose these issues through statistical tests.
2. Understand the impact on the reliability and validity of
inferences in case of violation of these assumptions.
3.Examine the remedies available to alleviate these problems.
Introduction
Classical linear regression model is a method to estimate the
parameters of a model by the method of least squares. The model
makes some crucial assumptions like linearity of the parameters,
nonstochastic regressors, zero conditional mean of errors, no
heteroskedasticity (homoscedasticity), no autocorrelation, no
multicollinearity, and no specification error.
Multicollinearity
Nature of Multicollinearity
If two or more variables in a classical linear regression model have
an exact linear relationship between them, then it is a case of
perfect multicollinearity. This violates the assumption of CLRM.
θ1X1 + θ2X2 + θ3X3+...+θκXK =0
if the explanatory variables are intercorrelated (even if not
perfectly), then it is considered to be a problem of imperfect (or
near) multicollinearity. Equation 5.1 can be modified as follows to
allow for a stochastic error term, ui in the case of imperfect
multicollinearity
θ1X1 + θ2X2 + θ3X3 + ...+θκXK + ui=0
There are various sources of multicollinearity in data.
Some of the factors causing multicollinearity are
explained below:
1. The inclusion of the same information twice can lead to a
perfect correlation between the explanatory variables. For
example, if the weight of an object is measured in pounds in
variable X1 and kilograms in variable X2, then both variables will
be perfectly correlated.
2. Falling into a dummy variable trap by including all the
categories of the dummy variable simultaneously in a single
equation.
3. Common trend in the data can also make the explanatory
variables highly collinear. For example, if the performance of
students is regressed on school expenditure on instructional
materials (X2) and expenditure on athletics (X3), then there is a
possibility of a high correlation between X2 and X3 because
wealthier schools tend to spend more on both the aspects in
comparison to poorer schools.
4. An overdetermined model can also violate the assumption of no
multicollinearity as the number of explanatory variables would be
more than the number of observations.
5. Model specification with various polynomial terms of the
explanatory variables can also make these variables highly
collinear.
Impact of Multicollinearity on Estimation and Inference
If two or more variables are highly correlated in a dataset, then
intuitively the estimator fails to distinguish between the
individual effects of the explanatory variables. In applied
econometrics, the regression analysis facilitates the separation of
partial effects of each explanatory variable (X) upon the
dependent variable. But with multicollinearity, the explanatory
variables become indistinguishable and the partial effects of each
X cannot be estimated.
The inflated standard errors of the coefficients lead to
the following:
1. Confidence intervals are very large making the estimates less
reliable.
2. The estimated t-statistics are very small making the variables
statistically insignificant.
3. Although the t-statistics of the explanatory variables are
insignificant, the overall R2 of the model can be very high which
can give an incorrect measure for the goodness of fit.
3. Any small change in the dataset can affect the OLS estimators
and their standard errors by a large magnitude.
Mathematically, the variances and covariance of estimated
coefficients ( β2 and β3 ) for regression model

where r23 refers to the coefficient of correlation between X2 and X3


2
and σ is the variance of Y. If there is perfect multicollinearity,
then the coefficient r23 is unity and the variances and covariance
are infinite. However, in the case of imperfect multicollinearity, a
high value of r23 increases the values of variances and covariances.
The speed at which these values increase with respect to the
correlation coefficients of the explanatory variables is defined as
the variance-inflating factor and is denoted by
2
VIF = 1/(1-r23 )
Detection of Multicollinearity
Multicollinearity refers to a high correlation among the
explanatory variables and is, therefore, a sample phenomenon.
Moreover, the degree of correlation among the variables is an
important factor while identifying and rectifying multicollinearity.
Note that a small degree of multicollinearity may exist in every
equation; it is the severe correlation among the variables which
makes the inferences invalid.
There is no unique statistical method to detect multicollinearity
but there are a few rules of thumb to facilitate its detection. These
rules are explained below:
1. High R2 and Insignificant t-Statistics: As the presence of
multicollinearity inflates the standard error of the estimators,
the t-statistics are decreased in magnitude, thereby making them
statistically insignificant. If simultaneously, the R2 is very high (in
Excess of 0.8) then it is a classic system of multicollinearity.
2. High Correlation among the Explanatory Variables: Before
running a regression model, it is advisable to check pair-wise
correlations among the various independent variables along with
the significance of the correlation coefficient. If these coefficients
Are high ( in excess of 0.8 ) then could be existence of
multicollinearity.
3. High Value of VIF: The variance inflation factor (VIF) discussed
above can also help detect multicollinearity. If the calculated VIF
is higher than 10, the variables are said to be highly correlated.
Remedial measures of multicollinearity
1. Dropping a Variable(s): One of the most common ways to deal
with multicollinearity is to drop one or more of the variables from
the regression equation which shows high degree of correlation.
2. A Priori Information: It refers to the information that the
researcher has about the regression coefficients based on the
previous empirical work or relevant theoretical constructs. Then,
such information can be used to modify the regression equation
and deal with multicollinearity.
3. Increasing the Sample Size: Another way to deal with
multicollinearity is to combine and increase the sample size of the
dataset. This can rectify the problem of multicollinearity because
it is primarily a sample phenomenon and increasing the sample
size reduces the width of the confidence interval making the
inferences more reliable.
4.Transforming the Data: An appropriate transformation of the
explanatory variables can help deal with multicollinearity. Two of
such transformations discussed here are, first differencing and
ratio transformation
Heteroskedasticity
The assumption of a constant variance (σ2) of each disturbance
term, ui in a classical linear regression model is termed as
homoscedasticity. It implies equal (homo) spread (scedasticity) of
the disturbance
2 2
E(ui ) = σ
1. Error-Learning Model: These models suggest that as people
learn, the errors arising in behavior can be reduced dramatically.
For example, typing speed can improve with practice overtime.
Therefore, the σi2 is expected to reduce as the value of the
independent variable increases (time in this example).
2. Improved Data Collection Technique: As the data collection
techniques improve, the variance of errors can reduce overtime.
3. Presence of Outliers: Outliers are the observations in the data
which quite different (either small or large) from the rest of the
observations of the data. Such observations can create
heteroskedasticity in the data.
4. Misspecification of Regression Model: In case some variables
are omitted from the model, it can lead to specification error. The
error variance in such models may not be constant, thereby
causing heteroskedasticity
consequences of using OLS estimators in the presence of
heteroskedasticity are listed below:
1. The estimated standard errors of the coefficients would be
biased.
2. As the estimators are no longer efficient, the estimated
confidence intervals would be unnecessarily large.
3. As a result, the estimated t and F-statistics would be smaller
than what is appropriate and, therefore, would turn out to be
insignificant.
4. The usual testing procedures would provide misleading
inferences in the presence of heteroskedasticity.
Detection of Heteroskedasticity
Remedial Measures for Heteroskedasticity
1. Redefining the Variables: The variance of the observations can
be reduced by redefining the variables.
2. Generalized Least Squares: If σi2 is known, the simplest way to
correct for heteroskedasticity is by the means of generalized least
squares. Consider a two-variable regression model.
3. White’s Heteroskedasticity-Consistent Standard Errors: In case
the σi2 is not known, the researcher is suggested to use White’s
heteroskedasticity-consistent variances and standard errors
which are available in almost every statistical package. These
standard errors are also known as robust standard errors.
Lesson:-6, Autocorrelation
Learning objectives
1. Understand the nature of problems due to violation of the
assumptions of no autocorrelation and no specification errors in
the classical linear regression model (CLRM).
2. Learn how to diagnose these issues through statistical tests.
3. Understand the impact on the reliability and validity of
inferences in case of violation of these assumptions.
4. Examine the remedies available to alleviate these problems.
Introduction
classical linear regression model is a method to estimate the
parameters of a model by the method of least squares. The model
makes some crucial assumptions like linearity of the parameters,
nonstochastic regressors, zero-conditional mean of errors, no
heteroskedasticity (homoscedasticity), no autocorrelation, no
multicollinearity, and no specification error.
Autocorrelation
autocorrelation refers to the correlation between the
observations in a dataset ordered according to time or space.
Consider the following linear regression model:
Y = β1 + β2Xi + ui
assumption of no autocorrelation means:
E(uiuj)=0 where i≠j
Patterns of autocorrelation

Some of the possible cases of autocorrelation


1. Model Misspecification: The problem of autocorrelation can be
caused by to incorrect functional form of the regression equation
or the omission of certain variables in the model. Consider the
following model as the correct specification of consumption
expenditure
Consumptioni = β1 + β2 Incomei+ β3 Incomei2 + ui
2. Manipulation of Data: In the time series analysis, the raw data
are often averaged to smooth out the fluctuation in the data. For
example, the quarterly data is converted to yearly data or
monthly observations are converted to quarterly data. In such
cases, a systematic pattern emerges in the disturbances due to
the smoothening of data.
3. Data Transformation: Using of lagged values of the variables
to construct a first difference model can lead to the problem of
autocorrelation.

4. Non-stationarity: A time series is said to be stationary if it is


characteristics like mean, variance, and covariance do not depend
on time, that is, they are time-invariant. These characteristics do
not change with time and the errors are not correlated. However,
if the series are nonstationary, then its errors are autocorrelated.
Impact of Autocorrelation on Estimation and Inference
Remedial Measures for Autocorrelation

If the data has autocorrelation based on the detection methods


discussed in the previous sub-section, there are the following
remedial methods to deal with it.
1. Correct Functional Form: The omission of a key independent
variable from the model can result in autocorrelation. Therefore, a
simple way is to check the functional form of the regression model
carefully and add a predictor variable which is probably missed. If
adding such an independent variable can reduce or eliminate the
autocorrelation in the model, then it should be added to the
model.
2. Add Lagged Dependent Variable: One way to deal with
autocorrelation in the data is to add the lagged dependent
variable among the set of explanatory variables. This will control
for the dynamics in the model and the autocorrelation can be
corrected if the degree of serial correlation is low.
3. Transformation of the Model: This solution is similar to the one
suggested for the presence of heteroskedasticity, wherein the
model is appropriately transformed to nullify the effect the
autocorrelation. The generalized least square method is discussed
here.
4. First-Difference Form: As the estimated ρ is subject to sampling
errors, a simpler way is to run the regression model with first-
differenced dependent and independent variables. This measure
is useful when the estimated d-statistic is small. The implicit
assumption of this technique is that the first difference of the
errors is independent.
5. White’s Robust Standard Errors: The autocorrelation in the data
does not affect the unbiasedness or consistency of the estimates
but distorts the standard errors of the estimators which make the
inferences based on statistical tests invalid or unreliable.
Therefore, one way to deal with this is to keep the estimated
coefficients and adjust the standard errors by taking White’s
robust standard errors. It is similar to the technique used for
correcting heteroskedasticity as well.
Lesson:-7, Specification Bias
Learning objectives
1. Understand the nature of problems due to violation of
the assumptions of no autocorrelation and no
specification errors in the classical linear regression
model (CLRM).
2. Learn how to diagnose these issues through statistical
tests.
3. Understand the impact on the reliability and validity of
inferences in case of violation of these assumptions.
4. Examine the remedies available to alleviate these
problems.
Introduction
Specification bias occurs when the model or analysis used in
research or decision-making is not adequately designed to
represent the real-world phenomenon under study. This bias
arises due to flaws in the structure of the model, variables,
assumptions, or functional form. This, in turn, can have a
significant impact on the accuracy and validity of conclusions
drawn from the economic analysis.
Model selection criteria
1. Data Admissibility: The model’s predictions must align with
logical possibilities within the data.
2. Theoretical Consistency: The model should make economic
sense based on established theories. For instance, in line with
Milton Friedman’s permanent income hypothesis, the regression
of permanent consumption on permanent income should ideally
yield an intercept value of zero.
3. Weakly Exogenous Regressors: The explanatory variables
(regressors) should have minimal correlation with the error term.
4. Parameter Constancy: Stable parameter values are crucial for
reliable forecasting. As Friedman highlights, a hypothesis or
model’s validity is best assessed by comparing its predictions with
real-world outcomes. Without stable parameters, these
predictions lose reliability.
5. Data Coherency: The residuals derived from the model should
resemble purely random noise (technically known as white noise).
Any deviation from this randomness indicates potential
specification errors within the model.
6. Encompassing Nature: The chosen model should encompass or
explain the outcomes of rival models. Essentially, it should be
comprehensive enough that other models cannot significantly
improve upon it.
Nature of Specification Errors
One of the assumptions of the classical linear regression model
(CLRM) is that the model is correctly specified. The violation of
this assumption leads to a problem of model specification bias or
model specification error. There are various types of specification
errors can be encountered in regression modeling.
Yi = β1 + β2Xi + β3Xi2 + ui
1. Inclusion of Irrelevant Variable: Suppose that instead of running
the regression model , the following model is run
Yi = β1 + β2Xi + β3Xi2 + β4Xi2+ui
In this model, the variable Xᵢ3 is not needed in the model. It is an
irrelevant variable and adding it in the model is a specification
error.
2. Omission of a Relevant Variable: Suppose instead of model in
Equation is Y= β1 + β2Xi + vi , the following model is run:
2
In this model, the variable Xi is omitted from the model and the
2
error term vi will include β3Xi + ui and therefore, the covariance
between error and independent variable is no longer zero as β
assumed under CLRM. This would create a problem of
endogeneity in the model.
3. Wrong Functional Form: Suppose that instead of the correct
regression model in Equation is lnYi = β1 + β2Xi + β3Xi2 the following
model is run:
In this model, instead of the level value of Yi and natural
logarithms of Yi are regressed on the independent variables.
4. Measurement Errors in the Data: In case the dependent and
independent variables are not measured correctly, that is, there
are errors of measurement bias. Consider the following model
with measurement errors:
Yi'=β1' + β2'Yi + β3'Xi'2 + ui'
measurement errors independent and independent variable,
respectively.
5. Incorrect Specification of the Stochastic Error Term: A different
form of specification error involves how the stochastic error u
(or ut) is integrated into the regression model. For instance, in the
subsequent bivariate regression model is formulated in which the
intercept and error term is introduced wrongly:
Yi = βXiui
Nature of Specification Errors
1. Underfitting a Model: The underfitting of a model occurs in case
a relevant variable if omitted from the regression equation. For
example,
suppose the true model is as follows:
Yi = β1 + β2X2i + β3X3i + ui
2. Overfitting a Model: The overfitting of the model implies the
inclusion of an irrelevant variable in the model. Suppose the true
model is Yi = β1 + β2X2i + ui
Detection of Specification Errors
1. Detecting Overfitting of the Model: A model has the problem of
overfitting if irrelevant variables are included in the model. Let us
take for example that in the following model, the variable X2
is irrelevant
Yi = β1 + β2X1 + β3X2 + ui
2. Detecting Underfitting of Model: A model has a problem of
underfitting if the relevant variables are omitted from the model.
To start with it, after the initial estimation of the model, the broad
features of the estimation output should be checked carefully for
any unwarranted outcomes.
3. R2 and Adjusted R2 Test: The values of R2 and adjusted R2 can
indicate about the specification errors. There is a tendency of R2
to increase as more explanatory variables are added to the model,
therefore, in a model with inclusion of irrelevant variables may
lead to inflated R2

4. Information Criterion: Another way of identifying specification


errors in the model is to check the information criterion for the
different models. Some of the commonly used information
criterion include Akaike information criterion (AIC) and Schwarz
information criterion (SIC).
5. Ramsey’s Regression Specification Error Test (RESET): This
is a formal test for specification error in the model. It is a test
that checks whether the overall fit of the regression model can
be improved by adding polynomials of the estimated dependent
variable, Y. l The steps of RESET test are as follows:

You might also like