0% found this document useful (0 votes)
8 views5 pages

A Guide to Panel Data Regression_ Theoretics and Implementation with Python TEXT

The document is a comprehensive guide on panel data regression, detailing its theoretical foundations and practical implementation using Python. It explains the significance of panel data in addressing issues of unobserved heterogeneity and endogeneity, and outlines various regression models such as PooledOLS, Fixed-Effects, and Random-Effects. Additionally, it provides a step-by-step approach to building a panel data regression model in Python, including data preparation and testing assumptions.

Uploaded by

Tan Dang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

A Guide to Panel Data Regression_ Theoretics and Implementation with Python TEXT

The document is a comprehensive guide on panel data regression, detailing its theoretical foundations and practical implementation using Python. It explains the significance of panel data in addressing issues of unobserved heterogeneity and endogeneity, and outlines various regression models such as PooledOLS, Fixed-Effects, and Random-Effects. Additionally, it provides a step-by-step approach to building a panel data regression model in Python, including data preparation and testing assumptions.

Uploaded by

Tan Dang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

08:23, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python.

with Python. | by Bernhard Brugger | Towards Data Scie…

This is Google's cache of https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-and-implementation-with-python-4c84c5055cf8. It is a snapshot of the


page as it appeared on 13 Jan 2022 17:14:33 GMT. The current page could have changed in the meantime. Learn more.

Full version Text-only version View source


Tip: To quickly find your search term on this page, press Ctrl+F or ⌘-F (Mac) and use the find bar.

Get started
Open in app
Towards Data Science

Sign in

Get started
Follow
613K Followers
·
Editors' PicksFeaturesDeep DivesGrowContribute
About
Get started
Open in app
Image by Author

Getting Started

A Guide to Panel Data Regression: Theoretics and Implementation with


Python.
Bernhard Brugger

Bernhard Brugger

Jan 6, 2021·12 min read

Panel data regression is a powerful way to control dependencies of unobserved, independent variables on a dependent variable, which can lead to biased estimators
in traditional linear regression models. In this article, I want to share the most important theoretics behind this topic and how to build a panel data regression model
with Python in a step-by-step manner.

My intention to write this post is twofold: First, in my opinion, it is hard to find an easy and comprehensible explanation of an integrated panel data regression
model. Second, performing panel data regression in Python is not as straightforward as in R for example, which doesn´t mean that it is less effective. So, I decided to
share my knowledge gained during a recent project in order to make future panel data analysis maybe a bit easier ;-)

Enough talk! Let´s dive into the topic by describing what panel data is and why it is so powerful!

What is Panel Data?

“Panel data is a two-dimensional concept, where the same individuums are observered repeatedly over different periods in time.”

In general, panel data can be seen as a combination of cross-sectional and time-series data. Cross-sectional data is described as one observation of multiple objects
and corresponding variables at a specific point in time (i.e. an observation is taken once). Time-series data only observes one object recurrently over time. Panel data
comprises characteristics of both into one model by collecting data from multiple, same objects over time.

In a nutshell, we can think of it like a timeline in which we periodically observe the same individuums.

Illustration of the Panel Data Design

Let´s go a step further by breaking down the definition above and explain it stept-by-step on a sample panel dataset:

Sample Panel Dataset

“Panel data is a two-dimensional concept […]”: Panel data is commonly stored in a two-dimensional way with rows and columns (we have a dataset with nine rows
and four columns). It is important to note that we always need one column to identify the indiviuums under obervation (column person) and one column to document
the points in time the data was collected (column year). Those two columns should be seen as multi-index.

“[…] where the same individuums […]”: We have the individuums person A, person B, and person C, from which we collect the variables x and y. The individuums
and the observed variables will always stay the same.

Note: This preculiarity is also the main difference to another, often mixed-up data concept, namely pooled-cross sections. While both can be seen as summarized
cross-sectional data over time, the main difference is that panel data always observes the same individuums, while this cannot be proven in pooled-cross sections.

Example of pooled-cross sections:

Pooled cross-sectional data

“[…] are observered repeatedly over different periods in time.”: We collect data from 2018, 2019, and 2020.

So far so good…now we understand what panel data is. But what is the meaning behind this data concept and why should we use it???

The answer is….heterogeneity and resulting endogeneity! Maybe you already heard about this issue in traditional linear regression models, in which heterogeneity
often leads to biased results. Panel data is able to deal with that problem.

Since heterogeneity and endogeneity are crucial for understanding why we use panel data models, I will try to explain this problem straightforward in the next
section.

The Problem of Endogeneity caused by unobserved Heterogeneity

webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-… 1/5
08:23, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
“The unobserved dependency of other independent variable(s) is called unobserved heterogeneity and the correlation between the independent variable(s) and the
error term (i.e. the unobserved independent variabels) is called endogeneity.”

Let´s say, we want to analyze the relationship on how coffee consumption affects the level of concentration. A simple linear regression model would look like this:

Simple Linear Regression

where:

Concentration_Level is the dependent variable (DV)


β0 is the intercept
β1 is the regression coefficient
Coffee_Consumption is the independent variable (IV)
ɛ is the error term

However, the goal of this model is to explore the relationship of Coffee_Consumption (IV) on the Concentration_Level (DV). Assuming that IV and DV are
positively correlated, this would mean that if IV increases, DV would also increase. Let´s add this fact to our formula:

Relationship between IV and DV

But what, if there is another variabel that would affect existing IV(s) and is not included in the model? For example, Tiredness has a high chance to affect
Coffee_Consumption (if you are tired, you will obviously drink coffee ;-) ). If you remeber the first sentence of this article, such variables are called unobserved,
independent variables. They are “hidden” behind the error term and if, e.g., Coffee_Consumption is positively related to such a variable, the error term would
increase as Coffee_Consumption increases:

Correlation between IV and error term

This, in turn, would lead to an over-increased estimator of the DV Concentration_Level. Therefore, the estimated DV is biased and will lead to inaccurate inferences.
In our example, the bias would be the red over-increase at Concentration_Level.

Biased estimators because of heterogeneity

Luckily, there is a way to deal with this problem…maybe you already guessed it, panel data regression! The advantage of panel data is that we can control
heterogeneity in our regression model by acknowledge heterogeneity as fix or random. But more on that in the next section!

Types of Panel Data Regression

The following explanations are built on this notation:

Notation

where:

y = DV
X = IV(s)
β = Coefficients
α = Individual Effects
μ = Idiosyncratic Error

Basically, there are three types of regression for panel data:

1) PooledOLS: PooledOLS can be described as simple OLS (Ordinary Least Squared) model that is performed on panel data. It ignores time and individual
characteristics and focuses only on dependencies between the individuums. However, simple OLS requires that there is no correlation between unobserved,
independent variable(s) and the IVs (i.e. exogeneity). Let´s write this down:

Exogeneity Assumption

The problem with PooledOLS is that even the assumption above holds true, alpha might have a serial correlation over time. Consequentely, PooledOLS is mostly
inappropriate for panel data.

Serial Correlation between alpha

Note: To counter this problem, there is another regression model called FGLS (Feasible Generalized Least Squares), which is also used in random effects models
described below.

2) Fixed-Effects (FE) Model: The FE-model determines individual effects of unobserved, independent variables as constant (“fix“) over time. Within FE-models, the
relationship between unobserved, independent variables and the IVs (i.e. endogeneity) can be existent:

Endogeneity allowed

The trick in a FE-model is, if we assume alpha as constant and subtract the mean values from each equation term, alpha (i.e. the unobserved heterogeneity) will get
zero and can therefore be neglected:

Get rid of Individual Effects in the FE-Model

Solely, the idiosyncratic error (represented by my = unobserved factors that change over time and across units) remains and has to be exogen and non-collinear.

However, because heterogeneity can be controlled, this model allows heterogeneity to be existent within the model. Unfortunately, due to the fact that individual
effects are fixed, dependencies can only be observed within the individuums.

Note: An alternative to the FE-model is the LSDV-model (Least Squares Dummy Variables), in which the (fixed) individual effects are represented by dummy
variables. This model will lead to the exact same results, but has a main disadvantage, since it will need a lot more computation power if the regression model is big.

3) Random-Effects (RE) Model: RE-models determine individual effects of unobserved, independent variables as random variables over time. They are able to
“switch” between OLS and FE and hence, can focus on both, dependencies between and within individuals. The idea behind RE-models is the following:

Let´s say, we have the same notation as above:

Notation

webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-… 2/5
08:23, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
In order to include between- as well as within-estimators, we first need to define, when to use which estimator. In general, if the covariance between alpha and IV(s)
is zero (or very small), there is no correlation between them and an OLS-model is preferred. If that covariance is not zero, there is a relationship that should be
eliminated by using a FE-model:

When to use which model?

The problem with using OLS, as stated above, is the serial correlation between alpha over time. Hence, RE-models determine which model to take according to the
serial correlation of the error terms. To do so, the model uses the term lambda. In short, lambda calculates how big the variance of alpha is. If it is zero, then there
will be no variance of alpha, which, in turn, means that PooledOLS is the preferred choice. On the other side, if the variance of alpha tend to become very big,
lambda tends to become one and therefore it might make sense to eliminate alpha and go with the FE-model.

Desicion-making Process

Now that we know the common models, how do we decide which model to take? Let´s have a look on that…

How to decide which Model is appropriate?

Choosing between PooledOLS and FE/RE: Basically, there are five assumptions for simple linear regression models that must be fulfilled. Two of them can help us
in choosing between PooledOLS and FE/RE.

These assumptions are (1) Linearity, (2) Exogeneity, (3a) Homoskedasticity and (3b) Non-autocorrelation, (4) Independent variables are not Stochastic and (5) No
Multicolinearity.

If assumption (2) or (3) (or both) are violated, then FE or RE might be more suitable.

Choosing between FE and RE: Answering this question depends on your assumption, if the individual, unobserved heterogeneity is a constant or random effect. But
this question can also be answered perfoming the Hausman-Test.

Hausman-Test: In simple termns, the Hausman-Test is a test of endogeneity. By running the Hausman-Test, the null hypothesis is that the covariance between IV(s)
and alpha is zero. If this is the case, then RE is preferred over FE. If the null hypothesis is not true, we must go with the FE-model.

So, we now understand the theoretics behind panel data regression. Let´s go to the fun stuff and build the model in Python step-by-step:

Implementing Panel Data Model in Python

Step 1: Import dataset and transform it into the right format.

I will use the “Guns.csv” dataset, which is normally provided in R. As stated in the description of this dataset: “Guns is a balanced panel of data on 50 US states,
plus the District of Columbia (for a total of 51 states), by year for 1977–1999.” (Note: a panel dataset is called “balanced” if there are no missing values within the
dataset, otherwise, it would be called “unbalanced”).

For terms of simplicity, I will only use the following columns provided by the dataset:

State: This column represents our individuums under observation.


Year: The column Year documents our periodically collected data (between 1977–1999).
Income: Income is our IV and is represented as the per capita personal income.
Violent: Violent is our DV and includes violent crime rates (incidents/ 100,000 inhabitants).

Our “research” question would be: How does the income affects crime rate?
# Import and preprocess data
import pandas as pddataset = pd.read_csv(‘Guns.csv’, usecols = [‘state’, ‘year’, ‘income’, ‘violent’],\
index_col = [‘state’, ‘year’])years = dataset.index.get_level_values(‘year’).to_list()
dataset[‘year’] = pd.Categorical(years)

Step 2: Start with PooledOLS and check required assumptions

I would recommend to start performing PooledOLS. Since it can be seen as a simple OLS model, it has to fulfill certain assumptions (those in the chapter “How to
decide which Model is appropriate?” ). As stated above, if condition 2 or 3 (or both) are violated, then FE-/RE-models are likely more suitable. Since condition 2 can
only be tested further down with the Hausman-Test, we will stick to proving condition 3 for now.

Perform PooledOLS:
# Perform PooledOLS
from linearmodels import PooledOLS
import statsmodels.api as smexog = sm.tools.tools.add_constant(dataset['income'])
endog = dataset['violent']
mod = PooledOLS(endog, exog)
pooledOLS_res = mod.fit(cov_type='clustered', cluster_entity=True)# Store values for checking homoskedasticity graphically
fittedvals_pooled_OLS = pooledOLS_res.predict().fitted_values
residuals_pooled_OLS = pooledOLS_res.resids

Check condition 3:

Condition 3 is splitted in 3a (Homoskedasticity) and 3b (Non-Autocorrelation). Those assumptions can be tested with a number of different tests. For condition 3a, I
will show you how to identify heteroscedasticity graphically as well as perform the White-Test and Breusch-Pagan-Test (both are similar). For condition 3b, I will
show you the Durbin-Watson-Test.
# 3A. Homoskedasticity
import matplotlib.pyplot as plt
# 3A.1 Residuals-Plot for growing Variance Detection
fig, ax = plt.subplots()
ax.scatter(fittedvals_pooled_OLS, residuals_pooled_OLS, color = ‘blue’)
ax.axhline(0, color = 'r', ls = '--')
ax.set_xlabel(‘Predicted Values’, fontsize = 15)
ax.set_ylabel(‘Residuals’, fontsize = 15)
ax.set_title(‘Homoskedasticity Test’, fontsize = 30)
plt.show()

Residuals plot for Heteroskedasticity

Basically, a residuals-plot represents predicted values (x-axis) vs. residuals (y-axis). If the plotted data points spread out, this is an indicator for growing variance and
thus, for heteroskedasticity. Since this seems to be the case in our example, we might have the first violation. But let´s check this with the White- and the Breusch-

webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-… 3/5
08:23, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
Pagan-Test:
# 3A.2 White-Test
from statsmodels.stats.diagnostic import het_white, het_breuschpaganpooled_OLS_dataset = pd.concat([dataset, residuals_pooled_OLS], axis=1)
pooled_OLS_dataset = pooled_OLS_dataset.drop([‘year’], axis = 1).fillna(0)
exog = sm.tools.tools.add_constant(dataset['income']).fillna(0)white_test_results = het_white(pooled_OLS_dataset[‘residual’], exog)labels = [‘LM-Stat’
print(dict(zip(labels, white_test_results)))# 3A.3 Breusch-Pagan-Test
breusch_pagan_test_results = het_breuschpagan(pooled_OLS_dataset[‘residual’], exog)
labels = [‘LM-Stat’, ‘LM p-val’, ‘F-Stat’, ‘F p-val’]
print(dict(zip(labels, breusch_pagan_test_results)))

In simple terms, if p < 0.05, then heteroskedasticity is indicated. Both tests give very small p-values (White-test: 3.442621728589391e-44, Breusch-Pagan-test:
6.032616972194746e-26).

Therefore, we have proven our first violation! Let´s perform assumption 3b:
# 3.B Non-Autocorrelation
# Durbin-Watson-Test
from statsmodels.stats.stattools import durbin_watson

durbin_watson_test_results = durbin_watson(pooled_OLS_dataset[‘residual’])
print(durbin_watson_test_results)

The Durbin-Watson-Test will have one output between 0 – 4. The mean (= 2) would indicate that there is no autocorrelation identified, 0 – 2 means positive
autocorrelation (the nearer to zero the higher the correlation), and 2 – 4 means negative autocorrelation (the nearer to four the higher the correlation). In our example,
the result is 0.08937264851640213, which clearly indicates strong positive autocorrelation.

As a consequence, assumption 3b is also violated, so it seems that a FE-/RE-model will be more suitable.

So, let´s build the models!

Step 3: Perform FE- and RE-model


# FE und RE model
from linearmodels import PanelOLS
from linearmodels import RandomEffectsexog = sm.tools.tools.add_constant(dataset['income'])
endog = dataset[‘violent’]
# random effects model
model_re = RandomEffects(endog, exog)
re_res = model_re.fit()
# fixed effects model
model_fe = PanelOLS(endog, exog, entity_effects = True)
fe_res = model_fe.fit()
#print results
print(re_res)
print(fe_res)

Results FE-model:

FE-model results

Results RE-model:

RE-model results

In this example, both perform similar (although, FE seems to perform slightly better). So, in order to test which model should be preferred, we will finally perfom
the Hausman-test.

Step 4: Perform Hausman-Test

Note: Since I had problems with the hausman-function provided in econtools package (covariance was not working), I slightly changed the function. So, you are
welcome to use this function, if you are following this guideline.
import numpy.linalg as la
from scipy import stats
import numpy as npdef hausman(fe, re):
b = fe.params
B = re.params
v_b = fe.cov
v_B = re.covdf = b[np.abs(b) < 1e8].sizechi2 = np.dot((b — B).T, la.inv(v_b — v_B).dot(b — B))

pval = stats.chi2.sf(chi2, df)return chi2, df, pvalhausman_results = hausman(fe_res, re_res)


print(‘chi-Squared: ‘ + str(hausman_results[0]))
print(‘degrees of freedom: ‘ + str(hausman_results[1]))
print(‘p-Value: ‘ + str(hausman_results[2]))

Since the p-value is very small (0.008976136961544689), the null hypothesis can be rejected. Accordingly, the FE-model seems to be the most suitable, because we
clearly have endogeneity in our model.

In order to model endogeneity, we could now perform regression models like 2SLS (2 Stage Least Squares) in which instrument variables help to deal with
endogeneity, but this is stuff for another article ;-)

I really hope you liked this article and it helps you overcoming the common problems with panel data regression. And of course, please don´t be too over-critical
since this is my first post on this platform :-)

Bernhard Brugger
Data Analyst. Knowledge Creator.

Follow

175

13

webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-… 4/5
08:23, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
175

175

13

Panel Data Regression


Econometrics
Python
Getting Started

More from Towards Data Science


Follow

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Read more from Towards Data Science

More From Medium


[9]Data Analytics using PowerBI
Bhattbhoomi111

Did Someone Say Rugby 7s Analytics?


Emansmadi in The Startup

Correlation matrix:-
Swapnil Kirdak

The Best Data Labeling Company in 2021


ByteBridge in Becoming Human: Artificial Intelligence Magazine

GPU Geospatial with OmniSci (Open Source) — Experimental Failures


Daniel Voyce in Towards Data Science

Black-box and White-Box Models towards Explainable AI


Orhan G. Yalçın in Towards Data Science

BITCOIN: Predicting the top MVRV-1


Clark Mumaw

Introducing Geofeather, a Python library for faster geospatial I/O with Geopandas
Brendan Ward

webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-… 5/5

You might also like