A Guide to Panel Data Regression_ Theoretics and Implementation with Python TEXT
A Guide to Panel Data Regression_ Theoretics and Implementation with Python TEXT
Get started
Open in app
Towards Data Science
Sign in
Get started
Follow
613K Followers
·
Editors' PicksFeaturesDeep DivesGrowContribute
About
Get started
Open in app
Image by Author
Getting Started
Bernhard Brugger
Panel data regression is a powerful way to control dependencies of unobserved, independent variables on a dependent variable, which can lead to biased estimators
in traditional linear regression models. In this article, I want to share the most important theoretics behind this topic and how to build a panel data regression model
with Python in a step-by-step manner.
My intention to write this post is twofold: First, in my opinion, it is hard to find an easy and comprehensible explanation of an integrated panel data regression
model. Second, performing panel data regression in Python is not as straightforward as in R for example, which doesn´t mean that it is less effective. So, I decided to
share my knowledge gained during a recent project in order to make future panel data analysis maybe a bit easier ;-)
Enough talk! Let´s dive into the topic by describing what panel data is and why it is so powerful!
“Panel data is a two-dimensional concept, where the same individuums are observered repeatedly over different periods in time.”
In general, panel data can be seen as a combination of cross-sectional and time-series data. Cross-sectional data is described as one observation of multiple objects
and corresponding variables at a specific point in time (i.e. an observation is taken once). Time-series data only observes one object recurrently over time. Panel data
comprises characteristics of both into one model by collecting data from multiple, same objects over time.
In a nutshell, we can think of it like a timeline in which we periodically observe the same individuums.
Let´s go a step further by breaking down the definition above and explain it stept-by-step on a sample panel dataset:
“Panel data is a two-dimensional concept […]”: Panel data is commonly stored in a two-dimensional way with rows and columns (we have a dataset with nine rows
and four columns). It is important to note that we always need one column to identify the indiviuums under obervation (column person) and one column to document
the points in time the data was collected (column year). Those two columns should be seen as multi-index.
“[…] where the same individuums […]”: We have the individuums person A, person B, and person C, from which we collect the variables x and y. The individuums
and the observed variables will always stay the same.
Note: This preculiarity is also the main difference to another, often mixed-up data concept, namely pooled-cross sections. While both can be seen as summarized
cross-sectional data over time, the main difference is that panel data always observes the same individuums, while this cannot be proven in pooled-cross sections.
“[…] are observered repeatedly over different periods in time.”: We collect data from 2018, 2019, and 2020.
So far so good…now we understand what panel data is. But what is the meaning behind this data concept and why should we use it???
The answer is….heterogeneity and resulting endogeneity! Maybe you already heard about this issue in traditional linear regression models, in which heterogeneity
often leads to biased results. Panel data is able to deal with that problem.
Since heterogeneity and endogeneity are crucial for understanding why we use panel data models, I will try to explain this problem straightforward in the next
section.
webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-… 1/5
08:23, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
“The unobserved dependency of other independent variable(s) is called unobserved heterogeneity and the correlation between the independent variable(s) and the
error term (i.e. the unobserved independent variabels) is called endogeneity.”
Let´s say, we want to analyze the relationship on how coffee consumption affects the level of concentration. A simple linear regression model would look like this:
where:
However, the goal of this model is to explore the relationship of Coffee_Consumption (IV) on the Concentration_Level (DV). Assuming that IV and DV are
positively correlated, this would mean that if IV increases, DV would also increase. Let´s add this fact to our formula:
But what, if there is another variabel that would affect existing IV(s) and is not included in the model? For example, Tiredness has a high chance to affect
Coffee_Consumption (if you are tired, you will obviously drink coffee ;-) ). If you remeber the first sentence of this article, such variables are called unobserved,
independent variables. They are “hidden” behind the error term and if, e.g., Coffee_Consumption is positively related to such a variable, the error term would
increase as Coffee_Consumption increases:
This, in turn, would lead to an over-increased estimator of the DV Concentration_Level. Therefore, the estimated DV is biased and will lead to inaccurate inferences.
In our example, the bias would be the red over-increase at Concentration_Level.
Luckily, there is a way to deal with this problem…maybe you already guessed it, panel data regression! The advantage of panel data is that we can control
heterogeneity in our regression model by acknowledge heterogeneity as fix or random. But more on that in the next section!
Notation
where:
y = DV
X = IV(s)
β = Coefficients
α = Individual Effects
μ = Idiosyncratic Error
1) PooledOLS: PooledOLS can be described as simple OLS (Ordinary Least Squared) model that is performed on panel data. It ignores time and individual
characteristics and focuses only on dependencies between the individuums. However, simple OLS requires that there is no correlation between unobserved,
independent variable(s) and the IVs (i.e. exogeneity). Let´s write this down:
Exogeneity Assumption
The problem with PooledOLS is that even the assumption above holds true, alpha might have a serial correlation over time. Consequentely, PooledOLS is mostly
inappropriate for panel data.
Note: To counter this problem, there is another regression model called FGLS (Feasible Generalized Least Squares), which is also used in random effects models
described below.
2) Fixed-Effects (FE) Model: The FE-model determines individual effects of unobserved, independent variables as constant (“fix“) over time. Within FE-models, the
relationship between unobserved, independent variables and the IVs (i.e. endogeneity) can be existent:
Endogeneity allowed
The trick in a FE-model is, if we assume alpha as constant and subtract the mean values from each equation term, alpha (i.e. the unobserved heterogeneity) will get
zero and can therefore be neglected:
Solely, the idiosyncratic error (represented by my = unobserved factors that change over time and across units) remains and has to be exogen and non-collinear.
However, because heterogeneity can be controlled, this model allows heterogeneity to be existent within the model. Unfortunately, due to the fact that individual
effects are fixed, dependencies can only be observed within the individuums.
Note: An alternative to the FE-model is the LSDV-model (Least Squares Dummy Variables), in which the (fixed) individual effects are represented by dummy
variables. This model will lead to the exact same results, but has a main disadvantage, since it will need a lot more computation power if the regression model is big.
3) Random-Effects (RE) Model: RE-models determine individual effects of unobserved, independent variables as random variables over time. They are able to
“switch” between OLS and FE and hence, can focus on both, dependencies between and within individuals. The idea behind RE-models is the following:
Notation
webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-… 2/5
08:23, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
In order to include between- as well as within-estimators, we first need to define, when to use which estimator. In general, if the covariance between alpha and IV(s)
is zero (or very small), there is no correlation between them and an OLS-model is preferred. If that covariance is not zero, there is a relationship that should be
eliminated by using a FE-model:
The problem with using OLS, as stated above, is the serial correlation between alpha over time. Hence, RE-models determine which model to take according to the
serial correlation of the error terms. To do so, the model uses the term lambda. In short, lambda calculates how big the variance of alpha is. If it is zero, then there
will be no variance of alpha, which, in turn, means that PooledOLS is the preferred choice. On the other side, if the variance of alpha tend to become very big,
lambda tends to become one and therefore it might make sense to eliminate alpha and go with the FE-model.
Desicion-making Process
Now that we know the common models, how do we decide which model to take? Let´s have a look on that…
Choosing between PooledOLS and FE/RE: Basically, there are five assumptions for simple linear regression models that must be fulfilled. Two of them can help us
in choosing between PooledOLS and FE/RE.
These assumptions are (1) Linearity, (2) Exogeneity, (3a) Homoskedasticity and (3b) Non-autocorrelation, (4) Independent variables are not Stochastic and (5) No
Multicolinearity.
If assumption (2) or (3) (or both) are violated, then FE or RE might be more suitable.
Choosing between FE and RE: Answering this question depends on your assumption, if the individual, unobserved heterogeneity is a constant or random effect. But
this question can also be answered perfoming the Hausman-Test.
Hausman-Test: In simple termns, the Hausman-Test is a test of endogeneity. By running the Hausman-Test, the null hypothesis is that the covariance between IV(s)
and alpha is zero. If this is the case, then RE is preferred over FE. If the null hypothesis is not true, we must go with the FE-model.
So, we now understand the theoretics behind panel data regression. Let´s go to the fun stuff and build the model in Python step-by-step:
I will use the “Guns.csv” dataset, which is normally provided in R. As stated in the description of this dataset: “Guns is a balanced panel of data on 50 US states,
plus the District of Columbia (for a total of 51 states), by year for 1977–1999.” (Note: a panel dataset is called “balanced” if there are no missing values within the
dataset, otherwise, it would be called “unbalanced”).
For terms of simplicity, I will only use the following columns provided by the dataset:
Our “research” question would be: How does the income affects crime rate?
# Import and preprocess data
import pandas as pddataset = pd.read_csv(‘Guns.csv’, usecols = [‘state’, ‘year’, ‘income’, ‘violent’],\
index_col = [‘state’, ‘year’])years = dataset.index.get_level_values(‘year’).to_list()
dataset[‘year’] = pd.Categorical(years)
I would recommend to start performing PooledOLS. Since it can be seen as a simple OLS model, it has to fulfill certain assumptions (those in the chapter “How to
decide which Model is appropriate?” ). As stated above, if condition 2 or 3 (or both) are violated, then FE-/RE-models are likely more suitable. Since condition 2 can
only be tested further down with the Hausman-Test, we will stick to proving condition 3 for now.
Perform PooledOLS:
# Perform PooledOLS
from linearmodels import PooledOLS
import statsmodels.api as smexog = sm.tools.tools.add_constant(dataset['income'])
endog = dataset['violent']
mod = PooledOLS(endog, exog)
pooledOLS_res = mod.fit(cov_type='clustered', cluster_entity=True)# Store values for checking homoskedasticity graphically
fittedvals_pooled_OLS = pooledOLS_res.predict().fitted_values
residuals_pooled_OLS = pooledOLS_res.resids
Check condition 3:
Condition 3 is splitted in 3a (Homoskedasticity) and 3b (Non-Autocorrelation). Those assumptions can be tested with a number of different tests. For condition 3a, I
will show you how to identify heteroscedasticity graphically as well as perform the White-Test and Breusch-Pagan-Test (both are similar). For condition 3b, I will
show you the Durbin-Watson-Test.
# 3A. Homoskedasticity
import matplotlib.pyplot as plt
# 3A.1 Residuals-Plot for growing Variance Detection
fig, ax = plt.subplots()
ax.scatter(fittedvals_pooled_OLS, residuals_pooled_OLS, color = ‘blue’)
ax.axhline(0, color = 'r', ls = '--')
ax.set_xlabel(‘Predicted Values’, fontsize = 15)
ax.set_ylabel(‘Residuals’, fontsize = 15)
ax.set_title(‘Homoskedasticity Test’, fontsize = 30)
plt.show()
Basically, a residuals-plot represents predicted values (x-axis) vs. residuals (y-axis). If the plotted data points spread out, this is an indicator for growing variance and
thus, for heteroskedasticity. Since this seems to be the case in our example, we might have the first violation. But let´s check this with the White- and the Breusch-
webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-… 3/5
08:23, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
Pagan-Test:
# 3A.2 White-Test
from statsmodels.stats.diagnostic import het_white, het_breuschpaganpooled_OLS_dataset = pd.concat([dataset, residuals_pooled_OLS], axis=1)
pooled_OLS_dataset = pooled_OLS_dataset.drop([‘year’], axis = 1).fillna(0)
exog = sm.tools.tools.add_constant(dataset['income']).fillna(0)white_test_results = het_white(pooled_OLS_dataset[‘residual’], exog)labels = [‘LM-Stat’
print(dict(zip(labels, white_test_results)))# 3A.3 Breusch-Pagan-Test
breusch_pagan_test_results = het_breuschpagan(pooled_OLS_dataset[‘residual’], exog)
labels = [‘LM-Stat’, ‘LM p-val’, ‘F-Stat’, ‘F p-val’]
print(dict(zip(labels, breusch_pagan_test_results)))
In simple terms, if p < 0.05, then heteroskedasticity is indicated. Both tests give very small p-values (White-test: 3.442621728589391e-44, Breusch-Pagan-test:
6.032616972194746e-26).
Therefore, we have proven our first violation! Let´s perform assumption 3b:
# 3.B Non-Autocorrelation
# Durbin-Watson-Test
from statsmodels.stats.stattools import durbin_watson
durbin_watson_test_results = durbin_watson(pooled_OLS_dataset[‘residual’])
print(durbin_watson_test_results)
The Durbin-Watson-Test will have one output between 0 – 4. The mean (= 2) would indicate that there is no autocorrelation identified, 0 – 2 means positive
autocorrelation (the nearer to zero the higher the correlation), and 2 – 4 means negative autocorrelation (the nearer to four the higher the correlation). In our example,
the result is 0.08937264851640213, which clearly indicates strong positive autocorrelation.
As a consequence, assumption 3b is also violated, so it seems that a FE-/RE-model will be more suitable.
Results FE-model:
FE-model results
Results RE-model:
RE-model results
In this example, both perform similar (although, FE seems to perform slightly better). So, in order to test which model should be preferred, we will finally perfom
the Hausman-test.
Note: Since I had problems with the hausman-function provided in econtools package (covariance was not working), I slightly changed the function. So, you are
welcome to use this function, if you are following this guideline.
import numpy.linalg as la
from scipy import stats
import numpy as npdef hausman(fe, re):
b = fe.params
B = re.params
v_b = fe.cov
v_B = re.covdf = b[np.abs(b) < 1e8].sizechi2 = np.dot((b — B).T, la.inv(v_b — v_B).dot(b — B))
Since the p-value is very small (0.008976136961544689), the null hypothesis can be rejected. Accordingly, the FE-model seems to be the most suitable, because we
clearly have endogeneity in our model.
In order to model endogeneity, we could now perform regression models like 2SLS (2 Stage Least Squares) in which instrument variables help to deal with
endogeneity, but this is stuff for another article ;-)
I really hope you liked this article and it helps you overcoming the common problems with panel data regression. And of course, please don´t be too over-critical
since this is my first post on this platform :-)
Bernhard Brugger
Data Analyst. Knowledge Creator.
Follow
175
13
webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-… 4/5
08:23, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
175
175
13
Your home for data science. A Medium publication sharing concepts, ideas and codes.
Correlation matrix:-
Swapnil Kirdak
Introducing Geofeather, a Python library for faster geospatial I/O with Geopandas
Brendan Ward
webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-… 5/5