A Guide to Panel Data Regression_ Theoretics and Implementation with Python
A Guide to Panel Data Regression_ Theoretics and Implementation with Python
Towards Data
Science
Image by Author
GETTING STARTED
Enough talk! Let´s dive into the topic by describing what panel data is and why it is
so powerful!
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-the… 2/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
Let´s go a step further by breaking down the definition above and explain it stept-
by-step on a sample panel dataset:
“[…] where the same individuums […]”: We have the individuums person A,
person B, and person C, from which we collect the variables x and y. The
individuums and the observed variables will always stay the same.
Note: This preculiarity is also the main difference to another, often mixed-up data
concept, namely pooled-cross sections. While both can be seen as summarized
cross-sectional data over time, the main difference is that panel data always observes
the same individuums, while this cannot be proven in pooled-cross sections.
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-the… 3/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
So far so good…now we understand what panel data is. But what is the meaning
behind this data concept and why should we use it???
Since heterogeneity and endogeneity are crucial for understanding why we use panel
data models, I will try to explain this problem straightforward in the next section.
Let´s say, we want to analyze the relationship on how coffee consumption affects the
level of concentration. A simple linear regression model would look like this:
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-the… 4/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
where:
β0 is the intercept
But what, if there is another variabel that would affect existing IV(s) and is not
included in the model? For example, Tiredness has a high chance to affect
Coffee_Consumption (if you are tired, you will obviously drink coffee ;-) ). If you
remeber the first sentence of this article, such variables are called unobserved,
independent variables. They are “hidden” behind the error term and if, e.g.,
Coffee_Consumption is positively related to such a variable, the error term would
increase as Coffee_Consumption increases:
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-the… 5/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
Luckily, there is a way to deal with this problem…maybe you already guessed it,
panel data regression! The advantage of panel data is that we can control
heterogeneity in our regression model by acknowledge heterogeneity as fix or
random. But more on that in the next section!
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-the… 6/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
Notation
where:
y = DV
X = IV(s)
β = Coefficients
α = Individual Effects
μ = Idiosyncratic Error
Exogeneity Assumption
The problem with PooledOLS is that even the assumption above holds true, alpha
might have a serial correlation over time. Consequentely, PooledOLS is mostly
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-the… 7/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
Note: To counter this problem, there is another regression model called FGLS
(Feasible Generalized Least Squares), which is also used in random effects models
described below.
Endogeneity allowed
The trick in a FE-model is, if we assume alpha as constant and subtract the mean
values from each equation term, alpha (i.e. the unobserved heterogeneity) will get
zero and can therefore be neglected:
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-the… 8/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
Notation
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-the… 9/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
The problem with using OLS, as stated above, is the serial correlation between alpha
over time. Hence, RE-models determine which model to take according to the serial
correlation of the error terms. To do so, the model uses the term lambda. In short,
lambda calculates how big the variance of alpha is. If it is zero, then there will be no
variance of alpha, which, in turn, means that PooledOLS is the preferred choice. On
the other side, if the variance of alpha tend to become very big, lambda tends to
become one and therefore it might make sense to eliminate alpha and go with the
FE-model.
Desicion-making Process
Now that we know the common models, how do we decide which model to take? Let
´s have a look on that…
Choosing between PooledOLS and FE/RE: Basically, there are five assumptions for
simple linear regression models that must be fulfilled. Two of them can help us in
choosing between PooledOLS and FE/RE.
These assumptions are (1) Linearity, (2) Exogeneity, (3a) Homoskedasticity and (3b)
Non-autocorrelation, (4) Independent variables are not Stochastic and (5) No
Multicolinearity.
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-th… 10/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
If assumption (2) or (3) (or both) are violated, then FE or RE might be more
suitable.
So, we now understand the theoretics behind panel data regression. Let´s go to the
fun stuff and build the model in Python step-by-step:
I will use the “Guns.csv” dataset, which is normally provided in R. As stated in the
description of this dataset: “Guns is a balanced panel of data on 50 US states, plus
the District of Columbia (for a total of 51 states), by year for 1977–1999.” (Note: a
panel dataset is called “balanced” if there are no missing values within the dataset,
otherwise, it would be called “unbalanced”).
For terms of simplicity, I will only use the following columns provided by the
dataset:
Year: The column Year documents our periodically collected data (between
1977–1999).
Income: Income is our IV and is represented as the per capita personal income.
Violent: Violent is our DV and includes violent crime rates (incidents/ 100,000
inhabitants).
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-th… 11/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
Our “research” question would be: How does the income affects crime rate?
years = dataset.index.get_level_values(‘year’).to_list()
dataset[‘year’] = pd.Categorical(years)
Perform PooledOLS:
# Perform PooledOLS
from linearmodels import PooledOLS
import statsmodels.api as sm
exog = sm.tools.tools.add_constant(dataset['income'])
endog = dataset['violent']
mod = PooledOLS(endog, exog)
pooledOLS_res = mod.fit(cov_type='clustered', cluster_entity=True)
Check condition 3:
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-th… 12/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
White-Test and Breusch-Pagan-Test (both are similar). For condition 3b, I will show
you the Durbin-Watson-Test.
# 3A. Homoskedasticity
import matplotlib.pyplot as plt
# 3A.1 Residuals-Plot for growing Variance Detection
fig, ax = plt.subplots()
ax.scatter(fittedvals_pooled_OLS, residuals_pooled_OLS, color =
‘blue’)
ax.axhline(0, color = 'r', ls = '--')
ax.set_xlabel(‘Predicted Values’, fontsize = 15)
ax.set_ylabel(‘Residuals’, fontsize = 15)
ax.set_title(‘Homoskedasticity Test’, fontsize = 30)
plt.show()
# 3A.2 White-Test
from statsmodels.stats.diagnostic import het_white, het_breuschpagan
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-th… 13/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
# 3A.3 Breusch-Pagan-Test
breusch_pagan_test_results =
het_breuschpagan(pooled_OLS_dataset[‘residual’], exog)
labels = [‘LM-Stat’, ‘LM p-val’, ‘F-Stat’, ‘F p-val’]
print(dict(zip(labels, breusch_pagan_test_results)))
In simple terms, if p < 0.05, then heteroskedasticity is indicated. Both tests give very
small p-values (White-test: 3.442621728589391e-44, Breusch-Pagan-test:
6.032616972194746e-26).
Therefore, we have proven our first violation! Let´s perform assumption 3b:
# 3.B Non-Autocorrelation
# Durbin-Watson-Test
from statsmodels.stats.stattools import durbin_watson
durbin_watson_test_results =
durbin_watson(pooled_OLS_dataset[‘residual’])
print(durbin_watson_test_results)
The Durbin-Watson-Test will have one output between 0 – 4. The mean (= 2) would
indicate that there is no autocorrelation identified, 0 – 2 means positive
autocorrelation (the nearer to zero the higher the correlation), and 2 – 4 means
negative autocorrelation (the nearer to four the higher the correlation). In our
example, the result is 0.08937264851640213, which clearly indicates strong
positive autocorrelation.
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-th… 14/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
# FE und RE model
from linearmodels import PanelOLS
from linearmodels import RandomEffects
exog = sm.tools.tools.add_constant(dataset['income'])
endog = dataset[‘violent’]
# random effects model
model_re = RandomEffects(endog, exog)
re_res = model_re.fit()
# fixed effects model
model_fe = PanelOLS(endog, exog, entity_effects = True)
fe_res = model_fe.fit()
#print results
print(re_res)
print(fe_res)
Results FE-model:
FE-model results
Results RE-model:
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-th… 15/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
RE-model results
import numpy.linalg as la
from scipy import stats
import numpy as np
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-th… 16/17
08:22, 14/01/2022 A Guide to Panel Data Regression: Theoretics and Implementation with Python. | by Bernhard Brugger | Towards Data Scie…
Since the p-value is very small (0.008976136961544689), the null hypothesis can
be rejected. Accordingly, the FE-model seems to be the most suitable, because we
clearly have endogeneity in our model.
In order to model endogeneity, we could now perform regression models like 2SLS
(2 Stage Least Squares) in which instrument variables help to deal with endogeneity,
but this is stuff for another article ;-)
I really hope you liked this article and it helps you overcoming the common
problems with panel data regression. And of course, please don´t be too over-critical
since this is my first post on this platform :-)
https://webcache.googleusercontent.com/search?q=cache:rqGNU7Zl8pAJ:https://towardsdatascience.com/a-guide-to-panel-data-regression-th… 17/17