En Tanagra Python StatsModels PDF
En Tanagra Python StatsModels PDF
1 Introduction
Regression analysis with the StatsModels package for Python.
Statsmodels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical
data exploration. The description of the library is available on the PyPI page, the repository
that lists the tools and packages devoted to Python1.
In this tutorial, we will try to identify the potentialities of StatsModels by conducting a case
study in multiple linear regression. We will discuss about: the estimation of model
parameters using the ordinary least squares method, the implementation of some statistical
tests, the checking of the model assumptions by analyzing the residuals, the detection of
1
The French version of this tutorial was written in September 2015. We were using the version 0.6.1.
outliers and influential points, the analysis of multicollinearity, the calculation of the
prediction interval for a new instance.
About the regression analysis, an excellent reference is the online course available on the
PennState Eberly College of Science website: "STAT 501 - Regression Methods".
2 Dataset
We use the “vehicules_1.txt” data file for the construction of the model. It describes n = 26
vehicles from their characteristics: engine size (cylindrée), horsepower (puissance), weight
(poids) and consumption (conso, liter per 100 km).
We want to explain / predict the consumption of the cars (y : CONSO) from p = 3 input
(explanatory, predictor) variables (X1 : cylindrée, X2 : puissance, X3 : poids).
yi a0 a1 xi1 a2 xi 2 a3 xi 3 i , i 1,, n
The aim of the regression analysis is to estimate the values of the coefficients (𝑎0 , 𝑎1 , 𝑎2 , 𝑎3 )
using the available dataset.
3 Data importation
We use the Pandas package for importing the data file into a data frame structure.
#modifying the default directory
import os
os.chdir("… directory containing the data file …")
In the read_table() command, we specify: the first row represents the name of the variables
(header = 0), the first column (n°0) represents the labels of the instances (index_col = 0). We
display the dimensions of the dataset. We retrieve the values of n (number of instances) and
p (number of input variables for the regression).
#dimensions
print(cars.shape)
#number of instances
n = cars.shape[0]
All the variables are numeric (integer or float). The first column (n°0) which represents the
labels of the instances is not accounted for here.
cylindree int64
puissance int64
poids int64
conso float64
dtype: object
print(cars.index)
We obtain:
Index(['Maserati Ghibli GT', 'Daihatsu Cuore', 'Toyota Corolla',
'Fort Escort 1.4i PT', 'Mazda Hachtback V', 'Volvo 960 Kombi aut',
'Toyota Previa salon', 'Renault Safrane 2.2. V',
'Honda Civic Joker 1.4', 'VW Golt 2.0 GTI', 'Suzuki Swift 1.0 GLS',
'Lancia K 3.0 LS', 'Mercedes S 600', 'Volvo 850 2.5', 'VW Polo 1.4 60',
'Hyundai Sonata 3000', 'Opel Corsa 1.2i Eco', 'Opel Astra 1.6i 16V',
'Peugeot 306 XS 108', 'Mitsubishi Galant', 'Citroen ZX Volcane',
'Peugeot 806 2.0', 'Fiat Panda Mambo L', 'Seat Alhambra 2.0',
'Ford Fiesta 1.2 Zetec'],
dtype='object', name='modele')
4 Regression analysis
4.1 Launching the analysis
We can perform the modelling step by importing the Statsmodels package. We have two
options. (1) The first consists of dividing the data into two parts: a vector containing the
values of the target variable CONSO, a matrix with the explanatory variables CYLINDREE,
POWER, CONSO. Then, we pass them to the OLS tool. This implies some manipulation of
data, and especially we must convert the data frame structure into numpy vector and
matrix. (2) The second is based on a specific tool (ols) which directly recognizes formulas
similar to the ones used under R [e. g. lm() function for R]. The Pandas data frame structure
can be used directly in this case. We prefer this second solution.
#regression with formula
import statsmodels.formula.api as smf
#instantiation
reg = smf.ols('conso ~ cylindree + puissance + poids', data = cars)
reg is an instance of the class ols. We can list their members with the dir() command i.e.
properties and methods.
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__',
'__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__',
'__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__',
'__subclasshook__', '__weakref__', '_data_attr', '_df_model', '_df_resid',
'_get_init_kwds', '_handle_data', '_init_keys', 'data', 'df_model', 'df_resid',
We use the fit() command to launch the modelling process on the dataset.
#launching the modelling process
res = reg.fit()
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.14e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
The object res enables to perform intermediate calculations. Some are relatively simple.
Below, we display some important results (estimated coefficients, R2). We try to calculate
manually the F-Statistic. We have the same as the value provided by the res object of course.
#estimated coefficients
print(res.params)
#R2
print(res.rsquared)
Other calculations are possible, giving us access to sophisticated procedures such as tests
that a subset of the slopes is null.
We try to test that all the slopes are 0. This is a special case of the testing that a subset is
null. The null hypothesis may be written in a matrix format.
H 0 : Ra 0
Where
0 1 0 0
R 0 0 1 0
0 0 0 1
a0
0 1 0 0 0
a
H0 : 0 0 1 0 1 0
0 0 0 1 a2 0
a
3
a1 0
H 0 : a2 0
a 0
3
#matrix R
R = np.array([[0,1,0,0],[0,0,1,0],[0,0,0,1]])
#F-statistic
print(res.f_test(R))
5 Regression diagnostic
5.1 Model assumptions – Test for error normality
One of the main assumption for the inferential part of the regression (OLS - ordinary least
squares) is the assumption that the errors follow a normal distribution. A first important
verification is to check the compatibility of the residuals (the errors observed on the sample)
with this assumption.
Jarque-Bera Test. We use the stattools module in order to perform the Jarque-Bera test.
This test checks if the observed skewness and kurtosis matching a normal distribution.
#Jarque-Bera normality test
import statsmodels.api as sm
JB, JBpv,skw,kurt = sm.stats.stattools.jarque_bera(res.resid)
print(JB,JBpv,skw,kurt)
We have respectively: the statistic JB, the p-value of the test JBpv, the skewness skw and the
kurtosis kurt.
0.7721503004927122 0.679719442677 -0.3693040742424057 2.5575948785729956
We observe that the values obtained (JB, JBpv) are consistent with those provided by the
summary() command above (page 5). Here, we can assume that the errors are a normal
distribution at the 5% level.
Normal probability plot. The normal probability plot is a graphical technique to identify
substantive departures from normality. It is based on the comparison between the observed
distribution and the theoretical distribution under the normal assumption. The null
hypothesis (normal distribution) is rejected if the points are not aligned on a straight line.
We use the qqplot() procedure.
#qqpolot vs. normal distribution
sm.qqplot(res.resid)
The graph confirms the Jarque-Bera test. The points are approximately aligned. But there
seems to be some problems with the high values of the residuals. This suggests the existence
of atypical points in our data.
We use a specific object provided by the regression result to analyze the influential points.
#object for the analysis of influential points
infl = res.get_influence()
#members
print(dir(infl))
Knowing how to code allows us to verify by calculation the results proposed by the different
procedures. For instance, we can obtain the internally studentized residuals (ti) from the
residuals ˆi yi yˆ i , the leverage hi and the regression standard error ˆ obtained from
the scale property of the object resulting from the regression (res.scale):
ˆi
ti
ˆ 1 hi
The values are consistent to those provided directly with the property
"resid_studenized_internal". Fortunately, we would have been quite disturbed otherwise.
Externally studentized residuals. We repeat the approach for the externally studentized
residuals, using the following formula:
n p2
t i* t i
n p 1 t i2
We compare the values provided by the property resid_studentized_external and the values
obtained by the formula above.
#values provided by the property of the object
print(infl.resid_studentized_external)
#checking with the formula
res_studs = res_stds*np.sqrt((n-p-2)/(n-p-1-res_stds**2))
print(res_studs)
The problem concern Mitsubishi Galant, Hyundai Sonata, Mercedes S 600 and Maserati
Ghibli GT.
p 1
sh 2
n
An observation is suspicious if hi s h
#threshold leverage
seuil_levier = 2*(p+1)/n
print(seuil_levier)
#identification
atyp_levier = leviers > seuil_levier
print(atyp_levier)
Python displays a vector of Boolean values. True represents instances with a leverage higher
than the threshold value.
[ True False False False False False False False False False False False
True False False False False False False False False False False False False]
The reading is not easy. It is more convenient to display the index of the vehicles.
#which vehicles?
print(cars.index[atyp_levier],leviers[atyp_levier])
We have both the car model and the values of the leverage.
Index(['Maserati Ghibli GT', 'Mercedes S 600'], dtype='object', name='modele') [
0.72183931 0.72325833]
st t10.05 (n p 2)
2
An observation is suspicious if
t i* st
#which ones?
print(cars.index[atyp_stud],res_studs[atyp_stud])
These are,
Index(['Hyundai Sonata 3000', 'Mitsubishi Galant'], dtype='object', name='modele')
[ 2.33908893 -2.57180996]
We identify 4 vehicles which are highlighted into the graphical representation above.
Index(['Maserati Ghibli GT', 'Mercedes S 600', 'Hyundai Sonata 3000',
'Mitsubishi Galant'], dtype='object', name='modele')
Other criteria. The tool provides other criteria such as DFFITS, Cook’s distance, etc. We can
present them in a tabular form.
#Other criteria for detecting influential points
print(infl.summary_frame().filter(["hat_diag","student_resid","dffits","cooks_d"]))
p 1
• DFFITSi 2 for DFFITS ;
n
4
• Di for Cook’s distance.
n p 1
hat_diag student_resid dffits cooks_d
modele
Maserati Ghibli GT 0.721839 1.298823 2.092292 1.059755
Daihatsu Cuore 0.174940 0.701344 0.322948 0.026720
Toyota Corolla 0.060524 -0.708326 -0.179786 0.008277
Fort Escort 1.4i PT 0.059843 0.804633 0.203004 0.010479
Mazda Hachtback V 0.058335 0.105469 0.026251 0.000181
Volvo 960 Kombi aut 0.098346 0.933571 0.308322 0.023912
Toyota Previa salon 0.306938 0.603915 0.401898 0.041640
Renault Safrane 2.2. V 0.094349 0.845397 0.272865 0.018870
Honda Civic Joker 1.4 0.072773 -1.303472 -0.365168 0.032263
VW Golt 2.0 GTI 0.052438 0.815140 0.191757 0.009342
Suzuki Swift 1.0 GLS 0.112343 -0.473508 -0.168453 0.007366
Lancia K 3.0 LS 0.080713 -0.888366 -0.263232 0.017498
Mercedes S 600 0.723258 -1.525140 -2.465581 1.429505
Volvo 850 2.5 0.053154 0.460959 0.109217 0.003098
VW Polo 1.4 60 0.088934 -0.648581 -0.202639 0.010557
Hyundai Sonata 3000 0.250186 2.339089 1.351145 0.376280
Note: Detecting outliers or influential points is one thing, dealing them is another. Indeed,
we cannot remove them systematically. It is necessary to identify why an observation is
problematic and thus to determine the most appropriate solution, which may be deletion,
but not systematically. For instance, let us take a simple situation. A point can be atypical
because it takes an unusual value on a variable. If the variable selection process leads to its
exclusion from the model, what should be done then? Re-enter the point? Leave as is? There
is no pre-determined solution. The modelling process is exploratory in nature.
6 Multicollinearity problem
The multicollinearity problem disturbs the statistical inference, in part because it inflates the
estimated standard error of coefficients. There are different ways of identifying the
multicollinearity. We study a few detection techniques based on the analysis of the
correlation matrix in this section.
Correlation matrix. A rule of thumb is to compare the absolute value of the correlation
between each pair of variables with the threshold value 0.8. In Python, we copy the
explanatory variables in a matrix. Then, we use corrcoef() procedure form the scipy library.
#correlation matrix
import scipy
mc = scipy.corrcoef(cars_exog,rowvar=0)
print(mc)
Klein’s rule of thumb. It consists in comparing the square of the correlation between the
pairs of predictors with the overall R2 (R² = 0.957) of the regression. It is interesting because
it takes into account the characteristics of the regression.
#Klein’s rule of thumb
mc2 = mc**2
print(mc2)
None of the values exceed the R², but there are uncomfortable similarities nonetheless.
[[ 1. 0.89686872 0.76530582]
[ 0.89686872 1. 0.68552706]
[ 0.76530582 0.68552706 1. ]]
Variance Inflation Factor (VIF). This criterion makes it possible to evaluate the relationship
of one predictor with all other explanatory variables. We can read its value on the diagonal
of the inverse of the correlation matrix.
#VIF criterion
vif = np.linalg.inv(mc)
print(vif)
A possible rule for multicollinearity detection is (VIF > 4). Here also, we note that the
multicollinearity problem affects our regression.
[[ 12.992577 -9.20146328 -3.7476397 ]
[ -9.20146328 9.6964853 0.02124552]
[ -3.7476397 0.02124552 4.26091058]]
Possible solutions. Regularized least squares or variable selection approaches are possible
solution for the multicollinearity problem. It seems that they are not available in the
Statsmodels package. But it is not matter , Python is a powerful programming language, it
would be easy for us to program additional calculations from the objects provided by
Statsmodels.
In our case, this is above all an exercise. We apply the model on a set of observations not
used during the modelling process. We therefore have the values of both predictors and
target variable. This would allow us to verify the reliability of our model by comparing
predicted and observed values. It is a kind of holdout validation scheme. This approach is
widely use in classification problems.
For these n*=6 vehicles, we calculate the prediction and we compare them to the observed
values of the target attribute.
#number of instances
n_pred = cars2.shape[0]
The structure of the data file (index, columns) is identical to the first dataset.
cylindree puissance poids conso
modele
Nissan Primera 2.0 1997 92 1240 9.2
Fiat Tempra 1.6 Liberty 1580 65 1080 9.3
Opel Omega 2.5i V6 2496 125 1670 11.3
Subaru Vivio 4WD 658 32 740 6.8
Seat Ibiza 2.0 GTI 1983 85 1075 9.5
Ferrari 456 GT 5474 325 1690 21.3
The predicted value of the response is obtained by applying the model on the predictor
values. Using the matrix form is better when the number of instances to process is high:
y* X * aˆ
Where â is the vector of estimated coefficients, including the intercept. So that the
calculation is consistent, we must add to the X* matrix a column of 1. The dimension of the
matrix X* is (n*, p+1).
In Python, we create the matrix with the new column, we use the add_constant() procedure.
#predictor columns
cars2_exog = cars2[['cylindree','puissance','poids']]
#add a column of 1
cars2_exog = sm.add_constant(cars2_exog)
print(cars2_exog)
The constant value 1 is in the first column. The target variable CONSO is not included in the
X* matrix of course.
const cylindree puissance poids
modele
Nissan Primera 2.0 1 1997 92 1240
Fiat Tempra 1.6 Liberty 1 1580 65 1080
Opel Omega 2.5i V6 1 2496 125 1670
Subaru Vivio 4WD 1 658 32 740
Seat Ibiza 2.0 GTI 1 1983 85 1075
Ferrari 456 GT 1 5474 325 1690
Then, we apply the coefficients of the regression with the predict() command…
#punctual prediction by applying the regression coefficients
pred_conso = reg.predict(res.params,cars2_exog)
print(pred_conso)
A scatterplot enables to check the quality of prediction. We set in the x axis the observed
values of the response variable, in the y axis the predicted values. If the prediction is perfect,
the plotted points are aligned in the diagonal. We use the matplotlib package for Python.
#comparison obs. Vs. pred.
import matplotlib.pyplot as plt
plt.scatter(cars2['conso'],pred_conso)
plt.plot(np.arange(5,23),np.arange(5,23))
plt.xlabel('Valeurs Observées')
plt.ylabel('Valeurs Prédites')
plt.xlim(5,22)
plt.ylim(5,22)
plt.show()
The prediction is acceptable for 5 points. For the 6th, the Ferrari, the regression model
strongly underestimates the consumption (17.33 vs. 21.3).
A punctual prediction is a first step. A prediction interval is always more interesting because
we can associate a probability of error to the result provided.
Standard error of the prediction. The calculation is quite complex. We need to calculate the
standard error of the prediction for each individual, we use the following formula (squared
value of the standard error of the prediction) for a new instance i*
ˆ2ˆ ˆ 2 1 X i* X ' X 1 X 'i*
i*
Some information is provided by the Python objects reg and res (sections 4.1 and Erreur ! S
ource du renvoi introuvable.) : ˆ 2 is the square of the residual standard error, X ' X is
1
obtained from the matrix of explanatory variables. Other information is related to the new
individual: X i* is the values of the input attributes, including the constant 1. For instance,
for (Nissan Primera 2.0), we have the vector (1 ; 1997 ; 92 ; 1240).
We obtain:
[ 0.51089413 0.51560511 0.56515759 0.56494062 0.53000431 1.11282003]
Confidence interval. Now we calculate the lower and upper bounds for a 95% confidence
level, using the quantile of the Student distribution and the punctual prediction.
#quantile of the Student distribution (0.975)
qt = scipy.stats.t.ppf(0.975,df=n-p-1)
#lower bound
yb = pred_conso - qt * np.sqrt(var_err)
print(yb)
#upper bound
yh = pred_conso + qt * np.sqrt(var_err)
print(yh)
Checking of the prediction interval. The evaluation takes on a concrete dimension here
since we check, for a given confidence level, if the confidence interval contains the true
(observed) value of the response variable. We organize the presentation in such a way that
the observed values are surrounded by the lower and upper bounds of the intervals.
#matrix with the various values (lower bound, observed, upper bound)
a = np.resize(yb,new_shape=(n_pred,1))
y_obs = cars2['conso']
a = np.append(a,np.resize(y_obs,new_shape=(n_pred,1)),axis=1)
a = np.append(a,np.resize(yh,new_shape=(n_pred,1)),axis=1)
It is confirmed, the Ferrari is crossing the limits (if I may say so).
B.Basse Y.Obs B.Haute
modele
Nissan Primera 2.0 8.100926 9.2 11.073811
Fiat Tempra 1.6 Liberty 6.642306 9.3 9.628866
Opel Omega 2.5i V6 11.022800 11.3 14.149581
Subaru Vivio 4WD 4.242270 6.8 7.368451
Seat Ibiza 2.0 GTI 7.009954 9.5 10.037930
Ferrari 456 GT 15.138681 21.3 19.526262
8 Conclusion
The StatsModels package provides interesting features for statistical modeling. Coupled with
the Python language and other packages (numpy, scipy, pandas, etc.), the possibilities
become immense. The skills that we have been able to develop under R are very easy to
transpose here.