3.1. Statistics in Python - Scipy Lecture Notes
3.1. Statistics in Python - Scipy Lecture Notes
Requirements
Standard scientific Python environment (numpy, scipy, matplotlib)
Pandas
Statsmodels
Seaborn
To install Python and these dependencies, we recommend that you download Anaconda Python or
Enthought Canopy, or preferably use the package manager if you are under Ubuntu or other linux.
See also:
Bayesian statistics in Python: This chapter does not cover tools for Bayesian statistics. Of par-
ticular interest for Bayesian modelling is PyMC, which implements a probabilistic programming
language in Python.
Read a statistics book: The Think stats book is available as free PDF or in print and is a great
introduction to statistics.
Contents
Data representation and interaction
Data as a table
The pandas data-frame
Hypothesis testing: comparing two groups
Student’s t-test: the simplest statistical test
Paired tests: repeated measurements on the same individuals
Linear models, multiple factors, and analysis of variance
“formulas” to specify statistical models in Python
Multiple Regression: including multiple factors
Post-hoc hypothesis testing: analysis of variance (ANOVA)
More visualization: seaborn for statistical exploration
Pairplot: scatter matrices
https://scipy-lectures.org/packages/statistics/index.html 1/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
In this document, the Python inputs are represented with the sign “>>>”.
The setting that we consider for statistical analysis is that of multiple observations or samples described
by a set of different attributes or features. The data can than be seen as a 2D table, or matrix, with col-
umns giving the different attributes of the data, and rows the observations. For instance, the data con-
tained in examples/brain_size.csv:
"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"1";"Female";133;132;124;"118";"64.5";816932
"2";"Male";140;150;124;".";"72.5";1001121
"3";"Male";139;123;150;"143";"73.3";1038437
"4";"Male";133;129;128;"172";"68.8";965353
"5";"Female";137;132;134;"147";"65.0";951545
We will store and manipulate this data in a pandas.DataFrame, from the pandas module. It is the Python
equivalent of the spreadsheet table. It is different from a 2D numpy array as it has named columns, can contain
a mixture of different data types by column, and has elaborate selection and pivotal mechanisms.
Separator
Reading from a CSV file: Using the above CSV file that gives observations of brain size and weight and
IQ (Willerman et al. 1991), the data are a mixture of numerical and categorical values:
Missing values
The weight of the second individual is missing in the CSV file. If we don’t specify the missing value (NA
= not available) marker, we will not be able to do statistical analysis.
Creating from arrays: A pandas.DataFrame can also be seen as a dictionary of 1D ‘series’, eg arrays
or lists. If we have 3 numpy arrays:
Other inputs: pandas can input data from SQL, excel files, or other formats. See the pandas documenta-
tion.
https://scipy-lectures.org/packages/statistics/index.html 3/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
Manipulating data
Note: For a quick view on a large dataframe, use its describe method:
pandas.DataFrame.describe().
groupby_gender is a powerful object that exposes many operations on the resulting group of dataframes:
Use tab-completion on groupby_gender to find more. Other common grouping functions are median, count
(useful for checking to see the amount of missing values in different subsets) or sum. Groupby evaluation is
lazy, no work is done until an aggregation function is applied.
Exercise
https://scipy-lectures.org/packages/statistics/index.html 4/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
Note: groupby_gender.boxplot is used for the plots above (see this example).
Plotting data
Pandas comes with some plotting tools (pandas.tools.plotting, using matplotlib behind the scene) to
display statistics of the data in dataframes:
Scatter matrices:
Two populations
https://scipy-lectures.org/packages/statistics/index.html 5/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
Exercise
Plot the scatter matrix for males only, and for females only. Do you think that the 2 sub-populations cor-
respond to gender?
For simple statistical tests, we will use the scipy.stats sub-module of scipy:
See also: Scipy is a vast library. For a quick summary to the whole library, see the scipy chapter.
scipy.stats.ttest_1samp() tests if the population mean of data is likely to be equal to a given value
(technically if observations are drawn from a Gaussian distributions of given population mean). It returns
the T statistic, and the p-value (see the function’s help):
https://scipy-lectures.org/packages/statistics/index.html 6/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
With a p-value of 10^-28 we can claim that the population mean for the
IQ (VIQ measure) is not 0.
We have seen above that the mean VIQ in the male and female populations were different. To test if this
is significant, we do a 2-sample t-test with scipy.stats.ttest_ind():
https://scipy-lectures.org/packages/statistics/index.html 7/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
Exercise
Test the difference between weights in males and females.
Use non parametric statistics to test the difference between VIQ in males and females.
Conclusion: we find that the data does not support the hypothesis that males and females have differ-
ent VIQ.
https://scipy-lectures.org/packages/statistics/index.html 8/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is cor-
rectly specified.
Terminology:
Statsmodels uses a statistical terminology: the y variable in statsmodels is called ‘endogenous’ while
the x variable is called exogenous. This is discussed in more detail here.
To simplify, y (endogenous) is the value you are trying to predict, while x (exogenous) represents the
features you are using to make the prediction.
Exercise
Retrieve the estimated parameters from the model above. Hint: use tab-completion to find the relevent
attribute.
We can write a comparison between IQ of male and female using a linear model:
0.5969
Date: ... Prob (F-statistic):
0.445
Time: ... Log-Likelihood:
-182.42
No. Observations: 40 AIC:
368.8
Df Residuals: 38 BIC:
372.2
Df Model: 1
Covariance Type: nonrobust
==========================...
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------...
Intercept 109.4500 5.308 20.619 0.000 98.704
120.196
Gender[T.Male] 5.8000 7.507 0.773 0.445 -9.397
20.997
==========================...
Omnibus: 26.188 Durbin-Watson:
1.709
Prob(Omnibus): 0.000 Jarque-Bera (JB):
3.703
Skew: 0.010 Prob(JB):
0.157
Kurtosis: 1.510 Cond. No.
2.62
==========================...
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is cor-
rectly specified.
Forcing categorical: the ‘Gender’ is automatically detected as a categorical variable, and thus each of
its different values are treated as different entities.
Intercept: We can remove the intercept using - 1 in the formula, or force the use of an intercept using
+ 1.
By default, statsmodels treats a categorical variable with K possible values as K-1 ‘dummy’ boolean vari-
ables (the last level being absorbed into the intercept term). This is almost always a good default choice -
however, it is possible to specify different encodings for categorical variables
(http://statsmodels.sourceforge.net/devel/contrasts.html).
https://scipy-lectures.org/packages/statistics/index.html 11/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
To compare different types of IQ, we need to create a “long-form” table, listing IQs, where the type of
IQ is indicated by a categorical variable:
We can see that we retrieve the same values for t-test and corresponding p-values for the effect of the
type of iq than the previous t-test:
https://scipy-lectures.org/packages/statistics/index.html 12/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is cor-
rectly specified.
In the above iris example, we wish to test if the petal length is different between versicolor and virginica,
after removing the effect of sepal width. This can be formulated as testing the difference between the co-
efficient associated to versicolor and virginica in the linear model estimated above (it is an Analysis of
Variance, ANOVA). For this, we write a vector of ‘contrast’ on the parameters estimated: we want to test
"name[T.versicolor] - name[T.virginica]", with an F-test:
Exercise
Going back to the brain size + IQ data, test if the VIQ of male and female are different after removing
the effect of brain size, height and weight.
Let us consider a data giving wages and many other personal information on 500 individuals (Berndt, ER.
The Practice of Econometrics. 1991. NY: Addison-Wesley).
The full code loading and plotting of the wages data is found in corresponding example.
>>> print(data) >>>
EDUCATION SOUTH SEX EXPERIENCE UNION WAGE AGE RACE \
0 8 0 1 21 0 0.707570 35 2
1 9 0 1 42 0 0.694605 57 3
2 12 0 0 1 0 0.824126 19 3
3 12 0 0 4 0 0.602060 22 3
...
We can easily have an intuition on the interactions between continuous variables using
seaborn.pairplot() to display a scatter matrix:
https://scipy-lectures.org/packages/statistics/index.html 15/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
Seaborn changes the default of matplotlib figures to achieve a more “modern”, “excel-like” look. It does
that upon import. You can reset the default using:
To switch back to seaborn settings, or understand better styling in seaborn, see the relevent section of the
seaborn documentation.
A regression capturing the relation between one variable and another, eg wage and eduction, can be
plotted using seaborn.lmplot():
https://scipy-lectures.org/packages/statistics/index.html 16/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
Robust regression
Given that, in the above plot, there seems to be a couple of data points that are outside of the main cloud to
the right, they might be outliers, not representative of the population, but driving the regression.
To compute a regression that is less sentive to outliers, one must use a robust model. This is done in
seaborn using robust=True in the plotting functions, or in statsmodels by replacing the use of the
OLS by a “Robust Linear Model”, statsmodels.formula.api.rlm().
https://scipy-lectures.org/packages/statistics/index.html 17/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
Boxplots and paired Plotting simple quantities Analysis of Iris petal and
differences of a pandas dataframe sepal sizes
https://scipy-lectures.org/packages/statistics/index.html 18/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
https://scipy-lectures.org/packages/statistics/index.html 19/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
https://scipy-lectures.org/packages/statistics/index.html 20/20