0% found this document useful (0 votes)

61 views20 pages

3.1. Statistics in Python - Scipy Lecture Notes

Uploaded by

simanjuntaksan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views20 pages

3.1. Statistics in Python - Scipy Lecture Notes

Uploaded by

simanjuntaksan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

12/3/21, 7:12 PM 3.1.

Statistics in Python — Scipy lecture notes

3.1. Statistics in Python

Author: Gaël Varoquaux

Requirements
Standard scientific Python environment (numpy, scipy, matplotlib)
Pandas
Statsmodels
Seaborn

To install Python and these dependencies, we recommend that you download Anaconda Python or
Enthought Canopy, or preferably use the package manager if you are under Ubuntu or other linux.

See also:
Bayesian statistics in Python: This chapter does not cover tools for Bayesian statistics. Of par-
ticular interest for Bayesian modelling is PyMC, which implements a probabilistic programming
language in Python.
Read a statistics book: The Think stats book is available as free PDF or in print and is a great
introduction to statistics.

Why Python for statistics?

R is a language dedicated to statistics. Python is a general-purpose language with statistics modules. R has
more statistical analysis features than Python, and specialized syntaxes. However, when it comes to building
complex analysis pipelines that mix statistics with e.g. image analysis, text mining, or control of a physical ex-
periment, the richness of Python is an invaluable asset.

Contents
Data representation and interaction
Data as a table
The pandas data-frame
Hypothesis testing: comparing two groups
Student’s t-test: the simplest statistical test
Paired tests: repeated measurements on the same individuals
Linear models, multiple factors, and analysis of variance
“formulas” to specify statistical models in Python
Multiple Regression: including multiple factors
Post-hoc hypothesis testing: analysis of variance (ANOVA)
More visualization: seaborn for statistical exploration
Pairplot: scatter matrices

https://scipy-lectures.org/packages/statistics/index.html 1/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

lmplot: plotting a univariate regression

Testing for interactions
Full code for the figures
Solutions to this chapter’s exercises

In this document, the Python inputs are represented with the sign “>>>”.

Disclaimer: Gender questions

Some of the examples of this tutorial are chosen around gender questions. The reason is that on such ques-
tions controlling the truth of a claim actually matters to many people.

3.1.1. Data representation and interaction

3.1.1.1. Data as a table

The setting that we consider for statistical analysis is that of multiple observations or samples described
by a set of different attributes or features. The data can than be seen as a 2D table, or matrix, with col-
umns giving the different attributes of the data, and rows the observations. For instance, the data con-
tained in examples/brain_size.csv:

"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"1";"Female";133;132;124;"118";"64.5";816932
"2";"Male";140;150;124;".";"72.5";1001121
"3";"Male";139;123;150;"143";"73.3";1038437
"4";"Male";133;129;128;"172";"68.8";965353
"5";"Female";137;132;134;"147";"65.0";951545

3.1.1.2. The pandas data-frame

We will store and manipulate this data in a pandas.DataFrame, from the pandas module. It is the Python
equivalent of the spreadsheet table. It is different from a 2D numpy array as it has named columns, can contain
a mixture of different data types by column, and has elaborate selection and pivotal mechanisms.

Creating dataframes: reading data files or converting arrays

Separator

It is a CSV file, but the separator is “;”

https://scipy-lectures.org/packages/statistics/index.html 2/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

Reading from a CSV file: Using the above CSV file that gives observations of brain size and weight and
IQ (Willerman et al. 1991), the data are a mixture of numerical and categorical values:

>>> import pandas >>>

>>> data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values=
".")
>>> data
Unnamed: 0 Gender FSIQ VIQ PIQ Weight Height MRI_Count
0 1 Female 133 132 124 118.0 64.5 816932
1 2 Male 140 150 124 NaN 72.5 1001121
2 3 Male 139 123 150 143.0 73.3 1038437
3 4 Male 133 129 128 172.0 68.8 965353
4 5 Female 137 132 134 147.0 65.0 951545
...

 Missing values
The weight of the second individual is missing in the CSV file. If we don’t specify the missing value (NA
= not available) marker, we will not be able to do statistical analysis.

Creating from arrays: A pandas.DataFrame can also be seen as a dictionary of 1D ‘series’, eg arrays
or lists. If we have 3 numpy arrays:

>>> import numpy as np >>>

>>> t = np.linspace(-6, 6, 20)
>>> sin_t = np.sin(t)
>>> cos_t = np.cos(t)

We can expose them as a pandas.DataFrame:

>>> pandas.DataFrame({'t': t, 'sin': sin_t, 'cos': cos_t}) >>>

t sin cos
0 -6.000000 0.279415 0.960170
1 -5.368421 0.792419 0.609977
2 -4.736842 0.999701 0.024451
3 -4.105263 0.821291 -0.570509
4 -3.473684 0.326021 -0.945363
5 -2.842105 -0.295030 -0.955488
6 -2.210526 -0.802257 -0.596979
7 -1.578947 -0.999967 -0.008151
8 -0.947368 -0.811882 0.583822
...

Other inputs: pandas can input data from SQL, excel files, or other formats. See the pandas documenta-
tion.

https://scipy-lectures.org/packages/statistics/index.html 3/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

Manipulating data

data is a pandas.DataFrame, that resembles R’s dataframe:

>>> data.shape # 40 rows and 8 columns >>>

(40, 8)

>>> data.columns # It has columns

Index([u'Unnamed: 0', u'Gender', u'FSIQ', u'VIQ', u'PIQ', u'Weight',
u'Height', u'MRI_Count'], dtype='object')

>>> print(data['Gender']) # Columns can be addressed by name

0 Female
1 Male
2 Male
3 Male
4 Female
...

>>> # Simpler selector

>>> data[data['Gender'] == 'Female']['VIQ'].mean()
109.45

Note: For a quick view on a large dataframe, use its describe method:
pandas.DataFrame.describe().

groupby: splitting a dataframe on values of categorical variables:

>>> groupby_gender = data.groupby('Gender') >>>

>>> for gender, value in groupby_gender['VIQ']:
... print((gender, value.mean()))
('Female', 109.45)
('Male', 115.25)

groupby_gender is a powerful object that exposes many operations on the resulting group of dataframes:

>>> groupby_gender.mean() >>>

Unnamed: 0 FSIQ VIQ PIQ Weight Height MRI_Count
Gender
Female 19.65 111.9 109.45 110.45 137.200000 65.765000 862654.6
Male 21.35 115.0 115.25 111.60 166.444444 71.431579 954855.4

Use tab-completion on groupby_gender to find more. Other common grouping functions are median, count
(useful for checking to see the amount of missing values in different subsets) or sum. Groupby evaluation is
lazy, no work is done until an aggregation function is applied.

Exercise

https://scipy-lectures.org/packages/statistics/index.html 4/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

What is the mean value for VIQ for the full

population?
How many males/females were included in this
study?

Hint use ‘tab completion’ to find out the methods

that can be called, instead of ‘mean’ in the above
example.
What is the average value of MRI counts expressed
in log units, for males and females?

Note: groupby_gender.boxplot is used for the plots above (see this example).

Plotting data

Pandas comes with some plotting tools (pandas.tools.plotting, using matplotlib behind the scene) to
display statistics of the data in dataframes:

Scatter matrices:

>>> from pandas.tools import plotting >>>

>>> plotting.scatter_matrix(data[['Weight', 'Height', 'MRI_Count']])

>>> plotting.scatter_matrix(data[['PIQ', 'VIQ', 'FSIQ']]) >>>

Two populations

https://scipy-lectures.org/packages/statistics/index.html 5/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

The IQ metrics are bimodal, as if there

are 2 sub-populations.

Exercise

Plot the scatter matrix for males only, and for females only. Do you think that the 2 sub-populations cor-
respond to gender?

3.1.2. Hypothesis testing: comparing two groups

For simple statistical tests, we will use the scipy.stats sub-module of scipy:

>>> from scipy import stats >>>

See also: Scipy is a vast library. For a quick summary to the whole library, see the scipy chapter.

3.1.2.1. Student’s t-test: the simplest statistical test

1-sample t-test: testing the value of a population mean

scipy.stats.ttest_1samp() tests if the population mean of data is likely to be equal to a given value
(technically if observations are drawn from a Gaussian distributions of given population mean). It returns
the T statistic, and the p-value (see the function’s help):

https://scipy-lectures.org/packages/statistics/index.html 6/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

>>> stats.ttest_1samp(data['VIQ'], 0) >>>

Ttest_1sampResult(statistic=30.088099970...,
pvalue=1.32891964...e-28)

With a p-value of 10^-28 we can claim that the population mean for the
IQ (VIQ measure) is not 0.

2-sample t-test: testing for difference across populations

We have seen above that the mean VIQ in the male and female populations were different. To test if this
is significant, we do a 2-sample t-test with scipy.stats.ttest_ind():

>>> female_viq = data[data['Gender'] == 'Female']['VIQ'] >>>

>>> male_viq = data[data['Gender'] == 'Male']['VIQ']
>>> stats.ttest_ind(female_viq, male_viq)
Ttest_indResult(statistic=-0.77261617232..., pvalue=0.4445287677858...)

3.1.2.2. Paired tests: repeated measurements on the same individuals

PIQ, VIQ, and FSIQ give 3 measures of IQ. Let us test if

FISQ and PIQ are significantly different. We can use a 2
sample test:

>>> stats.ttest_ind(data['FSIQ'], data[ >>>

'PIQ'])
Ttest_indResult(statistic=0.465637596
38..., pvalue=0.64277250...)

The problem with this approach is that it forgets that there

are links between observations: FSIQ and PIQ are mea-
sured on the same individuals. Thus the variance due to inter-subject variability is confounding, and can
be removed, using a “paired test”, or “repeated measures test”:

>>> stats.ttest_rel(data['FSIQ'], data['PIQ']) >>>

Ttest_relResult(statistic=1.784201940..., pvalue=0.082172638183...)

This is equivalent to a 1-sample test on the difference:

>>> stats.ttest_1samp(data['FSIQ'] - data['PIQ'], 0) >>>

Ttest_1sampResult(statistic=1.784201940..., pvalue=0.082172638...)

https://scipy-lectures.org/packages/statistics/index.html 7/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

T-tests assume Gaussian errors. We can use a Wilcoxon signed-

rank test, that relaxes this assumption:

>>> stats.wilcoxon(data['FSIQ'], data['PIQ']) >>>

WilcoxonResult(statistic=274.5,
pvalue=0.106594927...)

Note: The corresponding test in the non paired case is the

Mann–Whitney U test, scipy.stats.mannwhitneyu().

Exercise
Test the difference between weights in males and females.
Use non parametric statistics to test the difference between VIQ in males and females.

Conclusion: we find that the data does not support the hypothesis that males and females have differ-
ent VIQ.

3.1.3. Linear models, multiple factors, and analysis of

variance

3.1.3.1. “formulas” to specify statistical models in Python

A simple linear regression

Given two set of observations, x and y, we want to test

the hypothesis that y is a linear function of x. In other
terms:

https://scipy-lectures.org/packages/statistics/index.html 8/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

where e is observation noise. We will use the statsmodels module to:

1. Fit a linear model. We will use the simplest strategy, ordinary least squares (OLS).
2. Test that coef is non zero.

First, we generate simulated data according to the model:

>>> import numpy as np >>>

>>> x = np.linspace(-5, 5, 20)
>>> np.random.seed(1)
>>> # normal distributed noise
>>> y = -5 + 3*x + 4 * np.random.normal(size=x.shape)
>>> # Create a data frame containing all the relevant variables
>>> data = pandas.DataFrame({'x': x, 'y': y})

“formulas” for statistics in Python

See the statsmodels documentation

Then we specify an OLS model and fit it:

>>> from statsmodels.formula.api import ols >>>

>>> model = ols("y ~ x", data).fit()

We can inspect the various statistics derived from the fit:

>>> print(model.summary()) >>>

OLS Regression Results
==========================...
Dep. Variable: y R-squared:
0.804
Model: OLS Adj. R-squared:
0.794
Method: Least Squares F-statistic:
74.03
Date: ... Prob (F-statistic): 8.56e-
08
Time: ... Log-Likelihood:
-57.988
No. Observations: 20 AIC:
120.0
Df Residuals: 18 BIC:
122.0
Df Model: 1
Covariance Type: nonrobust
==========================...
coef std err t P>|t| [0.025
0.975]
------------------------------------------...
https://scipy-lectures.org/packages/statistics/index.html 9/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

Intercept -5.5335 1.036 -5.342 0.000 -7.710

-3.357
x 2.9369 0.341 8.604 0.000 2.220
3.654
==========================...
Omnibus: 0.100 Durbin-Watson:
2.956
Prob(Omnibus): 0.951 Jarque-Bera (JB):
0.322
Skew: -0.058 Prob(JB):
0.851
Kurtosis: 2.390 Cond. No.
3.03
==========================...

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is cor-
rectly specified.

Terminology:

Statsmodels uses a statistical terminology: the y variable in statsmodels is called ‘endogenous’ while
the x variable is called exogenous. This is discussed in more detail here.

To simplify, y (endogenous) is the value you are trying to predict, while x (exogenous) represents the
features you are using to make the prediction.

Exercise

Retrieve the estimated parameters from the model above. Hint: use tab-completion to find the relevent
attribute.

Categorical variables: comparing groups or multiple categories

Let us go back the data on brain size:

>>> data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values= >>>

".")

We can write a comparison between IQ of male and female using a linear model:

>>> model = ols("VIQ ~ Gender + 1", data).fit() >>>

>>> print(model.summary())
OLS Regression Results
==========================...
Dep. Variable: VIQ R-squared:
0.015
Model: OLS Adj. R-squared:
-0.010
Method: Least Squares F-statistic:
https://scipy-lectures.org/packages/statistics/index.html 10/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

0.5969
Date: ... Prob (F-statistic):
0.445
Time: ... Log-Likelihood:
-182.42
No. Observations: 40 AIC:
368.8
Df Residuals: 38 BIC:
372.2
Df Model: 1
Covariance Type: nonrobust
==========================...
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------...
Intercept 109.4500 5.308 20.619 0.000 98.704
120.196
Gender[T.Male] 5.8000 7.507 0.773 0.445 -9.397
20.997
==========================...
Omnibus: 26.188 Durbin-Watson:
1.709
Prob(Omnibus): 0.000 Jarque-Bera (JB):
3.703
Skew: 0.010 Prob(JB):
0.157
Kurtosis: 1.510 Cond. No.
2.62
==========================...

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is cor-
rectly specified.

Tips on specifying model

Forcing categorical: the ‘Gender’ is automatically detected as a categorical variable, and thus each of
its different values are treated as different entities.

An integer column can be forced to be treated as categorical using:

>>> model = ols('VIQ ~ C(Gender)', data).fit() >>>

Intercept: We can remove the intercept using - 1 in the formula, or force the use of an intercept using
+ 1.
By default, statsmodels treats a categorical variable with K possible values as K-1 ‘dummy’ boolean vari-
ables (the last level being absorbed into the intercept term). This is almost always a good default choice -
however, it is possible to specify different encodings for categorical variables
(http://statsmodels.sourceforge.net/devel/contrasts.html).

Link to t-tests between different FSIQ and PIQ

https://scipy-lectures.org/packages/statistics/index.html 11/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

To compare different types of IQ, we need to create a “long-form” table, listing IQs, where the type of
IQ is indicated by a categorical variable:

>>> data_fisq = pandas.DataFrame({'iq': data['FSIQ'], 'type': 'fsiq'}) >>>

>>> data_piq = pandas.DataFrame({'iq': data['PIQ'], 'type': 'piq'})
>>> data_long = pandas.concat((data_fisq, data_piq))
>>> print(data_long)
iq type
0 133 fsiq
1 140 fsiq
2 139 fsiq
...
31 137 piq
32 110 piq
33 86 piq
...

>>> model = ols("iq ~ type", data_long).fit()

>>> print(model.summary())
OLS Regression Results
...
==========================...
coef std err t P>|t| [0.025
0.975]
------------------------------------------...
Intercept 113.4500 3.683 30.807 0.000 106.119
120.781
type[T.piq] -2.4250 5.208 -0.466 0.643 -12.793
7.943
...

We can see that we retrieve the same values for t-test and corresponding p-values for the effect of the
type of iq than the previous t-test:

>>> stats.ttest_ind(data['FSIQ'], data['PIQ']) >>>

Ttest_indResult(statistic=0.46563759638..., pvalue=0.64277250...)

3.1.3.2. Multiple Regression: including multiple factors

Consider a linear model explaining a variable z (the de-

pendent variable) with 2 variables x and y:

Such a model can be seen in 3D as fitting a plane to a

cloud of (x, y, z) points.

https://scipy-lectures.org/packages/statistics/index.html 12/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

Example: the iris data (examples/iris.csv)

Sepal and petal size tend to be related: bigger flowers are bigger! But is there in addition a systematic effect of
species?

>>> data = pandas.read_csv('examples/iris.csv') >>>

>>> model = ols('sepal_width ~ name + petal_length', data).fit()
>>> print(model.summary())
OLS Regression Results
==========================...
Dep. Variable: sepal_width R-squared:
0.478
Model: OLS Adj. R-squared:
0.468
Method: Least Squares F-statistic:
44.63
Date: ... Prob (F-statistic): 1.58e-
20
Time: ... Log-Likelihood:
-38.185
No. Observations: 150 AIC:
84.37
Df Residuals: 146 BIC:
96.41
Df Model: 3
Covariance Type: nonrobust
==========================...
coef std err t P>|t| [0.025
0.975]
------------------------------------------...
Intercept 2.9813 0.099 29.989 0.000 2.785
3.178
name[T.versicolor] -1.4821 0.181 -8.190 0.000 -1.840
https://scipy-lectures.org/packages/statistics/index.html 13/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes
-1.124
name[T.virginica] -1.6635 0.256 -6.502 0.000 -2.169
-1.158
petal_length 0.2983 0.061 4.920 0.000 0.178
0.418
==========================...
Omnibus: 2.868 Durbin-Watson:
1.753
Prob(Omnibus): 0.238 Jarque-Bera (JB):
2.885
Skew: -0.082 Prob(JB):
0.236
Kurtosis: 3.659 Cond. No.
54.0
==========================...

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is cor-
rectly specified.

3.1.3.3. Post-hoc hypothesis testing: analysis of variance (ANOVA)

In the above iris example, we wish to test if the petal length is different between versicolor and virginica,
after removing the effect of sepal width. This can be formulated as testing the difference between the co-
efficient associated to versicolor and virginica in the linear model estimated above (it is an Analysis of
Variance, ANOVA). For this, we write a vector of ‘contrast’ on the parameters estimated: we want to test
"name[T.versicolor] - name[T.virginica]", with an F-test:

>>> print(model.f_test([0, 1, -1, 0])) >>>

Is this difference significant?

Exercise

Going back to the brain size + IQ data, test if the VIQ of male and female are different after removing
the effect of brain size, height and weight.

3.1.4. More visualization: seaborn for statistical exploration

https://scipy-lectures.org/packages/statistics/index.html 14/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

Seaborn combines simple statistical fits with plotting on pandas dataframes.

Let us consider a data giving wages and many other personal information on 500 individuals (Berndt, ER.
The Practice of Econometrics. 1991. NY: Addison-Wesley).
The full code loading and plotting of the wages data is found in corresponding example.
>>> print(data) >>>
EDUCATION SOUTH SEX EXPERIENCE UNION WAGE AGE RACE \
0 8 0 1 21 0 0.707570 35 2
1 9 0 1 42 0 0.694605 57 3
2 12 0 0 1 0 0.824126 19 3
3 12 0 0 4 0 0.602060 22 3
...

3.1.4.1. Pairplot: scatter matrices

We can easily have an intuition on the interactions between continuous variables using
seaborn.pairplot() to display a scatter matrix:

>>> import seaborn >>>

>>> seaborn.pairplot(data, vars=['WAGE', 'AGE', 'EDUCATION'],
... kind='reg')

Categorical variables can be plotted as the hue:

https://scipy-lectures.org/packages/statistics/index.html 15/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

>>> seaborn.pairplot(data, vars=['WAGE', 'AGE', 'EDUCATION'], >>>

... kind='reg', hue='SEX')

Look and feel and matplotlib settings

Seaborn changes the default of matplotlib figures to achieve a more “modern”, “excel-like” look. It does
that upon import. You can reset the default using:

>>> from matplotlib import pyplot as plt >>>

>>> plt.rcdefaults()

To switch back to seaborn settings, or understand better styling in seaborn, see the relevent section of the
seaborn documentation.

3.1.4.2. lmplot: plotting a univariate regression

A regression capturing the relation between one variable and another, eg wage and eduction, can be
plotted using seaborn.lmplot():

>>> seaborn.lmplot(y='WAGE', x='EDUCATION', data=data) >>>

https://scipy-lectures.org/packages/statistics/index.html 16/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

Robust regression
Given that, in the above plot, there seems to be a couple of data points that are outside of the main cloud to
the right, they might be outliers, not representative of the population, but driving the regression.

To compute a regression that is less sentive to outliers, one must use a robust model. This is done in
seaborn using robust=True in the plotting functions, or in statsmodels by replacing the use of the
OLS by a “Robust Linear Model”, statsmodels.formula.api.rlm().

3.1.5. Testing for interactions

https://scipy-lectures.org/packages/statistics/index.html 17/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

Do wages increase more with education for males than females?

The plot above is made of two different fits. We need to formulate a single model that tests for a variance of
slope across the two populations. This is done via an “interaction”.
>>>
>>> result = sm.ols(formula='wage ~ education + gender + education * gender',
... data=data).fit()
>>> print(result.summary())
...
coef std err t P>|t| [0.025
0.975]
-----------------------------------------------------------------------------
-
Intercept 0.2998 0.072 4.173 0.000 0.159 0.441
gender[T.male] 0.2750 0.093 2.972 0.003 0.093 0.457
education 0.0415 0.005 7.647 0.000 0.031 0.052
education:gender[T.male] -0.0134 0.007 -1.919 0.056 -0.027 0.000
==========================...
...

Can we conclude that education benefits males more than females?

Take home messages

Hypothesis testing and p-values give you the significance of an effect / difference.
Formulas (with categorical variables) enable you to express rich links in your data.
Visualizing your data and fitting simple models give insight into the data.
Conditionning (adding factors that can explain all or part of the variation) is an important model-
ing aspect that changes the interpretation.

3.1.6. Full code for the figures

Code examples for the statistics chapter.

Boxplots and paired Plotting simple quantities Analysis of Iris petal and
differences of a pandas dataframe sepal sizes

https://scipy-lectures.org/packages/statistics/index.html 18/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

Simple Regression Multiple Regression Test for an

education/gender interac-
tion in wages

Visualizing factors influenc- Air fares before and after

ing wages 9/11

3.1.7. Solutions to this chapter’s exercises

Relating Gender and IQ

Download all examples in Python source code: auto_examples_python.zip

https://scipy-lectures.org/packages/statistics/index.html 19/20
12/3/21, 7:12 PM 3.1. Statistics in Python — Scipy lecture notes

Download all examples in Jupyter notebooks: auto_examples_jupyter.zip

Gallery generated by Sphinx-Gallery

https://scipy-lectures.org/packages/statistics/index.html 20/20

MT43315 S 141 v5 Advrobotics
0% (1)
MT43315 S 141 v5 Advrobotics
916 pages
Python For Data Analysis: Dr. Kishore Kunal
100% (1)
Python For Data Analysis: Dr. Kishore Kunal
43 pages
Cmlexch PDF
100% (1)
Cmlexch PDF
7,952 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Python For DA
100% (2)
Python For DA
47 pages
cs3362 Foundations of Data Science Lab Manual
75% (8)
cs3362 Foundations of Data Science Lab Manual
53 pages
Sequential Circuits
No ratings yet
Sequential Circuits
134 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
Data Analysis and Visulaization Experiment
No ratings yet
Data Analysis and Visulaization Experiment
104 pages
Tutorial Data Visualization Pandas Matplotlib Seaborn
No ratings yet
Tutorial Data Visualization Pandas Matplotlib Seaborn
32 pages
Fresco Play Course Detail
50% (4)
Fresco Play Course Detail
4 pages
CNC Controller Tc55xx User Manual Ru
No ratings yet
CNC Controller Tc55xx User Manual Ru
28 pages
CS3361 - Data Science Laboratory
No ratings yet
CS3361 - Data Science Laboratory
31 pages
Adarsh Vardhan Patel
No ratings yet
Adarsh Vardhan Patel
20 pages
ML-Lab Manual - NEP - DSS
No ratings yet
ML-Lab Manual - NEP - DSS
23 pages
Practise MCQ Questions
No ratings yet
Practise MCQ Questions
3 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
NVIDIA - Academy Cursos
No ratings yet
NVIDIA - Academy Cursos
7 pages
Unit 5
No ratings yet
Unit 5
27 pages
Unit 4
No ratings yet
Unit 4
105 pages
Dokumen - Pub Python 3 Module Examples
No ratings yet
Dokumen - Pub Python 3 Module Examples
109 pages
Statistics in Python
No ratings yet
Statistics in Python
19 pages
Python Libraries 2
No ratings yet
Python Libraries 2
80 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Unit 5
No ratings yet
Unit 5
93 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Machine Learning Lab Word 12-1-2025. Document
No ratings yet
Machine Learning Lab Word 12-1-2025. Document
68 pages
Fds Lab Record
No ratings yet
Fds Lab Record
84 pages
cs3362 Foundations of Data Science Lab Manual
No ratings yet
cs3362 Foundations of Data Science Lab Manual
53 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
Trainee - Software Engineer - JD + JNF (Engineering)
No ratings yet
Trainee - Software Engineer - JD + JNF (Engineering)
2 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
TM256 Revision
No ratings yet
TM256 Revision
68 pages
Full Download Design Computing and Cognition'22 John S. Gero PDF
No ratings yet
Full Download Design Computing and Cognition'22 John S. Gero PDF
47 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
3-Numpy Pandas
No ratings yet
3-Numpy Pandas
37 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Mikronik SXUltra Opman V1.01 Eng
No ratings yet
Mikronik SXUltra Opman V1.01 Eng
54 pages
MYSQL
No ratings yet
MYSQL
44 pages
LA Network Manager Software Manual
No ratings yet
LA Network Manager Software Manual
45 pages
GDSC USTP - Constitution and Bylaws
No ratings yet
GDSC USTP - Constitution and Bylaws
37 pages
Python For Data Analysis Edgar
No ratings yet
Python For Data Analysis Edgar
49 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
ML Programs
No ratings yet
ML Programs
41 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Python For ML
No ratings yet
Python For ML
41 pages
DSL Rough Draft
No ratings yet
DSL Rough Draft
34 pages
ANL252 SU3 Jul2022
No ratings yet
ANL252 SU3 Jul2022
23 pages
BDD Cucumber Framework
No ratings yet
BDD Cucumber Framework
2 pages
Class 8 - Year Plan: Term 1 Month Topics
No ratings yet
Class 8 - Year Plan: Term 1 Month Topics
4 pages
Ss Project With Python
No ratings yet
Ss Project With Python
9 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
Nba 2kx Mod Tool
No ratings yet
Nba 2kx Mod Tool
7 pages
FYP Thesis Template
No ratings yet
FYP Thesis Template
25 pages
ML Lab - Abbs
No ratings yet
ML Lab - Abbs
23 pages
Exp 1
No ratings yet
Exp 1
22 pages
ML Lab - Manual
No ratings yet
ML Lab - Manual
15 pages
TCL - Transaction Control Language
No ratings yet
TCL - Transaction Control Language
14 pages
ML LabManual
No ratings yet
ML LabManual
16 pages
ML in Python
No ratings yet
ML in Python
15 pages
Python For Exploratory Data Analysis
No ratings yet
Python For Exploratory Data Analysis
12 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Data Science Programs
No ratings yet
Data Science Programs
11 pages
Seulangatv
No ratings yet
Seulangatv
10 pages
Python Abstract
No ratings yet
Python Abstract
7 pages
Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
Machine Learning Lab File: Submitted To: Submitted by
9 pages
TP1 - Machine Learning H
No ratings yet
TP1 - Machine Learning H
8 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
TP1 - Machine Learning
No ratings yet
TP1 - Machine Learning
8 pages
Utf-8''libraries Data Management
No ratings yet
Utf-8''libraries Data Management
9 pages
Unit 5 Descriptive Statistics
No ratings yet
Unit 5 Descriptive Statistics
7 pages
Py 10
No ratings yet
Py 10
5 pages
Usa Esim
No ratings yet
Usa Esim
3 pages
JSS 1 - 3
No ratings yet
JSS 1 - 3
6 pages
Workbook U2 JuanChambi
No ratings yet
Workbook U2 JuanChambi
6 pages
Dragonlock Cura Profile Instr v3
No ratings yet
Dragonlock Cura Profile Instr v3
6 pages
Servo Motor Coding With Manual
No ratings yet
Servo Motor Coding With Manual
4 pages
Single Developer License
No ratings yet
Single Developer License
2 pages
MC 10206878 0001
No ratings yet
MC 10206878 0001
3 pages
Monitor Your Industrial Plant From Anywhere: The World'S #1-Selling Industrial Alarm Notification Software
No ratings yet
Monitor Your Industrial Plant From Anywhere: The World'S #1-Selling Industrial Alarm Notification Software
2 pages
Data Exploration in Python PDF
No ratings yet
Data Exploration in Python PDF
1 page
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet